This is an overview of how distributed data grids can enable sharing across web servers and virtual cloud environments to enable scalability and high availability. It also covers how distributed data grids are highly useful for running MapReduce analysis across large data sets.
2. Agenda
• The Need for Memory-Based, Distributed
Storage
• What Is a Distributed Data Grid (DDG)
• Performance Advantages and Architecture
• Migrating Data to the Cloud and Across Global
Sites
• Parallel Data Analysis
• Comparison of DDG to File-Based
Map/Reduce
2 WSTA Seminar
3. The Need for Memory-Based Storage
Example: Web server farm:
Internet
• Load-balancer directs POW E R FAU LT DATA AL A RM
Load-balancer
incoming client requests Ethernet
to Web servers.
• Web and app. server
farms build Web pages W eb Server
Distributed, In-Memory DataServer W eb Server
W eb Server W eb Server W eb Server W eb
Grid
and run business logic. Ethernet
• Database server holds all
mission-critical, LOB data.
D atabase R aid D isk D atabase
Server Array Server Bottleneck
• Server farms share fast- Ethernet
changing data using a Distributed, In-Memory Data Grid
DDG to avoid bottlenecks
and maximize scalability. App. Server App. Server App. Server App. Server
3 WSTA Seminar
4. The Need for Memory-Based Storage
Example: Cloud Application: Cloud Application
• Application runs as multiple, App VS App VS
App VS
virtual servers (VS). App VS
App VS
• Application instances store and
retrieve LOB data from cloud- Grid VS
Grid VS
based file system or database. Grid VS
Distributed Data Grid
• Applications need fast, scalable
storage for fast-changing data.
• Distributed data grid runs as
multiple, virtual servers to
provide “elastic,” in-memory
storage.
Cloud-Based Storage
4 WSTA Seminar
5. What is a Distributed Data Grid?
• A new “vertical” storage tier: Processor Processor
Cache Cache
– Adds missing layer to boost
performance.
– Uses in-memory, out-of-process L2 Cache L2 Cache
storage.
– Avoids repeated trips to backing Application
Memory
Application
Memory
“In-Process” “In-Process”
storage.
• A new “horizontal” storage tier: Distributed Distributed
Cache Cache
– Allows data sharing among servers. “Out-of- “Out-of-
Process” Process”
– Scales performance & capacity.
– Adds high availability.
Backing
– Can be used independently of Storage
backing storage.
5 WSTA Seminar
6. Distributed Data Grids: A Closer Look
• Incorporates a client-side, in- Application
Memory
process cache (“near cache”): “In-Process”
– Transparent to the application
– Holds recently accessed data.
Client-side
• Boosts performance: Cache
– Eliminates repeated network data “In-Process”
transfers & deserialization.
– Reduces access times to near “in-
process” latency. Distributed
– Is automatically updated if the Cache
distributed grid changes. “Out-of-
Process”
– Supports various coherency models
(coherent, polled, event-driven)
6 WSTA Seminar
7. Performance Benefit of Client-side Cache
• Eliminates repeated network data transfers.
• Eliminates repeated object deserialization.
Average Response Time
10KB Objects
3500 20:1 Read/Update
3000
2500
Microseconds
2000
1500
1000
500
0
DDG DBMS
7 WSTA Seminar
8. Top 5 Benefits of Distributed Data Grids
1. Faster access time for business logic state or database data
2. Scalable throughput to match a growing workload and keep
response times low
3. High availability to prevent data loss if a grid server (or network
link) fails
Access Latency vs. Throughput
4. Shared access to data across
Access Latency (msec)
the server farm Grid DBMS
5. Advanced capabilities
for quickly and easily mining
data using scalable,
“map/reduce,” analysis.
Throughput (accesses / sec)
8 WSTA Seminar
9. Scaling the Distributed Data Grid
• Distributed data grid must deliver scalable throughput.
• To do so, its architecture must eliminate bottlenecks to
scaling:
– Avoid centralized scheduling to eliminate hot spots.
– Use data partitioning and maintain load-balance to allow scaling.
– Use fixed vs. full replication Read/Write Throughput
to avoid n-fold overhead. 10KB Objects
– Use low overhead
Accesses / Second
heart-beating. 80,000
• Example of linear 60,000
40,000
throughput scaling: 20,000
0
4 16 28 40 52 64 Nodes
16,000 ------------------------------------------- 256,000 #Objects
9 WSTA Seminar
10. Typical Commercial Distributed Data Grids
• Partition objects to scale throughput and avoid hot
spots.
• Synchronize access to objects across all servers.
• Dynamically rebalance objects to avoid hot spots.
• Replicate each cached object for high availability.
• Detect server or network failures and self-heal.
Client
Application
Retrieve
Client Cached
Library Copy
Object Copy Replica
Cache Cache Cache Cache
Service Service Service Service
Distributed Cache
Ethernet
10 WSTA Seminar
11. Wide Range of Applications
Financial Services E-commerce
• Portfolio risk analysis • Session-state storage
• VaR calculations • Application state storage
• Monte Carlo simulations • Online banking
• Algorithmic trading • Loan applications
• Market message caching • Wealth management
• Derivatives trading • Online learning
• Pricing calculations • Hotel reservations
• News story caching
Other Applications
• Edge servers: chat, email • Shopping carts
• Online gaming servers • Social networking
• Scientific computations • Service call tracking
• Command and control • Online surveys
11 WSTA Seminar
12. Importance for Cloud Computing
• Cloud computing:
– Make elastic resources readily available, but…
– Clouds have relatively slow interconnects.
• Distributed data grids add significant value in the cloud:
– Allow data sharing across a group of virtual servers.
– Elastically scale throughput as needed.
– Provide low latency, object-oriented storage
• Clouds provide the elastic platform for parallel data
analysis.
• DDGs provides the efficiency and scalability needed to
overcome the cloud’s limited interconnect speed.
12 WSTA Seminar
13. DDGs Simplify Data Migration to the Cloud
• Distributed data grids can automatically bridge on-
premise and cloud-based data grids to unify access.
• This enables seamless access to data across
multiple sites.
Cloud Application
Cloud Application VS
App App VS
App VS App VS App VS
App VS
App VS
App VS On-Premise Application 2
App VS App VS
Server App Server App
On-Premise Application 2
SOSS VS
Server App Server App
SOSS VS
SOSS VSVS
SOSS Aut
o
SOSS VS Mig matic
rate ally
Cloud-Based Distributed Automatically
Cache Da
ta SOSS Host SOSSHost
SOSS Host
SOSS VS Migrate Data
SOSS Host
Cloud hosted Cloud of Virtual Servers On-Premise Backing
Distributed Data Grid Distributed Data Grid
On-Premise Cache Store
User’s On-Premise Application
Cloud of Virtual Servers User’s On-Premise Application
13 WSTA Seminar
14. DDGs Enable Seamless Global Access
Mirrored Data Centers
SOSS SVR Satellite Data Centers
SOSS SVR
SOSS SVR
SOSS SVR
Distributed Data Grid SOSS SVR
SOSS SVR
SOSS SVR
SOSS SVR
SOSS SVR Distributed Data Grid
Distributed Data Grid
SOSS SVR
SOSS SVR
SOSS SVR
Distributed Data Grid
Global Distributed Data Grid
14 WSTA Seminar
15. Introducing Parallel Data Analysis
• The goal:
– Quickly analyze a large set of data for patterns and trends.
– How? Run a method E (“eval”) across a set of objects D in parallel.
– Optionally merge the results using method M (“merge”).
• Evolution of parallel analysis: E M
– '80s: “SIMD/SPMD” (Flynn, Hillis)
– '90s: “Domain decomposition” (Intel, IBM) D D D D
– '00s: “Map/reduce” (Google, Hadoop, Dryad)
D D D D
• Applications:
– Search, financial services, D D D D
business intelligence, simulation
D D D D
Result
15 WSTA Seminar
16. Example in Financial Services
Analyze trading strategies across stock histories:
Why?
• Back-testing systems help guard against risks in deploying new
trading strategies.
• Performance is critical for “first to market” advantage.
• Uses significant amount of market data and computation time.
How?
• Write method E to analyze trading strategies across a single
stock history.
• Write method M to merge two sets of results.
• Populate the data store with a set of stock histories.
• Run method E in parallel on all stock histories.
• Merge the results with method M to produce a report.
• Refine and repeat…
16 WSTA Seminar
17. Stage the Data for Analysis
• Step 1: Populate the distributed data grid with objects each of which
represents a price history for a ticker symbol:
17 WSTA Seminar
18. Code the Eval and Merge Methods
• Step 2: Write a method to evaluate a stock history based on parameters:
Results EvalStockHistory(StockHistory history, Parameters params)
{
<analyze trading strategy for this stock history>
return results;
}
• Step 3: Write a method to merge the results of two evaluations:
Results MergeResuts(Results results1, Results results2)
{
<merge both results>
return results;
}
• Notes:
– This code can be run a sequential calculation on in-memory data.
– No explicit accesses to the distributed data grid are used.
18 WSTA Seminar
19. Run the Analysis
• Step 4: Invoke parallel evaluation and merging of results:
Results Invoke(EvalStockHistory, MergeResults, querySpec,
params);
EvalStockHistory()
MergeResults()
19 WSTA Seminar
20. Start parallel
analysis
.eval()
stock stock stock stock stock stock
history history history history history history
results results results results results results
.merge() .merge() .merge()
results results results
.merge()
results returned results
to client
20 WSTA Seminar
21. DDG Minimizes Data Motion
• File-based map/reduce must move data to memory for analysis:
M/R Server M/R Server M/R Server
E E E
Server
Memory
File System /
D D D D D D D D D Database
• Memory-based DDG analyzes data in place:
Grid Server Grid Server Grid Server
E E E
Distributed
D D D D D D D D D Data Grid
21 WSTA Seminar
22. Start parallel
analysis
.eval()
File I/O
stock stock stock stock stock stock
history history history history history history
results results results results results results
.merge() .merge() .merge()
File I/O
results results results
File I/O
.merge()
results returned results
to client
22 WSTA Seminar
23. Performance Impact of Data Motion
Measured random access to DDG data to simulate file I/O:
23 WSTA Seminar
24. Comparison of DDGs and File-Based M/R
DDG File-Based M/R
Data set size Gigabytes->terabytes Terabytes->petabytes
Data repository In-memory File / database
Data view Queried object collection File-based key/value
pairs
Development time Low High
Automatic Yes Application
scalability dependent
Best use Quick-turn analysis of Complex analysis of
memory-based data large datasets
I/O overhead Low High
Cluster mgt. Simple Complex
High availability Memory-based File-based
24 WSTA Seminar
25. Walk-Away Points
• Developers need fast, scalable, highly available and sharable
memory-based storage for scaled out applications.
• Distributed data grids (DDGs) address these needs with:
– Fast access time & scalable throughput
– Highly available data storage
– Support for parallel data analysis
• Cloud-based and globally distributed applications need DDGs to:
– Support scalable data access for “elastic” applications.
– Efficiently and easily migrate data across sites.
– Avoid relatively slow cloud I/O storage and interconnects.
• DDGs offer simple, fast “map/reduce” parallel analysis:
– Make it easy to develop applications and configure clusters.
– Avoid file I/O overhead for datasets that fit in memory-based grids.
– Deliver automatic, highly scalable performance.
25 WSTA Seminar
26. Distributed Data Grids for
Server Farms & High Performance Computing
www.scaleoutsoftware.com