SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Using Distributed, In-Memory
Computing for Fast Data Analysis
             WSTA Seminar
              September 14, 2011




    Bill Bain (wbain@scaleoutsoftware.com)




            Copyright © 2011 by ScaleOut Software, Inc.
Agenda
• The Need for Memory-Based, Distributed
  Storage
• What Is a Distributed Data Grid (DDG)
• Performance Advantages and Architecture
• Migrating Data to the Cloud and Across Global
  Sites
• Parallel Data Analysis
• Comparison of DDG to File-Based
  Map/Reduce


2                                                 WSTA Seminar
The Need for Memory-Based Storage
Example: Web server farm:
                                                                          Internet
• Load-balancer directs                                                   POW E R FAU LT DATA AL A RM



                                                                                                            Load-balancer
  incoming client requests                                                           Ethernet

  to Web servers.

• Web and app. server
  farms build Web pages         W eb Server
                                              Distributed, In-Memory DataServer W eb Server
                                                W eb Server W eb Server W eb Server W eb
                                                                                         Grid
  and run business logic.                                                            Ethernet




• Database server holds all
  mission-critical, LOB data.
                                                             D atabase   R aid D isk                         D atabase
                                                              Server       Array                              Server                   Bottleneck
• Server farms share fast-                                                Ethernet


  changing data using a                       Distributed, In-Memory Data Grid
  DDG to avoid bottlenecks
  and maximize scalability.                    App. Server      App. Server                             App. Server      App. Server



 3                                                                                                                            WSTA Seminar
The Need for Memory-Based Storage
Example: Cloud Application:           Cloud Application

• Application runs as multiple,       App VS         App VS

                                               App VS
  virtual servers (VS).              App VS
                                                         App VS


• Application instances store and
  retrieve LOB data from cloud-                      Grid VS
                                               Grid VS
  based file system or database.     Grid VS

                                     Distributed Data Grid
• Applications need fast, scalable
  storage for fast-changing data.

• Distributed data grid runs as
  multiple, virtual servers to
  provide “elastic,” in-memory
  storage.
                                     Cloud-Based Storage

4                                                                 WSTA Seminar
What is a Distributed Data Grid?
• A new “vertical” storage tier:              Processor         Processor
                                               Cache             Cache
    – Adds missing layer to boost
      performance.
    – Uses in-memory, out-of-process          L2 Cache          L2 Cache

      storage.
    – Avoids repeated trips to backing        Application
                                                Memory
                                                                Application
                                                                  Memory
                                             “In-Process”      “In-Process”
      storage.

• A new “horizontal” storage tier:           Distributed       Distributed
                                               Cache             Cache
    –   Allows data sharing among servers.    “Out-of-          “Out-of-
                                              Process”          Process”
    –   Scales performance & capacity.
    –   Adds high availability.
                                              Backing
    –   Can be used independently of          Storage

        backing storage.
5                                                           WSTA Seminar
Distributed Data Grids: A Closer Look
• Incorporates a client-side, in-          Application
                                             Memory
  process cache (“near cache”):           “In-Process”
    – Transparent to the application
    – Holds recently accessed data.
                                           Client-side
• Boosts performance:                        Cache
    – Eliminates repeated network data    “In-Process”
      transfers & deserialization.
    – Reduces access times to near “in-
      process” latency.                   Distributed
    – Is automatically updated if the       Cache
      distributed grid changes.            “Out-of-
                                           Process”
    – Supports various coherency models
      (coherent, polled, event-driven)
6                                                 WSTA Seminar
Performance Benefit of Client-side Cache

• Eliminates repeated network data transfers.
• Eliminates repeated object deserialization.

                                  Average Response Time
                                        10KB Objects
                           3500       20:1 Read/Update

                           3000

                           2500
            Microseconds




                           2000

                           1500

                           1000

                            500

                              0
                                   DDG                    DBMS


 7                                                               WSTA Seminar
Top 5 Benefits of Distributed Data Grids
1. Faster access time for business logic state or database data
2. Scalable throughput to match a growing workload and keep
   response times low
3. High availability to prevent data loss if a grid server (or network
   link) fails
                                                              Access Latency vs. Throughput
4. Shared access to data across




                                      Access Latency (msec)
   the server farm                                              Grid     DBMS


5. Advanced capabilities
   for quickly and easily mining
   data using scalable,
   “map/reduce,” analysis.

                                                                Throughput (accesses / sec)



8                                                                                     WSTA Seminar
Scaling the Distributed Data Grid
• Distributed data grid must deliver scalable throughput.
• To do so, its architecture must eliminate bottlenecks to
  scaling:
     – Avoid centralized scheduling to eliminate hot spots.
     – Use data partitioning and maintain load-balance to allow scaling.
     – Use fixed vs. full replication         Read/Write Throughput
       to avoid n-fold overhead.                   10KB Objects

     – Use low overhead
                               Accesses / Second


       heart-beating.               80,000

• Example of linear                                60,000
                                                   40,000
  throughput scaling:                              20,000
                                                       0
                                                                 4       16       28       40       52       64        Nodes
                                                            16,000 ------------------------------------------- 256,000 #Objects

 9                                                                                                    WSTA Seminar
Typical Commercial Distributed Data Grids
• Partition objects to scale throughput and avoid hot
  spots.
• Synchronize access to objects across all servers.
• Dynamically rebalance objects to avoid hot spots.
• Replicate each cached object for high availability.
• Detect server or network failures and self-heal.
                 Client
               Application
                             Retrieve

                Client Cached
                Library Copy
                                                          Object   Copy   Replica

                Cache             Cache                Cache              Cache
                Service           Service              Service            Service

                                   Distributed Cache


                                            Ethernet




10                                                                                  WSTA Seminar
Wide Range of Applications
Financial Services            E-commerce
• Portfolio risk analysis     • Session-state storage
• VaR calculations            • Application state storage
• Monte Carlo simulations     • Online banking
• Algorithmic trading         • Loan applications
• Market message caching      • Wealth management
• Derivatives trading         • Online learning
• Pricing calculations        • Hotel reservations
                              • News story caching
Other Applications
• Edge servers: chat, email   • Shopping carts
• Online gaming servers       • Social networking
• Scientific computations     • Service call tracking
• Command and control         • Online surveys

11                                                  WSTA Seminar
Importance for Cloud Computing
• Cloud computing:
     – Make elastic resources readily available, but…
     – Clouds have relatively slow interconnects.
• Distributed data grids add significant value in the cloud:
     – Allow data sharing across a group of virtual servers.
     – Elastically scale throughput as needed.
     – Provide low latency, object-oriented storage
• Clouds provide the elastic platform for parallel data
  analysis.
• DDGs provides the efficiency and scalability needed to
  overcome the cloud’s limited interconnect speed.

12                                                             WSTA Seminar
DDGs Simplify Data Migration to the Cloud
• Distributed data grids can automatically bridge on-
  premise and cloud-based data grids to unify access.
• This enables seamless access to data across
  multiple sites.
                                 Cloud Application

           Cloud Application VS
                           App              App VS

        App VS              App VS     App VS
                             App VS
                                                App VS
                  App VS                                                           On-Premise Application 2
        App VS              App VS
                                                                                   Server App        Server App
                                                                                         On-Premise Application 2
                                            SOSS VS
                                                                                        Server App      Server App
                                      SOSS VS
                           SOSS VSVS
                             SOSS                                    Aut
                                                                        o
                 SOSS VS                                            Mig matic
                                                                       rate ally
                           Cloud-Based Distributed Automatically
                                                   Cache                   Da
                                                                              ta   SOSS Host         SOSSHost
                                                                                                     SOSS Host
        SOSS VS                                      Migrate Data
                                                                                           SOSS Host
             Cloud hosted Cloud of Virtual Servers                                        On-Premise                       Backing
         Distributed Data Grid                                                       Distributed Data Grid
                                                                                            On-Premise Cache                Store

                                                                                           User’s On-Premise Application
       Cloud of Virtual Servers                                                     User’s On-Premise Application



13                                                                                                                               WSTA Seminar
DDGs Enable Seamless Global Access


     Mirrored Data Centers
                                SOSS SVR                              Satellite Data Centers
                         SOSS SVR
                   SOSS SVR
                                                                                         SOSS SVR
                   Distributed Data Grid                                          SOSS SVR
                     SOSS SVR
                                                                            SOSS SVR
              SOSS SVR
        SOSS SVR                                                            Distributed Data Grid

        Distributed Data Grid
                                                                                            SOSS SVR
                                                                                       SOSS SVR
                                                                              SOSS SVR

                                                                               Distributed Data Grid

                                           Global Distributed Data Grid




14                                                                                        WSTA Seminar
Introducing Parallel Data Analysis
• The goal:
     – Quickly analyze a large set of data for patterns and trends.
     – How? Run a method E (“eval”) across a set of objects D in parallel.
     – Optionally merge the results using method M (“merge”).
• Evolution of parallel analysis:                      E          M
     – '80s: “SIMD/SPMD” (Flynn, Hillis)
     – '90s: “Domain decomposition” (Intel, IBM)      D    D     D    D
     – '00s: “Map/reduce” (Google, Hadoop, Dryad)
                                                      D    D     D    D
• Applications:
     – Search, financial services,                    D    D     D    D
       business intelligence, simulation
                                                      D    D     D    D


                                                           Result
15                                                             WSTA Seminar
Example in Financial Services
Analyze trading strategies across stock histories:
Why?
• Back-testing systems help guard against risks in deploying new
  trading strategies.
• Performance is critical for “first to market” advantage.
• Uses significant amount of market data and computation time.
How?
• Write method E to analyze trading strategies across a single
  stock history.
• Write method M to merge two sets of results.
• Populate the data store with a set of stock histories.
• Run method E in parallel on all stock histories.
• Merge the results with method M to produce a report.
• Refine and repeat…
16                                                       WSTA Seminar
Stage the Data for Analysis

• Step 1: Populate the distributed data grid with objects each of which
  represents a price history for a ticker symbol:




17                                                         WSTA Seminar
Code the Eval and Merge Methods
•    Step 2: Write a method to evaluate a stock history based on parameters:
       Results EvalStockHistory(StockHistory history, Parameters params)
       {
           <analyze trading strategy for this stock history>
           return results;
       }

•    Step 3: Write a method to merge the results of two evaluations:
       Results MergeResuts(Results results1, Results results2)
       {
           <merge both results>
           return results;
       }

•    Notes:
      – This code can be run a sequential calculation on in-memory data.
      – No explicit accesses to the distributed data grid are used.



18                                                                  WSTA Seminar
Run the Analysis
 • Step 4: Invoke parallel evaluation and merging of results:
      Results Invoke(EvalStockHistory, MergeResults, querySpec,
      params);


EvalStockHistory()




      MergeResults()


 19                                                          WSTA Seminar
Start parallel
  analysis

                                                 .eval()


         stock                stock     stock                 stock     stock                stock
        history              history   history               history   history              history




        results              results   results               results   results              results




                  .merge()                       .merge()                        .merge()


                   results                         results                        results




                                                 .merge()

  results returned                                 results
      to client
   20                                                                               WSTA Seminar
DDG Minimizes Data Motion
• File-based map/reduce must move data to memory for analysis:
            M/R Server                M/R Server               M/R Server

        E                     E                        E

                                                                                      Server
                                                                                      Memory



                                                                                    File System /
      D         D        D    D        D           D   D        D           D         Database



• Memory-based DDG analyzes data in place:
                Grid Server             Grid Server             Grid Server

            E                     E                        E

                                                                                     Distributed
       D        D        D    D         D          D   D         D          D        Data Grid



21                                                                              WSTA Seminar
Start parallel
  analysis

                                                 .eval()
                                                  File I/O

         stock                stock     stock                 stock     stock                stock
        history              history   history               history   history              history




        results              results   results               results   results              results




                  .merge()                       .merge()                        .merge()
                                                  File I/O

                   results                         results                        results

                                                  File I/O

                                                 .merge()

  results returned                                 results
      to client
   22                                                                               WSTA Seminar
Performance Impact of Data Motion
     Measured random access to DDG data to simulate file I/O:




23                                                         WSTA Seminar
Comparison of DDGs and File-Based M/R
                    DDG                      File-Based M/R
Data set size       Gigabytes->terabytes     Terabytes->petabytes
Data repository     In-memory                File / database
Data view           Queried object collection File-based key/value
                                              pairs
Development time    Low                      High
Automatic           Yes                      Application
scalability                                  dependent
Best use            Quick-turn analysis of   Complex analysis of
                    memory-based data        large datasets
I/O overhead        Low                      High
Cluster mgt.        Simple                   Complex
High availability   Memory-based             File-based

24                                                         WSTA Seminar
Walk-Away Points
• Developers need fast, scalable, highly available and sharable
  memory-based storage for scaled out applications.
• Distributed data grids (DDGs) address these needs with:
     – Fast access time & scalable throughput
     – Highly available data storage
     – Support for parallel data analysis
• Cloud-based and globally distributed applications need DDGs to:
     – Support scalable data access for “elastic” applications.
     – Efficiently and easily migrate data across sites.
     – Avoid relatively slow cloud I/O storage and interconnects.
• DDGs offer simple, fast “map/reduce” parallel analysis:
     – Make it easy to develop applications and configure clusters.
     – Avoid file I/O overhead for datasets that fit in memory-based grids.
     – Deliver automatic, highly scalable performance.
25                                                                  WSTA Seminar
Distributed Data Grids for
Server Farms & High Performance Computing

        www.scaleoutsoftware.com

Weitere ähnliche Inhalte

Was ist angesagt?

IBM Systems solution for SAP NetWeaver Business Warehouse Accelerator
IBM Systems solution for SAP NetWeaver Business Warehouse AcceleratorIBM Systems solution for SAP NetWeaver Business Warehouse Accelerator
IBM Systems solution for SAP NetWeaver Business Warehouse AcceleratorIBM India Smarter Computing
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesDataWorks Summit
 
An Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive ApplicationsAn Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive ApplicationsXiao Qin
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Dynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 PresentationDynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 PresentationShanley Kane
 
Virtualization Primer for Java Developers
Virtualization Primer for Java DevelopersVirtualization Primer for Java Developers
Virtualization Primer for Java DevelopersRichard McDougall
 
Data Domain Architecture
Data Domain ArchitectureData Domain Architecture
Data Domain Architecturekoesteruk22
 
Storage Options in Windows Server 2012
Storage Options in Windows Server 2012Storage Options in Windows Server 2012
Storage Options in Windows Server 2012Lai Yoong Seng
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHortonworks
 
Lenovo Storage S3200 Simple Setup
Lenovo Storage S3200 Simple SetupLenovo Storage S3200 Simple Setup
Lenovo Storage S3200 Simple SetupLenovo Data Center
 
Adaptec Hybrid RAID Whitepaper
Adaptec Hybrid RAID WhitepaperAdaptec Hybrid RAID Whitepaper
Adaptec Hybrid RAID WhitepaperAdaptec by PMC
 
Symantec Netbackup Appliance Family
Symantec Netbackup Appliance FamilySymantec Netbackup Appliance Family
Symantec Netbackup Appliance FamilySymantec
 
Demartek Lenovo Storage S3200 i a mixed workload environment_2016-01
Demartek Lenovo Storage S3200  i a mixed workload environment_2016-01Demartek Lenovo Storage S3200  i a mixed workload environment_2016-01
Demartek Lenovo Storage S3200 i a mixed workload environment_2016-01Lenovo Data Center
 
DV01 Ten Things You Always Wanted to Know About Windows Azure But Were Afraid...
DV01 Ten Things You Always Wanted to Know About Windows Azure But Were Afraid...DV01 Ten Things You Always Wanted to Know About Windows Azure But Were Afraid...
DV01 Ten Things You Always Wanted to Know About Windows Azure But Were Afraid...Ronald Widha
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud ComputingNephoScale
 
CodeFutures - Scaling Your Database in the Cloud
CodeFutures - Scaling Your Database in the CloudCodeFutures - Scaling Your Database in the Cloud
CodeFutures - Scaling Your Database in the CloudRightScale
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesDataWorks Summit
 

Was ist angesagt? (20)

IBM Systems solution for SAP NetWeaver Business Warehouse Accelerator
IBM Systems solution for SAP NetWeaver Business Warehouse AcceleratorIBM Systems solution for SAP NetWeaver Business Warehouse Accelerator
IBM Systems solution for SAP NetWeaver Business Warehouse Accelerator
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual Machines
 
Hadoop on Virtual Machines
Hadoop on Virtual MachinesHadoop on Virtual Machines
Hadoop on Virtual Machines
 
An Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive ApplicationsAn Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive Applications
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Dynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 PresentationDynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 Presentation
 
Virtualization Primer for Java Developers
Virtualization Primer for Java DevelopersVirtualization Primer for Java Developers
Virtualization Primer for Java Developers
 
Data Domain Architecture
Data Domain ArchitectureData Domain Architecture
Data Domain Architecture
 
Storage Options in Windows Server 2012
Storage Options in Windows Server 2012Storage Options in Windows Server 2012
Storage Options in Windows Server 2012
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
 
Lenovo Storage S3200 Simple Setup
Lenovo Storage S3200 Simple SetupLenovo Storage S3200 Simple Setup
Lenovo Storage S3200 Simple Setup
 
Adaptec Hybrid RAID Whitepaper
Adaptec Hybrid RAID WhitepaperAdaptec Hybrid RAID Whitepaper
Adaptec Hybrid RAID Whitepaper
 
Symantec Netbackup Appliance Family
Symantec Netbackup Appliance FamilySymantec Netbackup Appliance Family
Symantec Netbackup Appliance Family
 
Demartek Lenovo Storage S3200 i a mixed workload environment_2016-01
Demartek Lenovo Storage S3200  i a mixed workload environment_2016-01Demartek Lenovo Storage S3200  i a mixed workload environment_2016-01
Demartek Lenovo Storage S3200 i a mixed workload environment_2016-01
 
DV01 Ten Things You Always Wanted to Know About Windows Azure But Were Afraid...
DV01 Ten Things You Always Wanted to Know About Windows Azure But Were Afraid...DV01 Ten Things You Always Wanted to Know About Windows Azure But Were Afraid...
DV01 Ten Things You Always Wanted to Know About Windows Azure But Were Afraid...
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud Computing
 
CodeFutures - Scaling Your Database in the Cloud
CodeFutures - Scaling Your Database in the CloudCodeFutures - Scaling Your Database in the Cloud
CodeFutures - Scaling Your Database in the Cloud
 
gfs-sosp2003
gfs-sosp2003gfs-sosp2003
gfs-sosp2003
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual Machines
 

Ähnlich wie Using Distributed In-Memory Computing for Fast Data Analysis

Elastic Caching for a Smarter Planet - Make Every Transaction Count
Elastic Caching for a Smarter Planet - Make Every Transaction CountElastic Caching for a Smarter Planet - Make Every Transaction Count
Elastic Caching for a Smarter Planet - Make Every Transaction CountYakura Coffee
 
Scalable Resilient Web Services In .Net
Scalable Resilient Web Services In .NetScalable Resilient Web Services In .Net
Scalable Resilient Web Services In .NetBala Subra
 
Virtualizing Latency Sensitive Workloads and vFabric GemFire
Virtualizing Latency Sensitive Workloads and vFabric GemFireVirtualizing Latency Sensitive Workloads and vFabric GemFire
Virtualizing Latency Sensitive Workloads and vFabric GemFireCarter Shanklin
 
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYC
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYCAWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYC
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYCAmazon Web Services
 
Building a highly scalable and available cloud application
Building a highly scalable and available cloud applicationBuilding a highly scalable and available cloud application
Building a highly scalable and available cloud applicationNoam Sheffer
 
Caching principles-solutions
Caching principles-solutionsCaching principles-solutions
Caching principles-solutionspmanvi
 
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...Andrew Miller
 
NetApp-ClusteredONTAP-Fall2012
NetApp-ClusteredONTAP-Fall2012NetApp-ClusteredONTAP-Fall2012
NetApp-ClusteredONTAP-Fall2012Michael Harding
 
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Odinot Stanislas
 
Database Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big DataDatabase Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big Dataexponential-inc
 
Implementing Private Database Clouds
Implementing Private Database CloudsImplementing Private Database Clouds
Implementing Private Database CloudsRoland Slee
 
AWS Summit 2011: Architecting in the cloud
AWS Summit 2011: Architecting in the cloudAWS Summit 2011: Architecting in the cloud
AWS Summit 2011: Architecting in the cloudAmazon Web Services
 
Storage for Microsoft®Windows Enfironments
Storage for Microsoft®Windows EnfironmentsStorage for Microsoft®Windows Enfironments
Storage for Microsoft®Windows EnfironmentsMichael Hudak
 
SQL PASS Taiwan 七月份聚會-1
SQL PASS Taiwan 七月份聚會-1SQL PASS Taiwan 七月份聚會-1
SQL PASS Taiwan 七月份聚會-1SQLPASSTW
 
Caching for Microservices Architectures: Session I
Caching for Microservices Architectures: Session ICaching for Microservices Architectures: Session I
Caching for Microservices Architectures: Session IVMware Tanzu
 
ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!
ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!
ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!ScaleBase
 
Datasheet Virbak Abio V32
Datasheet Virbak Abio V32Datasheet Virbak Abio V32
Datasheet Virbak Abio V32powerguy73
 

Ähnlich wie Using Distributed In-Memory Computing for Fast Data Analysis (20)

Elastic Caching for a Smarter Planet - Make Every Transaction Count
Elastic Caching for a Smarter Planet - Make Every Transaction CountElastic Caching for a Smarter Planet - Make Every Transaction Count
Elastic Caching for a Smarter Planet - Make Every Transaction Count
 
Scalable Resilient Web Services In .Net
Scalable Resilient Web Services In .NetScalable Resilient Web Services In .Net
Scalable Resilient Web Services In .Net
 
Virtualizing Latency Sensitive Workloads and vFabric GemFire
Virtualizing Latency Sensitive Workloads and vFabric GemFireVirtualizing Latency Sensitive Workloads and vFabric GemFire
Virtualizing Latency Sensitive Workloads and vFabric GemFire
 
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYC
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYCAWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYC
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYC
 
Building a highly scalable and available cloud application
Building a highly scalable and available cloud applicationBuilding a highly scalable and available cloud application
Building a highly scalable and available cloud application
 
High Performance Databases
High Performance DatabasesHigh Performance Databases
High Performance Databases
 
Caching principles-solutions
Caching principles-solutionsCaching principles-solutions
Caching principles-solutions
 
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...
 
NetApp-ClusteredONTAP-Fall2012
NetApp-ClusteredONTAP-Fall2012NetApp-ClusteredONTAP-Fall2012
NetApp-ClusteredONTAP-Fall2012
 
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
 
Database Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big DataDatabase Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big Data
 
Implementing Private Database Clouds
Implementing Private Database CloudsImplementing Private Database Clouds
Implementing Private Database Clouds
 
2018 jk
2018 jk2018 jk
2018 jk
 
AWS Summit 2011: Architecting in the cloud
AWS Summit 2011: Architecting in the cloudAWS Summit 2011: Architecting in the cloud
AWS Summit 2011: Architecting in the cloud
 
Storage for Microsoft®Windows Enfironments
Storage for Microsoft®Windows EnfironmentsStorage for Microsoft®Windows Enfironments
Storage for Microsoft®Windows Enfironments
 
SQL PASS Taiwan 七月份聚會-1
SQL PASS Taiwan 七月份聚會-1SQL PASS Taiwan 七月份聚會-1
SQL PASS Taiwan 七月份聚會-1
 
Caching for Microservices Architectures: Session I
Caching for Microservices Architectures: Session ICaching for Microservices Architectures: Session I
Caching for Microservices Architectures: Session I
 
ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!
ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!
ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!
 
Apache ignite v1.3
Apache ignite v1.3Apache ignite v1.3
Apache ignite v1.3
 
Datasheet Virbak Abio V32
Datasheet Virbak Abio V32Datasheet Virbak Abio V32
Datasheet Virbak Abio V32
 

Kürzlich hochgeladen

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Using Distributed In-Memory Computing for Fast Data Analysis

  • 1. Using Distributed, In-Memory Computing for Fast Data Analysis WSTA Seminar September 14, 2011 Bill Bain (wbain@scaleoutsoftware.com) Copyright © 2011 by ScaleOut Software, Inc.
  • 2. Agenda • The Need for Memory-Based, Distributed Storage • What Is a Distributed Data Grid (DDG) • Performance Advantages and Architecture • Migrating Data to the Cloud and Across Global Sites • Parallel Data Analysis • Comparison of DDG to File-Based Map/Reduce 2 WSTA Seminar
  • 3. The Need for Memory-Based Storage Example: Web server farm: Internet • Load-balancer directs POW E R FAU LT DATA AL A RM Load-balancer incoming client requests Ethernet to Web servers. • Web and app. server farms build Web pages W eb Server Distributed, In-Memory DataServer W eb Server W eb Server W eb Server W eb Server W eb Grid and run business logic. Ethernet • Database server holds all mission-critical, LOB data. D atabase R aid D isk D atabase Server Array Server Bottleneck • Server farms share fast- Ethernet changing data using a Distributed, In-Memory Data Grid DDG to avoid bottlenecks and maximize scalability. App. Server App. Server App. Server App. Server 3 WSTA Seminar
  • 4. The Need for Memory-Based Storage Example: Cloud Application: Cloud Application • Application runs as multiple, App VS App VS App VS virtual servers (VS). App VS App VS • Application instances store and retrieve LOB data from cloud- Grid VS Grid VS based file system or database. Grid VS Distributed Data Grid • Applications need fast, scalable storage for fast-changing data. • Distributed data grid runs as multiple, virtual servers to provide “elastic,” in-memory storage. Cloud-Based Storage 4 WSTA Seminar
  • 5. What is a Distributed Data Grid? • A new “vertical” storage tier: Processor Processor Cache Cache – Adds missing layer to boost performance. – Uses in-memory, out-of-process L2 Cache L2 Cache storage. – Avoids repeated trips to backing Application Memory Application Memory “In-Process” “In-Process” storage. • A new “horizontal” storage tier: Distributed Distributed Cache Cache – Allows data sharing among servers. “Out-of- “Out-of- Process” Process” – Scales performance & capacity. – Adds high availability. Backing – Can be used independently of Storage backing storage. 5 WSTA Seminar
  • 6. Distributed Data Grids: A Closer Look • Incorporates a client-side, in- Application Memory process cache (“near cache”): “In-Process” – Transparent to the application – Holds recently accessed data. Client-side • Boosts performance: Cache – Eliminates repeated network data “In-Process” transfers & deserialization. – Reduces access times to near “in- process” latency. Distributed – Is automatically updated if the Cache distributed grid changes. “Out-of- Process” – Supports various coherency models (coherent, polled, event-driven) 6 WSTA Seminar
  • 7. Performance Benefit of Client-side Cache • Eliminates repeated network data transfers. • Eliminates repeated object deserialization. Average Response Time 10KB Objects 3500 20:1 Read/Update 3000 2500 Microseconds 2000 1500 1000 500 0 DDG DBMS 7 WSTA Seminar
  • 8. Top 5 Benefits of Distributed Data Grids 1. Faster access time for business logic state or database data 2. Scalable throughput to match a growing workload and keep response times low 3. High availability to prevent data loss if a grid server (or network link) fails Access Latency vs. Throughput 4. Shared access to data across Access Latency (msec) the server farm Grid DBMS 5. Advanced capabilities for quickly and easily mining data using scalable, “map/reduce,” analysis. Throughput (accesses / sec) 8 WSTA Seminar
  • 9. Scaling the Distributed Data Grid • Distributed data grid must deliver scalable throughput. • To do so, its architecture must eliminate bottlenecks to scaling: – Avoid centralized scheduling to eliminate hot spots. – Use data partitioning and maintain load-balance to allow scaling. – Use fixed vs. full replication Read/Write Throughput to avoid n-fold overhead. 10KB Objects – Use low overhead Accesses / Second heart-beating. 80,000 • Example of linear 60,000 40,000 throughput scaling: 20,000 0 4 16 28 40 52 64 Nodes 16,000 ------------------------------------------- 256,000 #Objects 9 WSTA Seminar
  • 10. Typical Commercial Distributed Data Grids • Partition objects to scale throughput and avoid hot spots. • Synchronize access to objects across all servers. • Dynamically rebalance objects to avoid hot spots. • Replicate each cached object for high availability. • Detect server or network failures and self-heal. Client Application Retrieve Client Cached Library Copy Object Copy Replica Cache Cache Cache Cache Service Service Service Service Distributed Cache Ethernet 10 WSTA Seminar
  • 11. Wide Range of Applications Financial Services E-commerce • Portfolio risk analysis • Session-state storage • VaR calculations • Application state storage • Monte Carlo simulations • Online banking • Algorithmic trading • Loan applications • Market message caching • Wealth management • Derivatives trading • Online learning • Pricing calculations • Hotel reservations • News story caching Other Applications • Edge servers: chat, email • Shopping carts • Online gaming servers • Social networking • Scientific computations • Service call tracking • Command and control • Online surveys 11 WSTA Seminar
  • 12. Importance for Cloud Computing • Cloud computing: – Make elastic resources readily available, but… – Clouds have relatively slow interconnects. • Distributed data grids add significant value in the cloud: – Allow data sharing across a group of virtual servers. – Elastically scale throughput as needed. – Provide low latency, object-oriented storage • Clouds provide the elastic platform for parallel data analysis. • DDGs provides the efficiency and scalability needed to overcome the cloud’s limited interconnect speed. 12 WSTA Seminar
  • 13. DDGs Simplify Data Migration to the Cloud • Distributed data grids can automatically bridge on- premise and cloud-based data grids to unify access. • This enables seamless access to data across multiple sites. Cloud Application Cloud Application VS App App VS App VS App VS App VS App VS App VS App VS On-Premise Application 2 App VS App VS Server App Server App On-Premise Application 2 SOSS VS Server App Server App SOSS VS SOSS VSVS SOSS Aut o SOSS VS Mig matic rate ally Cloud-Based Distributed Automatically Cache Da ta SOSS Host SOSSHost SOSS Host SOSS VS Migrate Data SOSS Host Cloud hosted Cloud of Virtual Servers On-Premise Backing Distributed Data Grid Distributed Data Grid On-Premise Cache Store User’s On-Premise Application Cloud of Virtual Servers User’s On-Premise Application 13 WSTA Seminar
  • 14. DDGs Enable Seamless Global Access Mirrored Data Centers SOSS SVR Satellite Data Centers SOSS SVR SOSS SVR SOSS SVR Distributed Data Grid SOSS SVR SOSS SVR SOSS SVR SOSS SVR SOSS SVR Distributed Data Grid Distributed Data Grid SOSS SVR SOSS SVR SOSS SVR Distributed Data Grid Global Distributed Data Grid 14 WSTA Seminar
  • 15. Introducing Parallel Data Analysis • The goal: – Quickly analyze a large set of data for patterns and trends. – How? Run a method E (“eval”) across a set of objects D in parallel. – Optionally merge the results using method M (“merge”). • Evolution of parallel analysis: E M – '80s: “SIMD/SPMD” (Flynn, Hillis) – '90s: “Domain decomposition” (Intel, IBM) D D D D – '00s: “Map/reduce” (Google, Hadoop, Dryad) D D D D • Applications: – Search, financial services, D D D D business intelligence, simulation D D D D Result 15 WSTA Seminar
  • 16. Example in Financial Services Analyze trading strategies across stock histories: Why? • Back-testing systems help guard against risks in deploying new trading strategies. • Performance is critical for “first to market” advantage. • Uses significant amount of market data and computation time. How? • Write method E to analyze trading strategies across a single stock history. • Write method M to merge two sets of results. • Populate the data store with a set of stock histories. • Run method E in parallel on all stock histories. • Merge the results with method M to produce a report. • Refine and repeat… 16 WSTA Seminar
  • 17. Stage the Data for Analysis • Step 1: Populate the distributed data grid with objects each of which represents a price history for a ticker symbol: 17 WSTA Seminar
  • 18. Code the Eval and Merge Methods • Step 2: Write a method to evaluate a stock history based on parameters: Results EvalStockHistory(StockHistory history, Parameters params) { <analyze trading strategy for this stock history> return results; } • Step 3: Write a method to merge the results of two evaluations: Results MergeResuts(Results results1, Results results2) { <merge both results> return results; } • Notes: – This code can be run a sequential calculation on in-memory data. – No explicit accesses to the distributed data grid are used. 18 WSTA Seminar
  • 19. Run the Analysis • Step 4: Invoke parallel evaluation and merging of results: Results Invoke(EvalStockHistory, MergeResults, querySpec, params); EvalStockHistory() MergeResults() 19 WSTA Seminar
  • 20. Start parallel analysis .eval() stock stock stock stock stock stock history history history history history history results results results results results results .merge() .merge() .merge() results results results .merge() results returned results to client 20 WSTA Seminar
  • 21. DDG Minimizes Data Motion • File-based map/reduce must move data to memory for analysis: M/R Server M/R Server M/R Server E E E Server Memory File System / D D D D D D D D D Database • Memory-based DDG analyzes data in place: Grid Server Grid Server Grid Server E E E Distributed D D D D D D D D D Data Grid 21 WSTA Seminar
  • 22. Start parallel analysis .eval() File I/O stock stock stock stock stock stock history history history history history history results results results results results results .merge() .merge() .merge() File I/O results results results File I/O .merge() results returned results to client 22 WSTA Seminar
  • 23. Performance Impact of Data Motion Measured random access to DDG data to simulate file I/O: 23 WSTA Seminar
  • 24. Comparison of DDGs and File-Based M/R DDG File-Based M/R Data set size Gigabytes->terabytes Terabytes->petabytes Data repository In-memory File / database Data view Queried object collection File-based key/value pairs Development time Low High Automatic Yes Application scalability dependent Best use Quick-turn analysis of Complex analysis of memory-based data large datasets I/O overhead Low High Cluster mgt. Simple Complex High availability Memory-based File-based 24 WSTA Seminar
  • 25. Walk-Away Points • Developers need fast, scalable, highly available and sharable memory-based storage for scaled out applications. • Distributed data grids (DDGs) address these needs with: – Fast access time & scalable throughput – Highly available data storage – Support for parallel data analysis • Cloud-based and globally distributed applications need DDGs to: – Support scalable data access for “elastic” applications. – Efficiently and easily migrate data across sites. – Avoid relatively slow cloud I/O storage and interconnects. • DDGs offer simple, fast “map/reduce” parallel analysis: – Make it easy to develop applications and configure clusters. – Avoid file I/O overhead for datasets that fit in memory-based grids. – Deliver automatic, highly scalable performance. 25 WSTA Seminar
  • 26. Distributed Data Grids for Server Farms & High Performance Computing www.scaleoutsoftware.com