SlideShare ist ein Scribd-Unternehmen logo
1 von 7
Downloaden Sie, um offline zu lesen
Technical Brief

       GridGain & Hadoop:
     Differences & Synergies
                          GridGain Systems, November 2012




Overview
This paper helps you understand how Hadoop and GridGain are different and how
they complement each other. It compares the main concepts of each product.

Hadoop is increasingly being seen as an attractive platform to integrate and
analyze data from multiple sources, especially when traditional databases hit their
limits. It provides a convenient and fast way to integrate and store data with
different structures which is then batch processed for later analysis.

With more and more companies realizing the competitive advantage they are
gaining from these insights, they are looking for solutions which offer them faster
analytic capabilities. Instead of waiting for results from batch jobs running
overnight or in off-hours, they want to use their data in real-time to maximize their
business value and to enable additional real-time functionality for internal or client-
facing systems.

While Hadoop today is used in situations where high-write speeds and the
unstructured integration of data matter most, its lack of ACID transactions and the
latencies involved in data processing have not mattered that much. However, a
focus now on real-time processing and live data analytics, companies are looking
for ways better to process live data in real-time.

GridGain is a modern platform that has been specifically designed as a high
performance platform for the the high-performance storage and processing of data
in memory. It handles the processing of both transactional and non-transactional
live data with very low latencies. GridGain typically resides between business,
analytics, transactional or BI applications on one side and long term data storage
such as RDBMS, ERP or Hadoop HDFS on the other side.

As a Java-based middleware for distributed in-memory processing, GridGain
integrates a fast in-memory MapReduce implementation with its advanced in-
memory data grid technology. It provides companies with a complete platform for
real-time processing and analytics, and GridGain can also be integrated into their
existing architecture, databases or Hadoop data stores.

GridGain can process terabytes of data, on thousands of nodes, in real-time. Its
modern architecture has been created to integrate well with traditional databases
or unstructured data stores. It is a solution that does scale.


GridGain In-Memory Compute Grid vs
Hadoop MapReduce
 MapReduce is a programming model developed by Google for processing large data
sets of data stored on disks. Hadoop MapReduce is an implementation of such
model. The model is based on the fact that data in a single file can be distributed
across multiple nodes and hence the processing of those files has to be co-located
on the same nodes to avoid moving data around. The processing is based on
scanning files record by record in parallel on multiple nodes and then reducing the
results in parallel on multiple nodes as well. Because of that, standard disk-based
MapReduce is good for problem sets which require analyzing every single record in
a file and does not fit for cases when direct access to a certain data record is
required. Furthermore, due to offline batch orientation of Hadoop it is not suited
for low-latency applications.

GridGain In-Memory Compute Grid (IMCG) on the other hand is geared towards in-
memory computations and very low latencies. GridGain IMCG has its own
implementation of MapReduce which is designed specifically for real-time in-
memory processing use cases and is very different from Hadoop one. Its main goal
is to split a task into multiple sub-tasks, load balance those sub-tasks among
available cluster nodes, execute them in parallel, then aggregate the results from
those sub-tasks and return them to user.




Splitting tasks into multiple sub-tasks and assigning them to nodes is the mapping
step and aggregating of results is reducing step. However, there is no concept of
mandatory data built in into this design and it can work in the absence of any data
at all which makes it a good fit for both, stateless and state-full computations, like
traditional HPC. In cases when data is present, GridGain IMCG will also automatically
colocate computations with the nodes where the data is to avoid redundant data
movement.

It is also worth mentioning, that unlike Hadoop, GridGain IMCG is very well suited
for processing of computations which are very short-lived in nature, e.g. below
100 milliseconds and may not require any mapping or reducing.

Here is a simple Java coding example of GridGain IMCG which counts number of
letters in a phrase by splitting it into multiple words, assigning each word to a sub-
task for parallel remote execution in the map step, and then adding all lengths
receives from remote jobs in reduce step.

int letterCount = g.reduce(
     BALANCE,
     // Mapper
new GridClosure<String, Integer>() {
          @Override public Integer apply(String s) {
              return s.length();
          }
      },
      Arrays.asList("GridGain Letter Count".split(" ")),
      // Reducer
      F.sumIntReducer()
));



GridGain In-Memory Data Grid vs Hadoop
Distributed File System
Hadoop Distributed File System (HDFS) is designed for storing large amounts of
data in files on disk. Just like any file system, the data is mostly stored in textual
or binary formats. To find a single record inside an HDFS file requires a file scan.
Also, being distributed in nature, to update a single record within a file in HDFS
requires copying of a whole file (file in HDFS can only be appended). This makes
HDFS well-suited for cases when data is appended at the end of a file, but not well
suited for cases when data needs to be located and/or updated in the middle of a
file. With indexing technologies, like HBase or Impala, data access becomes
somewhat easier because keys can be indexed, but not being able to index into
values (secondary indexes) only allow for primitive query execution.

GridGain In-Memory Data Grid (IMDG) on the other hand is an in-memory key-value
data store. The roots of IMDGs came from distributed caching, however GridGain
IMDG also adds transactions, data partitioning, and SQL querying to cached data.
The main difference with HDFS (or Hadoop ecosystem overall) is the ability to
transact and update any data directly in real time. This makes GridGain IMDG well
suited for working on operational data sets, the data sets that are currently being
updated and queried, while HDFS is suited for working on historical data which is
constant and will never change.

Unlike a file system, GridGain IMDG works with user domain model by directly
caching user application objects. Objects are accessed and updated by key which
allows IMDG to work with volatile data which requires direct key-based access.
GridGain IMDG allows for indexing into keys and values (i.e. primary and secondary
indices) and supports native SQL for data querying & processing. One of unique
features of GridGain IMDG is support for distributed joins which allow to execute
complex SQL queries on the data in-memory without limitations.


GridGain and Hadoop Working Together
To summarize:


    Hadoop essentially is a Big Data warehouse which is good for batch
    processing of historic data that never changes, while GridGain, on the other
    hand, is an In-Memory Data Platform which works with your current
    operational data set in transactional fashion with very low latencies.
    Focusing on very different use cases make GridGain and Hadoop very
    complementary with each other.
Up-Stream Integration
The diagram above shows integration between GridGain and Hadoop. Here we have
GridGain In-Memory Compute Grid and Data Grid working directly in real-time with
user application by partitioning and caching data within data grid, and executing in-
memory computations and SQL queries on it. Every so often, when data becomes
historic, it is snapshotted into HDFS where it can be analyzed using Hadoop
MapReduce and analytical tools from Hadoop eco-system.

Down-Stream Integration
Another possible way to integrate would be for cases when data is already stored
in HDFS but needs to be loaded into IMDG for faster in-memory processing. For
cases like that GridGain provides fast loading mechanisms from HDFS into GridGain
IMDG where it can be further analyzed using GridGain in-memory Map Reduce and
indexed SQL queries.


Conclusion
Integration between an in-memory data platform like GridGain and disk based data
platform like Hadoop allows businesses to get valuable insights into the whole data
set at once, including volatile operational data set cached in memory, as well as
historic data set stored in Hadoop. This essentially eliminates any gaps in
processing time caused by Extract-Transfer-Load (ETL) process of copying data
from operational system of records, like standard databases, into historic data
warehouses like Hadoop. Now data can be analyzed and processed at any point of
its lifecycle, from the moment when it gets into the system up until it gets put
away into a warehouse.

Weitere ähnliche Inhalte

Was ist angesagt?

In memory grids IMDG
In memory grids IMDGIn memory grids IMDG
In memory grids IMDG
Prateek Jain
 
Tarmin Cloud Storage Solution Brief
Tarmin Cloud Storage Solution BriefTarmin Cloud Storage Solution Brief
Tarmin Cloud Storage Solution Brief
ajarson
 
Graph Based Workload Driven Partitioning System by Using MongoDB
Graph Based Workload Driven Partitioning System by Using MongoDBGraph Based Workload Driven Partitioning System by Using MongoDB
Graph Based Workload Driven Partitioning System by Using MongoDB
IJAAS Team
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 

Was ist angesagt? (19)

MongoDB and In-Memory Computing
MongoDB and In-Memory ComputingMongoDB and In-Memory Computing
MongoDB and In-Memory Computing
 
Keysum - Using Checksum Keys
Keysum - Using Checksum KeysKeysum - Using Checksum Keys
Keysum - Using Checksum Keys
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...
 
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
 
In memory grids IMDG
In memory grids IMDGIn memory grids IMDG
In memory grids IMDG
 
Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?
 
A Comparison of EDB Postgres to Self-Supported PostgreSQL
A Comparison of EDB Postgres to Self-Supported PostgreSQLA Comparison of EDB Postgres to Self-Supported PostgreSQL
A Comparison of EDB Postgres to Self-Supported PostgreSQL
 
Tarmin Cloud Storage Solution Brief
Tarmin Cloud Storage Solution BriefTarmin Cloud Storage Solution Brief
Tarmin Cloud Storage Solution Brief
 
Graph Based Workload Driven Partitioning System by Using MongoDB
Graph Based Workload Driven Partitioning System by Using MongoDBGraph Based Workload Driven Partitioning System by Using MongoDB
Graph Based Workload Driven Partitioning System by Using MongoDB
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
A Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen DonigianA Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen Donigian
 
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
 
Massive parallel processing database systems mpp
Massive parallel processing database systems mppMassive parallel processing database systems mpp
Massive parallel processing database systems mpp
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 

Andere mochten auch

Andere mochten auch (8)

How Decision-Support Tools Cure the Prior Authorization Time Drain
How Decision-Support Tools Cure the Prior Authorization Time DrainHow Decision-Support Tools Cure the Prior Authorization Time Drain
How Decision-Support Tools Cure the Prior Authorization Time Drain
 
Collective Intelligence: Filling the Insurance Talent Gap
Collective Intelligence: Filling the Insurance Talent GapCollective Intelligence: Filling the Insurance Talent Gap
Collective Intelligence: Filling the Insurance Talent Gap
 
The Work Ahead: How Digital Thinking Separates Retail's Leaders from Laggards
The Work Ahead: How Digital Thinking Separates Retail's Leaders from LaggardsThe Work Ahead: How Digital Thinking Separates Retail's Leaders from Laggards
The Work Ahead: How Digital Thinking Separates Retail's Leaders from Laggards
 
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita IvanovGridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
 
Helping Pharmas Manage Compliance Risks for Speaker Programs
Helping Pharmas Manage Compliance Risks for Speaker ProgramsHelping Pharmas Manage Compliance Risks for Speaker Programs
Helping Pharmas Manage Compliance Risks for Speaker Programs
 
Preparing for the OECD Common Reporting Standard
Preparing for the OECD Common Reporting StandardPreparing for the OECD Common Reporting Standard
Preparing for the OECD Common Reporting Standard
 
Blockchain: A Potential Game-Changer for Life Insurance
Blockchain: A Potential Game-Changer for Life InsuranceBlockchain: A Potential Game-Changer for Life Insurance
Blockchain: A Potential Game-Changer for Life Insurance
 
The Work Ahead: How Data and Digital Mastery Will Usher In an Era of Innovati...
The Work Ahead: How Data and Digital Mastery Will Usher In an Era of Innovati...The Work Ahead: How Data and Digital Mastery Will Usher In an Era of Innovati...
The Work Ahead: How Data and Digital Mastery Will Usher In an Era of Innovati...
 

Ähnlich wie GridGain & Hadoop: Differences & Synergies

Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
paperpublications3
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 

Ähnlich wie GridGain & Hadoop: Differences & Synergies (20)

Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
IJARCCE_49
IJARCCE_49IJARCCE_49
IJARCCE_49
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
G017143640
G017143640G017143640
G017143640
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

GridGain & Hadoop: Differences & Synergies

  • 1. Technical Brief GridGain & Hadoop: Differences & Synergies GridGain Systems, November 2012 Overview This paper helps you understand how Hadoop and GridGain are different and how they complement each other. It compares the main concepts of each product. Hadoop is increasingly being seen as an attractive platform to integrate and analyze data from multiple sources, especially when traditional databases hit their limits. It provides a convenient and fast way to integrate and store data with different structures which is then batch processed for later analysis. With more and more companies realizing the competitive advantage they are gaining from these insights, they are looking for solutions which offer them faster analytic capabilities. Instead of waiting for results from batch jobs running overnight or in off-hours, they want to use their data in real-time to maximize their business value and to enable additional real-time functionality for internal or client- facing systems. While Hadoop today is used in situations where high-write speeds and the
  • 2. unstructured integration of data matter most, its lack of ACID transactions and the latencies involved in data processing have not mattered that much. However, a focus now on real-time processing and live data analytics, companies are looking for ways better to process live data in real-time. GridGain is a modern platform that has been specifically designed as a high performance platform for the the high-performance storage and processing of data in memory. It handles the processing of both transactional and non-transactional live data with very low latencies. GridGain typically resides between business, analytics, transactional or BI applications on one side and long term data storage such as RDBMS, ERP or Hadoop HDFS on the other side. As a Java-based middleware for distributed in-memory processing, GridGain integrates a fast in-memory MapReduce implementation with its advanced in- memory data grid technology. It provides companies with a complete platform for real-time processing and analytics, and GridGain can also be integrated into their existing architecture, databases or Hadoop data stores. GridGain can process terabytes of data, on thousands of nodes, in real-time. Its modern architecture has been created to integrate well with traditional databases or unstructured data stores. It is a solution that does scale. GridGain In-Memory Compute Grid vs Hadoop MapReduce MapReduce is a programming model developed by Google for processing large data sets of data stored on disks. Hadoop MapReduce is an implementation of such model. The model is based on the fact that data in a single file can be distributed across multiple nodes and hence the processing of those files has to be co-located on the same nodes to avoid moving data around. The processing is based on scanning files record by record in parallel on multiple nodes and then reducing the results in parallel on multiple nodes as well. Because of that, standard disk-based MapReduce is good for problem sets which require analyzing every single record in a file and does not fit for cases when direct access to a certain data record is required. Furthermore, due to offline batch orientation of Hadoop it is not suited for low-latency applications. GridGain In-Memory Compute Grid (IMCG) on the other hand is geared towards in- memory computations and very low latencies. GridGain IMCG has its own implementation of MapReduce which is designed specifically for real-time in- memory processing use cases and is very different from Hadoop one. Its main goal is to split a task into multiple sub-tasks, load balance those sub-tasks among available cluster nodes, execute them in parallel, then aggregate the results from
  • 3. those sub-tasks and return them to user. Splitting tasks into multiple sub-tasks and assigning them to nodes is the mapping step and aggregating of results is reducing step. However, there is no concept of mandatory data built in into this design and it can work in the absence of any data at all which makes it a good fit for both, stateless and state-full computations, like traditional HPC. In cases when data is present, GridGain IMCG will also automatically colocate computations with the nodes where the data is to avoid redundant data movement. It is also worth mentioning, that unlike Hadoop, GridGain IMCG is very well suited for processing of computations which are very short-lived in nature, e.g. below 100 milliseconds and may not require any mapping or reducing. Here is a simple Java coding example of GridGain IMCG which counts number of letters in a phrase by splitting it into multiple words, assigning each word to a sub- task for parallel remote execution in the map step, and then adding all lengths receives from remote jobs in reduce step. int letterCount = g.reduce( BALANCE, // Mapper
  • 4. new GridClosure<String, Integer>() { @Override public Integer apply(String s) { return s.length(); } }, Arrays.asList("GridGain Letter Count".split(" ")), // Reducer F.sumIntReducer() )); GridGain In-Memory Data Grid vs Hadoop Distributed File System Hadoop Distributed File System (HDFS) is designed for storing large amounts of data in files on disk. Just like any file system, the data is mostly stored in textual or binary formats. To find a single record inside an HDFS file requires a file scan. Also, being distributed in nature, to update a single record within a file in HDFS requires copying of a whole file (file in HDFS can only be appended). This makes HDFS well-suited for cases when data is appended at the end of a file, but not well suited for cases when data needs to be located and/or updated in the middle of a file. With indexing technologies, like HBase or Impala, data access becomes somewhat easier because keys can be indexed, but not being able to index into values (secondary indexes) only allow for primitive query execution. GridGain In-Memory Data Grid (IMDG) on the other hand is an in-memory key-value data store. The roots of IMDGs came from distributed caching, however GridGain IMDG also adds transactions, data partitioning, and SQL querying to cached data. The main difference with HDFS (or Hadoop ecosystem overall) is the ability to transact and update any data directly in real time. This makes GridGain IMDG well suited for working on operational data sets, the data sets that are currently being updated and queried, while HDFS is suited for working on historical data which is constant and will never change. Unlike a file system, GridGain IMDG works with user domain model by directly caching user application objects. Objects are accessed and updated by key which allows IMDG to work with volatile data which requires direct key-based access.
  • 5. GridGain IMDG allows for indexing into keys and values (i.e. primary and secondary indices) and supports native SQL for data querying & processing. One of unique features of GridGain IMDG is support for distributed joins which allow to execute complex SQL queries on the data in-memory without limitations. GridGain and Hadoop Working Together To summarize: Hadoop essentially is a Big Data warehouse which is good for batch processing of historic data that never changes, while GridGain, on the other hand, is an In-Memory Data Platform which works with your current operational data set in transactional fashion with very low latencies. Focusing on very different use cases make GridGain and Hadoop very complementary with each other.
  • 6. Up-Stream Integration The diagram above shows integration between GridGain and Hadoop. Here we have GridGain In-Memory Compute Grid and Data Grid working directly in real-time with user application by partitioning and caching data within data grid, and executing in- memory computations and SQL queries on it. Every so often, when data becomes historic, it is snapshotted into HDFS where it can be analyzed using Hadoop MapReduce and analytical tools from Hadoop eco-system. Down-Stream Integration Another possible way to integrate would be for cases when data is already stored in HDFS but needs to be loaded into IMDG for faster in-memory processing. For cases like that GridGain provides fast loading mechanisms from HDFS into GridGain IMDG where it can be further analyzed using GridGain in-memory Map Reduce and indexed SQL queries. Conclusion Integration between an in-memory data platform like GridGain and disk based data
  • 7. platform like Hadoop allows businesses to get valuable insights into the whole data set at once, including volatile operational data set cached in memory, as well as historic data set stored in Hadoop. This essentially eliminates any gaps in processing time caused by Extract-Transfer-Load (ETL) process of copying data from operational system of records, like standard databases, into historic data warehouses like Hadoop. Now data can be analyzed and processed at any point of its lifecycle, from the moment when it gets into the system up until it gets put away into a warehouse.