October 2013 HUG: GridGain-In memory accelaration

•

9 gefällt mir•1,920 views

As Apache Hadoop adoption continues to advance, customers are depending more and more on Hadoop for critical tasks, and deploying Hadoop for use cases with more real-time requirements. In this session, we will discuss the desired performance characteristics of such a deployment and the corresponding challenges. Leave with an understanding of how performance-sensitive deployments can be accelerated using In-Memory technologies that merge the Big Data capabilities of Hadoop with the unmatched performance of In-Memory data management.

Technologie

In-Memory Accelerator
for Hadoop™

www.gridgain.com

#gridgain

Hadoop: Pros and Cons
What is Hadoop?

Pros:

>

Hadoop is a batch system

>

Scales very well

>

HDFS - Hadoop Distributed File System

>

Fault tolerant and resilient

>

Data must be ETL-ed into HDFS

>

Very active and rich eco-system

>

Parallel processing over data in HDFS

>

Process TBs/PBs in parallel fashion

>

Hive, Pig, HBase, Mahout...

>

Most popular data warehouse

Cons:
>
>

Complex deployment

>

Signiﬁcant execution overhead

>

www.gridgain.com

Batch oriented - real time not possible

HDFS is IO and network bound

Slide 2

In-Memory Accelerator For Hadoop: Overview

Up To 100x Faster:
1. In-Memory

File System

100% compatible with HDFS
Boost HDFS performance by removing IO overhead
Dual-mode: standalone or caching
Blend into Hadoop ecosystem

2. In-Memory

MapReduce

Eliminate Hadoop MapReduce overhead
Allow for embedded execution
Record-based

www.gridgain.com

Slide 3

GridGain: In-Memory Computing Platform

www.gridgain.com

Slide 4

In-Memory Accelerator For Hadoop: Details
>

PnP Integration
Minimal or zero code change

>

Any Hadoop distro
Hadoop v1 and v2

>

In-Memory File System
100% compatible with HDFS
Dual-mode: no ETL needed, read/write-through
Block-level caching & smart eviction
Automatic pre-fetching
Background fragmentizer
On-heap and off-heap memory utilization

>

In-Memory MapReduce
In-process co-located computations - access GGFS in-process
Eliminate unnecessary IPC
Eliminate long task startup time
Eliminate mandatory sorting and re-shufﬂing on reduction

www.gridgain.com

Slide 5

GridGain Visor: Uniﬁed DevOps

File Manager

HDFS Proﬁler

www.gridgain.com

Slide 6

Benchmarks: GGFS vs HDFS

10 nodes cluster of Dell R610
>
>

Ubuntu 12.4, Java 7

>

10 GBE network

>

www.gridgain.com

Each has dual 8-core CPU

Stock Apache Hadoop 2.x

Slide 7

Comparison: Hadoop Accelerator vs. Spark

>

No ETL required

>

Automatic HDFS read-through and write-through
Data is loaded on demand
>

Per-block ﬁle caching

Changes to data do not get propagated to HDFS
Explicit ETL step consumes time
>

Only hot data blocks are in memory
>

Strong management capabilities
GridGain Visor - Uniﬁed DevOps

www.gridgain.com

Requires data ETL-ed into Spark

Needs to have full ﬁle loaded
If does not ﬁt - gets ofﬂoaded to disk

>

No management capabilities

Slide 8

Customer Use Case: Task & Challenge

Task:

Challenge:

>

Real time search with MapReduce

>

Hadoop MapReduce too slow (> 30 sec)

>

Dataset size is 5TB

>

Data scanning slow due to constant IO

>

Writes 80%, reads 20%

>

Overall job takes > 1 minute

>

Perceptual real time SLA (few seconds)

www.gridgain.com

Slide 9

Customer Use Case: Solution

>

Utilize existing servers
Start GridGain data node on every server

>

Only put highly utilized ﬁles in GGFS
User controlled caching

>

In-Memory MapReduce over GGFS
Embedded processing

>

Results under 3 seconds

www.gridgain.com

Slide 10

@gridgain
GridGain Systems
www.gridgain.com
1065 East Hillsdale Blvd, Suite 230
Foster City, CA 94404

Weitere ähnliche Inhalte

Mehr von Yahoo Developer Network

CICD at Oath using ScrewdriverYahoo Developer Network

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network

Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network

Architecting Petabyte Scale AI ApplicationsYahoo Developer Network

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network

Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network

February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network

February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network

October 2016 HUG: Pulsar, a highly scalable, low latency pub-sub messaging s...Yahoo Developer Network

October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network

October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopYahoo Developer Network

August 2016 HUG: Recent development in Apache OozieYahoo Developer Network

Mehr von Yahoo Developer Network (20)

CICD at Oath using Screwdriver

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...

Moving the Oath Grid to Docker, Eric Badger, Oath

Architecting Petabyte Scale AI Applications

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...

Jun 2017 HUG: YARN Scheduling – A Step Beyond

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...

February 2017 HUG: Exactly-once end-to-end processing with Apache Apex

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

October 2016 HUG: Pulsar, a highly scalable, low latency pub-sub messaging s...

October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...

October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop

August 2016 HUG: Recent development in Apache Oozie

Kürzlich hochgeladen

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Why Teams call analytics are critical to your entire businesspanagenda

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

MINDCTI Revenue Release Quarter One 2024MIND CTI

Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Architecting Cloud Native ApplicationsWSO2

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

ICT role in 21st century education and its challengesrafiqahmad00786416

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney

Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf

Why Teams call analytics are critical to your entire business

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

Axa Assurance Maroc - Insurer Innovation Award 2024

AWS Community Day CPH - Three problems of Terraform

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

MINDCTI Revenue Release Quarter One 2024

Spring Boot vs Quarkus the ultimate battle - DevoxxUK

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Architecting Cloud Native Applications

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Artificial Intelligence Chap.5 : Uncertainty

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

ICT role in 21st century education and its challenges

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Exploring the Future Potential of AI-Enabled Smartphone Processors

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...

Cyberprint. Dark Pink Apt Group [EN].pdf

October 2013 HUG: GridGain-In memory accelaration

1. In-Memory Accelerator for Hadoop™ www.gridgain.com #gridgain

2. Hadoop: Pros and Cons What is Hadoop? Pros: > Hadoop is a batch system > Scales very well > HDFS - Hadoop Distributed File System > Fault tolerant and resilient > Data must be ETL-ed into HDFS > Very active and rich eco-system > Parallel processing over data in HDFS > Process TBs/PBs in parallel fashion > Hive, Pig, HBase, Mahout... > Most popular data warehouse Cons: > > Complex deployment > Signiﬁcant execution overhead > www.gridgain.com Batch oriented - real time not possible HDFS is IO and network bound Slide 2

3. In-Memory Accelerator For Hadoop: Overview Up To 100x Faster: 1. In-Memory File System 100% compatible with HDFS Boost HDFS performance by removing IO overhead Dual-mode: standalone or caching Blend into Hadoop ecosystem 2. In-Memory MapReduce Eliminate Hadoop MapReduce overhead Allow for embedded execution Record-based www.gridgain.com Slide 3

4. GridGain: In-Memory Computing Platform www.gridgain.com Slide 4

5. In-Memory Accelerator For Hadoop: Details > PnP Integration Minimal or zero code change > Any Hadoop distro Hadoop v1 and v2 > In-Memory File System 100% compatible with HDFS Dual-mode: no ETL needed, read/write-through Block-level caching & smart eviction Automatic pre-fetching Background fragmentizer On-heap and off-heap memory utilization > In-Memory MapReduce In-process co-located computations - access GGFS in-process Eliminate unnecessary IPC Eliminate long task startup time Eliminate mandatory sorting and re-shufﬂing on reduction www.gridgain.com Slide 5

6. GridGain Visor: Uniﬁed DevOps File Manager HDFS Proﬁler www.gridgain.com Slide 6

7. Benchmarks: GGFS vs HDFS 10 nodes cluster of Dell R610 > > Ubuntu 12.4, Java 7 > 10 GBE network > www.gridgain.com Each has dual 8-core CPU Stock Apache Hadoop 2.x Slide 7

8. Comparison: Hadoop Accelerator vs. Spark > No ETL required > Automatic HDFS read-through and write-through Data is loaded on demand > Per-block file caching Changes to data do not get propagated to HDFS Explicit ETL step consumes time > Only hot data blocks are in memory > Strong management capabilities GridGain Visor - Unified DevOps www.gridgain.com Requires data ETL-ed into Spark Needs to have full file loaded If does not fit - gets offloaded to disk > No management capabilities Slide 8

9. Customer Use Case: Task & Challenge Task: Challenge: > Real time search with MapReduce > Hadoop MapReduce too slow (> 30 sec) > Dataset size is 5TB > Data scanning slow due to constant IO > Writes 80%, reads 20% > Overall job takes > 1 minute > Perceptual real time SLA (few seconds) www.gridgain.com Slide 9

10. Customer Use Case: Solution > Utilize existing servers Start GridGain data node on every server > Only put highly utilized ﬁles in GGFS User controlled caching > In-Memory MapReduce over GGFS Embedded processing > Results under 3 seconds www.gridgain.com Slide 10

11. @gridgain GridGain Systems www.gridgain.com 1065 East Hillsdale Blvd, Suite 230 Foster City, CA 94404

October 2013 HUG: GridGain-In memory accelaration

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Yahoo Developer Network

Mehr von Yahoo Developer Network (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

October 2013 HUG: GridGain-In memory accelaration