As Apache Hadoop adoption continues to advance, customers are depending more and more on Hadoop for critical tasks, and deploying Hadoop for use cases with more real-time requirements. In this session, we will discuss the desired performance characteristics of such a deployment and the corresponding challenges. Leave with an understanding of how performance-sensitive deployments can be accelerated using In-Memory technologies that merge the Big Data capabilities of Hadoop with the unmatched performance of In-Memory data management.
2. Hadoop: Pros and Cons
What is Hadoop?
Pros:
>
Hadoop is a batch system
>
Scales very well
>
HDFS - Hadoop Distributed File System
>
Fault tolerant and resilient
>
Data must be ETL-ed into HDFS
>
Very active and rich eco-system
>
Parallel processing over data in HDFS
>
Process TBs/PBs in parallel fashion
>
Hive, Pig, HBase, Mahout...
>
Most popular data warehouse
Cons:
>
>
Complex deployment
>
Significant execution overhead
>
www.gridgain.com
Batch oriented - real time not possible
HDFS is IO and network bound
Slide 2
3. In-Memory Accelerator For Hadoop: Overview
Up To 100x Faster:
1. In-Memory
File System
100% compatible with HDFS
Boost HDFS performance by removing IO overhead
Dual-mode: standalone or caching
Blend into Hadoop ecosystem
2. In-Memory
MapReduce
Eliminate Hadoop MapReduce overhead
Allow for embedded execution
Record-based
www.gridgain.com
Slide 3
7. Benchmarks: GGFS vs HDFS
10 nodes cluster of Dell R610
>
>
Ubuntu 12.4, Java 7
>
10 GBE network
>
www.gridgain.com
Each has dual 8-core CPU
Stock Apache Hadoop 2.x
Slide 7
8. Comparison: Hadoop Accelerator vs. Spark
>
No ETL required
>
Automatic HDFS read-through and write-through
Data is loaded on demand
>
Per-block file caching
Changes to data do not get propagated to HDFS
Explicit ETL step consumes time
>
Only hot data blocks are in memory
>
Strong management capabilities
GridGain Visor - Unified DevOps
www.gridgain.com
Requires data ETL-ed into Spark
Needs to have full file loaded
If does not fit - gets offloaded to disk
>
No management capabilities
Slide 8
9. Customer Use Case: Task & Challenge
Task:
Challenge:
>
Real time search with MapReduce
>
Hadoop MapReduce too slow (> 30 sec)
>
Dataset size is 5TB
>
Data scanning slow due to constant IO
>
Writes 80%, reads 20%
>
Overall job takes > 1 minute
>
Perceptual real time SLA (few seconds)
www.gridgain.com
Slide 9
10. Customer Use Case: Solution
>
Utilize existing servers
Start GridGain data node on every server
>
Only put highly utilized files in GGFS
User controlled caching
>
In-Memory MapReduce over GGFS
Embedded processing
>
Results under 3 seconds
www.gridgain.com
Slide 10