The Elephant in the Cloud: A Quest for the Next Generation
In this talk, I will go through the evolution of Hadoop and its ecosystem projects and will try to peer into the crystal ball to predict what may be coming down the pike. I will discuss various way of crunching the data on Hadoop (MapReduce, OpenMPI, Spark and various SQL engines) and how these tools compliment each other.
Apache Hadoop is no longer just a faithful, open source, scalable implementation of two seminal papers that came out of Google 10 years ago. It has evolved into a project that provides the enterprises with a reliable layer for storing massive amounts of unstructured data (HDFS) while allowing different computational frameworks to leverage those datasets.
The original computational framework (MapReduce) has evolved into a much more scalable set of general purpose cluster management APIs collectively known as YARN. With YARN underneath, MapReduce is still there to support batch-oriented computations, but it is no longer the only game in town. With OpenMPI, Spark, and Tez rapidly becoming available now is truly the most exciting time to be a developer in a Hadoop ecosystem. It is also the time when you don't have to be employed by Yahoo!, Facebook or EBay to have access to mind-blowing compute power. That power is a credit card and a pivotal.io account away from anybody on the planet.
I will conclude by outlining some of the ongoing work that makes Hadoop and its ecosystem projects first class citizens in cloud environments based on the work that Pivotal engineers have done with integrating Hadoop into PivotalONE PaaS.
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Elephant in the cloud
1. Elephant in the Cloud:
a quest for the next generation
Hadoop architecture
Roman Shaposhnik
Sr. Manager, Open Source Hadoop Platform @Pivotal
(Twitter: @rhatr)
2. Who’s this guy?
• Sr. Manager @Pivotal building a team of OS contributors
• Apache Software Foundation guy (VP of Apache Incubator, VP of
Apache Bigtop, committer on Hadoop, Giraph, Sqoop, etc)
• Used to be root@Cloudera
• Used to be PHB@Yahoo! (original Hadoop team)
• Used to be a hacker at Sun microsystems (Sun Studio compilers
and tools)
11. Big Data Utility Gap
70% of data
generated by
customers
80% of data
being stored
3% being
prepared for
analysis
0.5% being
analyzed
<0.5% being
operationalized
Average Enterprises
3 Exabytes
per day
now
40 Trillion total
Gigabytes in 2020
(Or 162 iPhones of
storage for every
human)
?
15. HDFS: not a POSIXfs
• Huge blocks: 64Mb (128Mb)
• Mostly immutable files (append, truncate)
• Streaming data access
• Block replication
16. How do I use it?
$ hadoop fs –lsr /
# hadoop-fuse-dfs dfs://hadoop-hdfs /mnt
$ ls /mnt
# mount –t nfs –o vers=3,proto=tcp,nolock host:/ /mnt
$ ls /mnt
18. Pivotal’s Focus on Data Lakes
Existing EDW
/ Datamarts
Raw “untouched” Data
In-MemoryParallelIngest
Data
Management
(Search Engine)
Processed Data
In-Memory Services
BI/AnalyticalTools
Data Lake
ERP
HR
SFDC
New Data
Sources/Formats
Machine
Traditional
Data Sources
Finally! I now
have full
transparency
on the data
with amazing
speed!
All data
is now
accessible!
I can now
afford
“Big
Data”
Business Users
ELT Processing
with Hadoop
HDFS
MapReduce/SQL/Pig/Hive
Analytical
Data
Marts/
Sandboxes
SecurityandControl
21. MapReduce
• Batch oriented (long jobs; final results)
• Brings the computation to the data
• Very constrained programming model
• Embarrassingly parallel programming model
• Used to be the only game in town for compute
28. How do I use it?
public static class TokenizerMapper
extends MapperObject, Text, Text, IntWritable {
public void map(Object key, Text value, Context context) {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
29. How do I use it?
public static class IntSumReducer
extends ReducerText,IntWritable,Text,IntWritable {
public void reduce(Text key, IterableIntWritable values, Context context) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
30. How do I run it?
$ hadoop jar hadoop-examples.jar wordcount
input
output
32. Hadoop’s childhood
• Compact (pretty much a single jar)
• Challenged in scalability and SPOFs
• Extremely batch oriented
• Hard for non-Java programmers
36. Hadoop 2.0
• HDFS 2.0
• Yet Another Resource Negotiator (YARN)
• MapReduce is just an “application” now
• Tez is another “application”
• Pivotal’s Hamster (OpenMPI) yet another one
42. Hamster
• Hadoop and MPI on the same cluster
• OpenMPI Runtime on Hadoop YARN
• Hadoop Provides: Resource Scheduling,
Process monitoring, Distributed File System
• Open MPI Provides: Process launching,
Communication, I/O forwarding
52. Apache HBase
• Small mutable records vs. HDFS files
• HFiles kept in HDFS
• Memcached for HDFS
• Built on HDFS and Zookeeper
• Google’s Bigtable
53. Hbase datamodel
• Driven by the original Webtable usecase:
com.cnn.www html...
content:
CNN
CNN.co
anchor:a.com
anchor:b.com
54. How do I use it?
HTable table = new HTable(config, “table”);
Put p = new Put(Bytes.toBytes(“row”));
p.add(Bytes.toBytes(“family”),
Bytes.toBytes(“qualifier”),
Bytes.toBytes(“data”));
table.put(p);
60. GemFire XD: a better HBase?
• Close sourced but extremely mature
• SQL/Objects/JSON data model
• High concurrency, high update load
• Mostly selective point queries (no scans)
• Tiered storage architecture
61. YCSB Benchmark; Throughput is 2-12X
0
100000
200000
300000
400000
500000
600000
700000
800000
AU
BU
CU
D
FU
LOAD
Throughput(ops/sec)
HBase
4
8
12
16
0
100000
200000
300000
400000
500000
600000
700000
800000
AU
BU
CU
D
FU
LOAD
Throughput(ops/sec)
GemFire XD
4
8
12
16
64. Querying data
• MapReduce: “an assembly language”
• Apache Pig: a data manipulation DSL (now
Turing complete!)
• Apache Hive: a batch-oriented SQL on top
of Hadoop
65. How do I use Pig?
grunt A = load ‘./input.txt’;
grunt B = foreach A generate
flatten(TOKENIZE((chararray)$0)) as
words;
grunt C = group B by word;
grunt D = foreach C generate COUNT(B),
group;
66. How do I use Hive?
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, 's')) AS word FROM docs)
GROUP BY word
ORDER BY word;
67. Can we short Oracle now?
• No indexing
• Batch oriented scheduling
• Optimization for long running queries
• Metadata management is still in flux
72. Getting data in: Flume
• Designed for collecting log data
• Flexible deployment topology
73. Sqoop: RDBMs connection
• Sqoop 1
• A MapReduce tool
• Must use Oozie for workflows
• Sqoop 2
• Well, 0.99.x really
• A standalone service
74. Spring XD
• Unified, distributed, extensible system for data
ingestions, real time analytics and data exports
• Apache Licensed, not ASF
• A runtime service, not a library
• AKA “Oozie + Flume + Sqoop + Morphlines”
75. How do I use it?
# deployment: ./xd-singlenode
$ ./xd-shell
xd: hadoop config fs –namenode hdfs://nn:8020
xd: stream create –definition “time | hdfs”
–name ticktock
xd: stream destroy –name ticktock
76. Feeding the Elephant
HDFS
Pig
Sqoop Flume
Coordination and
workflow
management
Zookeeper
Command
Center
ASF Projects
FLOSS Projects
Pivotal Products
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
GemFire XD
SpringXD
Hamster
YARN
78. What’s wrong with MR?
Source: UC Berkeley Spark project (just the image)
79. Spark innovations
• Resilient Distribtued Datasets (RDDs)
• Distributed on a cluster
• Manipulated via parallel operators (map, etc.)
• Automatically rebuilt on failure
• A parallel ecosystem
• A solution to iterative and multi-stage apps
82. An alternative backend
• Shark: a Hive on Spark
• Spork: a Pig on Spark
• Mlib: machine learning on Spark
• GraphX: Graph processing on Spark
• Also featuring its own streaming engine
83. How do I use it?
val file = spark.textFile(hdfs://...)
val counts = file.flatMap(line = line.split( ))
.map(word = (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile(hdfs://...)
87. Hadoop Maturity
ETL Offload
Accommodate massive
data growth with existing
EDW investments
Data Lakes
Unify Unstructured and
Structured Data Access
Big Data
Apps
Build analytic-led
applications impacting
top line revenue
Data-Driven
Enterprise
App Dev and Operational
Management on HDFS
Data Architecture
88. Pivotal HD on Pivotal CF
Ÿ Enterprise PaaS Management System
Ÿ Flexible multi-language ‘buildpack’
architecture
Ÿ Deployed applications enjoy built-in
services
Ÿ On-Premise Hadoop as a Service
Ÿ Single cluster deployment of Pivotal HD
Ÿ Developers instantly bind to shared
Hadoop Clusters
Ÿ Speeds up time-to-value
89. Pivotal Data Fabric Evolution
Analytic
Data Marts
SQL Services
Operational
Intelligence
In-Memory Database
Run-Time
Applications
Data Staging
Platform
Data Mgmt. Services
Pivotal Data Platform
Stream
Ingestion
Streaming Services
Software-Defined Datacenter
New Data-fabrics
In-Memory Grid
...ETC