SlideShare ist ein Scribd-Unternehmen logo
1 von 94
Downloaden Sie, um offline zu lesen
Elephant in the Cloud:
a quest for the next generation
Hadoop architecture	

Roman Shaposhnik	

Sr. Manager, Open Source Hadoop Platform @Pivotal	

(Twitter: @rhatr)
Who’s this guy?	

•  Sr. Manager @Pivotal building a team of OS contributors	

•  Apache Software Foundation guy (VP of Apache Incubator, VP of
Apache Bigtop, committer on Hadoop, Giraph, Sqoop, etc)	

•  Used to be root@Cloudera	

•  Used to be PHB@Yahoo! (original Hadoop team)	

•  Used to be a hacker at Sun microsystems (Sun Studio compilers
and tools)
Agenda	

&
Agenda
Long, long time ago…	

HDFS
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

MapReduce
In a blink of an eye:	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
Genesis of Hadoop	

• Google papers on GFS and MapReduce	

• A subproject of Apache Nutch	

• A bet by Yahoo!
Data brings value	

• What features to add to the product	

• Data analysis must enable decisions	

• V3: volume, velocity, variety
Big Data brings big value
Entering: Industrial Data
Big Data Utility Gap
70% of data
generated by
customers
80% of data
being stored
3% being
prepared for
analysis
0.5% being
analyzed
<0.5% being
operationalized
Average Enterprises
3 Exabytes
per day
now
40 Trillion total
Gigabytes in 2020
(Or 162 iPhones of
storage for every
human)
?
Hadoop’s childhood	

• HDFS: Hadoop Distributed Filesystem	

• MapReduce: computational framework
HDFS: not a POSIXfs	

• Huge blocks: 64Mb (128Mb)	

• Mostly immutable files (append, truncate)	

• Streaming data access	

• Block replication
How do I use it?	

$ hadoop fs –lsr /	

	

# hadoop-fuse-dfs dfs://hadoop-hdfs /mnt	

$ ls /mnt	

	

# mount –t nfs –o vers=3,proto=tcp,nolock host:/ /mnt	

$ ls /mnt
Principle #1	

HDFS is the datalake
Pivotal’s Focus on Data Lakes
Existing EDW 	

/ Datamarts	

Raw “untouched” Data	

In-MemoryParallelIngest	

Data	

Management
(Search Engine)	

Processed Data	

In-Memory Services	

BI/AnalyticalTools	

Data Lake	

ERP	

HR	

SFDC	

New Data	

Sources/Formats	

Machine	

Traditional	

Data Sources	

Finally! I now
have full
transparency
on the data
with amazing
speed!	

All data
is now
accessible!	

I can now
afford 
“Big
Data”	

Business Users	

ELT Processing
with Hadoop
HDFS	

MapReduce/SQL/Pig/Hive	

Analytical
Data
Marts/
Sandboxes	

SecurityandControl
HDFS enables the stack	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
Principle #2	

Apps share their
internal state
MapReduce	

• Batch oriented (long jobs; final results)	

• Brings the computation to the data	

• Very constrained programming model	

• Embarrassingly parallel programming model	

• Used to be the only game in town for compute
MapReduce Overview	

• Record = (Key, Value)	

• Key : Comparable, Serializable	

• Value: Serializable	

• Logical Phases: Input, Map, Shuffle, Reduce,
Output
Map	

• Input: (Key1, Value1)	

• Output: List(Key2, Value2)	

• Projections, Filtering, Transformation
Shuffle	

• Input: List(Key2, Value2)	

• Output	

• Sort(Partition(List(Key2, List(Value2))))	

• Provided by Hadoop : Several
Customizations Possible
Reduce	

• Input: List(Key2, List(Value2))	

• Output: List(Key3, Value3)	

• Aggregations
Anatomy of MapReduce	

d a c 	

a b c	

a 3	

b 1	

c 2	

a 1	

b 1 	

c 1	

a 1	

c 1 	

a 1	

a 1 1 1	

b 1 	

c 1 1	

HDFS mappers reducers HDFS
MapReduce DataFlow
How do I use it?	

	

public static class TokenizerMapper	

extends MapperObject, Text, Text, IntWritable {	

	

public void map(Object key, Text value, Context context) {	

StringTokenizer itr = new StringTokenizer(value.toString());	

while (itr.hasMoreTokens()) {	

word.set(itr.nextToken());	

context.write(word, one);	

}	

}	

}
How do I use it?	

public static class IntSumReducer	

extends ReducerText,IntWritable,Text,IntWritable {	

	

public void reduce(Text key, IterableIntWritable values, Context context) {	

int sum = 0;	

for (IntWritable val : values) {	

sum += val.get();	

}	

result.set(sum);	

context.write(key, result);	

}	

}
How do I run it?	

	

$ hadoop jar hadoop-examples.jar wordcount 	

input 	

output
Principle #3	

MapReduce is assembly
language of Hadoop
Hadoop’s childhood	

• Compact (pretty much a single jar)	

• Challenged in scalability and SPOFs	

• Extremely batch oriented	

• Hard for non-Java programmers
Then, something happened
Hadoop 1.0	

HDFS
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

MapReduce
Hadoop 2.0	

HDFS
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

MapReduce Tez
YARN
Hamster
YARN
Hadoop 2.0	

• HDFS 2.0	

• Yet Another Resource Negotiator (YARN)	

• MapReduce is just an “application” now	

• Tez is another “application”	

• Pivotal’s Hamster (OpenMPI) yet another one
MapReduce 1.0	

Job	

Tracker	

Task	

Tracker
(HDFS)	

Task	

Tracker
(HDFS)	

task1	

task1	

task1	

task1	

task1	

task1	

task1	

task1	

task1	

taskN
YARN (AKA MR2.0)	

Resource
Manager	

Job	

Tracker	

task1	

task1	

task1	

task1	

task1	

Task	

Tracker
YARN (AKA MR2.0)	

Resource
Manager	

Job	

Tracker	

task1	

task1	

task1	

task1	

task1	

Task	

Tracker
YARN	

• Yet Another Resource Negotiator	

• Resource Manager	

• Node Managers	

• Application Masters	

• Specific to paradigm, e.g. MR Application
master (aka JobTracker)
YARN: beyond MR	

Resource
Manager	

MPI	

MPI
Hamster	

•  Hadoop and MPI on the same cluster	

•  OpenMPI Runtime on Hadoop YARN	

•  Hadoop Provides: Resource Scheduling, 
Process monitoring, Distributed File System	

•  Open MPI Provides: Process launching, 
Communication, I/O forwarding
Hamster Components	

• Hamster Application Master	

• Gang Scheduler, YARN Application
Preemption	

• Resource Isolation (lxc Containers)	

• ORTE: Hamster Runtime	

• Process launching, Wireup, Interconnect
Hamster Architecture
Hadoop 2.0	

HDFS
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

MapReduce Tez
YARN
Hamster
YARN
Hadoop ecosystem	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
Hamster
YARN
There’s way too much stuff	

• Tracking dependencies	

• Integration testing	

• Optimizing the defaults	

• Rationalizing the behaviour
Wait! We’ve seen this!	

GNU Software	

 Linux kernel
Apache Bigtop	

Hadoop ecosystem	

(Hbase, Pig, Hive)	

Hadoop
(HDFS,YARN, MR)
Principle #4	

Apache Bigtop is how
the Hadoop distros get
defined
The ecosystem	

• Apache HBase	

• Apache Crunch, Pig, Hive and Phoenix	

• Apache Giraph	

• Apache Oozie	

• Apache Mahout	

• Apache Sqoop and Flume
Apache HBase	

• Small mutable records vs. HDFS files	

• HFiles kept in HDFS	

• Memcached for HDFS	

• Built on HDFS and Zookeeper	

• Google’s Bigtable
Hbase datamodel	

• Driven by the original Webtable usecase:	

	

com.cnn.www html...	

content:	

CNN	

 CNN.co	

anchor:a.com	

 anchor:b.com
How do I use it?	

HTable table = new HTable(config, “table”);	

Put p = new Put(Bytes.toBytes(“row”));	

p.add(Bytes.toBytes(“family”),	

Bytes.toBytes(“qualifier”),	

Bytes.toBytes(“data”));	

table.put(p);
Dataflow model	

HBase	

HDFS	

Producer	

 Consumer
When do I use it?	

• Serving up large amounts of data	

• Fast random access	

• Scan operations
Principle #5	

HBase: when you
need OLAP + OLTP
What if its OLTP?	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
Hamster
YARN
GemFire XD	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
GemFire XD
Hamster
YARN
GemFire XD: a better HBase?	

• Close sourced but extremely mature	

• SQL/Objects/JSON data model	

• High concurrency, high update load	

• Mostly selective point queries (no scans)	

• Tiered storage architecture
YCSB Benchmark; Throughput is 2-12X	

0	

100000	

200000	

300000	

400000	

500000	

600000	

700000	

800000	

AU	

 BU	

 CU	

 D	

 FU	

 LOAD	

Throughput(ops/sec)	

HBase	

4	

8	

12	

16	

0	

100000	

200000	

300000	

400000	

500000	

600000	

700000	

800000	

AU	

 BU	

 CU	

 D	

 FU	

 LOAD	

Throughput(ops/sec)	

GemFire XD	

4	

8	

12	

16
YCSB Benchmark; Latency is 2X – 20X
better	

0	

2000	

4000	

6000	

8000	

10000	

12000	

14000	

Latency(μsec)	

HBase	

4	

8	

12	

16	

0	

2000	

4000	

6000	

8000	

10000	

12000	

14000	

Latency(μsec)	

GemFire XD	

4	

8	

12	

16
Principle #6	

There are always 3
implementations
Querying data	

• MapReduce: “an assembly language”	

• Apache Pig: a data manipulation DSL (now
Turing complete!)	

• Apache Hive: a batch-oriented SQL on top
of Hadoop
How do I use Pig?	

grunt A = load ‘./input.txt’;	

grunt B = foreach A generate 
flatten(TOKENIZE((chararray)$0)) as
words;	

grunt C = group B by word;	

grunt D = foreach C generate COUNT(B), 
group;
How do I use Hive?	

CREATE TABLE docs (line STRING);	

LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;	

CREATE TABLE word_counts AS	

SELECT word, count(1) AS count FROM	

(SELECT explode(split(line, 's')) AS word FROM docs) 	

GROUP BY word	

ORDER BY word;
Can we short Oracle now?	

• No indexing	

• Batch oriented scheduling	

• Optimization for long running queries	

• Metadata management is still in flux
[Close to] real-time SQL	

• Impala (inspired by Google’s F1)	

• Hive/Tez (AKA Stinger)	

• Facebook’s Presto (Hive’s lineage)	

• Pivotal’s HAWQ
HAWQ	

• GreenPlum MPP database core	

• True ANSI SQL support	

• HDFS storage backend	

• Parquet support is coming
Principle #7	

SQL on Hadoop
Feeding the elephant
Getting data in: Flume	

• Designed for collecting log data	

• Flexible deployment topology
Sqoop: RDBMs connection	

• Sqoop 1	

• A MapReduce tool	

• Must use Oozie for workflows	

• Sqoop 2	

• Well, 0.99.x really	

• A standalone service
Spring XD	

• Unified, distributed, extensible system for data
ingestions, real time analytics and data exports	

• Apache Licensed, not ASF	

• A runtime service, not a library	

• AKA “Oozie + Flume + Sqoop + Morphlines”
How do I use it?	

# deployment: ./xd-singlenode	

$ ./xd-shell	

xd: hadoop config fs –namenode hdfs://nn:8020	

xd: stream create –definition “time | hdfs” 
–name ticktock	

xd: stream destroy –name ticktock
Feeding the Elephant	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
GemFire XD
SpringXD
Hamster
YARN
Spark the disruptor	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

GemFireXD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
SpringXD
YARN
Hamster
YARN
What’s wrong with MR?	

Source: UC Berkeley Spark project (just the image)
Spark innovations	

• Resilient Distribtued Datasets (RDDs)	

• Distributed on a cluster	

• Manipulated via parallel operators (map, etc.)	

• Automatically rebuilt on failure	

• A parallel ecosystem	

• A solution to iterative and multi-stage apps
RDDs	

warnings = textFile(…).filter(_.contains(“warning”))	

.map(_.split(‘ ‘)(1))	

	

	

	

	

	

	

	

HadoopRDD
path = hdfs://	

FilteredRDD
contains…	

MappedRDD	

split…
Parallel operators	

• map, reduce	

• sample, filter	

• groupBy, reduceByKey	

• join, leftOuterJoin, rightOuterJoin	

• union, cross
An alternative backend	

• Shark: a Hive on Spark	

• Spork: a Pig on Spark	

• Mlib: machine learning on Spark	

• GraphX: Graph processing on Spark	

• Also featuring its own streaming engine
How do I use it?	

val file = spark.textFile(hdfs://...)	

val counts = file.flatMap(line = line.split( ))	

.map(word = (word, 1))	

.reduceByKey(_ + _)	

counts.saveAsTextFile(hdfs://...)
Principle #8	

Spark is the
technology of 2014
Where’s the cloud?
What’s new?	

• True elasticity	

• Resource partitioning	

• Security	

• Data marketplace	

• Data leaks/breaches
Hadoop Maturity
ETL Offload	

Accommodate massive 
data growth with existing
EDW investments	

Data Lakes	

Unify Unstructured and
Structured Data Access	

Big Data
Apps	

Build analytic-led
applications impacting 
top line revenue	

Data-Driven
Enterprise	

App Dev and Operational
Management on HDFS
Data Architecture
Pivotal HD on Pivotal CF
Ÿ Enterprise PaaS Management System
Ÿ Flexible multi-language ‘buildpack’
architecture
Ÿ Deployed applications enjoy built-in
services
Ÿ On-Premise Hadoop as a Service
Ÿ Single cluster deployment of Pivotal HD
Ÿ Developers instantly bind to shared
Hadoop Clusters
Ÿ Speeds up time-to-value
Pivotal Data Fabric Evolution
Analytic
Data Marts	

SQL Services
Operational
Intelligence	

In-Memory Database
Run-Time
Applications	

Data Staging
Platform	

Data Mgmt. Services
Pivotal Data Platform	

Stream 
Ingestion	

Streaming Services
Software-Defined Datacenter	

New Data-fabrics	

In-Memory Grid
...ETC
Principle #9	

Hadoop in the Cloud
is one of many
distributed
frameworks
2014 is the year of Hadoop	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
A NEW PLATFORM FOR A NEW
ERA
Credits	

• Apache Software Foundation	

• Milind Bhandarkar	

• Konstantin Boudnik	

• Robert Geiger	

• Susheel Kaushik	

• Mak Gokhale
Questions ?

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to HivemallMakoto Yui
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooMithun Radhakrishnan
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyJosh Baer
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Hivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveHivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveDataWorks Summit
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthyhuguk
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillMapR Technologies
 

Was ist angesagt? (20)

Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and Pain
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At Spotify
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Hivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveHivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache Hive
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 

Ähnlich wie Elephant in the cloud

Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...rhatr
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystemJakub Stransky
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsNetajiGandi1
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxraghavanand36
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleBig Data Joe™ Rossi
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Claudio Martella
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 

Ähnlich wie Elephant in the cloud (20)

Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Presentation
PresentationPresentation
Presentation
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 

Kürzlich hochgeladen

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Kürzlich hochgeladen (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Elephant in the cloud

  • 1. Elephant in the Cloud: a quest for the next generation Hadoop architecture Roman Shaposhnik Sr. Manager, Open Source Hadoop Platform @Pivotal (Twitter: @rhatr)
  • 2. Who’s this guy? •  Sr. Manager @Pivotal building a team of OS contributors •  Apache Software Foundation guy (VP of Apache Incubator, VP of Apache Bigtop, committer on Hadoop, Giraph, Sqoop, etc) •  Used to be root@Cloudera •  Used to be PHB@Yahoo! (original Hadoop team) •  Used to be a hacker at Sun microsystems (Sun Studio compilers and tools)
  • 5. Long, long time ago… HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce
  • 6. In a blink of an eye: HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN
  • 7. Genesis of Hadoop • Google papers on GFS and MapReduce • A subproject of Apache Nutch • A bet by Yahoo!
  • 8. Data brings value • What features to add to the product • Data analysis must enable decisions • V3: volume, velocity, variety
  • 9. Big Data brings big value
  • 11. Big Data Utility Gap 70% of data generated by customers 80% of data being stored 3% being prepared for analysis 0.5% being analyzed <0.5% being operationalized Average Enterprises 3 Exabytes per day now 40 Trillion total Gigabytes in 2020 (Or 162 iPhones of storage for every human) ?
  • 12.
  • 13. Hadoop’s childhood • HDFS: Hadoop Distributed Filesystem • MapReduce: computational framework
  • 14.
  • 15. HDFS: not a POSIXfs • Huge blocks: 64Mb (128Mb) • Mostly immutable files (append, truncate) • Streaming data access • Block replication
  • 16. How do I use it? $ hadoop fs –lsr / # hadoop-fuse-dfs dfs://hadoop-hdfs /mnt $ ls /mnt # mount –t nfs –o vers=3,proto=tcp,nolock host:/ /mnt $ ls /mnt
  • 17. Principle #1 HDFS is the datalake
  • 18. Pivotal’s Focus on Data Lakes Existing EDW / Datamarts Raw “untouched” Data In-MemoryParallelIngest Data Management (Search Engine) Processed Data In-Memory Services BI/AnalyticalTools Data Lake ERP HR SFDC New Data Sources/Formats Machine Traditional Data Sources Finally! I now have full transparency on the data with amazing speed! All data is now accessible! I can now afford “Big Data” Business Users ELT Processing with Hadoop HDFS MapReduce/SQL/Pig/Hive Analytical Data Marts/ Sandboxes SecurityandControl
  • 19. HDFS enables the stack HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN
  • 20. Principle #2 Apps share their internal state
  • 21. MapReduce • Batch oriented (long jobs; final results) • Brings the computation to the data • Very constrained programming model • Embarrassingly parallel programming model • Used to be the only game in town for compute
  • 22. MapReduce Overview • Record = (Key, Value) • Key : Comparable, Serializable • Value: Serializable • Logical Phases: Input, Map, Shuffle, Reduce, Output
  • 23. Map • Input: (Key1, Value1) • Output: List(Key2, Value2) • Projections, Filtering, Transformation
  • 24. Shuffle • Input: List(Key2, Value2) • Output • Sort(Partition(List(Key2, List(Value2)))) • Provided by Hadoop : Several Customizations Possible
  • 25. Reduce • Input: List(Key2, List(Value2)) • Output: List(Key3, Value3) • Aggregations
  • 26. Anatomy of MapReduce d a c a b c a 3 b 1 c 2 a 1 b 1 c 1 a 1 c 1 a 1 a 1 1 1 b 1 c 1 1 HDFS mappers reducers HDFS
  • 28. How do I use it? public static class TokenizerMapper extends MapperObject, Text, Text, IntWritable { public void map(Object key, Text value, Context context) { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
  • 29. How do I use it? public static class IntSumReducer extends ReducerText,IntWritable,Text,IntWritable { public void reduce(Text key, IterableIntWritable values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
  • 30. How do I run it? $ hadoop jar hadoop-examples.jar wordcount input output
  • 31. Principle #3 MapReduce is assembly language of Hadoop
  • 32. Hadoop’s childhood • Compact (pretty much a single jar) • Challenged in scalability and SPOFs • Extremely batch oriented • Hard for non-Java programmers
  • 34. Hadoop 1.0 HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce
  • 35. Hadoop 2.0 HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce Tez YARN Hamster YARN
  • 36. Hadoop 2.0 • HDFS 2.0 • Yet Another Resource Negotiator (YARN) • MapReduce is just an “application” now • Tez is another “application” • Pivotal’s Hamster (OpenMPI) yet another one
  • 40. YARN • Yet Another Resource Negotiator • Resource Manager • Node Managers • Application Masters • Specific to paradigm, e.g. MR Application master (aka JobTracker)
  • 42. Hamster •  Hadoop and MPI on the same cluster •  OpenMPI Runtime on Hadoop YARN •  Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System •  Open MPI Provides: Process launching, Communication, I/O forwarding
  • 43. Hamster Components • Hamster Application Master • Gang Scheduler, YARN Application Preemption • Resource Isolation (lxc Containers) • ORTE: Hamster Runtime • Process launching, Wireup, Interconnect
  • 45. Hadoop 2.0 HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce Tez YARN Hamster YARN
  • 46. Hadoop ecosystem HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout YARN Hamster YARN
  • 47. There’s way too much stuff • Tracking dependencies • Integration testing • Optimizing the defaults • Rationalizing the behaviour
  • 48. Wait! We’ve seen this! GNU Software Linux kernel
  • 49. Apache Bigtop Hadoop ecosystem (Hbase, Pig, Hive) Hadoop (HDFS,YARN, MR)
  • 50. Principle #4 Apache Bigtop is how the Hadoop distros get defined
  • 51. The ecosystem • Apache HBase • Apache Crunch, Pig, Hive and Phoenix • Apache Giraph • Apache Oozie • Apache Mahout • Apache Sqoop and Flume
  • 52. Apache HBase • Small mutable records vs. HDFS files • HFiles kept in HDFS • Memcached for HDFS • Built on HDFS and Zookeeper • Google’s Bigtable
  • 53. Hbase datamodel • Driven by the original Webtable usecase: com.cnn.www html... content: CNN CNN.co anchor:a.com anchor:b.com
  • 54. How do I use it? HTable table = new HTable(config, “table”); Put p = new Put(Bytes.toBytes(“row”)); p.add(Bytes.toBytes(“family”), Bytes.toBytes(“qualifier”), Bytes.toBytes(“data”)); table.put(p);
  • 56. When do I use it? • Serving up large amounts of data • Fast random access • Scan operations
  • 57. Principle #5 HBase: when you need OLAP + OLTP
  • 58. What if its OLTP? HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout YARN Hamster YARN
  • 59. GemFire XD HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout YARN GemFire XD Hamster YARN
  • 60. GemFire XD: a better HBase? • Close sourced but extremely mature • SQL/Objects/JSON data model • High concurrency, high update load • Mostly selective point queries (no scans) • Tiered storage architecture
  • 61. YCSB Benchmark; Throughput is 2-12X 0 100000 200000 300000 400000 500000 600000 700000 800000 AU BU CU D FU LOAD Throughput(ops/sec) HBase 4 8 12 16 0 100000 200000 300000 400000 500000 600000 700000 800000 AU BU CU D FU LOAD Throughput(ops/sec) GemFire XD 4 8 12 16
  • 62. YCSB Benchmark; Latency is 2X – 20X better 0 2000 4000 6000 8000 10000 12000 14000 Latency(μsec) HBase 4 8 12 16 0 2000 4000 6000 8000 10000 12000 14000 Latency(μsec) GemFire XD 4 8 12 16
  • 63. Principle #6 There are always 3 implementations
  • 64. Querying data • MapReduce: “an assembly language” • Apache Pig: a data manipulation DSL (now Turing complete!) • Apache Hive: a batch-oriented SQL on top of Hadoop
  • 65. How do I use Pig? grunt A = load ‘./input.txt’; grunt B = foreach A generate flatten(TOKENIZE((chararray)$0)) as words; grunt C = group B by word; grunt D = foreach C generate COUNT(B), group;
  • 66. How do I use Hive? CREATE TABLE docs (line STRING); LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, 's')) AS word FROM docs) GROUP BY word ORDER BY word;
  • 67. Can we short Oracle now? • No indexing • Batch oriented scheduling • Optimization for long running queries • Metadata management is still in flux
  • 68. [Close to] real-time SQL • Impala (inspired by Google’s F1) • Hive/Tez (AKA Stinger) • Facebook’s Presto (Hive’s lineage) • Pivotal’s HAWQ
  • 69. HAWQ • GreenPlum MPP database core • True ANSI SQL support • HDFS storage backend • Parquet support is coming
  • 72. Getting data in: Flume • Designed for collecting log data • Flexible deployment topology
  • 73. Sqoop: RDBMs connection • Sqoop 1 • A MapReduce tool • Must use Oozie for workflows • Sqoop 2 • Well, 0.99.x really • A standalone service
  • 74. Spring XD • Unified, distributed, extensible system for data ingestions, real time analytics and data exports • Apache Licensed, not ASF • A runtime service, not a library • AKA “Oozie + Flume + Sqoop + Morphlines”
  • 75. How do I use it? # deployment: ./xd-singlenode $ ./xd-shell xd: hadoop config fs –namenode hdfs://nn:8020 xd: stream create –definition “time | hdfs” –name ticktock xd: stream destroy –name ticktock
  • 76. Feeding the Elephant HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout YARN GemFire XD SpringXD Hamster YARN
  • 77. Spark the disruptor HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFireXD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX SpringXD YARN Hamster YARN
  • 78. What’s wrong with MR? Source: UC Berkeley Spark project (just the image)
  • 79. Spark innovations • Resilient Distribtued Datasets (RDDs) • Distributed on a cluster • Manipulated via parallel operators (map, etc.) • Automatically rebuilt on failure • A parallel ecosystem • A solution to iterative and multi-stage apps
  • 80. RDDs warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1)) HadoopRDD path = hdfs:// FilteredRDD contains… MappedRDD split…
  • 81. Parallel operators • map, reduce • sample, filter • groupBy, reduceByKey • join, leftOuterJoin, rightOuterJoin • union, cross
  • 82. An alternative backend • Shark: a Hive on Spark • Spork: a Pig on Spark • Mlib: machine learning on Spark • GraphX: Graph processing on Spark • Also featuring its own streaming engine
  • 83. How do I use it? val file = spark.textFile(hdfs://...) val counts = file.flatMap(line = line.split( )) .map(word = (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(hdfs://...)
  • 84. Principle #8 Spark is the technology of 2014
  • 86. What’s new? • True elasticity • Resource partitioning • Security • Data marketplace • Data leaks/breaches
  • 87. Hadoop Maturity ETL Offload Accommodate massive data growth with existing EDW investments Data Lakes Unify Unstructured and Structured Data Access Big Data Apps Build analytic-led applications impacting top line revenue Data-Driven Enterprise App Dev and Operational Management on HDFS Data Architecture
  • 88. Pivotal HD on Pivotal CF Ÿ Enterprise PaaS Management System Ÿ Flexible multi-language ‘buildpack’ architecture Ÿ Deployed applications enjoy built-in services Ÿ On-Premise Hadoop as a Service Ÿ Single cluster deployment of Pivotal HD Ÿ Developers instantly bind to shared Hadoop Clusters Ÿ Speeds up time-to-value
  • 89. Pivotal Data Fabric Evolution Analytic Data Marts SQL Services Operational Intelligence In-Memory Database Run-Time Applications Data Staging Platform Data Mgmt. Services Pivotal Data Platform Stream Ingestion Streaming Services Software-Defined Datacenter New Data-fabrics In-Memory Grid ...ETC
  • 90. Principle #9 Hadoop in the Cloud is one of many distributed frameworks
  • 91. 2014 is the year of Hadoop HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN
  • 92. A NEW PLATFORM FOR A NEW ERA
  • 93. Credits • Apache Software Foundation • Milind Bhandarkar • Konstantin Boudnik • Robert Geiger • Susheel Kaushik • Mak Gokhale