SlideShare ist ein Scribd-Unternehmen logo
1 von 82
Downloaden Sie, um offline zu lesen
Hadoop:The Default
Machine Learning
Platform ?
Milind Bhandarkar
Chief Scientist, Pivotal
@techmilind
Wednesday, December 18, 2013
About Me
• http://www.linkedin.com/in/milindb
• Founding member of Hadoop team atYahoo! [2005-2010]
• Contributor to Apache Hadoop since v0.1
• Built and led Grid SolutionsTeam atYahoo! [2007-2010]
• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)
• Center for Development of Advanced Computing (C-DAC), National Center
for Supercomputing Applications (NCSA), Center for Simulation of Advanced
Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by
QLogic),Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
Wednesday, December 18, 2013
Acknowledgements
•Developers of various Open-Source, and
Proprietary Data Platforms
•Ex-Colleagues atYahoo! Research
•Colleagues at Data ScienceTeam, Pivotal
•Vijay Narayanan, Microsoft
Wednesday, December 18, 2013
Wednesday, December 18, 2013
Kryptonite: First Hadoop Cluster AtYahoo!
Wednesday, December 18, 2013
M45:Academic Collaboration
Wednesday, December 18, 2013
Analytics Workbench
Wednesday, December 18, 2013
Analytics Workbench
Wednesday, December 18, 2013
!"#$%&'())'
BATCH
HDFS
!"#$%&'())'
INTERACTIVE
!"#$%&'())'
BATCH
HDFS
!"#$%&'())'
BATCH
HDFS
!"#$%&'())'
ONLINE
Hadoop 1.0
(Image Courtesy Arun Murthy, Hortonworks)
Wednesday, December 18, 2013
MapReduce 1.0
(Image Courtesy Arun Murthy, Hortonworks)
Wednesday, December 18, 2013
ML in MapReduce:Why ?
Wednesday, December 18, 2013
ML in MapReduce:Why ?
• High data throughput: 100TB/hr using 500 mappers
Wednesday, December 18, 2013
ML in MapReduce:Why ?
• High data throughput: 100TB/hr using 500 mappers
• Framework provides fault tolerance
• Monitors and and re-starts tasks on other machines should
one of the machines fail
Wednesday, December 18, 2013
ML in MapReduce:Why ?
• High data throughput: 100TB/hr using 500 mappers
• Framework provides fault tolerance
• Monitors and and re-starts tasks on other machines should
one of the machines fail
• Excels in counting patterns over data records
Wednesday, December 18, 2013
ML in MapReduce:Why ?
• High data throughput: 100TB/hr using 500 mappers
• Framework provides fault tolerance
• Monitors and and re-starts tasks on other machines should
one of the machines fail
• Excels in counting patterns over data records
• Built on relatively cheap, commodity hardware
Wednesday, December 18, 2013
ML in MapReduce:Why ?
• High data throughput: 100TB/hr using 500 mappers
• Framework provides fault tolerance
• Monitors and and re-starts tasks on other machines should
one of the machines fail
• Excels in counting patterns over data records
• Built on relatively cheap, commodity hardware
• Large volumes of data already stored on Hadoop clusters
running MapReduce
Wednesday, December 18, 2013
ML in MapReduce:Why ?
• Learning can become limited by computation
time and not data volume
• With large enough data and number of machines
• Reduces the need to down-sample data
• More accurate parameter estimates compared to
learning on a single machine for the same amount of
time
Wednesday, December 18, 2013
Learning Models in
MapReduce
• Data parallel algorithms are most appropriate for
MapReduce implementations
• Not necessarily the most optimal implementation for a
specific algorithm
• Other specialized non-MapReduce implementations
exist for some algorithms, which may be better
• MR may not be the appropriate framework for exact
solutions of non data parallel/sequential algorithms
• Approximate solutions using MR may be good enough
Wednesday, December 18, 2013
Types of Learning in MapReduce
•Parallel training of multiple models
•Train either in Mappers or Reducers
•Ensemble training methods
•Train multiple models and combine them
•Distributed learning algorithms
•Learn using both Mappers and Reducers
Wednesday, December 18, 2013
ParallelTraining of
Multiple Models
• Train multiple models simultaneously using a learning algorithm that can
be learnt in memory
• Useful when individual models are trained using a subset, filtered or
modification of raw data
• Train 1 model in each reducer
• Map:
• Input:All data
• Filters subset of data relevant for each model training
• Output: <model_index, subset of data for training this model>
• Reduce
• Train model on data corresponding to that model_index
Wednesday, December 18, 2013
Distributed Learning
Algorithms
• Suitable for learning algorithms that are
• Compute-Intensive per data record
• One or few iterations for learning
• Do not transfer much data between iterations
• Typical algorithms
• Fit the Statistical query model (SQM)
• Divide and conquer
Wednesday, December 18, 2013
k-Means Clustering
• Choose k samples as initial cluster centroids
• In each MapReduce Iteration:
• Assign membership of each point to closest cluster
• Re-compute new cluster centroids using assigned members	

• Control program to
• Initialize the centroids
• random, initial clustering on sample etc.
• Run the MapReduce iterations
• Determine stopping criterion
Wednesday, December 18, 2013
k-Means Clustering in MapReduce
Wednesday, December 18, 2013
k-Means Clustering in MapReduce
• Map
• Input data points: x_i
• Input cluster centroids: c_i
• Assign each data point to closest cluster
• Output
Wednesday, December 18, 2013
k-Means Clustering in MapReduce
• Map
• Input data points: x_i
• Input cluster centroids: c_i
• Assign each data point to closest cluster
• Output
• Reduce
• Compute new centroids for each cluster
Wednesday, December 18, 2013
Complexity of k-Means
Clustering
• Each point is compared with each cluster
centroid
• Complexity = N*K*O(d) where O(d) is the
complexity of the distance metric
• Typical Euclidean distance is not a cheap
operation
• Can reduce complexity using an initial canopy
clustering to partition data cheaply
Wednesday, December 18, 2013
Apache Mahout
• Goal
• Create scalable, machine learning algorithms under the
Apache license
• Contains both:
• Hadoop implementations of algorithms that scale linearly
with data.
• Fast sequential (non MapReduce) algorithms
• Wiki:
• https://cwiki.apache.org/confluence/display/MAHOUT/
Mahout+Wiki
Wednesday, December 18, 2013
Algorithms in Mahout
• Classification:
• Logistic Regression, Naïve Bayes, Complementary Naïve
Bayes, Random Forests
• Clustering
• K-means, Fuzzy k-means, Canopy, Mean-shift clustering,
Dirichlet Process clustering, Latent Dirichlet allocation,
Spectral clustering
• Parallel FP growth
• Item based recommendations
• Stochastic Gradient Descent (sequential)
Wednesday, December 18, 2013
Challenges for ML with
MapReduce
•MapReduce is optimized for large batch data
processing
•Assumes data parallelism
•Ideal for shared-nothing computing
•Many learning algorithms are iterative
•Incur significant overheads per iteration
Wednesday, December 18, 2013
Challenges (contd.)
•Multiple scans of the same data
•Typically once per iteration: High I/O
overhead reading data into mappers per
iteration
•In some algorithms static data is read into
mappers in each iteration
•e.g. input data in k-means clustering.
Wednesday, December 18, 2013
Challenges (contd.)
•Need a separate controller outside the
framework to:
•Coordinate the multiple MapReduce jobs
for each iteration
•Perform some computations between
iterations and at the end
•Measure and implement stopping criterion
Wednesday, December 18, 2013
Challenges (contd.)
• Incur multiple task initialization overheads
• Setup and tear down mapper and reducer
tasks per iteration
• Transfer/shuffle static data between mapper
and reducer repeatedly
• Intermediate data is transferred through
index/data files on local disks of mappers
and pulled by reducers
Wednesday, December 18, 2013
Challenges (contd.)
•Blocking architecture
•Reducers cannot start till all map jobs
complete
•Availability of nodes in a shared environment
•Wait for mapper and reducer nodes to
become available in each iteration in a
shared computing cluster
Wednesday, December 18, 2013
Iterative Algorithms in MapReduce
PassResult
Wednesday, December 18, 2013
Iterative Algorithms in MapReduce
Overhead per Iteration:
• Job setup
• Data Loading
• Disk I/O
PassResult
Wednesday, December 18, 2013
Enhancements to
MapReduce
• Many proposals to overcome these challenges
• All try to retain the core strengths of data
partitioning and fault tolerance of Hadoop to
various degrees
• Proposed enhancements and alternatives to
Hadoop
• Worker/Aggregator framework, HaLoop,
MapReduce Online, iMapReduce, Spark,Twister,
Hadoop ML, ...
Wednesday, December 18, 2013
Worker/Aggregator
12/17/13102
FinalResultWednesday, December 18, 2013
Worker/Aggregator
12/17/13102
Advantages:
• Schedule once per Job
• Data stays in memory
• P2P communication
FinalResultWednesday, December 18, 2013
HaLoop
• Programming model and architecture for iterations
• New APIs to express iterations in the framework
• Loop-aware task scheduling
• Physically co-locate tasks that use the same data
in different iterations
• Remember association between data and node
• Assign task to node that uses data cached in that
node
Wednesday, December 18, 2013
HaLoop (contd.)
• Caching for loop invariant data:
• Detect invariants in first iteration, cache on local
disk to reduce I/O and shuffling cost in subsequent
iterations
• Cache for Mapper inputs, Reducer Inputs, Reducer
outputs
• Caching to support fixpoint evaluation:
• Avoids the need for a dedicated MR step on each
iteration
Wednesday, December 18, 2013
Spark
• Open Source Cluster Computing model:
• Different from MapReduce, but retains some basic
character
• Optimized for:
• Iterative computations
• Applies to many learning algorithms
• Interactive data mining
• Load data once into multiple mappers and run
multiple queries
Wednesday, December 18, 2013
Spark (contd.)
• Programming model using working sets 
• applications reuse intermediate results in multiple parallel
operations
• preserves the fault tolerance of MapReduce
• Supports
• Parallel loops over distributed datasets
• Loads data into memory for (re)use in multiple iterations
• Access to shared variables accessible from multiple machines
• Implemented in Scala,
• www.spark-project.org
Wednesday, December 18, 2013
Hadoop 2.0
(Image Courtesy Arun Murthy, Hortonworks)
HADOOP 1.0
!"#$%
!"#$%&$'&()*"#+,'-+#*.(/"'0#1*
&'()*+,-*%
!2+%.(#"*"#./%"2#*3'&'0#3#&(*
*4*$'('*5"/2#..,&01*
!"#$.%
!"#$%&$'&()*"#+,'-+#*.(/"'0#1*
/0)1%
!2+%.(#"*"#./%"2#*3'&'0#3#&(1*
2*3%
!#6#2%7/&*#&0,&#1*
HADOOP 2.0
456%
!$'('*8/91*
!57*%
!.:+1*
%
89:*;<%
!2'.2'$,&01*
*
456%
!$'('*8/91*
!57*%
!.:+1*
%
89:*;<%
!2'.2'$,&01*
%
&)%
!-'(2;1*
)2%%
$9;*'=>%
?;'(:%
!"#$%&''
()$*+,'
*
$*;75-*<%
-.*/0'
*
Wednesday, December 18, 2013
!""#$%&'()*+,-)+.&'/0#1+2.+3&4(("+
35678+!"#$%&$'&()*"#+,'-+#*.(/0'1#2*
9!,.+!3+%4(#0*"#4/%05#*6'&'1#7#&(2***
:!;<3+
=>&",04-%0?+
2.;@,!<;2A@+
=;0B?+
7;,@!>2.C+
=7D(EFG+7HGI?+
C,!J3+
=C$E&"K?+
2.L>@>M,9+
=7"&EN?+
3J<+>J2+
=M"0)>J2?+
M.O2.@+
=3:&*0?+
M;3@,+
=70&E%K?+
=P0&/0I?+
YARN Platform
(Image Courtesy Arun Murthy, Hortonworks)
Wednesday, December 18, 2013
!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*
+"',&-'$)*./.*
+"',&-'$)*0/1*
!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*
!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*
+"',&-'$)*./0*
+"',&-'$)*./2*
3%*.*
+"',&-'$)*0/0*
+"',&-'$)*0/.*
+"',&-'$)*0/2*
3%0*
+4-$',0*
5$6"7)8$%&'&($)*
98:$#74$)*
YARN Architecture
(Image Courtesy Arun Murthy, Hortonworks)
Wednesday, December 18, 2013
YARN
•Yet Another Resource Negotiator
•Resource Manager
•Node Managers
•Application Masters
•Specific to paradigm, e.g. MR Application
master (aka JobTracker)
Wednesday, December 18, 2013
MPP SQL On Hadoop
Wednesday, December 18, 2013
SQL-on-Hadoop
•Pivotal HAWQ
•Cloudera Impala, Facebook Presto,Apache
Drill, Cascading Lingual, Optiq, Hortonworks
Stinger
•Hadapt, Jethrodata, IBM BigSQL, Microsoft
PolyBase
•More to come...
Wednesday, December 18, 2013
Network
Interconnect
...
......HAWQ & HDFS
Master
Severs
Planning & dispatch
Segment
Severs
Query execution
...
Storage
HDFS, HBase …
Wednesday, December 18, 2013
Namenode
B
replication
Rack1 Rack2
DatanodeDatanode Datanode
Read/Write
Segment
Segment host
Segment
Segment
Segment host
Segment
Segment host
Master host
Meta Ops
HAWQ Interconnect
Segment
Segment
Segment
Segment host
Segment
Datanode
Segment
SegmentSegment Segment
Wednesday, December 18, 2013
HAWQ vs Hive
Lower is Better
Wednesday, December 18, 2013
Provides data-parallel implementations
of mathematical, statistical and machine-learning
methods
for structured and unstructured data.
In-Database Analytics
Wednesday, December 18, 2013
MADlib Algorithms
Wednesday, December 18, 2013
MADLib Functions
• Linear Regression
• Logistic Regression
• Multinomial Logistic
Regression
• K-Means
• Association Rules
• Latent Dirichlet Allocation
• Naïve Bayes
• Elastic Net Regression
• DecisionTrees / Random
Forest
• SupportVector Machines
• Cox Proportional Hazards
Regression
• Descriptive Statistics
• ARIMA
Wednesday, December 18, 2013
k-Means Usage
SELECT * FROM madlib.kmeanspp (
‘customers’, -- name of the input table
‘features’, -- name of the feature array column
2 -- k : number of clusters
);
centroids | objective_fn | frac_reassigned | …
------------------------------------------------------------------------+------------------+-----------------+ …
{{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | …
Wednesday, December 18, 2013
Accessing HAWQ
Through R
Wednesday, December 18, 2013
Pivotal R
•Interface is R client
•Execution is in database
•Parallelism handled by PivotalR
•Supports a portion of R
R> x = db.data.frame(“t1”)
R> l = madlib.lm(interlocks ~ assets + nation, data = t)
Wednesday, December 18, 2013
Wednesday, December 18, 2013
A wrapper of MADlib
• Linear regression
• Logistic regression
• Elastic Net
• ARIMA
• Table summary
Wednesday, December 18, 2013
A wrapper of MADlib
• Linear regression
• Logistic regression
• Elastic Net
• ARIMA
• Table summary
• $ [ [[ $<- [<- [[<-
• is.na
• + - * / %% %/%
^
• & | !
• == != > < >= <=
• merge
• by
• db.data.frame
• as.db.data.frame
• preview• sort
• c mean sum sd var min max
length colMeans colSums
• db.connect db.disconnect db.list
db.objects
db.existsObject delete
• dim names
• content
And more ... (SQL wrapper)
• predict
Wednesday, December 18, 2013
A wrapper of MADlib
• Linear regression
• Logistic regression
• Elastic Net
• ARIMA
• Table summary
• Categorial variable
as.factor()
• $ [ [[ $<- [<- [[<-
• is.na
• + - * / %% %/%
^
• & | !
• == != > < >= <=
• merge
• by
• db.data.frame
• as.db.data.frame
• preview• sort
• c mean sum sd var min max
length colMeans colSums
• db.connect db.disconnect db.list
db.objects
db.existsObject delete
• dim names
• content
And more ... (SQL wrapper)
• predict
Wednesday, December 18, 2013
Pivotal Confidential–Internal Use Only 49
db.obj
db.data.frame db.Rquery
db.table db.view
Wednesday, December 18, 2013
Pivotal Confidential–Internal Use Only 49
db.obj
db.data.frame db.Rquery
db.table db.view
Wednesday, December 18, 2013
Pivotal Confidential–Internal Use Only 49
db.obj
db.data.frame db.Rquery
db.table db.view
Wrapper of objects in database
x = db.data.frame("table")
Wednesday, December 18, 2013
Pivotal Confidential–Internal Use Only 49
db.obj
db.data.frame db.Rquery
db.table db.view
Wrapper of objects in database
x = db.data.frame("table")
Resides in R only
x[,1:2],
merge(x, y, by="column"),
etc.
Wednesday, December 18, 2013
Pivotal Confidential–Internal Use Only 49
db.obj
db.data.frame db.Rquery
db.table db.view
Wrapper of objects in database
x = db.data.frame("table")
Resides in R only
x[,1:2],
merge(x, y, by="column"),
etc.
Wednesday, December 18, 2013
Pivotal Confidential–Internal Use Only 49
db.obj
db.data.frame db.Rquery
db.table db.view
Most operations
Wrapper of objects in database
x = db.data.frame("table")
Resides in R only
x[,1:2],
merge(x, y, by="column"),
etc.
Wednesday, December 18, 2013
Pivotal Confidential–Internal Use Only 49
db.obj
db.data.frame db.Rquery
db.table db.view
Most operations
Wrapper of objects in database
x = db.data.frame("table")
Resides in R only
x[,1:2],
merge(x, y, by="column"),
etc.
Wednesday, December 18, 2013
Pivotal Confidential–Internal Use Only 49
db.obj
db.data.frame db.Rquery
db.table db.view
as.db.data.frame(...)
Most operations
Wrapper of objects in database
x = db.data.frame("table")
Resides in R only
x[,1:2],
merge(x, y, by="column"),
etc.
Wednesday, December 18, 2013
Pivotal Confidential–Internal Use Only 49
db.obj
db.data.frame db.Rquery
db.table db.view
as.db.data.frame(...)
Most operations
Wrapper of objects in database
x = db.data.frame("table")
Resides in R only
x[,1:2],
merge(x, y, by="column"),
etc.
MADlib
wrapper
functions
Wednesday, December 18, 2013
Pivotal Confidential–Internal Use Only 49
db.obj
db.data.frame db.Rquery
db.table db.view
as.db.data.frame(...)
Most operations
Wrapper of objects in database
x = db.data.frame("table")
Resides in R only
x[,1:2],
merge(x, y, by="column"),
etc.
MADlib
wrapper
functions
preview
Wednesday, December 18, 2013
In-Database Execution
•All data stays in DB: R objects merely point
to DB objects
•All model estimation and heavy lifting done
in DB by MADlib
•R→ SQL translation done in the R client
•Only strings of SQL and model output
transferred across ODBC/DBI
Wednesday, December 18, 2013
Beyond MapReduce
•Apache Giraph - BSP & Graph Processing
•Storm onYarn - Streaming Computation
•HOYA - HBase onYarn
•Hamster - MPI on Hadoop
•More to come ...
Wednesday, December 18, 2013
Hamster
• Hadoop and MPI on the same
cluster
• OpenMPI Runtime on Hadoop
YARN
• Hadoop Provides: Resource
Scheduling, Process monitoring,
Distributed File System
• Open MPI Provides: Process
launching, Communication, I/O
forwarding
Wednesday, December 18, 2013
Hamster Components
•Hamster Application Master
•Gang Scheduler,YARN Application
Preemption
•Resource Isolation (lxc Containers)
•ORTE: Hamster Runtime
•Process launching,Wireup, Interconnect
Wednesday, December 18, 2013
Resource Manager
Scheduler
AMService
Node Manager Node Manager Node Manager
!
Proc/
Container
Framework
Daemon
NS
MPI
Scheduler
HNP
MPI AM
Proc/
Container
!RM-AM
AM-NM
RM-NodeManagerClient
Client-RM
Aux Srvcs
Proc/
Container
Framework
Daemon
NS
Proc/
Container
!
Aux Srvcs
RM-
NodeManager
Hamster Architecture
Wednesday, December 18, 2013
Hamster Scalability
•Sufficient for small to medium HPC
workloads
•Job launch time gated byYARN resource
scheduler
Launch WireUp Collectives Monitor
OpenMPI O(logN) O(logN) O(logN) O(logN)
Hamster O(N) O(logN) O(logN) O(logN)
Wednesday, December 18, 2013
GraphLab + Hamster
on Hadoop
!
Wednesday, December 18, 2013
About GraphLab
•Graph-based, High-Performance distributed
computation framework
•Started by Prof. Carlos Guestrin in CMU in
2009
•Recently founded Graphlab Inc to
commercialize Graphlab.org
Wednesday, December 18, 2013
GraphLab Features
•Topic Modeling (e.g. LDA)
•Graph Analytics (Pagerank,Triangle counting)
•Clustering (K-Means)
•Collaborative Filtering
•Linear Solvers
•etc...
Wednesday, December 18, 2013
Only Graphs are not
Enough
• Full Data processing workflow required ETL/
Postprocessing,Visualization, Data Wrangling, Serving
• MapReduce excels at data wrangling
• OLTP/NoSQL Row-Based stores excel at Serving
• GraphLab should co-exist with other Hadoop
frameworks
Wednesday, December 18, 2013
Questions?
Wednesday, December 18, 2013

Weitere ähnliche Inhalte

Was ist angesagt?

Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 

Was ist angesagt? (20)

Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 

Andere mochten auch

Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdKevin Weil
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Kevin Weil
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Kevin Weil
 
Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Kevin Weil
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Kevin Weil
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joinsShalish VJ
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 

Andere mochten auch (10)

Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
 
Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joins
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Ähnlich wie Hadoop: The Default Machine Learning Platform ?

H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Survey of Available Machine Learning Frameworks - Brendan HergerH2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Survey of Available Machine Learning Frameworks - Brendan HergerSri Ambati
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data ModelingAdam Doyle
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupSri Kanajan
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2aswini pilli
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at myliferesponseteam
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoopGeoff Hendrey
 
advance computing and big adata analytic.pptx
advance computing and big adata analytic.pptxadvance computing and big adata analytic.pptx
advance computing and big adata analytic.pptxTeddyIswahyudi1
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in JavaRuben Badaró
 

Ähnlich wie Hadoop: The Default Machine Learning Platform ? (20)

H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Survey of Available Machine Learning Frameworks - Brendan HergerH2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Apereo OAE - Bootcamp
Apereo OAE - BootcampApereo OAE - Bootcamp
Apereo OAE - Bootcamp
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 
advance computing and big adata analytic.pptx
advance computing and big adata analytic.pptxadvance computing and big adata analytic.pptx
advance computing and big adata analytic.pptx
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Mongo db
Mongo dbMongo db
Mongo db
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
Asd 2015
Asd 2015Asd 2015
Asd 2015
 

Kürzlich hochgeladen

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 

Kürzlich hochgeladen (20)

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 

Hadoop: The Default Machine Learning Platform ?

  • 1. Hadoop:The Default Machine Learning Platform ? Milind Bhandarkar Chief Scientist, Pivotal @techmilind Wednesday, December 18, 2013
  • 2. About Me • http://www.linkedin.com/in/milindb • Founding member of Hadoop team atYahoo! [2005-2010] • Contributor to Apache Hadoop since v0.1 • Built and led Grid SolutionsTeam atYahoo! [2007-2010] • Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu) • Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic),Yahoo!, LinkedIn, and Pivotal (formerly Greenplum) Wednesday, December 18, 2013
  • 3. Acknowledgements •Developers of various Open-Source, and Proprietary Data Platforms •Ex-Colleagues atYahoo! Research •Colleagues at Data ScienceTeam, Pivotal •Vijay Narayanan, Microsoft Wednesday, December 18, 2013
  • 5. Kryptonite: First Hadoop Cluster AtYahoo! Wednesday, December 18, 2013
  • 10. MapReduce 1.0 (Image Courtesy Arun Murthy, Hortonworks) Wednesday, December 18, 2013
  • 11. ML in MapReduce:Why ? Wednesday, December 18, 2013
  • 12. ML in MapReduce:Why ? • High data throughput: 100TB/hr using 500 mappers Wednesday, December 18, 2013
  • 13. ML in MapReduce:Why ? • High data throughput: 100TB/hr using 500 mappers • Framework provides fault tolerance • Monitors and and re-starts tasks on other machines should one of the machines fail Wednesday, December 18, 2013
  • 14. ML in MapReduce:Why ? • High data throughput: 100TB/hr using 500 mappers • Framework provides fault tolerance • Monitors and and re-starts tasks on other machines should one of the machines fail • Excels in counting patterns over data records Wednesday, December 18, 2013
  • 15. ML in MapReduce:Why ? • High data throughput: 100TB/hr using 500 mappers • Framework provides fault tolerance • Monitors and and re-starts tasks on other machines should one of the machines fail • Excels in counting patterns over data records • Built on relatively cheap, commodity hardware Wednesday, December 18, 2013
  • 16. ML in MapReduce:Why ? • High data throughput: 100TB/hr using 500 mappers • Framework provides fault tolerance • Monitors and and re-starts tasks on other machines should one of the machines fail • Excels in counting patterns over data records • Built on relatively cheap, commodity hardware • Large volumes of data already stored on Hadoop clusters running MapReduce Wednesday, December 18, 2013
  • 17. ML in MapReduce:Why ? • Learning can become limited by computation time and not data volume • With large enough data and number of machines • Reduces the need to down-sample data • More accurate parameter estimates compared to learning on a single machine for the same amount of time Wednesday, December 18, 2013
  • 18. Learning Models in MapReduce • Data parallel algorithms are most appropriate for MapReduce implementations • Not necessarily the most optimal implementation for a specific algorithm • Other specialized non-MapReduce implementations exist for some algorithms, which may be better • MR may not be the appropriate framework for exact solutions of non data parallel/sequential algorithms • Approximate solutions using MR may be good enough Wednesday, December 18, 2013
  • 19. Types of Learning in MapReduce •Parallel training of multiple models •Train either in Mappers or Reducers •Ensemble training methods •Train multiple models and combine them •Distributed learning algorithms •Learn using both Mappers and Reducers Wednesday, December 18, 2013
  • 20. ParallelTraining of Multiple Models • Train multiple models simultaneously using a learning algorithm that can be learnt in memory • Useful when individual models are trained using a subset, filtered or modification of raw data • Train 1 model in each reducer • Map: • Input:All data • Filters subset of data relevant for each model training • Output: <model_index, subset of data for training this model> • Reduce • Train model on data corresponding to that model_index Wednesday, December 18, 2013
  • 21. Distributed Learning Algorithms • Suitable for learning algorithms that are • Compute-Intensive per data record • One or few iterations for learning • Do not transfer much data between iterations • Typical algorithms • Fit the Statistical query model (SQM) • Divide and conquer Wednesday, December 18, 2013
  • 22. k-Means Clustering • Choose k samples as initial cluster centroids • In each MapReduce Iteration: • Assign membership of each point to closest cluster • Re-compute new cluster centroids using assigned members • Control program to • Initialize the centroids • random, initial clustering on sample etc. • Run the MapReduce iterations • Determine stopping criterion Wednesday, December 18, 2013
  • 23. k-Means Clustering in MapReduce Wednesday, December 18, 2013
  • 24. k-Means Clustering in MapReduce • Map • Input data points: x_i • Input cluster centroids: c_i • Assign each data point to closest cluster • Output Wednesday, December 18, 2013
  • 25. k-Means Clustering in MapReduce • Map • Input data points: x_i • Input cluster centroids: c_i • Assign each data point to closest cluster • Output • Reduce • Compute new centroids for each cluster Wednesday, December 18, 2013
  • 26. Complexity of k-Means Clustering • Each point is compared with each cluster centroid • Complexity = N*K*O(d) where O(d) is the complexity of the distance metric • Typical Euclidean distance is not a cheap operation • Can reduce complexity using an initial canopy clustering to partition data cheaply Wednesday, December 18, 2013
  • 27. Apache Mahout • Goal • Create scalable, machine learning algorithms under the Apache license • Contains both: • Hadoop implementations of algorithms that scale linearly with data. • Fast sequential (non MapReduce) algorithms • Wiki: • https://cwiki.apache.org/confluence/display/MAHOUT/ Mahout+Wiki Wednesday, December 18, 2013
  • 28. Algorithms in Mahout • Classification: • Logistic Regression, Naïve Bayes, Complementary Naïve Bayes, Random Forests • Clustering • K-means, Fuzzy k-means, Canopy, Mean-shift clustering, Dirichlet Process clustering, Latent Dirichlet allocation, Spectral clustering • Parallel FP growth • Item based recommendations • Stochastic Gradient Descent (sequential) Wednesday, December 18, 2013
  • 29. Challenges for ML with MapReduce •MapReduce is optimized for large batch data processing •Assumes data parallelism •Ideal for shared-nothing computing •Many learning algorithms are iterative •Incur significant overheads per iteration Wednesday, December 18, 2013
  • 30. Challenges (contd.) •Multiple scans of the same data •Typically once per iteration: High I/O overhead reading data into mappers per iteration •In some algorithms static data is read into mappers in each iteration •e.g. input data in k-means clustering. Wednesday, December 18, 2013
  • 31. Challenges (contd.) •Need a separate controller outside the framework to: •Coordinate the multiple MapReduce jobs for each iteration •Perform some computations between iterations and at the end •Measure and implement stopping criterion Wednesday, December 18, 2013
  • 32. Challenges (contd.) • Incur multiple task initialization overheads • Setup and tear down mapper and reducer tasks per iteration • Transfer/shuffle static data between mapper and reducer repeatedly • Intermediate data is transferred through index/data files on local disks of mappers and pulled by reducers Wednesday, December 18, 2013
  • 33. Challenges (contd.) •Blocking architecture •Reducers cannot start till all map jobs complete •Availability of nodes in a shared environment •Wait for mapper and reducer nodes to become available in each iteration in a shared computing cluster Wednesday, December 18, 2013
  • 34. Iterative Algorithms in MapReduce PassResult Wednesday, December 18, 2013
  • 35. Iterative Algorithms in MapReduce Overhead per Iteration: • Job setup • Data Loading • Disk I/O PassResult Wednesday, December 18, 2013
  • 36. Enhancements to MapReduce • Many proposals to overcome these challenges • All try to retain the core strengths of data partitioning and fault tolerance of Hadoop to various degrees • Proposed enhancements and alternatives to Hadoop • Worker/Aggregator framework, HaLoop, MapReduce Online, iMapReduce, Spark,Twister, Hadoop ML, ... Wednesday, December 18, 2013
  • 38. Worker/Aggregator 12/17/13102 Advantages: • Schedule once per Job • Data stays in memory • P2P communication FinalResultWednesday, December 18, 2013
  • 39. HaLoop • Programming model and architecture for iterations • New APIs to express iterations in the framework • Loop-aware task scheduling • Physically co-locate tasks that use the same data in different iterations • Remember association between data and node • Assign task to node that uses data cached in that node Wednesday, December 18, 2013
  • 40. HaLoop (contd.) • Caching for loop invariant data: • Detect invariants in first iteration, cache on local disk to reduce I/O and shuffling cost in subsequent iterations • Cache for Mapper inputs, Reducer Inputs, Reducer outputs • Caching to support fixpoint evaluation: • Avoids the need for a dedicated MR step on each iteration Wednesday, December 18, 2013
  • 41. Spark • Open Source Cluster Computing model: • Different from MapReduce, but retains some basic character • Optimized for: • Iterative computations • Applies to many learning algorithms • Interactive data mining • Load data once into multiple mappers and run multiple queries Wednesday, December 18, 2013
  • 42. Spark (contd.) • Programming model using working sets  • applications reuse intermediate results in multiple parallel operations • preserves the fault tolerance of MapReduce • Supports • Parallel loops over distributed datasets • Loads data into memory for (re)use in multiple iterations • Access to shared variables accessible from multiple machines • Implemented in Scala, • www.spark-project.org Wednesday, December 18, 2013
  • 43. Hadoop 2.0 (Image Courtesy Arun Murthy, Hortonworks) HADOOP 1.0 !"#$% !"#$%&$'&()*"#+,'-+#*.(/"'0#1* &'()*+,-*% !2+%.(#"*"#./%"2#*3'&'0#3#&(* *4*$'('*5"/2#..,&01* !"#$.% !"#$%&$'&()*"#+,'-+#*.(/"'0#1* /0)1% !2+%.(#"*"#./%"2#*3'&'0#3#&(1* 2*3% !#6#2%7/&*#&0,&#1* HADOOP 2.0 456% !$'('*8/91* !57*% !.:+1* % 89:*;<% !2'.2'$,&01* * 456% !$'('*8/91* !57*% !.:+1* % 89:*;<% !2'.2'$,&01* % &)% !-'(2;1* )2%% $9;*'=>% ?;'(:% !"#$%&'' ()$*+,' * $*;75-*<% -.*/0' * Wednesday, December 18, 2013
  • 45. !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* +"',&-'$)*./.* +"',&-'$)*0/1* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* +"',&-'$)*./0* +"',&-'$)*./2* 3%*.* +"',&-'$)*0/0* +"',&-'$)*0/.* +"',&-'$)*0/2* 3%0* +4-$',0* 5$6"7)8$%&'&($)* 98:$#74$)* YARN Architecture (Image Courtesy Arun Murthy, Hortonworks) Wednesday, December 18, 2013
  • 46. YARN •Yet Another Resource Negotiator •Resource Manager •Node Managers •Application Masters •Specific to paradigm, e.g. MR Application master (aka JobTracker) Wednesday, December 18, 2013
  • 47. MPP SQL On Hadoop Wednesday, December 18, 2013
  • 48. SQL-on-Hadoop •Pivotal HAWQ •Cloudera Impala, Facebook Presto,Apache Drill, Cascading Lingual, Optiq, Hortonworks Stinger •Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase •More to come... Wednesday, December 18, 2013
  • 49. Network Interconnect ... ......HAWQ & HDFS Master Severs Planning & dispatch Segment Severs Query execution ... Storage HDFS, HBase … Wednesday, December 18, 2013
  • 50. Namenode B replication Rack1 Rack2 DatanodeDatanode Datanode Read/Write Segment Segment host Segment Segment Segment host Segment Segment host Master host Meta Ops HAWQ Interconnect Segment Segment Segment Segment host Segment Datanode Segment SegmentSegment Segment Wednesday, December 18, 2013
  • 51. HAWQ vs Hive Lower is Better Wednesday, December 18, 2013
  • 52. Provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data. In-Database Analytics Wednesday, December 18, 2013
  • 54. MADLib Functions • Linear Regression • Logistic Regression • Multinomial Logistic Regression • K-Means • Association Rules • Latent Dirichlet Allocation • Naïve Bayes • Elastic Net Regression • DecisionTrees / Random Forest • SupportVector Machines • Cox Proportional Hazards Regression • Descriptive Statistics • ARIMA Wednesday, December 18, 2013
  • 55. k-Means Usage SELECT * FROM madlib.kmeanspp ( ‘customers’, -- name of the input table ‘features’, -- name of the feature array column 2 -- k : number of clusters ); centroids | objective_fn | frac_reassigned | … ------------------------------------------------------------------------+------------------+-----------------+ … {{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | … Wednesday, December 18, 2013
  • 57. Pivotal R •Interface is R client •Execution is in database •Parallelism handled by PivotalR •Supports a portion of R R> x = db.data.frame(“t1”) R> l = madlib.lm(interlocks ~ assets + nation, data = t) Wednesday, December 18, 2013
  • 59. A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary Wednesday, December 18, 2013
  • 60. A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary • $ [ [[ $<- [<- [[<- • is.na • + - * / %% %/% ^ • & | ! • == != > < >= <= • merge • by • db.data.frame • as.db.data.frame • preview• sort • c mean sum sd var min max length colMeans colSums • db.connect db.disconnect db.list db.objects db.existsObject delete • dim names • content And more ... (SQL wrapper) • predict Wednesday, December 18, 2013
  • 61. A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary • Categorial variable as.factor() • $ [ [[ $<- [<- [[<- • is.na • + - * / %% %/% ^ • & | ! • == != > < >= <= • merge • by • db.data.frame • as.db.data.frame • preview• sort • c mean sum sd var min max length colMeans colSums • db.connect db.disconnect db.list db.objects db.existsObject delete • dim names • content And more ... (SQL wrapper) • predict Wednesday, December 18, 2013
  • 62. Pivotal Confidential–Internal Use Only 49 db.obj db.data.frame db.Rquery db.table db.view Wednesday, December 18, 2013
  • 63. Pivotal Confidential–Internal Use Only 49 db.obj db.data.frame db.Rquery db.table db.view Wednesday, December 18, 2013
  • 64. Pivotal Confidential–Internal Use Only 49 db.obj db.data.frame db.Rquery db.table db.view Wrapper of objects in database x = db.data.frame("table") Wednesday, December 18, 2013
  • 65. Pivotal Confidential–Internal Use Only 49 db.obj db.data.frame db.Rquery db.table db.view Wrapper of objects in database x = db.data.frame("table") Resides in R only x[,1:2], merge(x, y, by="column"), etc. Wednesday, December 18, 2013
  • 66. Pivotal Confidential–Internal Use Only 49 db.obj db.data.frame db.Rquery db.table db.view Wrapper of objects in database x = db.data.frame("table") Resides in R only x[,1:2], merge(x, y, by="column"), etc. Wednesday, December 18, 2013
  • 67. Pivotal Confidential–Internal Use Only 49 db.obj db.data.frame db.Rquery db.table db.view Most operations Wrapper of objects in database x = db.data.frame("table") Resides in R only x[,1:2], merge(x, y, by="column"), etc. Wednesday, December 18, 2013
  • 68. Pivotal Confidential–Internal Use Only 49 db.obj db.data.frame db.Rquery db.table db.view Most operations Wrapper of objects in database x = db.data.frame("table") Resides in R only x[,1:2], merge(x, y, by="column"), etc. Wednesday, December 18, 2013
  • 69. Pivotal Confidential–Internal Use Only 49 db.obj db.data.frame db.Rquery db.table db.view as.db.data.frame(...) Most operations Wrapper of objects in database x = db.data.frame("table") Resides in R only x[,1:2], merge(x, y, by="column"), etc. Wednesday, December 18, 2013
  • 70. Pivotal Confidential–Internal Use Only 49 db.obj db.data.frame db.Rquery db.table db.view as.db.data.frame(...) Most operations Wrapper of objects in database x = db.data.frame("table") Resides in R only x[,1:2], merge(x, y, by="column"), etc. MADlib wrapper functions Wednesday, December 18, 2013
  • 71. Pivotal Confidential–Internal Use Only 49 db.obj db.data.frame db.Rquery db.table db.view as.db.data.frame(...) Most operations Wrapper of objects in database x = db.data.frame("table") Resides in R only x[,1:2], merge(x, y, by="column"), etc. MADlib wrapper functions preview Wednesday, December 18, 2013
  • 72. In-Database Execution •All data stays in DB: R objects merely point to DB objects •All model estimation and heavy lifting done in DB by MADlib •R→ SQL translation done in the R client •Only strings of SQL and model output transferred across ODBC/DBI Wednesday, December 18, 2013
  • 73. Beyond MapReduce •Apache Giraph - BSP & Graph Processing •Storm onYarn - Streaming Computation •HOYA - HBase onYarn •Hamster - MPI on Hadoop •More to come ... Wednesday, December 18, 2013
  • 74. Hamster • Hadoop and MPI on the same cluster • OpenMPI Runtime on Hadoop YARN • Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System • Open MPI Provides: Process launching, Communication, I/O forwarding Wednesday, December 18, 2013
  • 75. Hamster Components •Hamster Application Master •Gang Scheduler,YARN Application Preemption •Resource Isolation (lxc Containers) •ORTE: Hamster Runtime •Process launching,Wireup, Interconnect Wednesday, December 18, 2013
  • 76. Resource Manager Scheduler AMService Node Manager Node Manager Node Manager ! Proc/ Container Framework Daemon NS MPI Scheduler HNP MPI AM Proc/ Container !RM-AM AM-NM RM-NodeManagerClient Client-RM Aux Srvcs Proc/ Container Framework Daemon NS Proc/ Container ! Aux Srvcs RM- NodeManager Hamster Architecture Wednesday, December 18, 2013
  • 77. Hamster Scalability •Sufficient for small to medium HPC workloads •Job launch time gated byYARN resource scheduler Launch WireUp Collectives Monitor OpenMPI O(logN) O(logN) O(logN) O(logN) Hamster O(N) O(logN) O(logN) O(logN) Wednesday, December 18, 2013
  • 78. GraphLab + Hamster on Hadoop ! Wednesday, December 18, 2013
  • 79. About GraphLab •Graph-based, High-Performance distributed computation framework •Started by Prof. Carlos Guestrin in CMU in 2009 •Recently founded Graphlab Inc to commercialize Graphlab.org Wednesday, December 18, 2013
  • 80. GraphLab Features •Topic Modeling (e.g. LDA) •Graph Analytics (Pagerank,Triangle counting) •Clustering (K-Means) •Collaborative Filtering •Linear Solvers •etc... Wednesday, December 18, 2013
  • 81. Only Graphs are not Enough • Full Data processing workflow required ETL/ Postprocessing,Visualization, Data Wrangling, Serving • MapReduce excels at data wrangling • OLTP/NoSQL Row-Based stores excel at Serving • GraphLab should co-exist with other Hadoop frameworks Wednesday, December 18, 2013