SlideShare ist ein Scribd-Unternehmen logo
1 von 32
8/9/2013 © MapR Confidential 1
MapR Architecture and
Machine Learning
1
8/9/2013 © MapR Confidential 2
Outline
• MapR system overview
• Map-reduce review
• MapR architecture
• Performance Results
• Map-reduce on MapR
• Machine learning on MapR
8/9/2013 © MapR Confidential 3
Map-Reduce
Input Output
Shuffle
8/9/2013 © MapR Confidential 4
Bottlenecks and Issues
• Read-only files
• Many copies in I/O path
• Shuffle based on HTTP
• Can’t use new technologies
• Eats file descriptors
• Spills go to local file space
• Bad for skewed distribution of sizes
8/9/2013 © MapR Confidential 5
MapR Improvements
• Faster file system
• Fewer copies
• Multiple NICS
• No file descriptor or page-buf competition
• Faster map-reduce
• Uses distributed file system
• Direct RPC to receiver
• Very wide merges
8/9/2013 © MapR Confidential 6
MapR Innovations
• Volumes
• Distributed management
• Data placement
• Read/write random access file system
• Allows distributed meta-data
• Improved scaling
• Enables NFS access
• Application-level NIC bonding
• Transactionally correct snapshots and mirrors
8/9/2013 © MapR Confidential 7
 Each container contains
 Directories & files
 Data blocks
 Replicated on servers
 No need to manage
directly
MapR's Containers
Files/directories are sharded into blocks, which
are placed into mini NNs (containers ) on disks
Containers are 16-
32 GB segments of
disk, placed on
nodes
8/9/2013 © MapR Confidential 8
Container locations and replication
CLDB
N1, N2
N3, N2
N1, N2
N1, N3
N3, N2
N1
N2
N3
Container location database
(CLDB) keeps track of nodes
hosting each container
8/9/2013 © MapR Confidential 9
Containers represent 16 - 32GB of data
 Each can hold up to 1 Billion files and directories
 100M containers = ~ 2 Exabytes (a very large cluster)
250 bytes DRAM to cache a container
 25GB to cache all containers for 2EB cluster
But not necessary, can page to disk
 Typical large 10PB cluster needs 2GB
Container-reports are 100x - 1000x < HDFS block-reports
 Serve 100x more data-nodes
 Increase container size to 64G to serve 4EB cluster
 Map/reduce not affected
MapR Scaling
8/9/2013 © MapR Confidential 10
MapR's Streaming Performance
Read Write
0
250
500
750
1000
1250
1500
1750
2000
2250
Read Write
0
250
500
750
1000
1250
1500
1750
2000
2250
Hardware
MapR
HadoopMB
per
sec
Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB
11 x 7200rpm SATA 11 x 15Krpm SAS
Higher is better
8/9/2013 © MapR Confidential 11
Terasort on MapR
1.0 TB
0
10
20
30
40
50
60
3.5 TB
0
50
100
150
200
250
300
MapR
Hadoop
Elapsed
time
(mins)
10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm
Lower is better
8/9/2013 © MapR Confidential 12
MUCH faster for some operations
# of files (millions)
Test
stopped
hereCreate
Rate
Same 10 nodes …
8/9/2013 © MapR Confidential 14
NFS mounting models
• Export to the world
• NFS gateway runs on selected gateway hosts
• Local server
• NFS gateway runs on local host
• Enables local compression and check summing
• Export to self
• NFS gateway runs on all data nodes, mounted
from localhost
8/9/2013 © MapR Confidential 15
Export to the world
NFS
Server
NFS
Server
NFS
Server
NFS
ServerNFS
Client
8/9/2013 © MapR Confidential 16
Client
NFS
Server
Local server
Application
Cluster
Nodes
8/9/2013 © MapR Confidential 17
Cluster
Node
NFS
Server
Universal export to self
Application
Cluster Nodes
8/9/2013 © MapR Confidential 18
Cluster
Node
NFS
Server
Application
Cluster
Node
NFS
Server
Application
Cluster
Node
NFS
Server
Application
Nodes are identical
8/9/2013 © MapR Confidential 19
Sharded text indexing
• Mapper assigns document to shard
• Shard is usually hash of document id
• Reducer indexes all documents for a shard
• Indexes created on local disk
• On success, copy index to DFS
• On failure, delete local files
• Must avoid directory collisions
• can’t use shard id!
• Must manage local disk space
8/9/2013 © MapR Confidential 20
Conventional data flows
Map
Reducer
Input
documents
Local
disk Search
Engine
Local
disk
Clustered
index storage
Failure of a reducer
causes garbage to
accumulate in the
local disk
Failure of search
engine requires
another download
of the index from
clustered storage.
8/9/2013 © MapR Confidential 21
Search
Engine
Simplified NFS data flows
Map
Reducer
Input
documents
Clustered
index storage
Failure of a reducer
is cleaned up by
map-reduce
framework
Search engine
reads mirrored
index directly.
8/9/2013 © MapR Confidential 22
Application to machine learning
• So now we have the hammer
• Let’s see some nails!
8/9/2013 © MapR Confidential 23
K-means
• Classic E-M based algorithm
• Given cluster centroids,
• Assign each data point to nearest centroid
• Accumulate new centroids
• Rinse, lather, repeat
8/9/2013 © MapR Confidential 24
Aggregate
new
centroids
K-means, the movie
Assign
to
Nearest
centroid
Centroids
I
n
p
u
t
8/9/2013 © MapR Confidential 25
But …
8/9/2013 © MapR Confidential 26
Average
models
Parallel Stochastic Gradient Descent
Train
sub
model
Model
I
n
p
u
t
8/9/2013 © MapR Confidential 27
Update
model
Variational Dirichlet Assignment
Gather
sufficient
statistics
Model
I
n
p
u
t
8/9/2013 © MapR Confidential 28
Old tricks, new dogs
• Mapper
• Assign point to cluster
• Emit cluster id, (1, point)
• Combiner and reducer
• Sum counts, weighted sum of points
• Emit cluster id, (n, sum/n)
• Output to HDFS
Read from
HDFS to local disk
by distributed cache
Written by
map-reduce
Read from local disk
from distributed cache
8/9/2013 © MapR Confidential 29
Old tricks, new dogs
• Mapper
• Assign point to cluster
• Emit cluster id, 1, point
• Combiner and reducer
• Sum counts, weighted sum of points
• Emit cluster id, n, sum/n
• Output to HDFS
MapR FS
Read
from
NFS
Written by
map-reduce
8/9/2013 © MapR Confidential 30
Click modeling architecture
Feature
extraction
and
down
sampling
I
n
p
u
t
Side-data
Data
join
Sequential
SGD
Learning
Map-reduce
Now via NFS
8/9/2013 © MapR Confidential 31
Poor man’s Pregel
• Mapper
• Lines in bold can use conventional I/O via NFS
31
while not done:
read and accumulate input models
for each input:
accumulate model
write model
synchronize
reset input format
emit summary
8/9/2013 © MapR Confidential 32
Trivial visualization interface
• Map-reduce output is visible via NFS
• Legacy visualization just works
$ R
> x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”)
> plot(error ~ t, x)
> q(save=„n‟)
8/9/2013 © MapR Confidential 33
Conclusions
• We used to know all this
• Tab completion used to work
• 5 years of work-arounds have clouded our
memories
• We just have to remember the future

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop: Beyond MapReduce
Hadoop: Beyond MapReduceHadoop: Beyond MapReduce
Hadoop: Beyond MapReduceSteve Loughran
 
Scaling Graphite At Yelp
Scaling Graphite At YelpScaling Graphite At Yelp
Scaling Graphite At YelpPaul O'Connor
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceJoydeep Sen Sarma
 
Federated Graphite in Docker - Denver Docker Meetup
Federated Graphite in Docker - Denver Docker MeetupFederated Graphite in Docker - Denver Docker Meetup
Federated Graphite in Docker - Denver Docker MeetupPhil Zimmerman
 
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Cloud-Friendly Hadoop and Hive - StampedeCon 2013Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Cloud-Friendly Hadoop and Hive - StampedeCon 2013StampedeCon
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetupamarsri
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012Joydeep Sen Sarma
 
Resource management in the cloud
Resource management in the cloudResource management in the cloud
Resource management in the cloudru_Parallels
 
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...Aerospike
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to YarnApache Apex
 
DSD-INT 2018 Earth Science Through Datacubes - Merticariu
DSD-INT 2018 Earth Science Through Datacubes - MerticariuDSD-INT 2018 Earth Science Through Datacubes - Merticariu
DSD-INT 2018 Earth Science Through Datacubes - MerticariuDeltares
 
Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitorInfluxData
 
TechEvent Operating MapR Hadoop Cluster for a year
TechEvent Operating MapR Hadoop Cluster for a yearTechEvent Operating MapR Hadoop Cluster for a year
TechEvent Operating MapR Hadoop Cluster for a yearTrivadis
 

Was ist angesagt? (19)

Hadoop: Beyond MapReduce
Hadoop: Beyond MapReduceHadoop: Beyond MapReduce
Hadoop: Beyond MapReduce
 
Scaling Graphite At Yelp
Scaling Graphite At YelpScaling Graphite At Yelp
Scaling Graphite At Yelp
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
 
Federated Graphite in Docker - Denver Docker Meetup
Federated Graphite in Docker - Denver Docker MeetupFederated Graphite in Docker - Denver Docker Meetup
Federated Graphite in Docker - Denver Docker Meetup
 
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Cloud-Friendly Hadoop and Hive - StampedeCon 2013Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Devopsconf 2015 sebamontini
Devopsconf 2015 sebamontiniDevopsconf 2015 sebamontini
Devopsconf 2015 sebamontini
 
The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012
 
Resource management in the cloud
Resource management in the cloudResource management in the cloud
Resource management in the cloud
 
Cassandra On EPAM Cloud - VDAY 2017
Cassandra On EPAM Cloud - VDAY 2017Cassandra On EPAM Cloud - VDAY 2017
Cassandra On EPAM Cloud - VDAY 2017
 
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
DSD-INT 2018 Earth Science Through Datacubes - Merticariu
DSD-INT 2018 Earth Science Through Datacubes - MerticariuDSD-INT 2018 Earth Science Through Datacubes - Merticariu
DSD-INT 2018 Earth Science Through Datacubes - Merticariu
 
Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitor
 
TechEvent Operating MapR Hadoop Cluster for a year
TechEvent Operating MapR Hadoop Cluster for a yearTechEvent Operating MapR Hadoop Cluster for a year
TechEvent Operating MapR Hadoop Cluster for a year
 
CERN Batch in the HNSciCloud
CERN Batch in the HNSciCloudCERN Batch in the HNSciCloud
CERN Batch in the HNSciCloud
 

Andere mochten auch

Transactional Data Mining Ted Dunning 2004
Transactional Data Mining Ted Dunning 2004Transactional Data Mining Ted Dunning 2004
Transactional Data Mining Ted Dunning 2004MapR Technologies
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time TogetherMapR Technologies
 
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01MapR Technologies
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the CloudMapR Technologies
 

Andere mochten auch (7)

SD Forum 11 04-2010
SD Forum 11 04-2010SD Forum 11 04-2010
SD Forum 11 04-2010
 
Mahout classifier tour
Mahout classifier tourMahout classifier tour
Mahout classifier tour
 
Transactional Data Mining Ted Dunning 2004
Transactional Data Mining Ted Dunning 2004Transactional Data Mining Ted Dunning 2004
Transactional Data Mining Ted Dunning 2004
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the Cloud
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
 

Ähnlich wie Lawrence Livermore Labs talk 2011

Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...NoSQLmatters
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down InternetMapR Technologies
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownDataWorks Summit
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series DatabaseDataWorks Summit
 
Performance Models for Apache Accumulo
Performance Models for Apache AccumuloPerformance Models for Apache Accumulo
Performance Models for Apache AccumuloSqrrl
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batchboorad
 
PhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond BatchPhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond Batchboorad
 
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014Johnny Miller
 
Bryan Thompson, Chief Scientist and Founder, SYSTAP, LLC at MLconf ATL
Bryan Thompson, Chief Scientist and Founder, SYSTAP, LLC at MLconf ATLBryan Thompson, Chief Scientist and Founder, SYSTAP, LLC at MLconf ATL
Bryan Thompson, Chief Scientist and Founder, SYSTAP, LLC at MLconf ATLMLconf
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies
 
Container and Kubernetes without limits
Container and Kubernetes without limitsContainer and Kubernetes without limits
Container and Kubernetes without limitsAntje Barth
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
 
Building HBase Applications - Ted Dunning
Building HBase Applications - Ted DunningBuilding HBase Applications - Ted Dunning
Building HBase Applications - Ted DunningMapR Technologies
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsIgor Sfiligoi
 
Apache geode
Apache geodeApache geode
Apache geodeYogesh BG
 

Ähnlich wie Lawrence Livermore Labs talk 2011 (20)

Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series Database
 
Performance Models for Apache Accumulo
Performance Models for Apache AccumuloPerformance Models for Apache Accumulo
Performance Models for Apache Accumulo
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
 
PhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond BatchPhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond Batch
 
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
 
Bryan Thompson, Chief Scientist and Founder, SYSTAP, LLC at MLconf ATL
Bryan Thompson, Chief Scientist and Founder, SYSTAP, LLC at MLconf ATLBryan Thompson, Chief Scientist and Founder, SYSTAP, LLC at MLconf ATL
Bryan Thompson, Chief Scientist and Founder, SYSTAP, LLC at MLconf ATL
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community Edition
 
Container and Kubernetes without limits
Container and Kubernetes without limitsContainer and Kubernetes without limits
Container and Kubernetes without limits
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
Building HBase Applications - Ted Dunning
Building HBase Applications - Ted DunningBuilding HBase Applications - Ted Dunning
Building HBase Applications - Ted Dunning
 
Cmu 2011 09.pptx
Cmu 2011 09.pptxCmu 2011 09.pptx
Cmu 2011 09.pptx
 
HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
 
Apache geode
Apache geodeApache geode
Apache geode
 

Mehr von MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Mehr von MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Kürzlich hochgeladen

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Lawrence Livermore Labs talk 2011

  • 1. 8/9/2013 © MapR Confidential 1 MapR Architecture and Machine Learning 1
  • 2. 8/9/2013 © MapR Confidential 2 Outline • MapR system overview • Map-reduce review • MapR architecture • Performance Results • Map-reduce on MapR • Machine learning on MapR
  • 3. 8/9/2013 © MapR Confidential 3 Map-Reduce Input Output Shuffle
  • 4. 8/9/2013 © MapR Confidential 4 Bottlenecks and Issues • Read-only files • Many copies in I/O path • Shuffle based on HTTP • Can’t use new technologies • Eats file descriptors • Spills go to local file space • Bad for skewed distribution of sizes
  • 5. 8/9/2013 © MapR Confidential 5 MapR Improvements • Faster file system • Fewer copies • Multiple NICS • No file descriptor or page-buf competition • Faster map-reduce • Uses distributed file system • Direct RPC to receiver • Very wide merges
  • 6. 8/9/2013 © MapR Confidential 6 MapR Innovations • Volumes • Distributed management • Data placement • Read/write random access file system • Allows distributed meta-data • Improved scaling • Enables NFS access • Application-level NIC bonding • Transactionally correct snapshots and mirrors
  • 7. 8/9/2013 © MapR Confidential 7  Each container contains  Directories & files  Data blocks  Replicated on servers  No need to manage directly MapR's Containers Files/directories are sharded into blocks, which are placed into mini NNs (containers ) on disks Containers are 16- 32 GB segments of disk, placed on nodes
  • 8. 8/9/2013 © MapR Confidential 8 Container locations and replication CLDB N1, N2 N3, N2 N1, N2 N1, N3 N3, N2 N1 N2 N3 Container location database (CLDB) keeps track of nodes hosting each container
  • 9. 8/9/2013 © MapR Confidential 9 Containers represent 16 - 32GB of data  Each can hold up to 1 Billion files and directories  100M containers = ~ 2 Exabytes (a very large cluster) 250 bytes DRAM to cache a container  25GB to cache all containers for 2EB cluster But not necessary, can page to disk  Typical large 10PB cluster needs 2GB Container-reports are 100x - 1000x < HDFS block-reports  Serve 100x more data-nodes  Increase container size to 64G to serve 4EB cluster  Map/reduce not affected MapR Scaling
  • 10. 8/9/2013 © MapR Confidential 10 MapR's Streaming Performance Read Write 0 250 500 750 1000 1250 1500 1750 2000 2250 Read Write 0 250 500 750 1000 1250 1500 1750 2000 2250 Hardware MapR HadoopMB per sec Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB 11 x 7200rpm SATA 11 x 15Krpm SAS Higher is better
  • 11. 8/9/2013 © MapR Confidential 11 Terasort on MapR 1.0 TB 0 10 20 30 40 50 60 3.5 TB 0 50 100 150 200 250 300 MapR Hadoop Elapsed time (mins) 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm Lower is better
  • 12. 8/9/2013 © MapR Confidential 12 MUCH faster for some operations # of files (millions) Test stopped hereCreate Rate Same 10 nodes …
  • 13. 8/9/2013 © MapR Confidential 14 NFS mounting models • Export to the world • NFS gateway runs on selected gateway hosts • Local server • NFS gateway runs on local host • Enables local compression and check summing • Export to self • NFS gateway runs on all data nodes, mounted from localhost
  • 14. 8/9/2013 © MapR Confidential 15 Export to the world NFS Server NFS Server NFS Server NFS ServerNFS Client
  • 15. 8/9/2013 © MapR Confidential 16 Client NFS Server Local server Application Cluster Nodes
  • 16. 8/9/2013 © MapR Confidential 17 Cluster Node NFS Server Universal export to self Application Cluster Nodes
  • 17. 8/9/2013 © MapR Confidential 18 Cluster Node NFS Server Application Cluster Node NFS Server Application Cluster Node NFS Server Application Nodes are identical
  • 18. 8/9/2013 © MapR Confidential 19 Sharded text indexing • Mapper assigns document to shard • Shard is usually hash of document id • Reducer indexes all documents for a shard • Indexes created on local disk • On success, copy index to DFS • On failure, delete local files • Must avoid directory collisions • can’t use shard id! • Must manage local disk space
  • 19. 8/9/2013 © MapR Confidential 20 Conventional data flows Map Reducer Input documents Local disk Search Engine Local disk Clustered index storage Failure of a reducer causes garbage to accumulate in the local disk Failure of search engine requires another download of the index from clustered storage.
  • 20. 8/9/2013 © MapR Confidential 21 Search Engine Simplified NFS data flows Map Reducer Input documents Clustered index storage Failure of a reducer is cleaned up by map-reduce framework Search engine reads mirrored index directly.
  • 21. 8/9/2013 © MapR Confidential 22 Application to machine learning • So now we have the hammer • Let’s see some nails!
  • 22. 8/9/2013 © MapR Confidential 23 K-means • Classic E-M based algorithm • Given cluster centroids, • Assign each data point to nearest centroid • Accumulate new centroids • Rinse, lather, repeat
  • 23. 8/9/2013 © MapR Confidential 24 Aggregate new centroids K-means, the movie Assign to Nearest centroid Centroids I n p u t
  • 24. 8/9/2013 © MapR Confidential 25 But …
  • 25. 8/9/2013 © MapR Confidential 26 Average models Parallel Stochastic Gradient Descent Train sub model Model I n p u t
  • 26. 8/9/2013 © MapR Confidential 27 Update model Variational Dirichlet Assignment Gather sufficient statistics Model I n p u t
  • 27. 8/9/2013 © MapR Confidential 28 Old tricks, new dogs • Mapper • Assign point to cluster • Emit cluster id, (1, point) • Combiner and reducer • Sum counts, weighted sum of points • Emit cluster id, (n, sum/n) • Output to HDFS Read from HDFS to local disk by distributed cache Written by map-reduce Read from local disk from distributed cache
  • 28. 8/9/2013 © MapR Confidential 29 Old tricks, new dogs • Mapper • Assign point to cluster • Emit cluster id, 1, point • Combiner and reducer • Sum counts, weighted sum of points • Emit cluster id, n, sum/n • Output to HDFS MapR FS Read from NFS Written by map-reduce
  • 29. 8/9/2013 © MapR Confidential 30 Click modeling architecture Feature extraction and down sampling I n p u t Side-data Data join Sequential SGD Learning Map-reduce Now via NFS
  • 30. 8/9/2013 © MapR Confidential 31 Poor man’s Pregel • Mapper • Lines in bold can use conventional I/O via NFS 31 while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input format emit summary
  • 31. 8/9/2013 © MapR Confidential 32 Trivial visualization interface • Map-reduce output is visible via NFS • Legacy visualization just works $ R > x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”) > plot(error ~ t, x) > q(save=„n‟)
  • 32. 8/9/2013 © MapR Confidential 33 Conclusions • We used to know all this • Tab completion used to work • 5 years of work-arounds have clouded our memories • We just have to remember the future