SlideShare a Scribd company logo
1 of 36
Download to read offline
RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING 
MateiZahariaet al. 
Universityof California, Berkeley
Alessandro MenabòPolitecnico di Torino, Italy
INTRODUCTION
Motivations 
Interactive (real-time) data mining 
Reuseof intermediate results(iterative algorithms) 
Examples: 
Machine learning 
K-meansclustering 
PageRank
Limitationsof currentframeworks 
Data reuseusuallythroughdisk storage 
Disk IO latencyand serialization 
Too high-levelabstractions 
Implicitmemorymanagement 
Implicitwork distribution 
Fault tolerancethroughdata replicationand logging 
High network traffic
Goals 
Keepfrequentlyuseddata in mainmemory 
Efficientfault recovery 
Log data transformationsratherthandata itself 
User control
RESILIENT DISTRIBUTED DATASETS (RDDs)
Whatisan RDD? 
Read-only, partitionedcollectionof recordsin key-valueform 
Createdthroughtransformations 
From storeddata or otherRDDs 
Coarse-grained: sameoperationon the wholedataset 
Examples: map, filter, join 
Lineage: sequenceof transformationsthatcreatedthe RDD 
Keyto efficientfault recovery 
Usedthroughactions 
Return a resultor storedata 
Examples: count, collect, save
Whatisan RDD? (cont’d) 
Lazycomputation 
RDDsare computedonlywhenthe first actionisinvoked 
Persistencecontrol 
ChooseRDDsto be reused, and howto storethem(e.g. in memory) 
Partitioningcontrol 
Definehowto distributeRDDsacrosscluster nodes 
Minimizeinter-nodecommunication
Implementation 
Apache Sparkcluster computingframework 
Open source 
Basedon HadoopDistributed File System (HDFS) (by Apache) 
Scala programminglanguage 
Derivedfrom Java, compilesto Java bytecode 
Object-orientedand functionalprogramming 
Staticallytyped, efficientand concise
Sparkprogramminginterface 
Driver program 
Definesand invokesactionson RDDs 
TracksRDDs’ lineage 
Assignsworkloadto workers 
Workers 
Persistentprocesseson cluster nodes 
Performactionson data 
Can storepartitionsof RDDsin RAM
Example: PageRank 
Iterative algorithm 
Updatesdocumentrankbasedon contributionsfrom documentsthatlink to it
Example: PageRank(cont’d) 
The graphgrowswith the numberof iterations 
Replicate some intermediate resultsto speedupfault recovery 
Reduce communicationoverhead 
Partitionbothlinksand ranksby URL in the sameway 
Joiningthemcan be doneon the samenode
RDD representation 
Goals 
Easilytracklineage 
Supportrichset of transformations 
Keepsystemassimpleaspossible(uniforminterface, avoidad-hoc logic) 
Graph-basedstructure 
Set of partitions(piecesof the dataset) 
Set of dependencieson parentRDDs 
Functionfor computingthe datasetfrom parentRDDs 
Metadataaboutpartitioningand data location
Dependencies 
Narrowdependencies 
Eachpartitionof the parentisusedby atmostonepartitionof the child 
Example: map, filter, union 
Wide dependencies 
Eachpartitionof the parentmaybe usedby manypartitionsof the child 
Example: join, groupByKey
Dependencies(cont’d) 
Normalexecution 
Narrowpipelined(e.g. map+ filteroneelementata time) 
Wide serial (allparentsneedto be availablebeforecomputationstarts) 
Fault recovery 
Narrowfast (onlyoneparentpartitionhasto be recomputed) 
Wide full (onefailednodemayrequireallparentsto be recomputed)
OVERVIEW OF SPARK
Scheduling 
Tracksin-memorypartitions 
On actionrequest: 
Examineslineageand buildsa DAG of executionstages 
Eachstage containsasmanytransformationswith narrowdependenciesaspossible 
Stage boundariescorrespondto wide dependencies, or alreadycomputedpartitions 
Launchestasksto compute missingpartitionsuntildesiredRDD iscomputed 
Tasksassignedaccordingto in-memorydata locality 
Otherwiseassignto RDD’spreferredlocation (user-specified)
Scheduling(cont’d) 
On task failure, re-runiton anothernodeifallparentsare stillavailable 
Ifstagesbecomeunavailable, re-runparenttasksin parallel 
Schedulerfailuresnotaddressed 
Replicate lineagegraph?
Interactivity 
Desirablegivenlow-latencyin-memorycapabilities 
Scala shellintegration 
Eachline iscompiledintoa Java classand runin JVM 
Bytecodeshippedto workersvia HTTP
Memory management 
PersistentRDDsstoragemodes: 
In-memory, deserializedobject: fastest(native supportby JVM) 
In-memory, serializedobject: more memory-efficient, butslower 
On-disk: ifRDD doesnotfitintoRAM, buttoocostlyto recomputeeverytime 
LRU evictionpolicy of entireRDD whennew partitiondoesnotfitintoRAM 
Unlessthe new partitionbelongsto the LRU RDD 
Separate memoryspaceon eachnode
Checkpointing 
Save intermediate RDDsto disk (replication) 
Speeduprecoveryof RDDswith long lineageor wide dependencies 
Pointlesswith short lineageor narrowdependencies(recomputingpartitionsin parallelislesscostlythanreplicatingthe wholeRDD) 
Notstrictlyrequired, butniceto have 
Easy becauseRDDsare read-only 
No consistencyissuesor distributedcoordinationrequired 
Donein the background, programsdo nothaveto be suspended 
Controlledby the user, no automaticcheckpointingyet
EVALUATION
Testingenvironment 
Amazon ElasticCompute Cloud(EC2) 
m1.xlarge nodes 
4 cores/ node 
15 GB of RAM / node 
HDFS with 256 MB blocks
Iterative machine learning 
10 iterationson 100 GB of data 
Runon 25, 50, 100 nodes
Iterative machine learning(cont’d) 
Differentalgorithms 
K-meansismore compute-intensive 
Logisticregressionismore sensitive to IO and deserialization 
Minimum overheadin Spark 
25.3×/ 20.7×with logisticregression 
3.2×/ 1.9×with K-means 
OutperformsevenHadoopBinMem(in-memorybinarydata)
PageRank 
10 iterationson a 54 GB Wikipedia dump 
Approximately4 millionarticles 
Runon 30 and 60 nodes 
Linear speedupwith numberof nodes 
2.4×with in-memorystorageonly 
7.4×with partitioncontrollingtoo
Fault recovery 
10 iterationsof K-meanswith 100 GB of data on 75 nodes 
Failureat6thiteration
Fault recovery(cont’d) 
Lossof tasksand partitionson failednode 
Task rescheduledon differentnodes 
Missingpartitionsrecomputedin parallel 
Lineagegraphslessthan10 KB 
Checkpointingwouldrequire 
Runningseveraliterationsagain 
Replicate all100 GB over the network 
Consumetwicethe memoryor writeall100 GB to disk
Lowmemory 
Logisticregressionwith variousamountsof RAM 
Gracefuldegradationwith lessspace
Interactive data mining 
1 TB of Wikipedia page viewlogs(2 yearsof data) 
Runon 100 m2.4xlarge nodes 
8 coresand 68 GB of RAM per node 
True interactivity(lessthan7 s) 
Queryingfrom disk took170 s
CONCLUSIONS
Applications 
Nothingnew under the sun 
In-memorycomputing, lineagetracking, partitioningand fast recoveryare alreadyavailablein otherframeworks(separately) 
RDDscan provideallthesefeaturesin a single framework 
RDDscan express existingcluster programmingmodels 
Sameoutput, betterperformance 
Examples: MapReduce, SQL, Google’sPregel, batchedstreamprocessing (periodicallyupdatingresultswith new data)
Advantages 
Dramaticspeedupwith reuseddata (dependingon application) 
Fast fault recoverythanksto lightweightloggingof transformations 
Efficiencyunder control of user(storage, partitioning) 
Gracefulperformance degradationwith lowRAM 
High expressivity 
Versatility 
Interactivity 
Open source 
Limitations 
Notsuitedfor fine-grainedtransformations 
Overheadfrom loggingtoomanylineagegraphs 
Traditionaldata loggingand checkpointingperformbetter
Thanks!

More Related Content

What's hot

Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
Ni Zo-Ma
 
Linux process management
Linux process managementLinux process management
Linux process management
Raghu nath
 
Parallel computing chapter 3
Parallel computing chapter 3Parallel computing chapter 3
Parallel computing chapter 3
Md. Mahedi Mahfuj
 

What's hot (20)

Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
 
Chapter19
Chapter19Chapter19
Chapter19
 
Unidad 4: Administración de usuarios grupos locales en Windows
Unidad 4: Administración de usuarios grupos locales en WindowsUnidad 4: Administración de usuarios grupos locales en Windows
Unidad 4: Administración de usuarios grupos locales en Windows
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
 
Linux principles and philosophy
Linux principles and philosophyLinux principles and philosophy
Linux principles and philosophy
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
Chapter 13 - I/O Systems
Chapter 13 - I/O SystemsChapter 13 - I/O Systems
Chapter 13 - I/O Systems
 
Concurrent transactions
Concurrent transactionsConcurrent transactions
Concurrent transactions
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
 
Real time operating system
Real time operating systemReal time operating system
Real time operating system
 
MySQL Atchitecture and Concepts
MySQL Atchitecture and ConceptsMySQL Atchitecture and Concepts
MySQL Atchitecture and Concepts
 
IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015
 
MAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxMAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsx
 
Multiprocessor structures
Multiprocessor structuresMultiprocessor structures
Multiprocessor structures
 
multiprocessors and multicomputers
 multiprocessors and multicomputers multiprocessors and multicomputers
multiprocessors and multicomputers
 
Join operation
Join operationJoin operation
Join operation
 
Linux process management
Linux process managementLinux process management
Linux process management
 
Os Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual MemoryOs Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual Memory
 
Advanced computer architechture -Memory Hierarchies and its Properties and Type
Advanced computer architechture -Memory Hierarchies and its Properties and TypeAdvanced computer architechture -Memory Hierarchies and its Properties and Type
Advanced computer architechture -Memory Hierarchies and its Properties and Type
 
Parallel computing chapter 3
Parallel computing chapter 3Parallel computing chapter 3
Parallel computing chapter 3
 

Viewers also liked

Viewers also liked (6)

IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
 
Social Network Analysis with Spark
Social Network Analysis with SparkSocial Network Analysis with Spark
Social Network Analysis with Spark
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 

Similar to Resilient Distributed Datasets

02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
WANdisco Plc
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Cheng-Hsuan Li
 
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Baruch Sadogursky
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 

Similar to Resilient Distributed Datasets (20)

Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
RDD
RDDRDD
RDD
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Clustering van IT-componenten
Clustering van IT-componentenClustering van IT-componenten
Clustering van IT-componenten
 
MYSQL
MYSQLMYSQL
MYSQL
 
Cassandra internals
Cassandra internalsCassandra internals
Cassandra internals
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Distributed Applications with Apache Zookeeper
Distributed Applications with Apache ZookeeperDistributed Applications with Apache Zookeeper
Distributed Applications with Apache Zookeeper
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
 
Column and hadoop
Column and hadoopColumn and hadoop
Column and hadoop
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed Storage
 

Recently uploaded

➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

Resilient Distributed Datasets

  • 1. RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING MateiZahariaet al. Universityof California, Berkeley
  • 4. Motivations Interactive (real-time) data mining Reuseof intermediate results(iterative algorithms) Examples: Machine learning K-meansclustering PageRank
  • 5. Limitationsof currentframeworks Data reuseusuallythroughdisk storage Disk IO latencyand serialization Too high-levelabstractions Implicitmemorymanagement Implicitwork distribution Fault tolerancethroughdata replicationand logging High network traffic
  • 6. Goals Keepfrequentlyuseddata in mainmemory Efficientfault recovery Log data transformationsratherthandata itself User control
  • 8. Whatisan RDD? Read-only, partitionedcollectionof recordsin key-valueform Createdthroughtransformations From storeddata or otherRDDs Coarse-grained: sameoperationon the wholedataset Examples: map, filter, join Lineage: sequenceof transformationsthatcreatedthe RDD Keyto efficientfault recovery Usedthroughactions Return a resultor storedata Examples: count, collect, save
  • 9. Whatisan RDD? (cont’d) Lazycomputation RDDsare computedonlywhenthe first actionisinvoked Persistencecontrol ChooseRDDsto be reused, and howto storethem(e.g. in memory) Partitioningcontrol Definehowto distributeRDDsacrosscluster nodes Minimizeinter-nodecommunication
  • 10. Implementation Apache Sparkcluster computingframework Open source Basedon HadoopDistributed File System (HDFS) (by Apache) Scala programminglanguage Derivedfrom Java, compilesto Java bytecode Object-orientedand functionalprogramming Staticallytyped, efficientand concise
  • 11. Sparkprogramminginterface Driver program Definesand invokesactionson RDDs TracksRDDs’ lineage Assignsworkloadto workers Workers Persistentprocesseson cluster nodes Performactionson data Can storepartitionsof RDDsin RAM
  • 12. Example: PageRank Iterative algorithm Updatesdocumentrankbasedon contributionsfrom documentsthatlink to it
  • 13. Example: PageRank(cont’d) The graphgrowswith the numberof iterations Replicate some intermediate resultsto speedupfault recovery Reduce communicationoverhead Partitionbothlinksand ranksby URL in the sameway Joiningthemcan be doneon the samenode
  • 14. RDD representation Goals Easilytracklineage Supportrichset of transformations Keepsystemassimpleaspossible(uniforminterface, avoidad-hoc logic) Graph-basedstructure Set of partitions(piecesof the dataset) Set of dependencieson parentRDDs Functionfor computingthe datasetfrom parentRDDs Metadataaboutpartitioningand data location
  • 15. Dependencies Narrowdependencies Eachpartitionof the parentisusedby atmostonepartitionof the child Example: map, filter, union Wide dependencies Eachpartitionof the parentmaybe usedby manypartitionsof the child Example: join, groupByKey
  • 16. Dependencies(cont’d) Normalexecution Narrowpipelined(e.g. map+ filteroneelementata time) Wide serial (allparentsneedto be availablebeforecomputationstarts) Fault recovery Narrowfast (onlyoneparentpartitionhasto be recomputed) Wide full (onefailednodemayrequireallparentsto be recomputed)
  • 18. Scheduling Tracksin-memorypartitions On actionrequest: Examineslineageand buildsa DAG of executionstages Eachstage containsasmanytransformationswith narrowdependenciesaspossible Stage boundariescorrespondto wide dependencies, or alreadycomputedpartitions Launchestasksto compute missingpartitionsuntildesiredRDD iscomputed Tasksassignedaccordingto in-memorydata locality Otherwiseassignto RDD’spreferredlocation (user-specified)
  • 19. Scheduling(cont’d) On task failure, re-runiton anothernodeifallparentsare stillavailable Ifstagesbecomeunavailable, re-runparenttasksin parallel Schedulerfailuresnotaddressed Replicate lineagegraph?
  • 20. Interactivity Desirablegivenlow-latencyin-memorycapabilities Scala shellintegration Eachline iscompiledintoa Java classand runin JVM Bytecodeshippedto workersvia HTTP
  • 21. Memory management PersistentRDDsstoragemodes: In-memory, deserializedobject: fastest(native supportby JVM) In-memory, serializedobject: more memory-efficient, butslower On-disk: ifRDD doesnotfitintoRAM, buttoocostlyto recomputeeverytime LRU evictionpolicy of entireRDD whennew partitiondoesnotfitintoRAM Unlessthe new partitionbelongsto the LRU RDD Separate memoryspaceon eachnode
  • 22. Checkpointing Save intermediate RDDsto disk (replication) Speeduprecoveryof RDDswith long lineageor wide dependencies Pointlesswith short lineageor narrowdependencies(recomputingpartitionsin parallelislesscostlythanreplicatingthe wholeRDD) Notstrictlyrequired, butniceto have Easy becauseRDDsare read-only No consistencyissuesor distributedcoordinationrequired Donein the background, programsdo nothaveto be suspended Controlledby the user, no automaticcheckpointingyet
  • 24. Testingenvironment Amazon ElasticCompute Cloud(EC2) m1.xlarge nodes 4 cores/ node 15 GB of RAM / node HDFS with 256 MB blocks
  • 25. Iterative machine learning 10 iterationson 100 GB of data Runon 25, 50, 100 nodes
  • 26. Iterative machine learning(cont’d) Differentalgorithms K-meansismore compute-intensive Logisticregressionismore sensitive to IO and deserialization Minimum overheadin Spark 25.3×/ 20.7×with logisticregression 3.2×/ 1.9×with K-means OutperformsevenHadoopBinMem(in-memorybinarydata)
  • 27. PageRank 10 iterationson a 54 GB Wikipedia dump Approximately4 millionarticles Runon 30 and 60 nodes Linear speedupwith numberof nodes 2.4×with in-memorystorageonly 7.4×with partitioncontrollingtoo
  • 28. Fault recovery 10 iterationsof K-meanswith 100 GB of data on 75 nodes Failureat6thiteration
  • 29. Fault recovery(cont’d) Lossof tasksand partitionson failednode Task rescheduledon differentnodes Missingpartitionsrecomputedin parallel Lineagegraphslessthan10 KB Checkpointingwouldrequire Runningseveraliterationsagain Replicate all100 GB over the network Consumetwicethe memoryor writeall100 GB to disk
  • 30. Lowmemory Logisticregressionwith variousamountsof RAM Gracefuldegradationwith lessspace
  • 31. Interactive data mining 1 TB of Wikipedia page viewlogs(2 yearsof data) Runon 100 m2.4xlarge nodes 8 coresand 68 GB of RAM per node True interactivity(lessthan7 s) Queryingfrom disk took170 s
  • 33. Applications Nothingnew under the sun In-memorycomputing, lineagetracking, partitioningand fast recoveryare alreadyavailablein otherframeworks(separately) RDDscan provideallthesefeaturesin a single framework RDDscan express existingcluster programmingmodels Sameoutput, betterperformance Examples: MapReduce, SQL, Google’sPregel, batchedstreamprocessing (periodicallyupdatingresultswith new data)
  • 34. Advantages Dramaticspeedupwith reuseddata (dependingon application) Fast fault recoverythanksto lightweightloggingof transformations Efficiencyunder control of user(storage, partitioning) Gracefulperformance degradationwith lowRAM High expressivity Versatility Interactivity Open source 
  • 35. Limitations Notsuitedfor fine-grainedtransformations Overheadfrom loggingtoomanylineagegraphs Traditionaldata loggingand checkpointingperformbetter