RDD

•

3 gefällt mir•1,998 views

Tien-Yang (Aiden) Wu

referance:Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Software

Resilient Distributed Datasets: A Fault-
Tolerant Abstraction for In-Memory Cluster
Computing
Matei Zaharia, Mosharaf Chowdhury...
2012 University of California, Berkeley

OUTLINE
• Introduction
• Resilient Distributed Datasets (RDDs)
• Representing RDDs
• Evaluation
• Conclusion

Introduction
Cluster computing frameworks like MapReduce is not
well in iterative machine learning and graph algorithms
because data replication,disk I/O,serialization

Introduction
Pregel is a system for iterative graph computations that
keeps intermediate data in memory, while HaLoop
offers an iterative MapReduce interface.
but only support specific computation patterns
They do not provide abstractions for more general
reuse.

Introduction
RDD is defining a programming interface that can
provide fault tolerance efficiently
RDD v.s distributed shared memory
coarse-grained transformations
(e.g., map, filter and join)
fine-grained updates to mutable state
lineage

Resilient Distributed
Datasets (RDDs)
RDD’s transformation are lazy operations that define a
new RDD, while actions launch a computation to
return a value to the program or write data to external
storage.

Resilient Distributed
Datasets (RDDs)
RDD is a read-only, partitioned collection of records,
only be created (1) data in stable storage (2) other
RDDs.
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.count()

Resilient Distributed
Datasets (RDDs)
RDD1
lines = spark.textFile(“hdfs://...")
RDD2
errors = lines.filter(_.startsWith(“ERROR"))
Long
number = errors.count()
RDD1 RDD2
Long
tranformation action

Resilient Distributed
Datasets (RDDs)
DEMO

Resilient Distributed
Datasets (RDDs)
RDD1
lines = spark.textFile(“hdfs://...")
RDD2
errors = lines.filter(_.startsWith(“ERROR"))
RDD3
error = errors.persist() or cache()
RDD3 error will in memory

Resilient Distributed
Datasets (RDDs)
Lineage: fault tolerance
if RDD2 lost
tranformation action
RDD1 RDD2 Long
recompute RDD1 and produce new RDD2

Resilient Distributed
Datasets (RDDs)
Spark provides the RDD abstraction through a
language-integrated API
scala
a functional programming language for the Java VM

Representing RDDs
dependencies between RDDs
narrow dependencies：allow for pipelined execution on
one cluster node
wide dependencies：require data from all parent
partitions to be available and to be shuffled across the
nodes using a MapReduce-like operation

Representing RDDs
in same node in different node

Representing RDDs
how spark compute job stages
partition
RDD
RDD in memory

Resilient Distributed
Datasets (RDDs)
Each stage contains as many pipelined transformations
with narrow dependencies as possible.
because avoid shuffled across the nodes

Evaluation
Amazon：m1.xlarge EC2 nodes with 4 cores and
15 GB of RAM. We used HDFS for storage, with
256 MB blocks.

Evaluation
10 iterations on 100 GB datasets using 25–100
machines.
logistic regression k-means
logistic regression is less compute-intensive and thus more
sensitive to time spent in deserialization and I/O.

Evaluation
HadoopBinMem：convert input data to binary format,in memory

Evaluation
pagerank
54 GB Wikipedia dump, 4 million articles.
iterations :10

Evaluation
fault recovery
k-means
100GB data,75 node ,iterations :10
one node fail at the start of the 6th iteration.

Evaluation
k-means 100GB data 75 node iterations :10

Evaluation
Behavior with Insufficient Memory
logistic regression
100GB data , 25machine

Conclusion
RDDs,an efficient, general-purpose and fault-tolerant
abstraction for sharing data in cluster applications.
RDDs offer an API based on coarse- grained
transformations that lets them recover data efficiently
using lineage.
Spark v.s Hadoop fast to 20× in iterative applications and
can be used interactively to query hundreds of gigabytes
of data.

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to PigPrashanth Babu

Hadoop introduction , Why and What is Hadoop ?sudhakara st

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

Mapreduce by examplesAndrea Iacono

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Hadoop Map ReduceVNIT-ACM Student Chapter

Hadoop Overview & Architecture EMC

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn

Map ReducePrashant Gupta

Sparknewmooxx

Google File SystemJunyoung Jung

Introduction to Hadoop TechnologyManish Borkar

Hadoop and Big DataHarshdeep Kaur

Hadoop HDFS Conceptstutorialvillage

Map reduce presentationateeq ateeq

Hadoop hive presentationArvind Kumar

Introduction to Apache SparkAnastasios Skarlatidis

HadoopRamakrishna Reddy Bijjam

HDFS ArchitectureJeff Hammerbacher

Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn

Was ist angesagt? (20)

Introduction to Pig

Hadoop introduction , Why and What is Hadoop ?

Apache Spark in Depth: Core Concepts, Architecture & Internals

Mapreduce by examples

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab

Hadoop Map Reduce

Hadoop Overview & Architecture

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...

Map Reduce

Spark

Google File System

Introduction to Hadoop Technology

Hadoop and Big Data

Hadoop HDFS Concepts

Map reduce presentation

Hadoop hive presentation

Introduction to Apache Spark

Hadoop

HDFS Architecture

Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...

Andere mochten auch

Scheduling Policies in YARNDataWorks Summit/Hadoop Summit

Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit

Apache Spark & HadoopMapR Technologies

Spark on YARNAdarsh Pannu

Spark on YarnQubole

Apache Spark RDDsDean Chen

Spark on yarndatamantra

Dynamic Resource Allocation Spark on YARNTsuyoshi OZAWA

Andere mochten auch (8)

Scheduling Policies in YARN

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

Apache Spark & Hadoop

Spark on YARN

Spark on Yarn

Apache Spark RDDs

Spark on yarn

Dynamic Resource Allocation Spark on YARN

Ähnlich wie RDD

dmapply: A functional primitive to express distributed machine learning algor...Bikash Chandra Karmokar

Study Notes: Apache SparkGao Yunzhong

Resilient Distributed DataSets - Apache SPARKTaposh Roy

SparkMário Almeida

Introduction to SparkSriram Kailasam

Spark cluster computing with working setsJinxinTang

Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui

Bigdata processing with Spark - part IIArjen de Vries

Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics

Spark 计算模型wang xing

Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar

SparkHeena Madan

Zaharia spark-scala-days-2012Skills Matter Talks

SparkNotesDemet Aksoy

Big Data Analytics with Apache SparkMarcoYuriFujiiMelo

Apache Spark: What? Why? When?Massimo Schenone

Spark Summit East 2015 Advanced Devops Student SlidesDatabricks

Large Scale Machine Learning with Apache SparkCloudera, Inc.

BDAS RDD study report v1.2Stefanie Zhao

Spark training-in-bangaloreKelly Technologies

Ähnlich wie RDD (20)

dmapply: A functional primitive to express distributed machine learning algor...

Study Notes: Apache Spark

Resilient Distributed DataSets - Apache SPARK

Spark

Introduction to Spark

Spark cluster computing with working sets

Spark & Cassandra at DataStax Meetup on Jan 29, 2015

Bigdata processing with Spark - part II

Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

Spark 计算模型

Geek Night - Functional Data Processing using Spark and Scala

Spark

Zaharia spark-scala-days-2012

SparkNotes

Big Data Analytics with Apache Spark

Apache Spark: What? Why? When?

Spark Summit East 2015 Advanced Devops Student Slides

Large Scale Machine Learning with Apache Spark

BDAS RDD study report v1.2

Spark training-in-bangalore

Mehr von Tien-Yang (Aiden) Wu

Hidden markov modelTien-Yang (Aiden) Wu

Scalable machine learningTien-Yang (Aiden) Wu

沒有想像中簡單的簡單分類器 KnnTien-Yang (Aiden) Wu

Scalable sentiment classification for big data analysis using naive bayes cla...Tien-Yang (Aiden) Wu

Collaborative filteringTien-Yang (Aiden) Wu

Collaborative Filtering Recommendation Algorithm based on HadoopTien-Yang (Aiden) Wu

Parallel-kmeansTien-Yang (Aiden) Wu

K meansTien-Yang (Aiden) Wu

Semantic ui教學Tien-Yang (Aiden) Wu

響應式網頁教學Tien-Yang (Aiden) Wu

NoSQL & JSONTien-Yang (Aiden) Wu

Weebly上手教學Tien-Yang (Aiden) Wu

簡易爬蟲製作和PttcrawlerTien-Yang (Aiden) Wu

Python簡介和多版本虛擬環境架設Tien-Yang (Aiden) Wu

Mehr von Tien-Yang (Aiden) Wu (14)

Hidden markov model

Scalable machine learning

沒有想像中簡單的簡單分類器 Knn

Scalable sentiment classification for big data analysis using naive bayes cla...

Collaborative filtering

Collaborative Filtering Recommendation Algorithm based on Hadoop

Parallel-kmeans

K means

Semantic ui教學

響應式網頁教學

NoSQL & JSON

Weebly上手教學

簡易爬蟲製作和Pttcrawler

Python簡介和多版本虛擬環境架設

Kürzlich hochgeladen

Clustering techniques data mining book ....ShaimaaMohamedGalal

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

Test Automation Strategy for Frontend and BackendArshad QA

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Professional Resume Template for Software DevelopersVinodh Ram

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

TECUNIQUE: Success Stories: IT Service providermohitmore19

Kürzlich hochgeladen (20)

Clustering techniques data mining book ....

Exploring iOS App Development: Simplifying the Process

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

Optimizing AI for immediate response in Smart CCTV

Diamond Application Development Crafting Solutions with Precision

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

Advancing Engineering with AI through the Next Generation of Strategic Projec...

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

Salesforce Certified Field Service Consultant

Test Automation Strategy for Frontend and Backend

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

A Secure and Reliable Document Management System is Essential.docx

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

HR Software Buyers Guide in 2024 - HRSoftware.com

Hand gesture recognition PROJECT PPT.pptx

Professional Resume Template for Software Developers

why an Opensea Clone Script might be your perfect match.pdf

How To Use Server-Side Rendering with Nuxt.js

TECUNIQUE: Success Stories: IT Service provider

RDD

1. Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury... 2012 University of California, Berkeley

2. OUTLINE • Introduction • Resilient Distributed Datasets (RDDs) • Representing RDDs • Evaluation • Conclusion

3. Introduction Cluster computing frameworks like MapReduce is not well in iterative machine learning and graph algorithms because data replication,disk I/O,serialization

4. Introduction Pregel is a system for iterative graph computations that keeps intermediate data in memory, while HaLoop offers an iterative MapReduce interface. but only support specific computation patterns They do not provide abstractions for more general reuse.

5. Introduction RDD is defining a programming interface that can provide fault tolerance efficiently RDD v.s distributed shared memory coarse-grained transformations (e.g., map, filter and join) fine-grained updates to mutable state lineage

6. Resilient Distributed Datasets (RDDs) RDD’s transformation are lazy operations that define a new RDD, while actions launch a computation to return a value to the program or write data to external storage.

7. Resilient Distributed Datasets (RDDs)

8. Resilient Distributed Datasets (RDDs) RDD is a read-only, partitioned collection of records, only be created (1) data in stable storage (2) other RDDs. lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.count()

9. Resilient Distributed Datasets (RDDs) RDD1 lines = spark.textFile(“hdfs://...") RDD2 errors = lines.filter(_.startsWith(“ERROR")) Long number = errors.count() RDD1 RDD2 Long tranformation action

10. Resilient Distributed Datasets (RDDs) DEMO

11. Resilient Distributed Datasets (RDDs) RDD1 lines = spark.textFile(“hdfs://...") RDD2 errors = lines.filter(_.startsWith(“ERROR")) RDD3 error = errors.persist() or cache() RDD3 error will in memory

12. Resilient Distributed Datasets (RDDs) Lineage: fault tolerance if RDD2 lost tranformation action RDD1 RDD2 Long recompute RDD1 and produce new RDD2

13. Resilient Distributed Datasets (RDDs) Spark provides the RDD abstraction through a language-integrated API scala a functional programming language for the Java VM

14. Representing RDDs dependencies between RDDs narrow dependencies：allow for pipelined execution on one cluster node wide dependencies：require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation

15. Representing RDDs in same node in different node

16. Representing RDDs how spark compute job stages partition RDD RDD in memory

17. Resilient Distributed Datasets (RDDs) Each stage contains as many pipelined transformations with narrow dependencies as possible. because avoid shuffled across the nodes

18. Evaluation Amazon：m1.xlarge EC2 nodes with 4 cores and 15 GB of RAM. We used HDFS for storage, with 256 MB blocks.

19. Evaluation 10 iterations on 100 GB datasets using 25–100 machines. logistic regression k-means logistic regression is less compute-intensive and thus more sensitive to time spent in deserialization and I/O.

20. Evaluation HadoopBinMem：convert input data to binary format,in memory

21. Evaluation pagerank 54 GB Wikipedia dump, 4 million articles. iterations :10

22. Evaluation pagerank iterations :10

23. Evaluation fault recovery k-means 100GB data,75 node ,iterations :10 one node fail at the start of the 6th iteration.

24. Evaluation k-means 100GB data 75 node iterations :10

25. Evaluation Behavior with Insufficient Memory logistic regression 100GB data , 25machine

26. Evaluation k-means 100GB data 25machine

27. Conclusion RDDs,an efficient, general-purpose and fault-tolerant abstraction for sharing data in cluster applications. RDDs offer an API based on coarse- grained transformations that lets them recover data efficiently using lineage. Spark v.s Hadoop fast to 20× in iterative applications and can be used interactively to query hundreds of gigabytes of data.

RDD

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie RDD

Ähnlich wie RDD (20)

Mehr von Tien-Yang (Aiden) Wu

Mehr von Tien-Yang (Aiden) Wu (14)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

RDD