SlideShare ist ein Scribd-Unternehmen logo
1 von 30
C H A P T E R 0 4 : W O R K I N G W I T H K E Y / V A L U E
P A I R S .
Learning Spark
by Holden Karau et. al.
Overview: Working with Key/Value Pairs
 Motivation
 Creating Pair RDDs
 Transformations on Pair RDDs
 Aggregations
 Grouping Data
 Joins
 Sorting Data
 Actions Available on Pair RDDs
 Data Partitioning (Advanced)
 Determining an RDD’s Partitioner
 Operations That Benefit from Partitioning
 Operations That Affect Partitioning
 Example: PageRank
 Custom Partitioners
 Conclusion
4.1 Motivation
 Spark provides special operations on RDDs containing key/value
pairs. These RDDs are called pair RDDs
 Pair RDDs are a useful building block in many programs, as they
expose operations that allow you to act on each key in parallel or
regroup data across the network
Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala
4.2 Creating Pair RDDs
 There are a number of ways to get pair RDDs in Spark
 We can do this by running a map() function that returns key/value pairs.
 The way to build key-value RDDs differs by language [python , java, scala]
 Examples:
 Creating a pair RDD using the first word as the key in Python
 pairs = lines.map(lambda x: (x.split(" ")[0], x))
 Creating a pair RDD using the first word as the key in Scala
 val pairs = lines.map(x => (x.split(" ")(0), x))
 Creating a pair RDD using the first word as the key in Java
 JavaPairRDD<String, String> pairs = lines.mapToPair(keyData);
 When creating a pair RDD from an in-memory collection in Scala and
Python, we only need to call SparkContext.parallelize() on a collection of
pairs. In java, we instead use SparkContext.parallelizePairs().
4.3 Transformations on Pair RDDs
 Pair RDDs are allowed to use all the transformations available to
standard RDDs.
4.3 Transformations on Pair RDDs
4.3 Transformations on Pair RDDs
4.3.1 Aggregations
 When datasets are described in terms of key/value pairs, it is
common to want to aggregate statistics across all elements with the
same key
 Example : Per-key average with reduceByKey() and mapValues() in Scala
 rdd.mapValues(x => (x, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
4.3.1 Aggregations
 Example 4-10. Word count using map() , reduceByKey() in Scala
 val input = sc.textFile("s3://...")
 val words = input.flatMap(x => x.split(" "))
 val result = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)
 Example 4-13. Per-key average using combineByKey() in Scala
 val result = input.combineByKey(
 (v) => (v, 1),
 (acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
 (acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
 ).map{ case (key, value) => (key, value._1 / value._2.toFloat) }
 result.collectAsMap().map(println(_))
 …
4.3.2 Grouping Data
 With keyed data a common use case is grouping our data by key—
for example, viewingall of a customer’s orders together.
 If our data is already keyed in the way we want, groupByKey() will
group our data using the key in our RDD. On an RDD consisting of
keys of type K and values of typeV, we get back an RDD of type [K,
Iterable[V]].
 groupBy() works on unpaired data or data where we want to use a
different condition besides equality on the current key. It takes a
function that it applies to every element in the source RDD and uses
the result to determine the key.
4.3.3 Joins
 Some of the most useful operations we get with keyed data comes
from using it together with other keyed data. Joining data together
is probably one of the most common operations on a pair RDD, and
we have a full range of options including right and left outer joins,
cross joins, and inner joins.
 The simple join operator is an inner join.1 Only keys that are
present in both pair RDDs are output. When there are multiple
values for the same key in one of the inputs, the resulting pair RDD
will have an entry for every possible pair of values with that key
from the two input RDDs
4.3.3 Joins
4.3.4 Sorting Data
 Having sorted data is quite useful in many cases, especially when
you’re producing downstream output. We can sort an RDD with
key/value pairs provided that there is an ordering defined on the
key. Once we have sorted our data, any subsequent call on the
sorted data to collect() or save() will result in ordered data.
 Since we often want our RDDs in the reverse order, the sortByKey()
function takes a parameter called ascending indicating whether we
want it in ascending order (it defaults to true). Sometimes we want a
different sort order entirely, and to support this we can provide our
own comparison function. we will sort our RDD by converting the
integers to strings and using the string comparison functions.
4.3.4 Sorting Data
4.4 Actions Available on Pair RDDs
 As with the transformations, all of the traditional actions available
on the base RDD are also available on pair RDDs. Some additional
actions are available on pair RDDs to take advantage of the
key/value nature of the data; these are listed in Table 4-3.
4.5 Data Partitioning (Advanced)
 The final Spark feature we will discuss in this chapter is how to
control datasets’ partitioning across nodes.
 In a distributed program, communication is very expensive, so laying out data to
minimize network traffic can greatly improve performance
 program needs to choose the right data structure for a collection of records, Spark
programs can choose to control their RDDs’ partitioning to reduce communication
 Spark’s partitioning is available on all RDDs of key/value pairs, and causes the system
to group elements based on a function of each key.
 Example:
4.5 Data Partitioning (Advanced)
Without using partitionBy
4.5 Data Partitioning (Advanced)
4.5 Data Partitioning (Advanced)
 Use the partitionBy() transformation on userData to hash-partition it at the
start of the program. We do this by passing a spark.HashPartitioner object to
partitionBy, as shown in Example 4-23.
4.5.1 Determining an RDD’s Partitioner
 In Scala and Java, you can determine how an RDD is partitioned using its partitioner
property (or partitioner() method in Java).2 This returns a scala.Option object, which is a
Scala class for a container that may or may not contain one item. You can call isDefined()
on the Option to check whether it has a value, and get() to get this value. If present, the
value will be a spark.Partitioner object. This is essentially a function telling the RDD
which partition each key goes into; we’ll talk more about this later.
4.5.2 Operations That Benefit from Partitioning
 Many of Spark’s operations involve shuffling data by key across the
network. All of these will benefit from partitioning. As of Spark 1.0, the
operations that benefit from partitioning are cogroup(), groupWith(),
join(), leftOuterJoin(), rightOuterJoin(), groupByKey(), reduceByKey(),
combineByKey(), and lookup().
 such as reduceByKey(), running on a prepartitioned RDD will cause all the values
for each key to be computed locally on a single machine, requiring only the final,
locally reduced value to be sent from each worker node back to the master. For
binary operations,
 such as cogroup() and join(), pre-partitioning will cause at least one of the RDDs
(the one with the known partitioner) to not be shuffled. If both RDDs have the
same partitioner, and if they are cached on the same machines (e.g., one was
created using mapValues() on the other, which preserves keys and partitioning)
or if one of them has not yet been computed, then no shuffling across the network
will occur.
4.5.3 Operations That Affect Partitioning
 Spark knows internally how each of its operations affects partitioning,
and automatically sets the partitioner on RDDs created by operations
that partition the data
 For example:
 suppose you called join() to join two RDDs; because the elements with the same key
have been hashed to the same machine, Spark knows that the result is hash-
partitioned, and operations like reduceByKey() on the join result are going to be
significantly faster.
 Finally, for binary operations, which partitioner is set on the output
depends on the parent RDDs’ partitioners. By default, it is a hash
partitioner, with the number of partitions set to the level of parallelism
of the operation. However, if one of the parents has a partitioner set, it
will be that partitioner; and if both parents have a partitioner set, it will
be the partitioner of the first parent.
4.5.4 Example: PageRank
 1. Initialize each page’s rank to 1.0.
 2. On each iteration, have page p send a contribution of rank(p)/numNeighbors(p) to its
neighbors (the pages it has links to).
 3. Set each page’s rank to 0.15 + 0.85 * contributionsReceived.
4.5.5 Custom Partitioners
 While Spark’s HashPartitioner and RangePartitioner are well
suited to many use cases, Spark also allows you to tune how
an RDD is partitioned by providing a custom Partitioner
object. This can help you further reduce communication by
taking advantage of domain-specific knowledge.
 For example,
 suppose we wanted to run the PageRank algorithm in the previous
section on a set of web pages. Here each page’s ID (the key in our RDD)
will be its URL. Using a simple hash function to do the partitioning,
pages with similar URLs (e.g., http://www.cnn.com/WORLD and
http://www.cnn.com/US) might be hashed to completely different
nodes. However, we know that web pages within the same domain tend
to link to each other a lot. Because PageRank needs to send a message
from each page to each of its neighbors on each iteration, it helps to
group these pages into the same partition. We can do this with a custom
Partitioner that looks at just the domain name instead of the whole URL
4.5.5 Custom Partitioners
 To implement a custom partitioner, you need to subclass the
org.apache.spark.Partitioner class and implement three methods:
 numPartitions: Int, which returns the number of partitions you will create.
 getPartition(key: Any): Int, which returns the partition ID (0 to numPartitions-1) for a
given key.
 equals(), the standard Java equality method. This is important to implement because
Spark will need to test your Partitioner object against other instances of itself when it
decides whether two of your RDDs are partitioned the same way!
4.5.5 Custom Partitioners
4.6 Conclusion
 In this chapter, we have seen how to work with
key/value data using the specialized functions
available in Spark.
Learn More about Apache Spark
END

Weitere ähnliche Inhalte

Was ist angesagt?

Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An OverviewMohit Jain
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewMario Cartia
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 

Was ist angesagt? (19)

Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 
SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 

Andere mochten auch

B Power - Deakin Transcript
B Power - Deakin TranscriptB Power - Deakin Transcript
B Power - Deakin TranscriptBenjamin Power
 
Pair learning and activities report (repaired)
Pair learning and activities report (repaired)Pair learning and activities report (repaired)
Pair learning and activities report (repaired)Christine Watts
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operationsphanleson
 
Mobile Security - Wireless hacking
Mobile Security - Wireless hackingMobile Security - Wireless hacking
Mobile Security - Wireless hackingphanleson
 
Authentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless ProtocolsAuthentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless Protocolsphanleson
 
Two Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Two Case Studies Big-Data and Machine Learning at Scale Solutions in IranTwo Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Two Case Studies Big-Data and Machine Learning at Scale Solutions in IranAmir Sedighi
 
Big Data and Machine Learning Workshop - Day 3 @ UTACM
Big Data and Machine Learning Workshop - Day 3 @ UTACMBig Data and Machine Learning Workshop - Day 3 @ UTACM
Big Data and Machine Learning Workshop - Day 3 @ UTACMAmir Sedighi
 
Session 1 Tp1
Session 1 Tp1Session 1 Tp1
Session 1 Tp1phanleson
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Hibernate Tutorial
Hibernate TutorialHibernate Tutorial
Hibernate TutorialRam132
 
Map(), flatmap() and reduce() are your new best friends: simpler collections,...
Map(), flatmap() and reduce() are your new best friends: simpler collections,...Map(), flatmap() and reduce() are your new best friends: simpler collections,...
Map(), flatmap() and reduce() are your new best friends: simpler collections,...Chris Richardson
 
local anesthesia / /certified fixed orthodontic courses by Indian dental ac...
 local anesthesia  / /certified fixed orthodontic courses by Indian dental ac... local anesthesia  / /certified fixed orthodontic courses by Indian dental ac...
local anesthesia / /certified fixed orthodontic courses by Indian dental ac...Indian dental academy
 
Firewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth FirewallsFirewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth Firewallsphanleson
 
Hacking web applications
Hacking web applicationsHacking web applications
Hacking web applicationsphanleson
 
Introduction to hibernate
Introduction to hibernateIntroduction to hibernate
Introduction to hibernatehr1383
 
Intro To Hibernate
Intro To HibernateIntro To Hibernate
Intro To HibernateAmit Himani
 

Andere mochten auch (20)

B Power - Deakin Transcript
B Power - Deakin TranscriptB Power - Deakin Transcript
B Power - Deakin Transcript
 
Pair learning and activities report (repaired)
Pair learning and activities report (repaired)Pair learning and activities report (repaired)
Pair learning and activities report (repaired)
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operations
 
Mobile Security - Wireless hacking
Mobile Security - Wireless hackingMobile Security - Wireless hacking
Mobile Security - Wireless hacking
 
Authentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless ProtocolsAuthentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless Protocols
 
Two Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Two Case Studies Big-Data and Machine Learning at Scale Solutions in IranTwo Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Two Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
 
Big Data and Machine Learning Workshop - Day 3 @ UTACM
Big Data and Machine Learning Workshop - Day 3 @ UTACMBig Data and Machine Learning Workshop - Day 3 @ UTACM
Big Data and Machine Learning Workshop - Day 3 @ UTACM
 
Session 1 Tp1
Session 1 Tp1Session 1 Tp1
Session 1 Tp1
 
COM Introduction
COM IntroductionCOM Introduction
COM Introduction
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
enterprise java bean
enterprise java beanenterprise java bean
enterprise java bean
 
Hibernate Tutorial
Hibernate TutorialHibernate Tutorial
Hibernate Tutorial
 
Map(), flatmap() and reduce() are your new best friends: simpler collections,...
Map(), flatmap() and reduce() are your new best friends: simpler collections,...Map(), flatmap() and reduce() are your new best friends: simpler collections,...
Map(), flatmap() and reduce() are your new best friends: simpler collections,...
 
local anesthesia / /certified fixed orthodontic courses by Indian dental ac...
 local anesthesia  / /certified fixed orthodontic courses by Indian dental ac... local anesthesia  / /certified fixed orthodontic courses by Indian dental ac...
local anesthesia / /certified fixed orthodontic courses by Indian dental ac...
 
Firewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth FirewallsFirewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth Firewalls
 
Hacking web applications
Hacking web applicationsHacking web applications
Hacking web applications
 
JPA and Hibernate
JPA and HibernateJPA and Hibernate
JPA and Hibernate
 
Introduction to hibernate
Introduction to hibernateIntroduction to hibernate
Introduction to hibernate
 
Intro To Hibernate
Intro To HibernateIntro To Hibernate
Intro To Hibernate
 
Hibernate performance tuning
Hibernate performance tuningHibernate performance tuning
Hibernate performance tuning
 

Ähnlich wie Learning spark ch04 - Working with Key/Value Pairs

Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache SparkMarcoYuriFujiiMelo
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistenceVenkat Datla
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkSupriya .
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxshivani22y
 
Berlin buzzwords 2018
Berlin buzzwords 2018Berlin buzzwords 2018
Berlin buzzwords 2018Matija Gobec
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxKnoldus Inc.
 
Big Data Transformations Powered By Spark
Big Data Transformations Powered By SparkBig Data Transformations Powered By Spark
Big Data Transformations Powered By SparkKnoldus Inc.
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsRavindra kumar
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogsprateek kumar
 

Ähnlich wie Learning spark ch04 - Working with Key/Value Pairs (20)

Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
Spark
SparkSpark
Spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Spark learning
Spark learningSpark learning
Spark learning
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptx
 
Spark core
Spark coreSpark core
Spark core
 
Berlin buzzwords 2018
Berlin buzzwords 2018Berlin buzzwords 2018
Berlin buzzwords 2018
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptx
 
Big Data Transformations Powered By Spark
Big Data Transformations Powered By SparkBig Data Transformations Powered By Spark
Big Data Transformations Powered By Spark
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 

Mehr von phanleson

E-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server AttacksE-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server Attacksphanleson
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designphanleson
 
Hbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBaseHbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBasephanleson
 
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about LibertagiaHướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagiaphanleson
 
Lecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLLecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLphanleson
 
Lecture 4 - Adding XTHML for the Web
Lecture  4 - Adding XTHML for the WebLecture  4 - Adding XTHML for the Web
Lecture 4 - Adding XTHML for the Webphanleson
 
Lecture 2 - Using XML for Many Purposes
Lecture 2 - Using XML for Many PurposesLecture 2 - Using XML for Many Purposes
Lecture 2 - Using XML for Many Purposesphanleson
 
SOA Course - SOA governance - Lecture 19
SOA Course - SOA governance - Lecture 19SOA Course - SOA governance - Lecture 19
SOA Course - SOA governance - Lecture 19phanleson
 
Lecture 18 - Model-Driven Service Development
Lecture 18 - Model-Driven Service DevelopmentLecture 18 - Model-Driven Service Development
Lecture 18 - Model-Driven Service Developmentphanleson
 
Lecture 15 - Technical Details
Lecture 15 - Technical DetailsLecture 15 - Technical Details
Lecture 15 - Technical Detailsphanleson
 
Lecture 10 - Message Exchange Patterns
Lecture 10 - Message Exchange PatternsLecture 10 - Message Exchange Patterns
Lecture 10 - Message Exchange Patternsphanleson
 
Lecture 9 - SOA in Context
Lecture 9 - SOA in ContextLecture 9 - SOA in Context
Lecture 9 - SOA in Contextphanleson
 
Lecture 07 - Business Process Management
Lecture 07 - Business Process ManagementLecture 07 - Business Process Management
Lecture 07 - Business Process Managementphanleson
 
Lecture 04 - Loose Coupling
Lecture 04 - Loose CouplingLecture 04 - Loose Coupling
Lecture 04 - Loose Couplingphanleson
 
Lecture 2 - SOA
Lecture 2 - SOALecture 2 - SOA
Lecture 2 - SOAphanleson
 
Lecture 3 - Services
Lecture 3 - ServicesLecture 3 - Services
Lecture 3 - Servicesphanleson
 
Lecture 01 - Motivation
Lecture 01 - MotivationLecture 01 - Motivation
Lecture 01 - Motivationphanleson
 

Mehr von phanleson (17)

E-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server AttacksE-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server Attacks
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design
 
Hbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBaseHbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBase
 
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about LibertagiaHướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
 
Lecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLLecture 1 - Getting to know XML
Lecture 1 - Getting to know XML
 
Lecture 4 - Adding XTHML for the Web
Lecture  4 - Adding XTHML for the WebLecture  4 - Adding XTHML for the Web
Lecture 4 - Adding XTHML for the Web
 
Lecture 2 - Using XML for Many Purposes
Lecture 2 - Using XML for Many PurposesLecture 2 - Using XML for Many Purposes
Lecture 2 - Using XML for Many Purposes
 
SOA Course - SOA governance - Lecture 19
SOA Course - SOA governance - Lecture 19SOA Course - SOA governance - Lecture 19
SOA Course - SOA governance - Lecture 19
 
Lecture 18 - Model-Driven Service Development
Lecture 18 - Model-Driven Service DevelopmentLecture 18 - Model-Driven Service Development
Lecture 18 - Model-Driven Service Development
 
Lecture 15 - Technical Details
Lecture 15 - Technical DetailsLecture 15 - Technical Details
Lecture 15 - Technical Details
 
Lecture 10 - Message Exchange Patterns
Lecture 10 - Message Exchange PatternsLecture 10 - Message Exchange Patterns
Lecture 10 - Message Exchange Patterns
 
Lecture 9 - SOA in Context
Lecture 9 - SOA in ContextLecture 9 - SOA in Context
Lecture 9 - SOA in Context
 
Lecture 07 - Business Process Management
Lecture 07 - Business Process ManagementLecture 07 - Business Process Management
Lecture 07 - Business Process Management
 
Lecture 04 - Loose Coupling
Lecture 04 - Loose CouplingLecture 04 - Loose Coupling
Lecture 04 - Loose Coupling
 
Lecture 2 - SOA
Lecture 2 - SOALecture 2 - SOA
Lecture 2 - SOA
 
Lecture 3 - Services
Lecture 3 - ServicesLecture 3 - Services
Lecture 3 - Services
 
Lecture 01 - Motivation
Lecture 01 - MotivationLecture 01 - Motivation
Lecture 01 - Motivation
 

Kürzlich hochgeladen

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfSanaAli374401
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 

Kürzlich hochgeladen (20)

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 

Learning spark ch04 - Working with Key/Value Pairs

  • 1. C H A P T E R 0 4 : W O R K I N G W I T H K E Y / V A L U E P A I R S . Learning Spark by Holden Karau et. al.
  • 2. Overview: Working with Key/Value Pairs  Motivation  Creating Pair RDDs  Transformations on Pair RDDs  Aggregations  Grouping Data  Joins  Sorting Data  Actions Available on Pair RDDs  Data Partitioning (Advanced)  Determining an RDD’s Partitioner  Operations That Benefit from Partitioning  Operations That Affect Partitioning  Example: PageRank  Custom Partitioners  Conclusion
  • 3. 4.1 Motivation  Spark provides special operations on RDDs containing key/value pairs. These RDDs are called pair RDDs  Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network
  • 4. Edx and Coursera Courses  Introduction to Big Data with Apache Spark  Spark Fundamentals I  Functional Programming Principles in Scala
  • 5. 4.2 Creating Pair RDDs  There are a number of ways to get pair RDDs in Spark  We can do this by running a map() function that returns key/value pairs.  The way to build key-value RDDs differs by language [python , java, scala]  Examples:  Creating a pair RDD using the first word as the key in Python  pairs = lines.map(lambda x: (x.split(" ")[0], x))  Creating a pair RDD using the first word as the key in Scala  val pairs = lines.map(x => (x.split(" ")(0), x))  Creating a pair RDD using the first word as the key in Java  JavaPairRDD<String, String> pairs = lines.mapToPair(keyData);  When creating a pair RDD from an in-memory collection in Scala and Python, we only need to call SparkContext.parallelize() on a collection of pairs. In java, we instead use SparkContext.parallelizePairs().
  • 6. 4.3 Transformations on Pair RDDs  Pair RDDs are allowed to use all the transformations available to standard RDDs.
  • 9. 4.3.1 Aggregations  When datasets are described in terms of key/value pairs, it is common to want to aggregate statistics across all elements with the same key  Example : Per-key average with reduceByKey() and mapValues() in Scala  rdd.mapValues(x => (x, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
  • 10. 4.3.1 Aggregations  Example 4-10. Word count using map() , reduceByKey() in Scala  val input = sc.textFile("s3://...")  val words = input.flatMap(x => x.split(" "))  val result = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)  Example 4-13. Per-key average using combineByKey() in Scala  val result = input.combineByKey(  (v) => (v, 1),  (acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),  (acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)  ).map{ case (key, value) => (key, value._1 / value._2.toFloat) }  result.collectAsMap().map(println(_))  …
  • 11. 4.3.2 Grouping Data  With keyed data a common use case is grouping our data by key— for example, viewingall of a customer’s orders together.  If our data is already keyed in the way we want, groupByKey() will group our data using the key in our RDD. On an RDD consisting of keys of type K and values of typeV, we get back an RDD of type [K, Iterable[V]].  groupBy() works on unpaired data or data where we want to use a different condition besides equality on the current key. It takes a function that it applies to every element in the source RDD and uses the result to determine the key.
  • 12. 4.3.3 Joins  Some of the most useful operations we get with keyed data comes from using it together with other keyed data. Joining data together is probably one of the most common operations on a pair RDD, and we have a full range of options including right and left outer joins, cross joins, and inner joins.  The simple join operator is an inner join.1 Only keys that are present in both pair RDDs are output. When there are multiple values for the same key in one of the inputs, the resulting pair RDD will have an entry for every possible pair of values with that key from the two input RDDs
  • 14. 4.3.4 Sorting Data  Having sorted data is quite useful in many cases, especially when you’re producing downstream output. We can sort an RDD with key/value pairs provided that there is an ordering defined on the key. Once we have sorted our data, any subsequent call on the sorted data to collect() or save() will result in ordered data.  Since we often want our RDDs in the reverse order, the sortByKey() function takes a parameter called ascending indicating whether we want it in ascending order (it defaults to true). Sometimes we want a different sort order entirely, and to support this we can provide our own comparison function. we will sort our RDD by converting the integers to strings and using the string comparison functions.
  • 16. 4.4 Actions Available on Pair RDDs  As with the transformations, all of the traditional actions available on the base RDD are also available on pair RDDs. Some additional actions are available on pair RDDs to take advantage of the key/value nature of the data; these are listed in Table 4-3.
  • 17. 4.5 Data Partitioning (Advanced)  The final Spark feature we will discuss in this chapter is how to control datasets’ partitioning across nodes.  In a distributed program, communication is very expensive, so laying out data to minimize network traffic can greatly improve performance  program needs to choose the right data structure for a collection of records, Spark programs can choose to control their RDDs’ partitioning to reduce communication  Spark’s partitioning is available on all RDDs of key/value pairs, and causes the system to group elements based on a function of each key.  Example:
  • 18. 4.5 Data Partitioning (Advanced) Without using partitionBy
  • 19. 4.5 Data Partitioning (Advanced)
  • 20. 4.5 Data Partitioning (Advanced)  Use the partitionBy() transformation on userData to hash-partition it at the start of the program. We do this by passing a spark.HashPartitioner object to partitionBy, as shown in Example 4-23.
  • 21. 4.5.1 Determining an RDD’s Partitioner  In Scala and Java, you can determine how an RDD is partitioned using its partitioner property (or partitioner() method in Java).2 This returns a scala.Option object, which is a Scala class for a container that may or may not contain one item. You can call isDefined() on the Option to check whether it has a value, and get() to get this value. If present, the value will be a spark.Partitioner object. This is essentially a function telling the RDD which partition each key goes into; we’ll talk more about this later.
  • 22. 4.5.2 Operations That Benefit from Partitioning  Many of Spark’s operations involve shuffling data by key across the network. All of these will benefit from partitioning. As of Spark 1.0, the operations that benefit from partitioning are cogroup(), groupWith(), join(), leftOuterJoin(), rightOuterJoin(), groupByKey(), reduceByKey(), combineByKey(), and lookup().  such as reduceByKey(), running on a prepartitioned RDD will cause all the values for each key to be computed locally on a single machine, requiring only the final, locally reduced value to be sent from each worker node back to the master. For binary operations,  such as cogroup() and join(), pre-partitioning will cause at least one of the RDDs (the one with the known partitioner) to not be shuffled. If both RDDs have the same partitioner, and if they are cached on the same machines (e.g., one was created using mapValues() on the other, which preserves keys and partitioning) or if one of them has not yet been computed, then no shuffling across the network will occur.
  • 23. 4.5.3 Operations That Affect Partitioning  Spark knows internally how each of its operations affects partitioning, and automatically sets the partitioner on RDDs created by operations that partition the data  For example:  suppose you called join() to join two RDDs; because the elements with the same key have been hashed to the same machine, Spark knows that the result is hash- partitioned, and operations like reduceByKey() on the join result are going to be significantly faster.  Finally, for binary operations, which partitioner is set on the output depends on the parent RDDs’ partitioners. By default, it is a hash partitioner, with the number of partitions set to the level of parallelism of the operation. However, if one of the parents has a partitioner set, it will be that partitioner; and if both parents have a partitioner set, it will be the partitioner of the first parent.
  • 24. 4.5.4 Example: PageRank  1. Initialize each page’s rank to 1.0.  2. On each iteration, have page p send a contribution of rank(p)/numNeighbors(p) to its neighbors (the pages it has links to).  3. Set each page’s rank to 0.15 + 0.85 * contributionsReceived.
  • 25. 4.5.5 Custom Partitioners  While Spark’s HashPartitioner and RangePartitioner are well suited to many use cases, Spark also allows you to tune how an RDD is partitioned by providing a custom Partitioner object. This can help you further reduce communication by taking advantage of domain-specific knowledge.  For example,  suppose we wanted to run the PageRank algorithm in the previous section on a set of web pages. Here each page’s ID (the key in our RDD) will be its URL. Using a simple hash function to do the partitioning, pages with similar URLs (e.g., http://www.cnn.com/WORLD and http://www.cnn.com/US) might be hashed to completely different nodes. However, we know that web pages within the same domain tend to link to each other a lot. Because PageRank needs to send a message from each page to each of its neighbors on each iteration, it helps to group these pages into the same partition. We can do this with a custom Partitioner that looks at just the domain name instead of the whole URL
  • 26. 4.5.5 Custom Partitioners  To implement a custom partitioner, you need to subclass the org.apache.spark.Partitioner class and implement three methods:  numPartitions: Int, which returns the number of partitions you will create.  getPartition(key: Any): Int, which returns the partition ID (0 to numPartitions-1) for a given key.  equals(), the standard Java equality method. This is important to implement because Spark will need to test your Partitioner object against other instances of itself when it decides whether two of your RDDs are partitioned the same way!
  • 28. 4.6 Conclusion  In this chapter, we have seen how to work with key/value data using the specialized functions available in Spark.
  • 29. Learn More about Apache Spark
  • 30. END

Hinweis der Redaktion

  1. Spark is a “computational engine” that is responsible for scheduling, distributing, and mon‐ itoring applications consisting of many computational tasks across many worker machines, or a computing cluster. First, all libraries and higher- level components in the stack benefit from improvements at the lower layers. Second, the costs associated with running the stack are minimized, because instead of running 5–10 independent software systems, an organization needs to run only one. Finally, one of the largest advantages of tight integration is the ability to build appli‐ cations that seamlessly combine different processing models.