SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Key-Value RDD
Key-Value RDD
Transformations on Pair RDDs
keys()
Returns an RDD with the keys of each tuple.
>>> var m = sc.parallelize(List((1, 2), (3, 4))).keys
>>> m.collect()
Array[Int] = Array(1, 3)
Key-Value RDD
Transformations on Pair RDDs
values()
>>> var m = sc.parallelize(List((1, 2), (3, 4))).values
>>> m.collect()
Array(2, 4)
Return an RDD with the values of each tuple.
Key-Value RDD
var rdd = sc.parallelize(List((1, 2), (3, 4), (3, 6)));
var rdd1 = rdd.groupByKey()
var vals = rdd1.collect()
for( i <- vals){
for (k <- i.productIterator) {
println("t" + k);
}
}
Transformations on Pair RDDs
groupByKey()
Group values with the same key.
Key-Value RDD
Questions - Set Operations
What will be the result of the following?
var rdd = sc.parallelize(Array(("a", 1), ("b", 1), ("a", 1)));
rdd.groupByKey().mapValues(_.size).collect()
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3,4,5)).map(("x", _))
def cc(x:Int):String = x.toString
def mv(x:String, y:Int):String = {x + "," + y}
def mc(x:String, y:String):String = {x + ", " + y}
myrdd1.combineByKey(cc, mv, mc).collect()
Array((x,1, 2, 3, 4,5))
Transformations on Pair RDDs
combineByKey(createCombiner, mergeValue, mergeCombiners,
numPartitions=None)
Combine values with the same key using a different result type.
Turns RDD[(K, V)] into a result of type RDD[(K, C)]
createCombiner, which turns a V into a C (e.g., creates a one-element list)
mergeValue, to merge a V into a C (e.g., adds it to the end of a list)
mergeCombiners, to combine two C’s into a single one.
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
Example: combineByKey
1 2 3 "1, 2, 3"
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
1 2 3
Example: combineByKey
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
1 2 3
"1"
cc
Example: combineByKey
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
1 2 3
"1"
cc
"3"
cc
Example: combineByKey
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
def mv(x:String, y:Int):String = {x + "," + y.toString}
1 2 3
"1"
cc
"3"
cc
mv
Example: combineByKey
"1,2"
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
def mv(x:String, y:Int):String = {x + "," + y.toString}
def mc(x:String, y:String):String = {x + ", " + y}
1 2 3
"1"
cc
"3"
cc
mv "1,2"
mc
"1,2,3"
Example: combineByKey
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
def mv(x:String, y:Int):String = {x + ":" + y.toString}
def mc(x:String, y:String):String = {x + "," + y}
myrdd.combineByKey(cc, mv, mc)
1 2 3
"1"
cc
"3"
cc
mv "1,2"
mc
"1,2,3"
Example: combineByKey
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
def mv(x:String, y:Int):String = {x + "," + y.toString}
def mc(x:String, y:String):String = {x + ", " + y}
myrdd.combineByKey(cc, mv, mc).collect()(0)._2
String = 1, 2, 3, 4,5
1 2 3
"1"
cc
"3"
cc
mv "1,2"
mc
"1,2,3"
Example: combineByKey
Key-Value RDD
Questions - Set Operations
What will be the result of the following?
def cc (v): return ("[" , v , "]");
def mv (c, v): return c[0:-1] + (v, "]")
def mc(c1,c2): return c1[0:-1] + c2[1:]
mc(mv(cc(1), 2), cc(3))
Key-Value RDD
Questions - Set Operations
('[', 1, 2, 3, ']')
What will be the result of the following?
def cc (v): return ("[" , v , "]");
def mv (c, v): return c[0:-1] + (v, "]")
def mc(c1,c2): return c1[0:-1] + c2[1:]
mc(mv(cc(1), 2), cc(3))
Key-Value RDD
Questions - Set Operations
What will be the result of the following?
def cc (v): return ("[" , v , "]");
def mv (c, v): return c[0:-1] + (v, "]")
def mc(c1, c2): return c1[0:-1] + c2[1:]
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
rdd.combineByKey(cc,mv, mc).collect()
Key-Value RDD
Questions - Set Operations
[('a', ('[', 1, 3, ']')), ('b', ('[', 2, ']'))]
What will be the result of the following?
def cc (v): return ("[" , v , "]");
def mv (c, v): return c[0:-1] + (v, "]")
def mc(c1, c2): return c1[0:-1] + c2[1:]
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
rdd.combineByKey(cc,mv, mc).collect()
Key-Value RDD
Transformations on Pair RDDs
sortByKey(ascending=true, numPartitions=current partitions)
Sorts this RDD, which is assumed to consist of (key, value) pairs.
Key-Value RDD
>>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5))
>>> sc.parallelize(tmp).sortByKey().collect()
Array((1,3), (2,5), (a,1), (b,2), (d,4))
Transformations on Pair RDDs
sortByKey(ascending=true, numPartitions=current partitions)
Sorts this RDD, which is assumed to consist of (key, value) pairs.
Key-Value RDD
>>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5))
>>> sc.parallelize(tmp).sortByKey(true, 1).collect()
Array((1,3), (2,5), (a,1), (b,2), (d,4))
Transformations on Pair RDDs
sortByKey(ascending=true, numPartitions=current partitions)
Sorts this RDD, which is assumed to consist of (key, value) pairs.
Key-Value RDD
>>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5))
>>> sc.parallelize(tmp).sortByKey(ascending=false,
numPartitions=2).collect()
Array((d,4), (b,2), (a,1), (2,5), (1,3))
Transformations on Pair RDDs
sortByKey(ascending=true, numPartitions=current partitions)
Sorts this RDD, which is assumed to consist of (key, value) pairs.
Key-Value RDD
Transformations on Pair RDDs
subtractByKey(other, numPartitions=None)
Return each (key, value) pair in self that has no pair with matching key in other.
>>> var x = sc.parallelize(List(("a", 1), ("b", 4), ("b", 5), ("a", 2)))
>>> var y = sc.parallelize(List(("a", 3), ("c", None)))
>>> x.subtractByKey(y).collect()
[('b', 4), ('b', 5)]
Key-Value RDD
Transformations on Pair RDDs
join(other, numPartitions=None)
Return an RDD containing all pairs of elements with matching keys in self and
other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in
self and (k, v2) is in other.
Key-Value RDD
Transformations on Pair RDDs
join(other, numPartitions=None)
Return an RDD containing all pairs of elements with matching keys in self and
other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in
self and (k, v2) is in other.
>>> var x = sc.parallelize(List(("a", 1), ("b", 4), ("c", 5)))
>>> var y = sc.parallelize(List(("a", 2), ("a", 3), ("d", 7)))
>>> x.join(y).collect()
Array((a,(1,2)), (a,(1,3)))
Key-Value RDD
Transformations on Pair RDDs
leftOuterJoin(other, numPartitions=None)
Perform a left outer join of self and other.
For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v,
w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k.
Hash-partitions the resulting RDD into the given number of partitions.
>>> var x = sc.parallelize(List(("a", 1), ("b", 4)))
>>> var y = sc.parallelize(List(("a", 2)))
>>> x.leftOuterJoin(y).collect()
Array((a,(1,Some(2))), (b,(4,None)))
Key-Value RDD
x = sc.parallelize(
[(1, "sandeep"), ("2", "sravani")])
y = sc.parallelize(
[(1, "ryan"), (3, "giri")])
x.leftOuterJoin(y).collect()
Questions - Set Operations
What will be the result of the following?
LEFT OUTER JOIN
Key-Value RDD
Questions - Set Operations
[(1, ('sandeep', 'ryan')), ('2', ('sravani', None))]
What will be the result of the following?
x = sc.parallelize(
[(1, "sandeep"), ("2", "sravani")])
y = sc.parallelize(
[(1, "ryan"), (3, "giri")])
x.leftOuterJoin(y).collect()
LEFT OUTER JOIN
Key-Value RDD
Transformations on Pair RDDs
rightOuterJoin(other, numPartitions=None)
Perform a right outer join of self and other.
For each element (k, w) in other, the resulting RDD will either contain all pairs
(k, (v, w)) for v in this, or the pair (k, (None, w)) if no elements in self have key
k.
Hash-partitions the resulting RDD into the given number of partitions.
>>> x = sc.parallelize([("a", 1), ("b", 4)])
>>> y = sc.parallelize([("a", 2)])
>>> y.rightOuterJoin(x).collect()
[('a', (2, 1)), ('b', (None, 4))]
Key-Value RDD
x = sc.parallelize(
[(1, "sandeep"), ("2", "sravani")])
y = sc.parallelize(
[(1, "ryan"), (3, "giri")])
x.rightOuterJoin(y).collect()
Questions - Set Operations
What will be the result of the following?
RIGHT OUTER JOIN
Key-Value RDD
Questions - Set Operations
[(1, ('sandeep', 'ryan')), (3, (None, 'giri'))]
What will be the result of the following?
x = sc.parallelize(
[(1, "sandeep"), ("2", "sravani")])
y = sc.parallelize(
[(1, "ryan"), (3, "giri")])
x.rightOuterJoin(y).collect()
RIGHT OUTER JOIN
3
Key-Value RDD
Transformations on Pair RDDs
>>> var x = sc.parallelize(List(("a", 1), ("b", 4)))
>>> var y = sc.parallelize(List(("a", 2), ("a", 3)))
>>> var cg = x.cogroup(y)
>>> cgl = cg.collect()
Array((a,(CompactBuffer(1),CompactBuffer(2, 3))),
(b,(CompactBuffer(4),CompactBuffer())))
This is basically same as:
((a, ([1], [2,3])), (b, ([4], []))))
cogroup(other, numPartitions=None)
For each key k in self or other, return a resulting RDD that contains a tuple with the
list of values for that key in self as well as other.
Key-Value RDD
Actions Available on Pair RDDs
countByKey()
Count the number of elements for each key, and return the result to the master as a
dictionary.
>>> var rdd = sc.parallelize(List(("a", 1), ("b", 1), ("a", 1), ('a', 10)))
>>> rdd.countByKey()
Map(a -> 2, a -> 1, b -> 1)
Key-Value RDD
Actions Available on Pair RDDs
lookup(key)
Return the list of values in the RDD for key. This operation is done efficiently if the
RDD has a known partitioner by only searching the partition that the key maps to.
var lr = sc.parallelize(1 to 1000).map(x => (x, x) )
lr.lookup(42)
Job 24 finished: lookup at <console>:28, took 0.037469 s
WrappedArray(42)
var sorted = lr.sortByKey()
sorted.lookup(42) # fast
Job 21 finished: lookup at <console>:28, took 0.008917 s
ArrayBuffer(42)
Thank you!
Basics of RDD

Weitere ähnliche Inhalte

Was ist angesagt?

Dive into Catalyst
Dive into CatalystDive into Catalyst
Dive into CatalystCheng Lian
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingToni Cebrián
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? DataWorks Summit
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeKonrad Malawski
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探台灣資料科學年會
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In RRsquared Academy
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitJDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitPROIDEA
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startupsbmlever
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science Chucheng Hsieh
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RRsquared Academy
 
Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Jonas Bonér
 
The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181Mahmoud Samir Fayed
 
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016Comsysto Reply GmbH
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 

Was ist angesagt? (20)

Dive into Catalyst
Dive into CatalystDive into Catalyst
Dive into Catalyst
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In R
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitJDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
 
Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)
 
D3.js workshop
D3.js workshopD3.js workshop
D3.js workshop
 
The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181
 
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 

Ähnlich wie Key-Value RDD Transformations and Actions

Super Advanced Python –act1
Super Advanced Python –act1Super Advanced Python –act1
Super Advanced Python –act1Ke Wei Louis
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
18. Java associative arrays
18. Java associative arrays18. Java associative arrays
18. Java associative arraysIntro C# Book
 
Erlang for data ops
Erlang for data opsErlang for data ops
Erlang for data opsmnacos
 
Microsoft Word Practice Exercise Set 2
Microsoft Word   Practice Exercise Set 2Microsoft Word   Practice Exercise Set 2
Microsoft Word Practice Exercise Set 2rampan
 
Frsa
FrsaFrsa
Frsa_111
 
Java Foundations: Maps, Lambda and Stream API
Java Foundations: Maps, Lambda and Stream APIJava Foundations: Maps, Lambda and Stream API
Java Foundations: Maps, Lambda and Stream APISvetlin Nakov
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavVyacheslav Arbuzov
 
Monadologie
MonadologieMonadologie
Monadologieleague
 
Will it Blend? - ScalaSyd February 2015
Will it Blend? - ScalaSyd February 2015Will it Blend? - ScalaSyd February 2015
Will it Blend? - ScalaSyd February 2015Filippo Vitale
 
Grokking Monads in Scala
Grokking Monads in ScalaGrokking Monads in Scala
Grokking Monads in ScalaTim Dalton
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)MongoSF
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programmingAlberto Labarga
 

Ähnlich wie Key-Value RDD Transformations and Actions (20)

Super Advanced Python –act1
Super Advanced Python –act1Super Advanced Python –act1
Super Advanced Python –act1
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
 
18. Java associative arrays
18. Java associative arrays18. Java associative arrays
18. Java associative arrays
 
Erlang for data ops
Erlang for data opsErlang for data ops
Erlang for data ops
 
Microsoft Word Practice Exercise Set 2
Microsoft Word   Practice Exercise Set 2Microsoft Word   Practice Exercise Set 2
Microsoft Word Practice Exercise Set 2
 
CLUSTERGRAM
CLUSTERGRAMCLUSTERGRAM
CLUSTERGRAM
 
Frsa
FrsaFrsa
Frsa
 
Java Foundations: Maps, Lambda and Stream API
Java Foundations: Maps, Lambda and Stream APIJava Foundations: Maps, Lambda and Stream API
Java Foundations: Maps, Lambda and Stream API
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
 
Monadologie
MonadologieMonadologie
Monadologie
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
A Shiny Example-- R
A Shiny Example-- RA Shiny Example-- R
A Shiny Example-- R
 
R for you
R for youR for you
R for you
 
Will it Blend? - ScalaSyd February 2015
Will it Blend? - ScalaSyd February 2015Will it Blend? - ScalaSyd February 2015
Will it Blend? - ScalaSyd February 2015
 
Grokking Monads in Scala
Grokking Monads in ScalaGrokking Monads in Scala
Grokking Monads in Scala
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programming
 

Mehr von CloudxLab

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep LearningCloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning OverviewCloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural NetworksCloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingCloudxLab
 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural NetsCloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningCloudxLab
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabCloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabCloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabCloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random ForestsCloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision TreesCloudxLab
 

Mehr von CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 

Kürzlich hochgeladen

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 

Kürzlich hochgeladen (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 

Key-Value RDD Transformations and Actions

  • 2. Key-Value RDD Transformations on Pair RDDs keys() Returns an RDD with the keys of each tuple. >>> var m = sc.parallelize(List((1, 2), (3, 4))).keys >>> m.collect() Array[Int] = Array(1, 3)
  • 3. Key-Value RDD Transformations on Pair RDDs values() >>> var m = sc.parallelize(List((1, 2), (3, 4))).values >>> m.collect() Array(2, 4) Return an RDD with the values of each tuple.
  • 4. Key-Value RDD var rdd = sc.parallelize(List((1, 2), (3, 4), (3, 6))); var rdd1 = rdd.groupByKey() var vals = rdd1.collect() for( i <- vals){ for (k <- i.productIterator) { println("t" + k); } } Transformations on Pair RDDs groupByKey() Group values with the same key.
  • 5. Key-Value RDD Questions - Set Operations What will be the result of the following? var rdd = sc.parallelize(Array(("a", 1), ("b", 1), ("a", 1))); rdd.groupByKey().mapValues(_.size).collect()
  • 6. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3,4,5)).map(("x", _)) def cc(x:Int):String = x.toString def mv(x:String, y:Int):String = {x + "," + y} def mc(x:String, y:String):String = {x + ", " + y} myrdd1.combineByKey(cc, mv, mc).collect() Array((x,1, 2, 3, 4,5)) Transformations on Pair RDDs combineByKey(createCombiner, mergeValue, mergeCombiners, numPartitions=None) Combine values with the same key using a different result type. Turns RDD[(K, V)] into a result of type RDD[(K, C)] createCombiner, which turns a V into a C (e.g., creates a one-element list) mergeValue, to merge a V into a C (e.g., adds it to the end of a list) mergeCombiners, to combine two C’s into a single one.
  • 7. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) Example: combineByKey 1 2 3 "1, 2, 3"
  • 8. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) 1 2 3 Example: combineByKey
  • 9. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString 1 2 3 "1" cc Example: combineByKey
  • 10. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString 1 2 3 "1" cc "3" cc Example: combineByKey
  • 11. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString def mv(x:String, y:Int):String = {x + "," + y.toString} 1 2 3 "1" cc "3" cc mv Example: combineByKey "1,2"
  • 12. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString def mv(x:String, y:Int):String = {x + "," + y.toString} def mc(x:String, y:String):String = {x + ", " + y} 1 2 3 "1" cc "3" cc mv "1,2" mc "1,2,3" Example: combineByKey
  • 13. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString def mv(x:String, y:Int):String = {x + ":" + y.toString} def mc(x:String, y:String):String = {x + "," + y} myrdd.combineByKey(cc, mv, mc) 1 2 3 "1" cc "3" cc mv "1,2" mc "1,2,3" Example: combineByKey
  • 14. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString def mv(x:String, y:Int):String = {x + "," + y.toString} def mc(x:String, y:String):String = {x + ", " + y} myrdd.combineByKey(cc, mv, mc).collect()(0)._2 String = 1, 2, 3, 4,5 1 2 3 "1" cc "3" cc mv "1,2" mc "1,2,3" Example: combineByKey
  • 15. Key-Value RDD Questions - Set Operations What will be the result of the following? def cc (v): return ("[" , v , "]"); def mv (c, v): return c[0:-1] + (v, "]") def mc(c1,c2): return c1[0:-1] + c2[1:] mc(mv(cc(1), 2), cc(3))
  • 16. Key-Value RDD Questions - Set Operations ('[', 1, 2, 3, ']') What will be the result of the following? def cc (v): return ("[" , v , "]"); def mv (c, v): return c[0:-1] + (v, "]") def mc(c1,c2): return c1[0:-1] + c2[1:] mc(mv(cc(1), 2), cc(3))
  • 17. Key-Value RDD Questions - Set Operations What will be the result of the following? def cc (v): return ("[" , v , "]"); def mv (c, v): return c[0:-1] + (v, "]") def mc(c1, c2): return c1[0:-1] + c2[1:] rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)]) rdd.combineByKey(cc,mv, mc).collect()
  • 18. Key-Value RDD Questions - Set Operations [('a', ('[', 1, 3, ']')), ('b', ('[', 2, ']'))] What will be the result of the following? def cc (v): return ("[" , v , "]"); def mv (c, v): return c[0:-1] + (v, "]") def mc(c1, c2): return c1[0:-1] + c2[1:] rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)]) rdd.combineByKey(cc,mv, mc).collect()
  • 19. Key-Value RDD Transformations on Pair RDDs sortByKey(ascending=true, numPartitions=current partitions) Sorts this RDD, which is assumed to consist of (key, value) pairs.
  • 20. Key-Value RDD >>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)) >>> sc.parallelize(tmp).sortByKey().collect() Array((1,3), (2,5), (a,1), (b,2), (d,4)) Transformations on Pair RDDs sortByKey(ascending=true, numPartitions=current partitions) Sorts this RDD, which is assumed to consist of (key, value) pairs.
  • 21. Key-Value RDD >>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)) >>> sc.parallelize(tmp).sortByKey(true, 1).collect() Array((1,3), (2,5), (a,1), (b,2), (d,4)) Transformations on Pair RDDs sortByKey(ascending=true, numPartitions=current partitions) Sorts this RDD, which is assumed to consist of (key, value) pairs.
  • 22. Key-Value RDD >>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)) >>> sc.parallelize(tmp).sortByKey(ascending=false, numPartitions=2).collect() Array((d,4), (b,2), (a,1), (2,5), (1,3)) Transformations on Pair RDDs sortByKey(ascending=true, numPartitions=current partitions) Sorts this RDD, which is assumed to consist of (key, value) pairs.
  • 23. Key-Value RDD Transformations on Pair RDDs subtractByKey(other, numPartitions=None) Return each (key, value) pair in self that has no pair with matching key in other. >>> var x = sc.parallelize(List(("a", 1), ("b", 4), ("b", 5), ("a", 2))) >>> var y = sc.parallelize(List(("a", 3), ("c", None))) >>> x.subtractByKey(y).collect() [('b', 4), ('b', 5)]
  • 24. Key-Value RDD Transformations on Pair RDDs join(other, numPartitions=None) Return an RDD containing all pairs of elements with matching keys in self and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other.
  • 25. Key-Value RDD Transformations on Pair RDDs join(other, numPartitions=None) Return an RDD containing all pairs of elements with matching keys in self and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other. >>> var x = sc.parallelize(List(("a", 1), ("b", 4), ("c", 5))) >>> var y = sc.parallelize(List(("a", 2), ("a", 3), ("d", 7))) >>> x.join(y).collect() Array((a,(1,2)), (a,(1,3)))
  • 26. Key-Value RDD Transformations on Pair RDDs leftOuterJoin(other, numPartitions=None) Perform a left outer join of self and other. For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k. Hash-partitions the resulting RDD into the given number of partitions. >>> var x = sc.parallelize(List(("a", 1), ("b", 4))) >>> var y = sc.parallelize(List(("a", 2))) >>> x.leftOuterJoin(y).collect() Array((a,(1,Some(2))), (b,(4,None)))
  • 27. Key-Value RDD x = sc.parallelize( [(1, "sandeep"), ("2", "sravani")]) y = sc.parallelize( [(1, "ryan"), (3, "giri")]) x.leftOuterJoin(y).collect() Questions - Set Operations What will be the result of the following? LEFT OUTER JOIN
  • 28. Key-Value RDD Questions - Set Operations [(1, ('sandeep', 'ryan')), ('2', ('sravani', None))] What will be the result of the following? x = sc.parallelize( [(1, "sandeep"), ("2", "sravani")]) y = sc.parallelize( [(1, "ryan"), (3, "giri")]) x.leftOuterJoin(y).collect() LEFT OUTER JOIN
  • 29. Key-Value RDD Transformations on Pair RDDs rightOuterJoin(other, numPartitions=None) Perform a right outer join of self and other. For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (v, w)) for v in this, or the pair (k, (None, w)) if no elements in self have key k. Hash-partitions the resulting RDD into the given number of partitions. >>> x = sc.parallelize([("a", 1), ("b", 4)]) >>> y = sc.parallelize([("a", 2)]) >>> y.rightOuterJoin(x).collect() [('a', (2, 1)), ('b', (None, 4))]
  • 30. Key-Value RDD x = sc.parallelize( [(1, "sandeep"), ("2", "sravani")]) y = sc.parallelize( [(1, "ryan"), (3, "giri")]) x.rightOuterJoin(y).collect() Questions - Set Operations What will be the result of the following? RIGHT OUTER JOIN
  • 31. Key-Value RDD Questions - Set Operations [(1, ('sandeep', 'ryan')), (3, (None, 'giri'))] What will be the result of the following? x = sc.parallelize( [(1, "sandeep"), ("2", "sravani")]) y = sc.parallelize( [(1, "ryan"), (3, "giri")]) x.rightOuterJoin(y).collect() RIGHT OUTER JOIN 3
  • 32. Key-Value RDD Transformations on Pair RDDs >>> var x = sc.parallelize(List(("a", 1), ("b", 4))) >>> var y = sc.parallelize(List(("a", 2), ("a", 3))) >>> var cg = x.cogroup(y) >>> cgl = cg.collect() Array((a,(CompactBuffer(1),CompactBuffer(2, 3))), (b,(CompactBuffer(4),CompactBuffer()))) This is basically same as: ((a, ([1], [2,3])), (b, ([4], [])))) cogroup(other, numPartitions=None) For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in self as well as other.
  • 33. Key-Value RDD Actions Available on Pair RDDs countByKey() Count the number of elements for each key, and return the result to the master as a dictionary. >>> var rdd = sc.parallelize(List(("a", 1), ("b", 1), ("a", 1), ('a', 10))) >>> rdd.countByKey() Map(a -> 2, a -> 1, b -> 1)
  • 34. Key-Value RDD Actions Available on Pair RDDs lookup(key) Return the list of values in the RDD for key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. var lr = sc.parallelize(1 to 1000).map(x => (x, x) ) lr.lookup(42) Job 24 finished: lookup at <console>:28, took 0.037469 s WrappedArray(42) var sorted = lr.sortByKey() sorted.lookup(42) # fast Job 21 finished: lookup at <console>:28, took 0.008917 s ArrayBuffer(42)