Cascading

Cascading
Nathan Marz
BackType

What is Cascading?

Cascading is a Java library that makes development of
complex Hadoop MapReduce workﬂows easy

Why Hadoop?

• Process large amounts of data in a scalable,
fault-tolerant way

Why Cascading?
Tool How you feel

Hadoop MapReduce

Cascading

Tuples
Cascading represents all data as “Tuples”

(“the man sat” , 25)
(“hello dolly” , 42)
(“say hello” ,1 )
(“the woman sat”, 10)

Tuples
Tuples are named, ordered ﬁelds

[“sentence”, “value”]
(“the man sat” , 25)
(“hello dolly” , 42)
(“say hello” ,1 )
(“the woman sat”, 10)

Flow
A ﬂow is a sequence of manipulations on
pipes of tuple streams
• Flow compiles to one or more MapReduce
jobs
• Inputs and outputs called “Taps”.
• Each Tap produces or receives a pipe of
tuples with the same format
• Multiple inputs, multiple outputs

Example

[“sentence”, “value”] [“word”, “sum”]

Get the sum of the values for each word

Example
[“sentence”, “value”]
Split(“sentence”) -> “word”
[“word”, “value”]
GroupBy(“word”)
[“word”, list<[“value”]>]
Sum(“value”) -> “sum”

[“word”, “sum”]

Example
Split(“sentence”) -> “word”

[“sentence”, “value”] [“word”, “value”]
(“the” , 25)
(“the man sat” , 25) (“man” , 25)
(“hello dolly” , 42) (“sat” , 25)
(“say hello” ,1 ) (“hello” , 42)
(“the woman sat”, 10) (“dolly” , 42)
(“say” ,1 )
(“hello” , 1 )
(“the” , 10)
(“woman” , 10)
(“sat” , 10)

Example
GroupBy(“word”)

[“word”, “value”] [“word”, list<[“value”]>]
(“the” , 25)
(“man” , 25) (“the” , [25, 10])
(“sat” , 25) (“man” , [25] )
(“hello” , 42) (“sat” , [25, 10])
(“dolly” , 42) (“hello” , [42, 1] )
(“say” ,1 ) (“dolly” , [42] )
(“hello” , 1 ) (“say” , [1] )
(“the” , 10) (“woman” , [10] )
(“woman” , 10)
(“sat” , 10)

Example
Sum(“value”) -> “sum”

[“word”, list<[“value”]>] [“word”, “sum”]

(“the” , [25, 10]) (“the” , 35)
(“man” , [25] ) (“man” , 25)
(“sat” , [25, 10]) (“sat” , 35)
(“hello” , [42, 1] ) (“hello” , 43)
(“dolly” , [42] ) (“dolly” , 42)
(“say” , [1] ) (“say” ,1 )
(“woman” , [10] ) (“woman” , 10)

More functionality

• Inner and outer joins natively supported
• Seamlessly branch and merge pipes of
tuples
• Integrate diverse data sources

Why not Pig?

• Pig is a custom language for writing
MapReduce workflows
• Because it’s a custom language, intermixing
“plain logic” in between flows is painful
• Not nearly as flexible as Cascading for
custom needs

Learn more

• Tutorial: http://blog.rapleaf.com/dev/?p=33
• Website: http://www.cascading.org

Cascading

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (17)

Mehr von nathanmarz

Mehr von nathanmarz (17)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Cascading

Hinweis der Redaktion