SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Cascading
 Nathan Marz
  BackType
What is Cascading?

Cascading is a Java library that makes development of
   complex Hadoop MapReduce workflows easy
Why Hadoop?


• Process large amounts of data in a scalable,
  fault-tolerant way
Why Cascading?
    Tool           How you feel


Hadoop MapReduce




  Cascading
Tuples
Cascading represents all data as “Tuples”

       (“the man sat” , 25)
       (“hello dolly”  , 42)
       (“say hello”    ,1 )
       (“the woman sat”, 10)
Tuples
Tuples are named, ordered fields

     [“sentence”, “value”]
     (“the man sat” , 25)
     (“hello dolly”  , 42)
     (“say hello”    ,1 )
     (“the woman sat”, 10)
Flow
  A flow is a sequence of manipulations on
           pipes of tuple streams
• Flow compiles to one or more MapReduce
  jobs
• Inputs and outputs called “Taps”.
• Each Tap produces or receives a pipe of
  tuples with the same format
• Multiple inputs, multiple outputs
Example

[“sentence”, “value”]         [“word”, “sum”]



      Get the sum of the values for each word
Example
  [“sentence”, “value”]
               Split(“sentence”) -> “word”
   [“word”, “value”]
               GroupBy(“word”)
[“word”, list<[“value”]>]
              Sum(“value”) -> “sum”

     [“word”, “sum”]
Example
             Split(“sentence”) -> “word”

[“sentence”, “value”]          [“word”, “value”]
                               (“the”   , 25)
(“the man sat” , 25)           (“man” , 25)
(“hello dolly”  , 42)          (“sat”    , 25)
(“say hello”    ,1 )           (“hello” , 42)
(“the woman sat”, 10)          (“dolly” , 42)
                               (“say”     ,1 )
                               (“hello” , 1 )
                               (“the”    , 10)
                               (“woman” , 10)
                               (“sat”     , 10)
Example
                   GroupBy(“word”)

[“word”, “value”]            [“word”, list<[“value”]>]
(“the”   , 25)
(“man” , 25)                  (“the”   , [25, 10])
(“sat”    , 25)               (“man” , [25]       )
(“hello” , 42)                (“sat”    , [25, 10])
(“dolly” , 42)                (“hello” , [42, 1] )
(“say”     ,1 )               (“dolly” , [42]      )
(“hello” , 1 )                (“say”     , [1]    )
(“the”    , 10)               (“woman” , [10]     )
(“woman” , 10)
(“sat”     , 10)
Example
                Sum(“value”) -> “sum”

[“word”, list<[“value”]>]        [“word”, “sum”]

(“the”   , [25, 10])          (“the”   , 35)
(“man” , [25]       )         (“man” , 25)
(“sat”    , [25, 10])         (“sat”    , 35)
(“hello” , [42, 1] )          (“hello” , 43)
(“dolly” , [42]      )        (“dolly” , 42)
(“say”     , [1]    )         (“say”     ,1 )
(“woman” , [10]     )         (“woman” , 10)
More functionality

• Inner and outer joins natively supported
• Seamlessly branch and merge pipes of
  tuples
• Integrate diverse data sources
Why not Pig?

• Pig is a custom language for writing
  MapReduce workflows
• Because it’s a custom language, intermixing
  “plain logic” in between flows is painful
• Not nearly as flexible as Cascading for
  custom needs
Learn more


• Tutorial: http://blog.rapleaf.com/dev/?p=33
• Website: http://www.cascading.org
Questions?

Weitere ähnliche Inhalte

Andere mochten auch

Animales en peligro de extincion
Animales en peligro de extincionAnimales en peligro de extincion
Animales en peligro de extincion
losdonkey
 
Ahead Week 1 Key Slides
Ahead Week 1 Key SlidesAhead Week 1 Key Slides
Ahead Week 1 Key Slides
altonbaird
 
02 epidemio enf reum
02 epidemio enf reum02 epidemio enf reum
02 epidemio enf reum
iloaeza_89
 
Wakefield customer insight project
Wakefield customer insight projectWakefield customer insight project
Wakefield customer insight project
localinsight
 
Setting up Your LinkedIn Account
Setting up Your LinkedIn AccountSetting up Your LinkedIn Account
Setting up Your LinkedIn Account
NET:101
 

Andere mochten auch (17)

Lab safety 12_10_13
Lab safety 12_10_13Lab safety 12_10_13
Lab safety 12_10_13
 
Animales en peligro de extincion
Animales en peligro de extincionAnimales en peligro de extincion
Animales en peligro de extincion
 
I love free_nsta2010
I love free_nsta2010I love free_nsta2010
I love free_nsta2010
 
Periodismo chiquinquireño
Periodismo chiquinquireñoPeriodismo chiquinquireño
Periodismo chiquinquireño
 
Ahead Week 1 Key Slides
Ahead Week 1 Key SlidesAhead Week 1 Key Slides
Ahead Week 1 Key Slides
 
Chistesvarios8
Chistesvarios8Chistesvarios8
Chistesvarios8
 
A replication study of the top performing systems in SemEval twitter sentimen...
A replication study of the top performing systems in SemEval twitter sentimen...A replication study of the top performing systems in SemEval twitter sentimen...
A replication study of the top performing systems in SemEval twitter sentimen...
 
Social media ROI
Social media ROISocial media ROI
Social media ROI
 
02 epidemio enf reum
02 epidemio enf reum02 epidemio enf reum
02 epidemio enf reum
 
Wakefield customer insight project
Wakefield customer insight projectWakefield customer insight project
Wakefield customer insight project
 
PNUTS
PNUTSPNUTS
PNUTS
 
certificate
certificatecertificate
certificate
 
Setting up Your LinkedIn Account
Setting up Your LinkedIn AccountSetting up Your LinkedIn Account
Setting up Your LinkedIn Account
 
Power tecnologia
Power tecnologiaPower tecnologia
Power tecnologia
 
Aprendiendo sobre las emociones de los pacientes mediante obras artísticas
Aprendiendo sobre las emociones de los pacientes mediante obras artísticasAprendiendo sobre las emociones de los pacientes mediante obras artísticas
Aprendiendo sobre las emociones de los pacientes mediante obras artísticas
 
Dr. Bart Cammaerts - The Mediation of Dissensus
Dr. Bart Cammaerts - The Mediation of DissensusDr. Bart Cammaerts - The Mediation of Dissensus
Dr. Bart Cammaerts - The Mediation of Dissensus
 
Presentasi moment
Presentasi momentPresentasi moment
Presentasi moment
 

Mehr von nathanmarz

Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop it
nathanmarz
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 

Mehr von nathanmarz (17)

Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
The Epistemology of Software Engineering
The Epistemology of Software EngineeringThe Epistemology of Software Engineering
The Epistemology of Software Engineering
 
Your Code is Wrong
Your Code is WrongYour Code is Wrong
Your Code is Wrong
 
Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop it
 
Storm
StormStorm
Storm
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
ElephantDB
ElephantDBElephantDB
ElephantDB
 
Become Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackTypeBecome Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackType
 
The Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data SystemsThe Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data Systems
 
Clojure at BackType
Clojure at BackTypeClojure at BackType
Clojure at BackType
 
Cascalog workshop
Cascalog workshopCascalog workshop
Cascalog workshop
 
Cascalog at Strange Loop
Cascalog at Strange LoopCascalog at Strange Loop
Cascalog at Strange Loop
 
Cascalog at Hadoop Day
Cascalog at Hadoop DayCascalog at Hadoop Day
Cascalog at Hadoop Day
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Group
 
Cascalog
CascalogCascalog
Cascalog
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Cascading

  • 2. What is Cascading? Cascading is a Java library that makes development of complex Hadoop MapReduce workflows easy
  • 3. Why Hadoop? • Process large amounts of data in a scalable, fault-tolerant way
  • 4. Why Cascading? Tool How you feel Hadoop MapReduce Cascading
  • 5. Tuples Cascading represents all data as “Tuples” (“the man sat” , 25) (“hello dolly” , 42) (“say hello” ,1 ) (“the woman sat”, 10)
  • 6. Tuples Tuples are named, ordered fields [“sentence”, “value”] (“the man sat” , 25) (“hello dolly” , 42) (“say hello” ,1 ) (“the woman sat”, 10)
  • 7. Flow A flow is a sequence of manipulations on pipes of tuple streams • Flow compiles to one or more MapReduce jobs • Inputs and outputs called “Taps”. • Each Tap produces or receives a pipe of tuples with the same format • Multiple inputs, multiple outputs
  • 8. Example [“sentence”, “value”] [“word”, “sum”] Get the sum of the values for each word
  • 9. Example [“sentence”, “value”] Split(“sentence”) -> “word” [“word”, “value”] GroupBy(“word”) [“word”, list<[“value”]>] Sum(“value”) -> “sum” [“word”, “sum”]
  • 10. Example Split(“sentence”) -> “word” [“sentence”, “value”] [“word”, “value”] (“the” , 25) (“the man sat” , 25) (“man” , 25) (“hello dolly” , 42) (“sat” , 25) (“say hello” ,1 ) (“hello” , 42) (“the woman sat”, 10) (“dolly” , 42) (“say” ,1 ) (“hello” , 1 ) (“the” , 10) (“woman” , 10) (“sat” , 10)
  • 11. Example GroupBy(“word”) [“word”, “value”] [“word”, list<[“value”]>] (“the” , 25) (“man” , 25) (“the” , [25, 10]) (“sat” , 25) (“man” , [25] ) (“hello” , 42) (“sat” , [25, 10]) (“dolly” , 42) (“hello” , [42, 1] ) (“say” ,1 ) (“dolly” , [42] ) (“hello” , 1 ) (“say” , [1] ) (“the” , 10) (“woman” , [10] ) (“woman” , 10) (“sat” , 10)
  • 12. Example Sum(“value”) -> “sum” [“word”, list<[“value”]>] [“word”, “sum”] (“the” , [25, 10]) (“the” , 35) (“man” , [25] ) (“man” , 25) (“sat” , [25, 10]) (“sat” , 35) (“hello” , [42, 1] ) (“hello” , 43) (“dolly” , [42] ) (“dolly” , 42) (“say” , [1] ) (“say” ,1 ) (“woman” , [10] ) (“woman” , 10)
  • 13. More functionality • Inner and outer joins natively supported • Seamlessly branch and merge pipes of tuples • Integrate diverse data sources
  • 14. Why not Pig? • Pig is a custom language for writing MapReduce workflows • Because it’s a custom language, intermixing “plain logic” in between flows is painful • Not nearly as flexible as Cascading for custom needs
  • 15. Learn more • Tutorial: http://blog.rapleaf.com/dev/?p=33 • Website: http://www.cascading.org

Hinweis der Redaktion