SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Why                     Scalding




            Needs Scala
         A Look at Scoobi and Scalding
            Scala DSLs for Hadoop



Scoobi

                                         @agemooij
Obligatory “About Me” Slide
Rocks!
But programming



    kinda
Sucks!
Hello World Word Count
        using
 Hadoop MapReduce
Split lines into words

Turn each word into a Pair(word, 1)


                        Group by word (?)



    For each word, sum the 1s to get the total
Lots of small unintuitive
                   Mapper and Reducer
                          Classes




          Lots of Hadoop intrusiveness
       (Context, Writables, Exceptions, etc.)




Low level glue code




Actually runs the code on the cluster
This does not make me a
           happy Hadoop developer!
Especially for things that are a little bit more complicated than counting words




 • Unintuitive, invasive programming model
 • Hard to compose/chain jobs into real, more
     complicated programs
 •   Lots of low-level boilerplate code
 •   Branching, Joins, CoGroups, etc. hard to
     implement
What Are the Alternatives?
Counting Words using Apache Pig




Nice!
Already a lot better, but anything more complex gets
hard pretty fast.
Pig is hard to customize/extend
Handy for quick exploration of data!
           And the same goes for Hive
package cascadingtutorial.wordcount;

/**
                                                                                 Very powerful!
 * Wordcount example in Cascading
 */                                                                              Record Model
public class Main
  {
                                                                                 Pipes & Filters
  public static void main( String[] args )
    {
      String inputPath = args[0];
                                                                                 Joins & CoGroups
      String outputPath = args[1];

         Scheme inputScheme = new TextLine(new Fields("offset", "line"));
         Scheme outputScheme = new TextLine();

         Tap sourceTap = inputPath.matches( "^[^:]+://.*") ?
           new Hfs(inputScheme, inputPath)    :                          Not very intuitive
           new Lfs(inputScheme, inputPath);
         Tap sinkTap   = outputPath.matches("^[^:]+://.*") ?
           new Hfs(outputScheme, outputPath) :                           Strange new abstraction
           new Lfs(outputScheme, outputPath);

         Pipe wcPipe = new Each("wordcount",
                                                                         Lots of boilerplate code
             new Fields("line"),
             new RegexSplitGenerator(new Fields("word"), "s+"),
             new Fields("word"));

         wcPipe = new GroupBy(wcPipe, new Fields("word"));
         wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

         Properties properties = new Properties();
         FlowConnector.setApplicationJarClass(properties, Main.class);

         Flow parsedLogFlow = new FlowConnector(properties)
           .connect(sourceTap, sinkTap, wcPipe);
         parsedLogFlow.start();
         parsedLogFlow.complete();
     }
 }
Meh...  I’m lazy
I want more power with less work!
How would we
count words in
 plain Scala?
  (My current language of choice)
Nice!
Familiar, intuitive
What if...?
But that code doesn’t
 scale to my cluster!
                 Or does it?




Meanwhile at Google...
Introducing
         Scoobi & Scalding
         Scala DSLs for Hadoop MapReduce


NOTE:
My relative familiarity
with either platform:
                          Scalding
                            5%




                             Scoobi
                              95%
http://github.com/nicta/scoobi


       A Scala library that
    implements a higher level
     programming model for
       Hadoop MapReduce
Counting Words using Scoobi




                                            Split lines into words
                                            Turn each word into a Pair(word, 1)
                                            Group by word
                                            For each word, sum the 1s to get the total




        Actually runs the code on the cluster
Scoobi is...
•   A distributed collections abstraction:
    •   Distributed collection objects abstract data in HDFS
    •   Methods on these objects abstract map/reduce
        operations
    •   Programs manipulate distributed collections objects
    •   Scoobi turns these manipulations into MapReduce jobs
    •   Based on Google’s FlumeJava / Cascades
•   A source code generator (it generates Java code!)
•   A job plan optimizer
•   Open sourced by NICTA
•   Written in Scala (W00t!)
DList[T]
•   Abstracts storage of data and files on HDFS
•   Calling methods on DList objects to transform and
    manipulate them abstracts the mapper, combiner,
    sort-and-shuffle, and reducer phases of MapReduce
•   Persisting a DList triggers compilation of the graph
    into one or more MR jobs and their execution
•   Very familiar: like standard Scala Lists
•   Strongly typed
•   Parameterized with rich types and Tuples
•   Easy list manipulation using typical higher order
    functions like map, flatMap, filter, etc.
DList[T]
IO
    •   Can read/write text files, Sequence files and Avro files
    •   Can influence sorting (raw, secondary)


                   Serialization
•   Serialization of custom types through Scala type
    classes and WireFormat[T]
•   Scoobi implements WireFormat[T] for primitive types,
    strings, tuples, Option[T], either[T], Iterable[T], etc.
•   Out of the box support for serialization of Scala case
    classes
IO/Serialization I
IO/Serialization II




      For normal (i.e. non-case) classes
Further Info
Version 0.4 released today (!)
• Avro, Sequence Files
• Materialized DObjects
• DList reduction methods (product, min,
    etc.)
•   Vastly improved testing support
•   Less overhead
•   Much more


http://nicta.github.com/scoobi/

scoobi-dev@googlegroups.com
scoobi-users@googlegroups.com
Scalding!



http://github.com/twitter/scalding


      A Scala library that
   implements a higher level
    programming model for
     Hadoop MapReduce
           Cascading
Counting Words using Scalding
Scalding is...
•   A distributed collections abstraction

•   A wrapper around Cascading (i.e. no source code
    generation)

•   Based on the same record model (i.e. named fields)

•   Less strongly typed

•   Uses Kryo Serialization

•   Used by Twitter in production

•   Written in Scala (W00t!)
Further Info
Current version: 0.5.4



http://github.com/twitter/scalding
https://github.com/twitter/scalding/wiki

@scalding

cascading-user@googlegroups.com

http://blog.echen.me/2012/02/09/movie-recommendations-and-more-
via-mapreduce-and-scalding/
How do they compare?
                              Small feature
Different approaches,    differences, which will
     similar power         even out over time

  Scoobi gets a little    Twitter is definitely a
  closer to idiomatic       bigger fish than
        Scala             NICTA, so Scalding
                          gets all the attention
  Both open sourced
      (last year)        Scoobi has better docs!
Which one should I use?
Ehm...

    ...I’m extremely prejudiced!
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 

Was ist angesagt? (20)

3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
 
R meetup talk
R meetup talkR meetup talk
R meetup talk
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Ruby1_full
Ruby1_fullRuby1_full
Ruby1_full
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Java Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and SolutionsJava Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and Solutions
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practice
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Python redis talk
Python redis talkPython redis talk
Python redis talk
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Scala Introduction
Scala IntroductionScala Introduction
Scala Introduction
 

Ähnlich wie Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for Hadoop
DataWorks Summit
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
AestasIT - Internal DSLs in Scala
AestasIT - Internal DSLs in ScalaAestasIT - Internal DSLs in Scala
AestasIT - Internal DSLs in Scala
Dmitry Buzdin
 

Ähnlich wie Why hadoop map reduce needs scala, an introduction to scoobi and scalding (20)

Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for Hadoop
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Sugar Presentation - YULHackers March 2009
Sugar Presentation - YULHackers March 2009Sugar Presentation - YULHackers March 2009
Sugar Presentation - YULHackers March 2009
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Sparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With SparkSparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With Spark
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Scala final ppt vinay
Scala final ppt vinayScala final ppt vinay
Scala final ppt vinay
 
Rust is for "Big Data"
Rust is for "Big Data"Rust is for "Big Data"
Rust is for "Big Data"
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Scala Days San Francisco
Scala Days San FranciscoScala Days San Francisco
Scala Days San Francisco
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
AestasIT - Internal DSLs in Scala
AestasIT - Internal DSLs in ScalaAestasIT - Internal DSLs in Scala
AestasIT - Internal DSLs in Scala
 

Mehr von Xebia Nederland BV

Mehr von Xebia Nederland BV (20)

The 10 tip recipe for business model innovation
The 10 tip recipe for business model innovationThe 10 tip recipe for business model innovation
The 10 tip recipe for business model innovation
 
Scan je teams!
Scan je teams!Scan je teams!
Scan je teams!
 
Holacracy: een nieuwe bodem voor de Scrum taart
Holacracy: een nieuwe bodem voor de Scrum taartHolacracy: een nieuwe bodem voor de Scrum taart
Holacracy: een nieuwe bodem voor de Scrum taart
 
3* Scrum Master
3* Scrum Master3* Scrum Master
3* Scrum Master
 
Judo Strategy
Judo StrategyJudo Strategy
Judo Strategy
 
Agile en Scrum buiten IT
Agile en Scrum buiten ITAgile en Scrum buiten IT
Agile en Scrum buiten IT
 
Scrumban
ScrumbanScrumban
Scrumban
 
Creating the right products
Creating the right productsCreating the right products
Creating the right products
 
Videoscribe je agile transitie
Videoscribe je agile transitieVideoscribe je agile transitie
Videoscribe je agile transitie
 
Sketchnote je Product Backlog Items & Sprint Retrospectives
Sketchnote je Product Backlog Items & Sprint RetrospectivesSketchnote je Product Backlog Items & Sprint Retrospectives
Sketchnote je Product Backlog Items & Sprint Retrospectives
 
Why we need test automation, but it’s not the right question
Why we need test automation, but it’s not the right questionWhy we need test automation, but it’s not the right question
Why we need test automation, but it’s not the right question
 
Testen in de transitie naar continuous delivery
Testen in de transitie naar continuous deliveryTesten in de transitie naar continuous delivery
Testen in de transitie naar continuous delivery
 
Becoming an agile enterprise, focus on the test ingredient
Becoming an agile enterprise, focus on the test ingredientBecoming an agile enterprise, focus on the test ingredient
Becoming an agile enterprise, focus on the test ingredient
 
How DUO started with Continuous Delivery and changed their way of Testing
How DUO started with Continuous Delivery and changed their way of TestingHow DUO started with Continuous Delivery and changed their way of Testing
How DUO started with Continuous Delivery and changed their way of Testing
 
Become a digital company - Case KPN / Xebia
Become a digital company - Case KPN / XebiaBecome a digital company - Case KPN / Xebia
Become a digital company - Case KPN / Xebia
 
Building a Docker powered feature driven delivery pipeline at hoyhoy.nl
Building a Docker powered feature driven delivery pipeline at hoyhoy.nlBuilding a Docker powered feature driven delivery pipeline at hoyhoy.nl
Building a Docker powered feature driven delivery pipeline at hoyhoy.nl
 
Webinar Xebia & bol.com
Webinar Xebia & bol.comWebinar Xebia & bol.com
Webinar Xebia & bol.com
 
TestWorks Conf The magic of models for 1000% test automation - Machiel van de...
TestWorks Conf The magic of models for 1000% test automation - Machiel van de...TestWorks Conf The magic of models for 1000% test automation - Machiel van de...
TestWorks Conf The magic of models for 1000% test automation - Machiel van de...
 
TestWorks Conf Serenity BDD in action - John Ferguson Smart
TestWorks Conf Serenity BDD in action - John Ferguson SmartTestWorks Conf Serenity BDD in action - John Ferguson Smart
TestWorks Conf Serenity BDD in action - John Ferguson Smart
 
TestWorks Conf Scalable QA with docker - Maarten van den Ende and Adé Mochtar
TestWorks Conf Scalable QA with docker - Maarten van den Ende and Adé MochtarTestWorks Conf Scalable QA with docker - Maarten van den Ende and Adé Mochtar
TestWorks Conf Scalable QA with docker - Maarten van den Ende and Adé Mochtar
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Why hadoop map reduce needs scala, an introduction to scoobi and scalding

  • 1. Why Scalding Needs Scala A Look at Scoobi and Scalding Scala DSLs for Hadoop Scoobi @agemooij
  • 4. But programming kinda Sucks!
  • 5. Hello World Word Count using Hadoop MapReduce
  • 6. Split lines into words Turn each word into a Pair(word, 1) Group by word (?) For each word, sum the 1s to get the total
  • 7. Lots of small unintuitive Mapper and Reducer Classes Lots of Hadoop intrusiveness (Context, Writables, Exceptions, etc.) Low level glue code Actually runs the code on the cluster
  • 8. This does not make me a happy Hadoop developer! Especially for things that are a little bit more complicated than counting words • Unintuitive, invasive programming model • Hard to compose/chain jobs into real, more complicated programs • Lots of low-level boilerplate code • Branching, Joins, CoGroups, etc. hard to implement
  • 9. What Are the Alternatives?
  • 10. Counting Words using Apache Pig Nice! Already a lot better, but anything more complex gets hard pretty fast. Pig is hard to customize/extend Handy for quick exploration of data! And the same goes for Hive
  • 11. package cascadingtutorial.wordcount; /** Very powerful! * Wordcount example in Cascading */ Record Model public class Main { Pipes & Filters public static void main( String[] args ) { String inputPath = args[0]; Joins & CoGroups String outputPath = args[1]; Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : Not very intuitive new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : Strange new abstraction new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", Lots of boilerplate code new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  • 12. Meh... I’m lazy I want more power with less work!
  • 13. How would we count words in plain Scala? (My current language of choice)
  • 15. But that code doesn’t scale to my cluster! Or does it? Meanwhile at Google...
  • 16. Introducing Scoobi & Scalding Scala DSLs for Hadoop MapReduce NOTE: My relative familiarity with either platform: Scalding 5% Scoobi 95%
  • 17. http://github.com/nicta/scoobi A Scala library that implements a higher level programming model for Hadoop MapReduce
  • 18. Counting Words using Scoobi Split lines into words Turn each word into a Pair(word, 1) Group by word For each word, sum the 1s to get the total Actually runs the code on the cluster
  • 19. Scoobi is... • A distributed collections abstraction: • Distributed collection objects abstract data in HDFS • Methods on these objects abstract map/reduce operations • Programs manipulate distributed collections objects • Scoobi turns these manipulations into MapReduce jobs • Based on Google’s FlumeJava / Cascades • A source code generator (it generates Java code!) • A job plan optimizer • Open sourced by NICTA • Written in Scala (W00t!)
  • 20. DList[T] • Abstracts storage of data and files on HDFS • Calling methods on DList objects to transform and manipulate them abstracts the mapper, combiner, sort-and-shuffle, and reducer phases of MapReduce • Persisting a DList triggers compilation of the graph into one or more MR jobs and their execution • Very familiar: like standard Scala Lists • Strongly typed • Parameterized with rich types and Tuples • Easy list manipulation using typical higher order functions like map, flatMap, filter, etc.
  • 22. IO • Can read/write text files, Sequence files and Avro files • Can influence sorting (raw, secondary) Serialization • Serialization of custom types through Scala type classes and WireFormat[T] • Scoobi implements WireFormat[T] for primitive types, strings, tuples, Option[T], either[T], Iterable[T], etc. • Out of the box support for serialization of Scala case classes
  • 24. IO/Serialization II For normal (i.e. non-case) classes
  • 25. Further Info Version 0.4 released today (!) • Avro, Sequence Files • Materialized DObjects • DList reduction methods (product, min, etc.) • Vastly improved testing support • Less overhead • Much more http://nicta.github.com/scoobi/ scoobi-dev@googlegroups.com scoobi-users@googlegroups.com
  • 26. Scalding! http://github.com/twitter/scalding A Scala library that implements a higher level programming model for Hadoop MapReduce Cascading
  • 28. Scalding is... • A distributed collections abstraction • A wrapper around Cascading (i.e. no source code generation) • Based on the same record model (i.e. named fields) • Less strongly typed • Uses Kryo Serialization • Used by Twitter in production • Written in Scala (W00t!)
  • 29. Further Info Current version: 0.5.4 http://github.com/twitter/scalding https://github.com/twitter/scalding/wiki @scalding cascading-user@googlegroups.com http://blog.echen.me/2012/02/09/movie-recommendations-and-more- via-mapreduce-and-scalding/
  • 30. How do they compare? Small feature Different approaches, differences, which will similar power even out over time Scoobi gets a little Twitter is definitely a closer to idiomatic bigger fish than Scala NICTA, so Scalding gets all the attention Both open sourced (last year) Scoobi has better docs!
  • 31. Which one should I use? Ehm... ...I’m extremely prejudiced!