SlideShare ist ein Scribd-Unternehmen logo
1 von 34
@Scalding
https://github.com/twitter/scalding


          Oscar Boykin
            Twitter
            @posco
#hadoopsummit
I encourage live tweeting (mention @posco/@scalding)
• What is Scalding?
• Why Scala for Map/Reduce?
• How is it used at Twitter?
• What’s next for Scalding?
Yep, we’re counting
               words:
Scalding jobs
subclass Job
Yep, we’re counting
                words:

Logic is in the
 constructor
Yep, we’re counting
              words:


Functions can
 be called or
defined inline
Scalding Model

• Source objects read and write data (from
  HDFS, DBs, MemCache, etc...)
• Pipes represent the flows of the data in the
  job. You can think of Pipe as a distributed
  list.
Yep, we’re counting
               words:


  Read and
  Write data
   through
Source objects
Yep, we’re counting
               words:


Data is modeled
  as streams of
named Tuples (of
     objects)
Why Scala
• The scala language has a lot of built-in
  features that make domain-specific
  languages easy to implement.
• Map/Reduce is already within the functional
  paradigm.
• Scala’s collection API covers almost all usual
  use cases.
Word Co-occurrence
Word Co-occurrence


 We can use
standard scala
  containers
Word Co-occurrence


We can do real
  logic in the
mapper without
external UDFs.
Word Co-occurrence

  Generalized
“plus” handles
lists/sets/maps
   and can be
  customized
  (implement
  Monoid[T])
GroupBuilder: enabling
 parallel reductions
           • groupBy takes a
             function that mutates a
             GroupBuilder.
           • GroupBuilder adds
             fields which are
             reductions of
             (potentially different)
             inputs.
           • On the left, we add 7
             fields.
scald.rb
• driver script that compiles the job and runs
  it locally or transfers and runs remotely.
• we plan to add EMR support.
Most functions in the API
have very close analogs in
 scala.collection.Iterable.
Cascading
• is the java library that handles most of the map/
  reduce planning for scalding.
• has years of production use.
• is used, tested, and optimized by many teams
  (Concurrent Inc., DSLs in Scala, Clojure, Python
  @Twitter. Ruby at Etsy).
• has a (very fast) local mode that runs without
  Hadoop.
• flow planner designed to be portable (cascading
  on Spark? Storm?)
mapReduceMap
• We abstract Cascading’s map-side
  aggregation ability with a function called
  mapReduceMap.
• If only mapReduceMaps are called, map-side
  aggregation works. If a foldLeft is called
  (which cannot be done map-side), scalding
  falls back to pushing everything to the
  reducers.
Most Reductions are
 mapReduceMap
Optimized Joins
• mapside join is called joinWithTiny.
  Implements left or inner join with a very
  small pipe.
• blockJoin: deals with data skew by
  replicating the data (useful for walking the
  Twitter follower graph, where everyone
  follows Gaga/Bieber/Obama).
• coming: combine the above to dynamically
  set replication on a per key basis: only Gaga
  is replicated, and just the right amount.
Scalding @Twitter
• Revenue quality team (ads targeting, market
  insight, click-prediction, traffic-quality) uses
  scalding for all our work.
• Scala engineers throughout the company
  use it (i.e. storage, platform).
• More than 60 in-production scalding jobs,
  more than 200 ad-hoc jobs.
• Not our only tool: Pig, PyCascading,
  Cascalog, Hive are also used.
Example: finding
         similarity
• A simple recommendation algorithm is
  cosine similarity.
• Represent user-tweet interaction as a
  vector, then find the users whose vectors
  point in directions near the user in
  question.
• We’ve developed a Matrix library on top of
  scalding to make this easy.
Cosine Similarity

Matrices are
 strongly
  typed.
Cosine Similarity
   Col,Row
types (Int,Int)
     can be
   anything
 comparable.
  Strings are
useful for text
    indices.
Cosine Similarity

    Value
(Double) can
 be anything
with a Ring[T]
 (plus/times)
Cosine Similarity

  Operator
 overloading
gives intuitive
    code.
Matrix in foreground,
        map/reduce behind

    With this
  syntax, we can
  focus on logic,
not how to map
linear algebra to
     Hadoop
Example uses:
• Do random-walks on the following graph.
  Matrix power iteration until convergence:
  (m * m * m * m).
• Dimensionality reduction of follower graph
  (Matrix product by a lower dimensional
  projection matrix).
• Triangle counting: (M*M*M).trace / 3
What is next?

• Improve the logical flow planning (reorder
  commuting filters/projections before maps,
  etc...).
• Improve Matrix flow planning to narrow
  the gap to hand optimized code.
One more thing:


• Type-safety geeks can relax: we just pushed
  a type-safe API to scalding 0.6.0 analogous
  to Scoobi/Scrunch/Spark, so relax.
That’s it.


• follow and mention: @scalding @posco
• pull reqs: http://github.com/twitter/scalding

Weitere ähnliche Inhalte

Was ist angesagt?

Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in SearchAmund Tveit
 
Scala introduction
Scala introductionScala introduction
Scala introductionvito jeng
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David SzakallasDatabricks
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017Sunghyouk Bae
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Sunghyouk Bae
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Samir Bessalah
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To CascadingNate Murray
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
ACM DBPL Keynote: The Graph Traversal Machine and Language
ACM DBPL Keynote: The Graph Traversal Machine and LanguageACM DBPL Keynote: The Graph Traversal Machine and Language
ACM DBPL Keynote: The Graph Traversal Machine and LanguageMarko Rodriguez
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
 
Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Jonas Bonér
 

Was ist angesagt? (20)

Scalding
ScaldingScalding
Scalding
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Scala introduction
Scala introductionScala introduction
Scala introduction
 
Meet scala
Meet scalaMeet scala
Meet scala
 
Scala+data
Scala+dataScala+data
Scala+data
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Joy of scala
Joy of scalaJoy of scala
Joy of scala
 
Requery overview
Requery overviewRequery overview
Requery overview
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
ACM DBPL Keynote: The Graph Traversal Machine and Language
ACM DBPL Keynote: The Graph Traversal Machine and LanguageACM DBPL Keynote: The Graph Traversal Machine and Language
ACM DBPL Keynote: The Graph Traversal Machine and Language
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)
 

Andere mochten auch

Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Martin Odersky
 
Metaprogramming in Scala 2.10, Eugene Burmako,
Metaprogramming  in Scala 2.10, Eugene Burmako, Metaprogramming  in Scala 2.10, Eugene Burmako,
Metaprogramming in Scala 2.10, Eugene Burmako, Vasil Remeniuk
 
Scala in practice
Scala in practiceScala in practice
Scala in practiceTomer Gabel
 
Procedure Typing for Scala
Procedure Typing for ScalaProcedure Typing for Scala
Procedure Typing for Scalaakuklev
 
Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Sriram Krishnan
 
BigData, Hadoop과 Node.js, R2
BigData, Hadoop과 Node.js, R2BigData, Hadoop과 Node.js, R2
BigData, Hadoop과 Node.js, R2고포릿 default
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeKonrad Malawski
 
Elixirと他言語の比較的紹介 ver.2
Elixirと他言語の比較的紹介ver.2Elixirと他言語の比較的紹介ver.2
Elixirと他言語の比較的紹介 ver.2Tsunenori Oohara
 
Go言語によるwebアプリの作り方
Go言語によるwebアプリの作り方Go言語によるwebアプリの作り方
Go言語によるwebアプリの作り方Yasutaka Kawamoto
 
PHP7はなぜ速いのか
PHP7はなぜ速いのかPHP7はなぜ速いのか
PHP7はなぜ速いのかYoshio Hanawa
 
java 8 람다식 소개와 의미 고찰
java 8 람다식 소개와 의미 고찰java 8 람다식 소개와 의미 고찰
java 8 람다식 소개와 의미 고찰Sungchul Park
 
地獄のElixir(目黒スタートアップ勉強会)
地獄のElixir(目黒スタートアップ勉強会)地獄のElixir(目黒スタートアップ勉強会)
地獄のElixir(目黒スタートアップ勉強会)Tsunenori Oohara
 
PHP7で変わること ——言語仕様とエンジンの改善ポイント
PHP7で変わること ——言語仕様とエンジンの改善ポイントPHP7で変わること ——言語仕様とエンジンの改善ポイント
PHP7で変わること ——言語仕様とエンジンの改善ポイントYoshio Hanawa
 
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)NTT DATA OSS Professional Services
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)NTT DATA OSS Professional Services
 
GoによるWebアプリ開発のキホン
GoによるWebアプリ開発のキホンGoによるWebアプリ開発のキホン
GoによるWebアプリ開発のキホンAkihiko Horiuchi
 

Andere mochten auch (19)

Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009
 
Scala test
Scala testScala test
Scala test
 
Metaprogramming in Scala 2.10, Eugene Burmako,
Metaprogramming  in Scala 2.10, Eugene Burmako, Metaprogramming  in Scala 2.10, Eugene Burmako,
Metaprogramming in Scala 2.10, Eugene Burmako,
 
Scala in practice
Scala in practiceScala in practice
Scala in practice
 
Scala profiling
Scala profilingScala profiling
Scala profiling
 
Procedure Typing for Scala
Procedure Typing for ScalaProcedure Typing for Scala
Procedure Typing for Scala
 
Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014
 
BigData, Hadoop과 Node.js, R2
BigData, Hadoop과 Node.js, R2BigData, Hadoop과 Node.js, R2
BigData, Hadoop과 Node.js, R2
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
RESTful Java
RESTful JavaRESTful Java
RESTful Java
 
Elixirと他言語の比較的紹介 ver.2
Elixirと他言語の比較的紹介ver.2Elixirと他言語の比較的紹介ver.2
Elixirと他言語の比較的紹介 ver.2
 
Go言語によるwebアプリの作り方
Go言語によるwebアプリの作り方Go言語によるwebアプリの作り方
Go言語によるwebアプリの作り方
 
PHP7はなぜ速いのか
PHP7はなぜ速いのかPHP7はなぜ速いのか
PHP7はなぜ速いのか
 
java 8 람다식 소개와 의미 고찰
java 8 람다식 소개와 의미 고찰java 8 람다식 소개와 의미 고찰
java 8 람다식 소개와 의미 고찰
 
地獄のElixir(目黒スタートアップ勉強会)
地獄のElixir(目黒スタートアップ勉強会)地獄のElixir(目黒スタートアップ勉強会)
地獄のElixir(目黒スタートアップ勉強会)
 
PHP7で変わること ——言語仕様とエンジンの改善ポイント
PHP7で変わること ——言語仕様とエンジンの改善ポイントPHP7で変わること ——言語仕様とエンジンの改善ポイント
PHP7で変わること ——言語仕様とエンジンの改善ポイント
 
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
 
GoによるWebアプリ開発のキホン
GoによるWebアプリ開発のキホンGoによるWebアプリ開発のキホン
GoによるWebアプリ開発のキホン
 

Ähnlich wie Scalding: Twitter's Scala DSL for Hadoop/Cascading

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingXebia Nederland BV
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtssiddharth30121
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at TwitterAlex Payne
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
Scala final ppt vinay
Scala final ppt vinayScala final ppt vinay
Scala final ppt vinayViplav Jain
 

Ähnlich wie Scalding: Twitter's Scala DSL for Hadoop/Cascading (20)

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On Time
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Scalding intro 20141125
Scalding intro 20141125Scalding intro 20141125
Scalding intro 20141125
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Scala final ppt vinay
Scala final ppt vinayScala final ppt vinay
Scala final ppt vinay
 

Kürzlich hochgeladen

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 

Kürzlich hochgeladen (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 

Scalding: Twitter's Scala DSL for Hadoop/Cascading

  • 1. @Scalding https://github.com/twitter/scalding Oscar Boykin Twitter @posco
  • 2. #hadoopsummit I encourage live tweeting (mention @posco/@scalding)
  • 3. • What is Scalding? • Why Scala for Map/Reduce? • How is it used at Twitter? • What’s next for Scalding?
  • 4. Yep, we’re counting words: Scalding jobs subclass Job
  • 5. Yep, we’re counting words: Logic is in the constructor
  • 6. Yep, we’re counting words: Functions can be called or defined inline
  • 7. Scalding Model • Source objects read and write data (from HDFS, DBs, MemCache, etc...) • Pipes represent the flows of the data in the job. You can think of Pipe as a distributed list.
  • 8. Yep, we’re counting words: Read and Write data through Source objects
  • 9. Yep, we’re counting words: Data is modeled as streams of named Tuples (of objects)
  • 10. Why Scala • The scala language has a lot of built-in features that make domain-specific languages easy to implement. • Map/Reduce is already within the functional paradigm. • Scala’s collection API covers almost all usual use cases.
  • 12. Word Co-occurrence We can use standard scala containers
  • 13. Word Co-occurrence We can do real logic in the mapper without external UDFs.
  • 14. Word Co-occurrence Generalized “plus” handles lists/sets/maps and can be customized (implement Monoid[T])
  • 15. GroupBuilder: enabling parallel reductions • groupBy takes a function that mutates a GroupBuilder. • GroupBuilder adds fields which are reductions of (potentially different) inputs. • On the left, we add 7 fields.
  • 16. scald.rb • driver script that compiles the job and runs it locally or transfers and runs remotely. • we plan to add EMR support.
  • 17.
  • 18. Most functions in the API have very close analogs in scala.collection.Iterable.
  • 19. Cascading • is the java library that handles most of the map/ reduce planning for scalding. • has years of production use. • is used, tested, and optimized by many teams (Concurrent Inc., DSLs in Scala, Clojure, Python @Twitter. Ruby at Etsy). • has a (very fast) local mode that runs without Hadoop. • flow planner designed to be portable (cascading on Spark? Storm?)
  • 20. mapReduceMap • We abstract Cascading’s map-side aggregation ability with a function called mapReduceMap. • If only mapReduceMaps are called, map-side aggregation works. If a foldLeft is called (which cannot be done map-side), scalding falls back to pushing everything to the reducers.
  • 21. Most Reductions are mapReduceMap
  • 22.
  • 23. Optimized Joins • mapside join is called joinWithTiny. Implements left or inner join with a very small pipe. • blockJoin: deals with data skew by replicating the data (useful for walking the Twitter follower graph, where everyone follows Gaga/Bieber/Obama). • coming: combine the above to dynamically set replication on a per key basis: only Gaga is replicated, and just the right amount.
  • 24. Scalding @Twitter • Revenue quality team (ads targeting, market insight, click-prediction, traffic-quality) uses scalding for all our work. • Scala engineers throughout the company use it (i.e. storage, platform). • More than 60 in-production scalding jobs, more than 200 ad-hoc jobs. • Not our only tool: Pig, PyCascading, Cascalog, Hive are also used.
  • 25. Example: finding similarity • A simple recommendation algorithm is cosine similarity. • Represent user-tweet interaction as a vector, then find the users whose vectors point in directions near the user in question. • We’ve developed a Matrix library on top of scalding to make this easy.
  • 27. Cosine Similarity Col,Row types (Int,Int) can be anything comparable. Strings are useful for text indices.
  • 28. Cosine Similarity Value (Double) can be anything with a Ring[T] (plus/times)
  • 29. Cosine Similarity Operator overloading gives intuitive code.
  • 30. Matrix in foreground, map/reduce behind With this syntax, we can focus on logic, not how to map linear algebra to Hadoop
  • 31. Example uses: • Do random-walks on the following graph. Matrix power iteration until convergence: (m * m * m * m). • Dimensionality reduction of follower graph (Matrix product by a lower dimensional projection matrix). • Triangle counting: (M*M*M).trace / 3
  • 32. What is next? • Improve the logical flow planning (reorder commuting filters/projections before maps, etc...). • Improve Matrix flow planning to narrow the gap to hand optimized code.
  • 33. One more thing: • Type-safety geeks can relax: we just pushed a type-safe API to scalding 0.6.0 analogous to Scoobi/Scrunch/Spark, so relax.
  • 34. That’s it. • follow and mention: @scalding @posco • pull reqs: http://github.com/twitter/scalding

Hinweis der Redaktion

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n