SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Scoobi

Ben Lever
@bmlever
Me



                                                  Haskell DSL for development of computer
      Machine learning, software
                                                      vision algorithms targeting GPUs
          systems, computer
vision, optimisation, networks, control
         and signal processing




                         Predictive analytics for the enterprise
Hadoop app development – wish list

        Quick dev cycles
          Expressive
          Reusability
          Type safety
           Reliability
Bridging the “tooling” gap
                      Scoobi

 MapReduce
  pipelines
          DList
DObject
                                     ScalaCheck
  Implementation                     Testing


                      Java APIs



                   HadoopMapReduce
At a glance
•   Scoobi = Scala for Hadoop
•   Inspired by Google’s FlumeJava
•   Developed at NICTA
•   Open-sourced Oct 2011
•   Apache V2
HadoopMapReduce – word count
                                                      ... hello … cat
                        324   ... cat …    323                                  325         ... hello … fire …       326       ... fire… cat …
                                                             …




(k1, v1)  [(k2, v2)]           Mapper                    Mapper                            Mapper                       Mapper


                                    cat   1                   hello     1                       hello    1                 fire       1
                                                              cat       1                       fire     1                 cat        1


[(k2, v2)]  [(k2, [v2])]                          Sort and shuffle: aggregate values by key

                                          hello       1   1             cat         1       1    1           fire    1     1



(k2, [v2])  [(k3, v3)]                   Reducer                           Reducer                          Reducer


                                           hello      2                       cat       3                     fire   2


                                                                                                                                            6
Java style
public class WordCount {                                           public static void main(String[] args) throws Exception {
                                                                       Configuration conf = new Configuration();
public static class Map extends Mapper<LongWritable, Text, Text,
    IntWritable> {                                                    Job job = new Job(conf, "wordcount");
   private final static IntWritable one = new IntWritable(1);
   private Text word = new Text();                                 job.setOutputKeyClass(Text.class);
                                                                   job.setOutputValueClass(IntWritable.class);
    public void map(LongWritable key, Text value, Context
      context) throws IOException, InterruptedException {          job.setMapperClass(Map.class);
         String line = value.toString();                           job.setReducerClass(Reduce.class);
StringTokenizertokenizer = new StringTokenizer(line);
         while (tokenizer.hasMoreTokens()) {                       job.setInputFormatClass(TextInputFormat.class);
word.set(tokenizer.nextToken());                                   job.setOutputFormatClass(TextOutputFormat.class);
context.write(word, one);
         }                                                         FileInputFormat.addInputPath(job, new Path(args[0]));
    }                                                              FileOutputFormat.setOutputPath(job, new Path(args[1]));
 }
 public static class Reduce extends Reducer<Text, IntWritable,     job.waitForCompletion(true);
      Text, IntWritable> {
                                                                    }
                                                                   }
    public void reduce(Text key, Iterable<IntWritable> values,
      Context context)
       throws IOException, InterruptedException {
int sum = 0;
         for (IntWritableval : values) {
                                                                                        Source: http://wiki.apache.org/hadoop/WordCount
             sum += val.get();
         }
context.write(key, new IntWritable(sum));
    }
 }


                                                                                                                                 7
DList abstraction
                                                            Distributed List (DList)
 Data
   on
HDFS

                                        Transform




                   DList type                           Abstraction for
   DList[String]                      Lines of text files
   DList[(Int, String, Boolean)]      CVS files of the form “37,Joe,M”
   DList[(Float,Map[(String, Int)]]   Avro files withshema: {record { int, map}}
Scoobi style
importcom.nicta.scoobi.Scoobi._

// Count the frequency of words from corpus of documents
objectWordCountextendsScoobiApp {
def run() {
vallines: DList[String] = fromTextFile(args(0))

valfreqs: DList[(String, Int)] =
lines.flatMap(_.split(" ")) // DList[String]
             .map(w=> (w, 1))      // DList[(String, Int)]
             .groupByKey// DList[(String, Iterable[Int])]
             .combine(_+_)          // DList[(String, Int)]

persist(toTextFile(freqs, args(1)))
  }
}
DList trait
traitDList[A] {
/* Abstract methods */
def parallelDo[B](dofn: DoFn[A, B]): DList[B]

def ++(that: DList[A]): DList[A]

def groupByKey[K, V]
    (implicit A <:< (K, V)): DList[(K, Iterable[V])]

def combine[K, V]
    (f: (V, V) => V)
    (implicit A <:< (K, Iterable[V])): DList[(K, V)]

/* All other methods are derived, e.g. „map‟ */
}
Under the hood
fromTextFile               LD

                   lines           HDFS
    flatMap                PD

                  words

        map                PD    MapReduce Job


                  word1

 groupByKey                GBK

                  wordG            HDFS

    combine                CV

                   freq
    persist
Removing less than the average
importcom.nicta.scoobi.Scoobi._

// Remove all integers that are less than the average integer
objectBetterThanAverageextendsScoobiApp {
def run() {
valints: DList[Int] =
fromTextFile(args(0)) collect { case AnInt(i) =>i }

valtotal: DObject[Int] = ints.sum
valcount: DObject[Int] = ints.size

valaverage: DObject[Int] =
      (total, count) map { case (t, c) =>t / c }

valbigger: DList[Int] =
      (average join ints) filter { case (a, i) =>i> a }

persist(toTextFile(bigger, args(1)))
  }
}
Under the hood        HDFS
                 LD


                                        MapReduce Job
    ints         PD


           PD         PD
                                            HDFS
           GBK        GBK


           CV         CV               Client computation



           M          M
                                       DCach
total                       count                  HDFS
                 OP                      e
    average

                 PD                     MapReduce Job


                 PD
        bigger
                                            HDFS
DObject abstraction
                                                Dlist[B]
                                                                 HDFS

          map              map                                 Distributed
DObject          DObject           DObject
                                                                  Cache
                                               Dobject[A]


      Client-side computations                     join



                                               DList[(A, B)]
                                                                 HDFS +
trait DObject[A] {                                             Distributed
  def map[B](f: A => B): DObject[B]                               Cache
  def join[B](list: DList[B]): DList[(A, B)]
}
Mirroring the Scala Collection API
     DList =>DList    DList =>DObject

       flatMap           reduce
         map             product
        filter             sum
      filterNot           length
      groupBy              size
      partition           count
       flatten             max
       distinct           maxBy
          ++               min
     keys, values         minBy
Building abstractions
               Functional programming




Functions as                            Functions as
procedures                              parameters




                   Composability
                        +
                    Reusability
Composing
// Compute the average of a DList of “numbers”
defaverage[A : Numeric](in: DList[A]): DObject[A] =
  (in.sum, in.size) map { case (sum, size) => sum / size }




// Compute histogram
defhistogram[A](in: DList[A]): DList[(A, Int)] =
in.map(x=> (x, 1)).groupByKey.combine(_+_)




// Throw away words with less-than-average frequency
defbetterThanAvgWords(lines: DList[String]): DList[String] = {
val words = lines.flatMap(_.split(“ “))
valwordCnts = histogram(words)
valavgFreq = average(wordCounts.values)
  (avgFreq join wordCnts) collect { case (avg, (w, f)) iff>avg=>w }
}
Unit-testing ‘histogram’
// Specification for histogram function
class HistogramSpecextendsHadoopSpecification {

“Histogram from DList”>> {
                                                                   ScalaCheck
“Sum of bins must equal size of DList”>> { implicitc: SC=>
Prop.forAll { list: List[Int] =>
valhist = histogram(list.toDList)
valbinSum = persist(hist.values.sum)
binSum == list.sz
      }
    }

“Number of bins must equal number of unique values”>> { implicitc: SC=>
Prop.forAll { list: List[Int] =>
val input = list.toDList
val bins = histogram(input).keys.size
valuniques = input.distinct.size
val (b, u) = persist(bins, uniques)
b == u
       }
    }
  }
}
sbt integration
> test-only *Histogram* -- exclude cluster
[info] HistogramSpec
[info]
[info] Histogram from DList
[info] + Sum of bins must equal size of DList
[info] No cluster execution time
[info] + Number of bins must equal number of unique values
[info] No cluster execution time
[info]
[info]
[info] Total for specification BoundedFilterSpec
[info] Finished in 12 seconds, 600 ms
[info] 2 examples, 4 expectations, 0 failure, 0 error                 Dependent JARs are
[info]                                                                 copied (once) to a
[info] Passed: : Total 2, Failed 0, Errors 0, Passed 2, Skipped 0   directory on the cluster
>                                                                     (~/libjars by default)
> test-only *Histogram*
> test-only *Histogram* -- scoobi verbose
> test-only *Histogram* -- scoobiverbose.warning
Other features
• Grouping:
  – API for controlling Hadoop’s sort-and-shuffle
  – Useful for implementing secondary sorting
• Join and Co-group helper methods
• Matrix multiplication utilities
• I/O:
  – Text, sequence, Avro
  – Roll your own
Want to know more?
• http://nicta.github.com/scoobi
• Mailing lists:
  – http://groups.google.com/group/scoobi-users
  – http://groups.google.com/group/scoobi-dev
• Twitter:
  – @bmlever
  – @etorreborre
• Meet me:
  – Will also be at Hadoop Summit (June 13-14)
  – Keen to get feedback

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) BigDataEverywhere
 
Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017Sunghyouk Bae
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceLivePerson
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Sunghyouk Bae
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Eelco Visser
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Kotlin coroutines and spring framework
Kotlin coroutines and spring frameworkKotlin coroutines and spring framework
Kotlin coroutines and spring frameworkSunghyouk Bae
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)lennartkats
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonJoe Stein
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...CloudxLab
 
2014 holden - databricks umd scala crash course
2014   holden - databricks umd scala crash course2014   holden - databricks umd scala crash course
2014 holden - databricks umd scala crash courseHolden Karau
 

Was ist angesagt? (20)

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
 
Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
 
Scala+data
Scala+dataScala+data
Scala+data
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Requery overview
Requery overviewRequery overview
Requery overview
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Kotlin coroutines and spring framework
Kotlin coroutines and spring frameworkKotlin coroutines and spring framework
Kotlin coroutines and spring framework
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
 
2014 holden - databricks umd scala crash course
2014   holden - databricks umd scala crash course2014   holden - databricks umd scala crash course
2014 holden - databricks umd scala crash course
 

Ähnlich wie Scoobi - Scala for Startups

Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoopdatasalt
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopDilum Bandara
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012Steven Francia
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusKoichi Fujikawa
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Ekeko Technology Showdown at SoTeSoLa 2012
Ekeko Technology Showdown at SoTeSoLa 2012Ekeko Technology Showdown at SoTeSoLa 2012
Ekeko Technology Showdown at SoTeSoLa 2012Coen De Roover
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programmingKuldeep Dhole
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Brian O'Neill
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 

Ähnlich wie Scoobi - Scala for Startups (20)

Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Pune Clojure Course Outline
Pune Clojure Course OutlinePune Clojure Course Outline
Pune Clojure Course Outline
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Ekeko Technology Showdown at SoTeSoLa 2012
Ekeko Technology Showdown at SoTeSoLa 2012Ekeko Technology Showdown at SoTeSoLa 2012
Ekeko Technology Showdown at SoTeSoLa 2012
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 

Kürzlich hochgeladen

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Kürzlich hochgeladen (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Scoobi - Scala for Startups

  • 2. Me Haskell DSL for development of computer Machine learning, software vision algorithms targeting GPUs systems, computer vision, optimisation, networks, control and signal processing Predictive analytics for the enterprise
  • 3. Hadoop app development – wish list Quick dev cycles Expressive Reusability Type safety Reliability
  • 4. Bridging the “tooling” gap Scoobi MapReduce pipelines DList DObject ScalaCheck Implementation Testing Java APIs HadoopMapReduce
  • 5. At a glance • Scoobi = Scala for Hadoop • Inspired by Google’s FlumeJava • Developed at NICTA • Open-sourced Oct 2011 • Apache V2
  • 6. HadoopMapReduce – word count ... hello … cat 324 ... cat … 323 325 ... hello … fire … 326 ... fire… cat … … (k1, v1)  [(k2, v2)] Mapper Mapper Mapper Mapper cat 1 hello 1 hello 1 fire 1 cat 1 fire 1 cat 1 [(k2, v2)]  [(k2, [v2])] Sort and shuffle: aggregate values by key hello 1 1 cat 1 1 1 fire 1 1 (k2, [v2])  [(k3, v3)] Reducer Reducer Reducer hello 2 cat 3 fire 2 6
  • 7. Java style public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { Job job = new Job(conf, "wordcount"); private final static IntWritable one = new IntWritable(1); private Text word = new Text(); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { job.setMapperClass(Map.class); String line = value.toString(); job.setReducerClass(Reduce.class); StringTokenizertokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { job.setInputFormatClass(TextInputFormat.class); word.set(tokenizer.nextToken()); job.setOutputFormatClass(TextOutputFormat.class); context.write(word, one); } FileInputFormat.addInputPath(job, new Path(args[0])); } FileOutputFormat.setOutputPath(job, new Path(args[1])); } public static class Reduce extends Reducer<Text, IntWritable, job.waitForCompletion(true); Text, IntWritable> { } } public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritableval : values) { Source: http://wiki.apache.org/hadoop/WordCount sum += val.get(); } context.write(key, new IntWritable(sum)); } } 7
  • 8. DList abstraction Distributed List (DList) Data on HDFS Transform DList type Abstraction for DList[String] Lines of text files DList[(Int, String, Boolean)] CVS files of the form “37,Joe,M” DList[(Float,Map[(String, Int)]] Avro files withshema: {record { int, map}}
  • 9. Scoobi style importcom.nicta.scoobi.Scoobi._ // Count the frequency of words from corpus of documents objectWordCountextendsScoobiApp { def run() { vallines: DList[String] = fromTextFile(args(0)) valfreqs: DList[(String, Int)] = lines.flatMap(_.split(" ")) // DList[String] .map(w=> (w, 1)) // DList[(String, Int)] .groupByKey// DList[(String, Iterable[Int])] .combine(_+_) // DList[(String, Int)] persist(toTextFile(freqs, args(1))) } }
  • 10. DList trait traitDList[A] { /* Abstract methods */ def parallelDo[B](dofn: DoFn[A, B]): DList[B] def ++(that: DList[A]): DList[A] def groupByKey[K, V] (implicit A <:< (K, V)): DList[(K, Iterable[V])] def combine[K, V] (f: (V, V) => V) (implicit A <:< (K, Iterable[V])): DList[(K, V)] /* All other methods are derived, e.g. „map‟ */ }
  • 11. Under the hood fromTextFile LD lines HDFS flatMap PD words map PD MapReduce Job word1 groupByKey GBK wordG HDFS combine CV freq persist
  • 12. Removing less than the average importcom.nicta.scoobi.Scoobi._ // Remove all integers that are less than the average integer objectBetterThanAverageextendsScoobiApp { def run() { valints: DList[Int] = fromTextFile(args(0)) collect { case AnInt(i) =>i } valtotal: DObject[Int] = ints.sum valcount: DObject[Int] = ints.size valaverage: DObject[Int] = (total, count) map { case (t, c) =>t / c } valbigger: DList[Int] = (average join ints) filter { case (a, i) =>i> a } persist(toTextFile(bigger, args(1))) } }
  • 13. Under the hood HDFS LD MapReduce Job ints PD PD PD HDFS GBK GBK CV CV Client computation M M DCach total count HDFS OP e average PD MapReduce Job PD bigger HDFS
  • 14. DObject abstraction Dlist[B] HDFS map map Distributed DObject DObject DObject Cache Dobject[A] Client-side computations join DList[(A, B)] HDFS + trait DObject[A] { Distributed def map[B](f: A => B): DObject[B] Cache def join[B](list: DList[B]): DList[(A, B)] }
  • 15. Mirroring the Scala Collection API DList =>DList DList =>DObject flatMap reduce map product filter sum filterNot length groupBy size partition count flatten max distinct maxBy ++ min keys, values minBy
  • 16. Building abstractions Functional programming Functions as Functions as procedures parameters Composability + Reusability
  • 17. Composing // Compute the average of a DList of “numbers” defaverage[A : Numeric](in: DList[A]): DObject[A] = (in.sum, in.size) map { case (sum, size) => sum / size } // Compute histogram defhistogram[A](in: DList[A]): DList[(A, Int)] = in.map(x=> (x, 1)).groupByKey.combine(_+_) // Throw away words with less-than-average frequency defbetterThanAvgWords(lines: DList[String]): DList[String] = { val words = lines.flatMap(_.split(“ “)) valwordCnts = histogram(words) valavgFreq = average(wordCounts.values) (avgFreq join wordCnts) collect { case (avg, (w, f)) iff>avg=>w } }
  • 18. Unit-testing ‘histogram’ // Specification for histogram function class HistogramSpecextendsHadoopSpecification { “Histogram from DList”>> { ScalaCheck “Sum of bins must equal size of DList”>> { implicitc: SC=> Prop.forAll { list: List[Int] => valhist = histogram(list.toDList) valbinSum = persist(hist.values.sum) binSum == list.sz } } “Number of bins must equal number of unique values”>> { implicitc: SC=> Prop.forAll { list: List[Int] => val input = list.toDList val bins = histogram(input).keys.size valuniques = input.distinct.size val (b, u) = persist(bins, uniques) b == u } } } }
  • 19. sbt integration > test-only *Histogram* -- exclude cluster [info] HistogramSpec [info] [info] Histogram from DList [info] + Sum of bins must equal size of DList [info] No cluster execution time [info] + Number of bins must equal number of unique values [info] No cluster execution time [info] [info] [info] Total for specification BoundedFilterSpec [info] Finished in 12 seconds, 600 ms [info] 2 examples, 4 expectations, 0 failure, 0 error Dependent JARs are [info] copied (once) to a [info] Passed: : Total 2, Failed 0, Errors 0, Passed 2, Skipped 0 directory on the cluster > (~/libjars by default) > test-only *Histogram* > test-only *Histogram* -- scoobi verbose > test-only *Histogram* -- scoobiverbose.warning
  • 20. Other features • Grouping: – API for controlling Hadoop’s sort-and-shuffle – Useful for implementing secondary sorting • Join and Co-group helper methods • Matrix multiplication utilities • I/O: – Text, sequence, Avro – Roll your own
  • 21. Want to know more? • http://nicta.github.com/scoobi • Mailing lists: – http://groups.google.com/group/scoobi-users – http://groups.google.com/group/scoobi-dev • Twitter: – @bmlever – @etorreborre • Meet me: – Will also be at Hadoop Summit (June 13-14) – Keen to get feedback

Hinweis der Redaktion

  1. 5VSQVUH22