SlideShare ist ein Scribd-Unternehmen logo
1 von 93
Writing Hadoop Jobs in Scala using
@tonicebrian
Scalding
How much storage can
100$ dollars buy you?
How much storage can
100$ dollars buy you?

1 photo
1980
How much storage can
100$ dollars buy you?

1 photo

5 songs

1980

1990
How much storage can
100$ dollars buy you?

1 photo

5 songs

7 movies

1980

1990

2000
How much storage can
100$ dollars buy you?
600 movies

170.000 songs

1 photo

5 songs

7 movies

1980

1990

2000

5 million photos

2010
From single drives…
From single drives…

to clusters…
Data
Science
“A mathematician is a
device for turning coffee
into theorems”

Alfréd Rényi
data scientist

“A mathematician is a
device for turning coffee
into theorems”

Alfréd Rényi
data scientist

“A mathematician is a
device for turning coffee
and
into theorems”
data

Alfréd Rényi
data scientist

“A mathematician is a
device for turning coffee
and
into theorems”
insights

data

Alfréd Rényi
Hadoop
=

Map
Distributed
+
File System
Reduce
Hadoop

Storage
=

Map
Distributed
+
File System
Reduce
Hadoop

Program
Model

=

Storage

Map
Distributed
+
File System
Reduce
Word Count

Raw

Hello cruel world
Say hello! Hello!
Word Count

Raw

Map
hello

Hello cruel world
Say hello! Hello!

1

cruel

1

world

1

say

1

hello

2
Word Count

Raw

Map

Reduce
hello

hello
Hello cruel world

1

2

cruel

1

world

1

say

1

Say hello! Hello!
Word Count

Raw

Map

Reduce

Result
hello

3

Hello cruel world

cruel

1

Say hello! Hello!

world

1

say

1
4 Main Characteristics of Scala
4 Main Characteristics of Scala

JVM
4 Main Characteristics of Scala

JVM

Statically
Typed
4 Main Characteristics of Scala

JVM

Object
Oriented

Statically
Typed
4 Main Characteristics of Scala

JVM

Statically
Typed

Object
Oriented

Functional
Programming
def map[B](f: (A) ⇒ B): List[B]
Builds a new collection by applying a function to all
elements of this list.

def reduce[A1 >: A](op: (A1, A1) ⇒ A1): A1
Reduces the elements of this list using the specified
associative binary operator.
Recap
Recap
Map/Reduce
• Programming paradigm that employs concepts from Functional
Programming
Recap
Map/Reduce
• Programming paradigm that employs concepts from Functional
Programming
Scala

• Map/Reduce

• Functional Language that runs on the JVM
Recap
Map/Reduce
• Programming paradigm that employs concepts from Functional
Programming
Scala

• Map/Reduce

• Functional Language that runs on the JVM
Hadoop

• Open Source Implementation of MR in the JVM
So in what language is Hadoop
implemented?
The Result?
The Result?
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public static class Reduce extends Reducer<Text, IntWritable,
Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");

public class WordCount {
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
High level approaches

SQL

Data
Transformations
High level approaches

input_lines = LOAD ‘myfile.txt' AS (line:chararray);
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES 'w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count,
group AS word;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
User defined functions (UDF)
-- myscript.pig
REGISTER myudfs.jar;
A = LOAD 'student_data' AS (name: chararray,
age: int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name); package myudfs;
import java.io.IOException;
DUMP B;
import org.apache.pig.EvalFunc;

Java

Pig

import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception processing
input row ", e);
}
}
}
WordCount in Cascading
package impatient;
import java.util.Properties;
import cascading.flow.Flow;
import cascading.flow.FlowDef;
import cascading.flow.hadoop.HadoopFlowConnector;
import cascading.operation.aggregator.Count;
import cascading.operation.regex.RegexFilter;
import cascading.operation.regex.RegexSplitGenerator;
import cascading.pipe.Each;
import cascading.pipe.Every;
import cascading.pipe.GroupBy;
import cascading.pipe.Pipe;
import cascading.property.AppProps;
import cascading.scheme.Scheme;
import cascading.scheme.hadoop.TextDelimited;
import cascading.tap.Tap;
import cascading.tap.hadoop.Hfs;
import cascading.tuple.Fields;

public class Main {
public static void main( String[] args )
{
String docPath = args[ 0 ];
String wcPath = args[ 1 ];

);

// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex operation to split the "document" text lines into a
token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[
[](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
Properties properties = new Properties();
wcFlow.writeDOT( "dot/wc.dot" );
AppProps.setApplicationJarClass( properties, Main.class );
wcFlow.complete();
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties}
}
Good parts
• Data Flow Programming Model
• User Defined Functions
Good parts
• Data Flow Programming Model
• User Defined Functions
Bad

• Still Java
• Objects for Flows
package com.twitter.scalding.examples
import com.twitter.scalding._
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
// Split a piece of text into individual words.
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+")
}

}
TDD Cycle
Red

Refactor

Green
Broader view
Red

…
Refactor

Continuous
Deployment

Green

Acceptance
Testing
Unit
Testing

Lean
Startup
Big Data

Big Speed
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
Is Scalding of any help here?
Is Scalding of any help here?

0

Size of code
Is Scalding of any help here?

0

Size of code

1

Types
Is Scalding of any help here?

0

Size of code

1

Types

2

Unit Testing
Is Scalding of any help here?

0

Size of code

1

Types

2

Unit Testing

3

Local execution
1
Types
An extra cycle

Continuous
Deployment

Acceptance
Testing
Unit
Testing

Lean
Startup
An extra cycle

Continuous
Deployment

Acceptance
Testing
Unit
Testing
Compilation
Phase

Lean
Startup
Static typechecking makes
you a better
programmer™
Fail-fast with type errors
(Int,Int,Int,Int)
Fail-fast with type errors
(Int,Int,Int,Int)
TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
Fail-fast with type errors
(Int,Int,Int,Int)
TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
val
val
val
val

w
x
y
z

=
=
=
=

5
5
5
5

w + x + y + z = 20
Fail-fast with type errors
(Int,Int,Int,Int)
TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
val
val
val
val

w
x
y
z

=
=
=
=

5
5
5
5

w + x + y + z = 20

val
val
val
val

w
x
y
z

=
=
=
=

Meters(5)
Miles(5)
Celsius(5)
Fahrenheit(5)

w + x + y + z

=> type error
2
Unit Testing
How do you test a distributed
algorithm without a distributed
platform?
Source

Tap
Source

Tap
Source

Tap
// Scalding
import com.twitter.scalding._
class WordCountTest extends Specification with TupleConversions {
"A WordCount job" should {
JobTest("com.snowplowanalytics.hadoop.scalding.WordCountJob").
arg("input", "inputFile").
arg("output", "outputFile").
source(TextLine("inputFile"), List("0" -> "hack hack hack and hack")).
sink[(String,Int)](Tsv("outputFile")){ outputBuffer =>
val outMap = outputBuffer.toMap
"count words correctly" in {
outMap("hack") must be_==(4)
outMap("and") must be_==(1)
}
}.
run.
finish
}
}
3
Local Execution
HDFS

Local
HDFS

Local
SBT as a REPL

> run-main com.twitter.scalding.Tool MyJob --local
> run-main com.twitter.scalding.Tool MyJob --hdfs
More Scalding goodness
More Scalding goodness

Algebird
More Scalding goodness

Algebird

Matrix library
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding

Weitere ähnliche Inhalte

Was ist angesagt?

MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London Meetup
Landoop Ltd
 

Was ist angesagt? (20)

Scala+data
Scala+dataScala+data
Scala+data
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Parallel-Ready Java Code: Managing Mutation in an Imperative Language
Parallel-Ready Java Code: Managing Mutation in an Imperative LanguageParallel-Ready Java Code: Managing Mutation in an Imperative Language
Parallel-Ready Java Code: Managing Mutation in an Imperative Language
 
Shooting the Rapids
Shooting the RapidsShooting the Rapids
Shooting the Rapids
 
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
Scalding Presentation
Scalding PresentationScalding Presentation
Scalding Presentation
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water Meetup
 
HBase RowKey design for Akka Persistence
HBase RowKey design for Akka PersistenceHBase RowKey design for Akka Persistence
HBase RowKey design for Akka Persistence
 
Shooting the Rapids: Getting the Best from Java 8 Streams
Shooting the Rapids: Getting the Best from Java 8 StreamsShooting the Rapids: Getting the Best from Java 8 Streams
Shooting the Rapids: Getting the Best from Java 8 Streams
 
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London Meetup
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling Water
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
 
あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法
 
Let's Get to the Rapids
Let's Get to the RapidsLet's Get to the Rapids
Let's Get to the Rapids
 

Ähnlich wie Writing Hadoop Jobs in Scala using Scalding

JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
Koichi Fujikawa
 

Ähnlich wie Writing Hadoop Jobs in Scala using Scalding (20)

Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Having Fun with Play
Having Fun with PlayHaving Fun with Play
Having Fun with Play
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
JS everywhere 2011
JS everywhere 2011JS everywhere 2011
JS everywhere 2011
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe Seiler
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Crossing the Bridge: Connecting Rails and your Front-end Framework
Crossing the Bridge: Connecting Rails and your Front-end FrameworkCrossing the Bridge: Connecting Rails and your Front-end Framework
Crossing the Bridge: Connecting Rails and your Front-end Framework
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
Atlassian Groovy Plugins
Atlassian Groovy PluginsAtlassian Groovy Plugins
Atlassian Groovy Plugins
 
Create & Execute First Hadoop MapReduce Project in.pptx
Create & Execute First Hadoop MapReduce Project in.pptxCreate & Execute First Hadoop MapReduce Project in.pptx
Create & Execute First Hadoop MapReduce Project in.pptx
 
2007 09 10 Fzi Training Groovy Grails V Ws
2007 09 10 Fzi Training Groovy Grails V Ws2007 09 10 Fzi Training Groovy Grails V Ws
2007 09 10 Fzi Training Groovy Grails V Ws
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Writing Hadoop Jobs in Scala using Scalding

  • 1. Writing Hadoop Jobs in Scala using @tonicebrian Scalding
  • 2. How much storage can 100$ dollars buy you?
  • 3. How much storage can 100$ dollars buy you? 1 photo 1980
  • 4. How much storage can 100$ dollars buy you? 1 photo 5 songs 1980 1990
  • 5. How much storage can 100$ dollars buy you? 1 photo 5 songs 7 movies 1980 1990 2000
  • 6. How much storage can 100$ dollars buy you? 600 movies 170.000 songs 1 photo 5 songs 7 movies 1980 1990 2000 5 million photos 2010
  • 10. “A mathematician is a device for turning coffee into theorems” Alfréd Rényi
  • 11. data scientist “A mathematician is a device for turning coffee into theorems” Alfréd Rényi
  • 12. data scientist “A mathematician is a device for turning coffee and into theorems” data Alfréd Rényi
  • 13. data scientist “A mathematician is a device for turning coffee and into theorems” insights data Alfréd Rényi
  • 14.
  • 18. Word Count Raw Hello cruel world Say hello! Hello!
  • 19. Word Count Raw Map hello Hello cruel world Say hello! Hello! 1 cruel 1 world 1 say 1 hello 2
  • 20. Word Count Raw Map Reduce hello hello Hello cruel world 1 2 cruel 1 world 1 say 1 Say hello! Hello!
  • 21. Word Count Raw Map Reduce Result hello 3 Hello cruel world cruel 1 Say hello! Hello! world 1 say 1
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 29. 4 Main Characteristics of Scala JVM
  • 30. 4 Main Characteristics of Scala JVM Statically Typed
  • 31. 4 Main Characteristics of Scala JVM Object Oriented Statically Typed
  • 32. 4 Main Characteristics of Scala JVM Statically Typed Object Oriented Functional Programming
  • 33. def map[B](f: (A) ⇒ B): List[B] Builds a new collection by applying a function to all elements of this list. def reduce[A1 >: A](op: (A1, A1) ⇒ A1): A1 Reduces the elements of this list using the specified associative binary operator.
  • 34. Recap
  • 35. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming
  • 36. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming Scala • Map/Reduce • Functional Language that runs on the JVM
  • 37. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming Scala • Map/Reduce • Functional Language that runs on the JVM Hadoop • Open Source Implementation of MR in the JVM
  • 38. So in what language is Hadoop implemented?
  • 39.
  • 41. The Result? package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); public class WordCount { job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • 43. High level approaches input_lines = LOAD ‘myfile.txt' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES 'w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
  • 44. User defined functions (UDF) -- myscript.pig REGISTER myudfs.jar; A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE myudfs.UPPER(name); package myudfs; import java.io.IOException; DUMP B; import org.apache.pig.EvalFunc; Java Pig import org.apache.pig.data.Tuple; import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } } }
  • 45.
  • 46.
  • 47. WordCount in Cascading package impatient; import java.util.Properties; import cascading.flow.Flow; import cascading.flow.FlowDef; import cascading.flow.hadoop.HadoopFlowConnector; import cascading.operation.aggregator.Count; import cascading.operation.regex.RegexFilter; import cascading.operation.regex.RegexSplitGenerator; import cascading.pipe.Each; import cascading.pipe.Every; import cascading.pipe.GroupBy; import cascading.pipe.Pipe; import cascading.property.AppProps; import cascading.scheme.Scheme; import cascading.scheme.hadoop.TextDelimited; import cascading.tap.Tap; import cascading.tap.hadoop.Hfs; import cascading.tuple.Fields; public class Main { public static void main( String[] args ) { String docPath = args[ 0 ]; String wcPath = args[ 1 ]; ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); Properties properties = new Properties(); wcFlow.writeDOT( "dot/wc.dot" ); AppProps.setApplicationJarClass( properties, Main.class ); wcFlow.complete(); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties} }
  • 48. Good parts • Data Flow Programming Model • User Defined Functions
  • 49. Good parts • Data Flow Programming Model • User Defined Functions Bad • Still Java • Objects for Flows
  • 50.
  • 51. package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+") } }
  • 52.
  • 56. A typical day working with Hadoop
  • 57. A typical day working with Hadoop
  • 58. A typical day working with Hadoop
  • 59. A typical day working with Hadoop
  • 60. A typical day working with Hadoop
  • 61. A typical day working with Hadoop
  • 62. A typical day working with Hadoop
  • 63. A typical day working with Hadoop
  • 64. Is Scalding of any help here?
  • 65. Is Scalding of any help here? 0 Size of code
  • 66. Is Scalding of any help here? 0 Size of code 1 Types
  • 67. Is Scalding of any help here? 0 Size of code 1 Types 2 Unit Testing
  • 68. Is Scalding of any help here? 0 Size of code 1 Types 2 Unit Testing 3 Local execution
  • 72.
  • 73. Static typechecking makes you a better programmer™
  • 74. Fail-fast with type errors (Int,Int,Int,Int)
  • 75. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
  • 76. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)] val val val val w x y z = = = = 5 5 5 5 w + x + y + z = 20
  • 77. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)] val val val val w x y z = = = = 5 5 5 5 w + x + y + z = 20 val val val val w x y z = = = = Meters(5) Miles(5) Celsius(5) Fahrenheit(5) w + x + y + z => type error
  • 79. How do you test a distributed algorithm without a distributed platform?
  • 83. // Scalding import com.twitter.scalding._ class WordCountTest extends Specification with TupleConversions { "A WordCount job" should { JobTest("com.snowplowanalytics.hadoop.scalding.WordCountJob"). arg("input", "inputFile"). arg("output", "outputFile"). source(TextLine("inputFile"), List("0" -> "hack hack hack and hack")). sink[(String,Int)](Tsv("outputFile")){ outputBuffer => val outMap = outputBuffer.toMap "count words correctly" in { outMap("hack") must be_==(4) outMap("and") must be_==(1) } }. run. finish } }
  • 85.
  • 88. SBT as a REPL > run-main com.twitter.scalding.Tool MyJob --local > run-main com.twitter.scalding.Tool MyJob --hdfs