3. What is Cascading ?
Tap / Pipe / Sink abstraction over Map / Reduce in Java
4. What is Scalding ?
• Scala wrapper for Cascading
• Just like working with in-memory collections !
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
• No more scripting and UDFs!
5. Hands on
• Clone the skeleton repository
• Get IntelliJ Idea and the scala plugin
• Open the project
• Compile, wait for dependencies to download
• Create a run configuration …
• Create a specs2 configuration for tests
7. Building and Deploying
• Get sbt
• sbt assembly produces jar file in target/scala_2.10
• sbt s3-upload produces jar and uploads to s3
• Configure teamcity
8. Running on EMR
• hadoop fs -get s3://dev-adform-temp-results/wordcount-job.jar job.jar
• hadoop jar job.jar
com.twitter.scalding.Tool Entry class
com.adform.dspr.WordCountJob Scalding job class
--hdfs Run in HDFS mode
--input s3://adform-dsp-metadata/countries/countries.txt Parameter
--output s3://dev-adform-temp-results/wordcount Parameter
12. Development
• Fields:
• No need to parse columns
• Redundant
• No IDE support like auto-completion
• Typed:
• All benefits of types
• More manual work with parsing
14. My Experience
• Running the job locally is a HUGE time saver
• Programming scala is amazing (no more UDFs)
• Type safety, IDE support!
• Debugging !!!!111
• More optimal job plans
15. My Experience
• A lot of configuring and googling random issues
• Scarce documentation, had to read source code
• IntelliJ is slow
• Boilerplate code for parsing data
16. Use cases
• Easy jobs hive
• Non-trivial jobs scalding
• Optional: scalding is nice for doing matrix calculations, twitter also
provides a lot of monoids (algorithms) for nice approximations, e.g.
HyperLogLog, CountMinSketch, etc. (see algebird).
17. process-logs-rtb
• Had to hack scalding:
• WritableMultiSinkTap
• Records
• CompressedTsv
• ModelKryoInstantiator
• Uses typed API
• Helpers like FluentJob