Scalding by Adform Research, Alex Gryzlov

•Download as PPTX, PDF•

0 likes•630 views

Vasil Remeniuk

Technology

What is Cascading ?
Tap / Pipe / Sink abstraction over Map / Reduce in Java

$What is Scalding ? • Scala wrapper for Cascading • Just like working with in-memory collections ! TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) • No more scripting and UDFs!$

Hands on
• Clone the skeleton repository
• Get IntelliJ Idea and the scala plugin
• Open the project
• Compile, wait for dependencies to download
• Create a run configuration …
• Create a specs2 configuration for tests

run the WordCountJob in local
mode with given input and output

Building and Deploying
• Get sbt
• sbt assembly produces jar file in target/scala_2.10
• sbt s3-upload produces jar and uploads to s3
• Configure teamcity

Running on EMR
• hadoop fs -get s3://dev-adform-temp-results/wordcount-job.jar job.jar
• hadoop jar job.jar
com.twitter.scalding.Tool Entry class
com.adform.dspr.WordCountJob Scalding job class
--hdfs Run in HDFS mode
--input s3://adform-dsp-metadata/countries/countries.txt Parameter
--output s3://dev-adform-temp-results/wordcount Parameter

Under the covers
• sbt run-main
com.twitter.scalding.Tool
com.adform.dspr.WordCountJob
--hdfs
--tool.graph
--input dummy --output dummy
• dot -Tpng com.adform.dspr.WordCountJob0.dot -o logical_plan.png
• dot -Tpng com.adform.dspr.WordCountJob0_steps.dot -o mr_plan.png

Development
• Different APIs:
• Fields – everything is a string
• Typed – working with classes, e.g. Request/Transaction

Development
• Fields:
• No need to parse columns
• Redundant
• No IDE support like auto-completion
• Typed:
• All benefits of types
• More manual work with parsing

Resources
• https://github.com/twitter/scalding
• https://github.com/twitter/scalding/tree/develop/tutorial
• https://github.com/twitter/scalding/wiki
• http://www.slideshare.net/AntwnisChalkiopoulos/scalding-presentation
• http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-2014
• https://gitz.adform.com/dspr/data-processing/tree/develop/jobs/process-logs-rtb

My Experience
• Running the job locally is a HUGE time saver
• Programming scala is amazing (no more UDFs)
• Type safety, IDE support!
• Debugging !!!!111
• More optimal job plans

My Experience
• A lot of configuring and googling random issues
• Scarce documentation, had to read source code
• IntelliJ is slow
• Boilerplate code for parsing data

Use cases
• Easy jobs  hive
• Non-trivial jobs  scalding
• Optional: scalding is nice for doing matrix calculations, twitter also
provides a lot of monoids (algorithms) for nice approximations, e.g.
HyperLogLog, CountMinSketch, etc. (see algebird).

process-logs-rtb
• Had to hack scalding:
• WritableMultiSinkTap
• Records
• CompressedTsv
• ModelKryoInstantiator
• Uses typed API
• Helpers like FluentJob

Scalding by Adform Research, Alex Gryzlov

What's hot

How to rewrite the OS using C by strong typeKiwamu Okabe

Railsチュートリアルの歩き方 (第3版)Yohei Yasukawa

Java 8 and Beyond, a Scala StoryTomer Gabel

Developing Cross-Platform Web Apps with ASP.NET Core1.0EastBanc Tachnologies

JS Lab`16. Роман Лютиков: "ClojureScript, что ты такое?"GeeksLab Odessa

JavaScript: Creative Coding for Browsersnoweverywhere

Develop realtime web with Scala and XitrumNgoc Dao

ReSharper SDKAntiGravitY56

BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)Clancy Childs

Utilizing the OpenNTF Domino APIOliver Busse

AEM/CQ Montreal User Group Meeting - March 25, 2015 - Takeaways from Adobe Su...INM_

Octocatは技術的負債の夢を見るか？treby

(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services

C# 9 - What's the cool stuff? - BASTA! Spring 2021Christian Nagel

PaperclipPatrick Crowley

Infrastructure as code with TerraformSam Bashton

Data Processing and Ruby in the WorldSATOSHI TAGOMORI

UPenn on Rails introMat Schaffer

Bldr: A Minimalist JSON Templating DSLAlex Sharp

IDLsRuslan Shevchenko

What's hot (20)

How to rewrite the OS using C by strong type

Railsチュートリアルの歩き方 (第3版)

Java 8 and Beyond, a Scala Story

Developing Cross-Platform Web Apps with ASP.NET Core1.0

JS Lab`16. Роман Лютиков: "ClojureScript, что ты такое?"

JavaScript: Creative Coding for Browsers

Develop realtime web with Scala and Xitrum

ReSharper SDK

BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)

Utilizing the OpenNTF Domino API

AEM/CQ Montreal User Group Meeting - March 25, 2015 - Takeaways from Adobe Su...

Octocatは技術的負債の夢を見るか？

(CMP310) Data Processing Pipelines Using Containers & Spot Instances

C# 9 - What's the cool stuff? - BASTA! Spring 2021

Paperclip

Infrastructure as code with Terraform

Data Processing and Ruby in the World

UPenn on Rails intro

Bldr: A Minimalist JSON Templating DSL

IDLs

Viewers also liked

"Error Recovery" by @alaz at scalaby#8Vasil Remeniuk

"Scala in Goozy", Alexey Zlobin Vasil Remeniuk

Spark by Adform Research, PauliusVasil Remeniuk

Scala laboratory: Globus. iteration #2Vasil Remeniuk

Scala Style by Adform Research (Saulius Valatka)Vasil Remeniuk

Testing in Scala. Adform ResearchVasil Remeniuk

Vaadin+ScalaVasil Remeniuk

scala.reflect, Eugene BurmakoVasil Remeniuk

Viewers also liked (9)

"Error Recovery" by @alaz at scalaby#8

"Scala in Goozy", Alexey Zlobin

Spark by Adform Research, Paulius

Scala laboratory: Globus. iteration #2

Scala Style by Adform Research (Saulius Valatka)

Testing in Scala. Adform Research

Vaadin+Scala

scala.reflect, Eugene Burmako

Similar to Scalding by Adform Research, Alex Gryzlov

Scalding by Adform Research, Alex GryzlovVasil Remeniuk

Scala at Treasure DataTaro L. Saito

ETL with SPARK - First Spark London meetupRafal Kwasny

Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks

Spark SQLCaserta

Introduction to real time big data with Apache SparkTaras Matyashovsky

Everything-as-code. A polyglot adventure. #DevoxxPLMario-Leander Reimer

Everything-as-code - A polyglot adventureQAware GmbH

Tips For Maintaining OSS ProjectsTaro L. Saito

Apache Spark TutorialAhmet Bulut

Spark + H20 = Machine Learning at scaleMateusz Dymczyk

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup

Dec6 meetup spark presentationRamesh Mudunuri

BuildingsocialanalyticstoolwithmongodbMongoDB APAC

20151015 zagreb spark_notebooksAndrey Vykhodtsev

Apache Spark FundamentalsZahra Eskandari

Spark Summit 2014: Spark Job Server TalkEvan Chan

Intro to Apache Spark by CTO of TwingoMapR Technologies

Solid and Sustainable Development in Scalascalaconfjp

Seattle Spark Meetup Mobius CSharp APIshareddatamsft

Similar to Scalding by Adform Research, Alex Gryzlov (20)

Scalding by Adform Research, Alex Gryzlov

Scala at Treasure Data

ETL with SPARK - First Spark London meetup

Migrating ETL Workflow to Apache Spark at Scale in Pinterest

Spark SQL

Introduction to real time big data with Apache Spark

Everything-as-code. A polyglot adventure. #DevoxxPL

Everything-as-code - A polyglot adventure

Tips For Maintaining OSS Projects

Apache Spark Tutorial

Spark + H20 = Machine Learning at scale

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup

Dec6 meetup spark presentation

Buildingsocialanalyticstoolwithmongodb

20151015 zagreb spark_notebooks

Apache Spark Fundamentals

Spark Summit 2014: Spark Job Server Talk

Intro to Apache Spark by CTO of Twingo

Solid and Sustainable Development in Scala

Seattle Spark Meetup Mobius CSharp API

Recently uploaded

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

A Year of the Servo Reboot: Where Are We Now?Igalia

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Architecting Cloud Native ApplicationsWSO2

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

MINDCTI Revenue Release Quarter One 2024MIND CTI

Recently uploaded (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

How to Troubleshoot Apps for the Modern Connected Worker

A Year of the Servo Reboot: Where Are We Now?

Powerful Google developer tools for immediate impact! (2023-24 C)

Data Cloud, More than a CDP by Matt Robison

Boost Fertility New Invention Ups Success Rates.pdf

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Architecting Cloud Native Applications

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

A Beginners Guide to Building a RAG App Using Open Source Milvus

AWS Community Day CPH - Three problems of Terraform

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

MINDCTI Revenue Release Quarter One 2024

Scalding by Adform Research, Alex Gryzlov

1. Quick Guide

2. What is Scalding ? • Scala wrapper for Cascading

3. What is Cascading ? Tap / Pipe / Sink abstraction over Map / Reduce in Java

4. What is Scalding ? • Scala wrapper for Cascading • Just like working with in-memory collections ! TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) • No more scripting and UDFs!

5. Hands on • Clone the skeleton repository • Get IntelliJ Idea and the scala plugin • Open the project • Compile, wait for dependencies to download • Create a run configuration … • Create a specs2 configuration for tests

6. run the WordCountJob in local mode with given input and output

7. Building and Deploying • Get sbt • sbt assembly produces jar file in target/scala_2.10 • sbt s3-upload produces jar and uploads to s3 • Configure teamcity

8. Running on EMR • hadoop fs -get s3://dev-adform-temp-results/wordcount-job.jar job.jar • hadoop jar job.jar com.twitter.scalding.Tool Entry class com.adform.dspr.WordCountJob Scalding job class --hdfs Run in HDFS mode --input s3://adform-dsp-metadata/countries/countries.txt Parameter --output s3://dev-adform-temp-results/wordcount Parameter

9. Under the covers • sbt run-main com.twitter.scalding.Tool com.adform.dspr.WordCountJob --hdfs --tool.graph --input dummy --output dummy • dot -Tpng com.adform.dspr.WordCountJob0.dot -o logical_plan.png • dot -Tpng com.adform.dspr.WordCountJob0_steps.dot -o mr_plan.png

10.

11. Development • Different APIs: • Fields – everything is a string • Typed – working with classes, e.g. Request/Transaction

12. Development • Fields: • No need to parse columns • Redundant • No IDE support like auto-completion • Typed: • All benefits of types • More manual work with parsing

13. Resources • https://github.com/twitter/scalding • https://github.com/twitter/scalding/tree/develop/tutorial • https://github.com/twitter/scalding/wiki • http://www.slideshare.net/AntwnisChalkiopoulos/scalding-presentation • http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-2014 • https://gitz.adform.com/dspr/data-processing/tree/develop/jobs/process-logs-rtb

14. My Experience • Running the job locally is a HUGE time saver • Programming scala is amazing (no more UDFs) • Type safety, IDE support! • Debugging !!!!111 • More optimal job plans

15. My Experience • A lot of configuring and googling random issues • Scarce documentation, had to read source code • IntelliJ is slow • Boilerplate code for parsing data

16. Use cases • Easy jobs  hive • Non-trivial jobs  scalding • Optional: scalding is nice for doing matrix calculations, twitter also provides a lot of monoids (algorithms) for nice approximations, e.g. HyperLogLog, CountMinSketch, etc. (see algebird).

17. process-logs-rtb • Had to hack scalding: • WritableMultiSinkTap • Records • CompressedTsv • ModelKryoInstantiator • Uses typed API • Helpers like FluentJob

Scalding by Adform Research, Alex Gryzlov

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Scalding by Adform Research, Alex Gryzlov

Similar to Scalding by Adform Research, Alex Gryzlov (20)

More from Vasil Remeniuk

More from Vasil Remeniuk (20)

Recently uploaded

Recently uploaded (20)

Scalding by Adform Research, Alex Gryzlov