SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Scio
A Scala API for Google Cloud Dataflow
Neville Li @sinisa_lyh
Who am I?
Origin Story
Scalding and Spark
ML, recommendations, analytics
50+ users, 400+ unique jobs
Moving to
Google Cloud
Early 2015 - Dataflow Scala hack project
What is Dataflow?
Data model
Spark
• RDD for batch, DStream for streaming
• Explicit caching semantics
• Two sets ofAPIs
Dataflow
• PCollection for both batch and streaming
• Windowed and timestamped values
• One unifiedAPI
Execution
Spark
• Driver and executors
• Dynamic execution from driver
• Transforms and actions
Dataflow
• No master
• Static execution planning
• Transforms only, no actions
Why Dataflow?
Why not Scalding on GCE
Pros
• Community

Twitter, eBay, Etsy, Stripe, LinkedIn, …
• Stable and proven
Why not Scalding on GCE
Cons
• Hadoop cluster operations
• Multi-tenancy

resource contention and utilization
• No streaming mode (Summingbird?)
Why not Spark on GCE
Pros
• Batch, streaming, interactive and SQL
• MLlib, GraphX
• Scala, Python, and R support
• Zeppelin, spark-notebook, Hue
Why not Spark on GCE
Cons
• Hard to tune and scale
• Cluster lifecycle management
Why Dataflow with Scala
Dataflow
• Hosted solution, no operations
• Ecosystem

GCS, BigQuery, PubSub, Bigtable, …
• Unified batch and streaming model
Why Dataflow with Scala
Scala
• High level DSL

easytransition for developers
• Reusable and composable code via FP
• Numerical libraries: Breeze,Algebird
Scio
Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o]
Verb: I can, know, understand, have knowledge.
github.com/spotify/scio
WordCount
Almost identical to Spark version
val sc = ScioContext()
sc.textFile("shakespeare.txt")
.flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
.countByValue()
.saveAsTextFile("wordcount.txt")
PageRank
def pageRank(in: SCollection[(String, String)]) = {
val links = in.groupByKey()
var ranks = links.mapValues(_ => 1.0)
for (i <- 1 to 10) {
val contribs = links.join(ranks).values
.flatMap { case (urls, rank) =>
val size = urls.size
urls.map((_, rank / size))
}
ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _)
}
ranks
}
Spotify Running
60 million tracks
30m users * 10 tempo buckets * 25 tracks
Audio: tempo, energy, time signature ...
Metadata: genres, categories, …
Latent vectors from collaborative filtering
Personalized new releases
• Pre-computed weekly on Hadoop

(on-premise cluster)
• 100GB recommendations

from HDFS to Bigtable in US+EU
• 250GB Bloom filters from Bigtable to HDFS
• 200 LOC
User conversion analysis
• For marketing and campaigning strategies
• Track usertransitions through products
• Aggregated for simulation and projection
• 150GB BigQuery in and out
Demo Time!
Design and Implementation
• Simplicity over premature optimization
• Usability over Python/Java inter-op
• Ser/de: ☑kryo/chill ☒Coder[T]
• Closure cleaner
What’s next?
• Apache Beam donation
• Migrating internal teams
• BigQuery SQL-2011 dialect
• Better streaming support
• PRs and issues welcome!
Neville Li
@sinisa_lyh
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connectorDuyhai Doan
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataJimmy Angelakos
 
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thSparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thAlton Alexander
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...randyguck
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RaySpark Summit
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5Yan Zhou
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkDatabricks
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDatabricks
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Holden Karau
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBMohamed Taher Alrefaie
 

Was ist angesagt? (20)

Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic Data
 
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thSparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New World
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DB
 

Andere mochten auch

Only the First Drop: Changing the Way Startups are Funded by Denes Ban (OurCr...
Only the First Drop: Changing the Way Startups are Funded by Denes Ban (OurCr...Only the First Drop: Changing the Way Startups are Funded by Denes Ban (OurCr...
Only the First Drop: Changing the Way Startups are Funded by Denes Ban (OurCr...Leadel
 
Smarter campus workshop Part I - Amit Sinha and Heidi Riley - Smarter planet ...
Smarter campus workshop Part I - Amit Sinha and Heidi Riley - Smarter planet ...Smarter campus workshop Part I - Amit Sinha and Heidi Riley - Smarter planet ...
Smarter campus workshop Part I - Amit Sinha and Heidi Riley - Smarter planet ...Smarter Planet Students for a
 
SCIO – Explore Me, IoT Israel 2014
SCIO – Explore Me, IoT Israel 2014SCIO – Explore Me, IoT Israel 2014
SCIO – Explore Me, IoT Israel 2014iotisrael
 
Open Spectrum - Physics, Engineering, Commerce and Politics
Open Spectrum - Physics, Engineering, Commerce and PoliticsOpen Spectrum - Physics, Engineering, Commerce and Politics
Open Spectrum - Physics, Engineering, Commerce and PoliticsBrough Turner
 
Refactoring workshop (Campus Party Quito 2014)
Refactoring workshop (Campus Party Quito 2014)Refactoring workshop (Campus Party Quito 2014)
Refactoring workshop (Campus Party Quito 2014)Maria Gomez
 
Bringing iot data to life, IoT Israel 2014
Bringing iot data to life, IoT Israel 2014Bringing iot data to life, IoT Israel 2014
Bringing iot data to life, IoT Israel 2014iotisrael
 
Dr. Jimmy Schwarzkopf main tent trends 2016
Dr. Jimmy Schwarzkopf  main tent trends 2016Dr. Jimmy Schwarzkopf  main tent trends 2016
Dr. Jimmy Schwarzkopf main tent trends 2016Dr. Jimmy Schwarzkopf
 
Linux Kernel Exploitation
Linux Kernel ExploitationLinux Kernel Exploitation
Linux Kernel ExploitationScio Security
 
Sensors candidated dkim_v2
Sensors candidated dkim_v2Sensors candidated dkim_v2
Sensors candidated dkim_v2David Yushin KIM
 
STKI Israeli IT market study 2016 V2
STKI Israeli IT  market study 2016 V2STKI Israeli IT  market study 2016 V2
STKI Israeli IT market study 2016 V2Dr. Jimmy Schwarzkopf
 
The Digital Health Tech Vision 2016
The Digital Health Tech Vision 2016The Digital Health Tech Vision 2016
The Digital Health Tech Vision 2016accenture
 
Video is Changing the World
Video is Changing the WorldVideo is Changing the World
Video is Changing the Worldaccenture
 
Chemicals: Smarter Investments, Outstanding Results
Chemicals: Smarter Investments, Outstanding ResultsChemicals: Smarter Investments, Outstanding Results
Chemicals: Smarter Investments, Outstanding Resultsaccenture
 
Unlocking the Power of RegTech
Unlocking the Power of RegTechUnlocking the Power of RegTech
Unlocking the Power of RegTechaccenture
 
Mastering The Fourth Industrial Revolution
Mastering The Fourth Industrial Revolution Mastering The Fourth Industrial Revolution
Mastering The Fourth Industrial Revolution Monty C. M. Metzger
 

Andere mochten auch (20)

Only the First Drop: Changing the Way Startups are Funded by Denes Ban (OurCr...
Only the First Drop: Changing the Way Startups are Funded by Denes Ban (OurCr...Only the First Drop: Changing the Way Startups are Funded by Denes Ban (OurCr...
Only the First Drop: Changing the Way Startups are Funded by Denes Ban (OurCr...
 
Smarter campus workshop Part I - Amit Sinha and Heidi Riley - Smarter planet ...
Smarter campus workshop Part I - Amit Sinha and Heidi Riley - Smarter planet ...Smarter campus workshop Part I - Amit Sinha and Heidi Riley - Smarter planet ...
Smarter campus workshop Part I - Amit Sinha and Heidi Riley - Smarter planet ...
 
SCIO – Explore Me, IoT Israel 2014
SCIO – Explore Me, IoT Israel 2014SCIO – Explore Me, IoT Israel 2014
SCIO – Explore Me, IoT Israel 2014
 
Open Spectrum - Physics, Engineering, Commerce and Politics
Open Spectrum - Physics, Engineering, Commerce and PoliticsOpen Spectrum - Physics, Engineering, Commerce and Politics
Open Spectrum - Physics, Engineering, Commerce and Politics
 
Refactoring workshop (Campus Party Quito 2014)
Refactoring workshop (Campus Party Quito 2014)Refactoring workshop (Campus Party Quito 2014)
Refactoring workshop (Campus Party Quito 2014)
 
Nutrition and It's Importance
Nutrition and It's ImportanceNutrition and It's Importance
Nutrition and It's Importance
 
Bringing iot data to life, IoT Israel 2014
Bringing iot data to life, IoT Israel 2014Bringing iot data to life, IoT Israel 2014
Bringing iot data to life, IoT Israel 2014
 
Dr. Jimmy Schwarzkopf main tent trends 2016
Dr. Jimmy Schwarzkopf  main tent trends 2016Dr. Jimmy Schwarzkopf  main tent trends 2016
Dr. Jimmy Schwarzkopf main tent trends 2016
 
Linux Kernel Exploitation
Linux Kernel ExploitationLinux Kernel Exploitation
Linux Kernel Exploitation
 
Sensors candidated dkim_v2
Sensors candidated dkim_v2Sensors candidated dkim_v2
Sensors candidated dkim_v2
 
STKI Israeli IT market study 2016 V2
STKI Israeli IT  market study 2016 V2STKI Israeli IT  market study 2016 V2
STKI Israeli IT market study 2016 V2
 
Scio
ScioScio
Scio
 
Molecular Sensor from SCIO
Molecular Sensor from SCIOMolecular Sensor from SCIO
Molecular Sensor from SCIO
 
Ansible + Hadoop
Ansible + HadoopAnsible + Hadoop
Ansible + Hadoop
 
The Future of Digital Health
The Future of Digital HealthThe Future of Digital Health
The Future of Digital Health
 
The Digital Health Tech Vision 2016
The Digital Health Tech Vision 2016The Digital Health Tech Vision 2016
The Digital Health Tech Vision 2016
 
Video is Changing the World
Video is Changing the WorldVideo is Changing the World
Video is Changing the World
 
Chemicals: Smarter Investments, Outstanding Results
Chemicals: Smarter Investments, Outstanding ResultsChemicals: Smarter Investments, Outstanding Results
Chemicals: Smarter Investments, Outstanding Results
 
Unlocking the Power of RegTech
Unlocking the Power of RegTechUnlocking the Power of RegTech
Unlocking the Power of RegTech
 
Mastering The Fourth Industrial Revolution
Mastering The Fourth Industrial Revolution Mastering The Fourth Industrial Revolution
Mastering The Fourth Industrial Revolution
 

Ähnlich wie Scio

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...Neville Li
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyoneAmanda Casari
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and HowBigBlueHat
 
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Roger Huang
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato ReviewHang Li
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Big data workloads using Apache Sparkon HDInsight
Big data workloads using Apache Sparkon HDInsightBig data workloads using Apache Sparkon HDInsight
Big data workloads using Apache Sparkon HDInsightNilesh Gule
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingXebia Nederland BV
 

Ähnlich wie Scio (20)

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On Time
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and How
 
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
 
Scala 20140715
Scala 20140715Scala 20140715
Scala 20140715
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Big data workloads using Apache Sparkon HDInsight
Big data workloads using Apache Sparkon HDInsightBig data workloads using Apache Sparkon HDInsight
Big data workloads using Apache Sparkon HDInsight
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Sandish3Certs
Sandish3CertsSandish3Certs
Sandish3Certs
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
 

Kürzlich hochgeladen

Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 

Kürzlich hochgeladen (20)

Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 

Scio

  • 1. Scio A Scala API for Google Cloud Dataflow Neville Li @sinisa_lyh
  • 3. Origin Story Scalding and Spark ML, recommendations, analytics 50+ users, 400+ unique jobs
  • 4. Moving to Google Cloud Early 2015 - Dataflow Scala hack project
  • 6. Data model Spark • RDD for batch, DStream for streaming • Explicit caching semantics • Two sets ofAPIs Dataflow • PCollection for both batch and streaming • Windowed and timestamped values • One unifiedAPI
  • 7. Execution Spark • Driver and executors • Dynamic execution from driver • Transforms and actions Dataflow • No master • Static execution planning • Transforms only, no actions
  • 9. Why not Scalding on GCE Pros • Community
 Twitter, eBay, Etsy, Stripe, LinkedIn, … • Stable and proven
  • 10. Why not Scalding on GCE Cons • Hadoop cluster operations • Multi-tenancy
 resource contention and utilization • No streaming mode (Summingbird?)
  • 11. Why not Spark on GCE Pros • Batch, streaming, interactive and SQL • MLlib, GraphX • Scala, Python, and R support • Zeppelin, spark-notebook, Hue
  • 12. Why not Spark on GCE Cons • Hard to tune and scale • Cluster lifecycle management
  • 13. Why Dataflow with Scala Dataflow • Hosted solution, no operations • Ecosystem
 GCS, BigQuery, PubSub, Bigtable, … • Unified batch and streaming model
  • 14. Why Dataflow with Scala Scala • High level DSL
 easytransition for developers • Reusable and composable code via FP • Numerical libraries: Breeze,Algebird
  • 15.
  • 16. Scio Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o] Verb: I can, know, understand, have knowledge.
  • 18. WordCount Almost identical to Spark version val sc = ScioContext() sc.textFile("shakespeare.txt") .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) .countByValue() .saveAsTextFile("wordcount.txt")
  • 19. PageRank def pageRank(in: SCollection[(String, String)]) = { val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) } ranks }
  • 20. Spotify Running 60 million tracks 30m users * 10 tempo buckets * 25 tracks Audio: tempo, energy, time signature ... Metadata: genres, categories, … Latent vectors from collaborative filtering
  • 21.
  • 22.
  • 23.
  • 24.
  • 25. Personalized new releases • Pre-computed weekly on Hadoop
 (on-premise cluster) • 100GB recommendations
 from HDFS to Bigtable in US+EU • 250GB Bloom filters from Bigtable to HDFS • 200 LOC
  • 26. User conversion analysis • For marketing and campaigning strategies • Track usertransitions through products • Aggregated for simulation and projection • 150GB BigQuery in and out
  • 28. Design and Implementation • Simplicity over premature optimization • Usability over Python/Java inter-op • Ser/de: ☑kryo/chill ☒Coder[T] • Closure cleaner
  • 29. What’s next? • Apache Beam donation • Migrating internal teams • BigQuery SQL-2011 dialect • Better streaming support • PRs and issues welcome!