3. 3
Client
Client
Client
Client
Current event delivery system
Gateway
Syslog
Syslog
Producer
Any Data Centre
Groupers Realtime
Brokers
ETL job
Checkpoint
Monitor
Hadoop
Hadoop Data Center
Service
Discovery
ACK
Brokers
Syslog
Consumer
Liveness
Monitor
Brokers
11. 11
Keep it simple
Gateway
Syslog
File Tailer
Any data centre
Hadoop
Event
Delivery
Service
Reliable
Persistent Queue
ETL
Client
Client
Client
Client
18. 18
Event delivery with Kafka 0.8
Gateway
Syslog
File Tailer
Any data centre
Client
Hadoop
Client
Client
Client
Event
Delivery
Service
Hadoop data centre
Camus
(ETL)
Brokers
Mirror
Makers
Brokers
19. 19
Gateway
Syslog
File Tailer
Any data centre
Client
Hadoop
Client
Client
Client
Event
Delivery
Service
Hadoop data centre
Camus
(ETL)
Brokers
Mirror
Makers
Brokers
Event delivery with Kafka 0.8
51. 51
Origin story
Scalding and Spark popular for ML, recommendations, analytics @ Spotify
50+ users, 400+ unique jobs
Early 2015 - Dataflow Scala hack project
52. 52
Why not Scalding on GCE
Pros
● Big community - Twitter, eBay, Etsy, Stripe, LinkedIn, SoundCloud
● Stable and proven
Cons
● Hadoop cluster operations
● Multi-tenancy, resource contention and utilization
● No streaming mode
53. 53
Why not Spark on GCE
Pros
● Batch, streaming, interactive and SQL
● MLlib, GraphX
● Scala, Python, and R support
Cons
● Hard to tune and scale
● Cluster lifecycle management
54. 54
Why Dataflow with Scala
Dataflow
● Hosted solution, no operations
● Ecosystem: GCS, Bigquery, Pubsub, Datastore, Bigtable
● Simple unified model for batch and streaming
Scala
● High level DSL, easy transition for developers
● Reusable and composable code via functional programming
● Numerical libraries: Breeze, Algebird
56. 56
Scio
Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o]
Verb: I can, know, understand, have knowledge.
Core API similar to spark-core, some ideas from scalding
github.com/spotify/scio
57. 57
WordCount
Almost identical to Spark version
val sc = ScioContext()
sc.textFile("shakespeare.txt")
.flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
.countByValue()
.saveAsTextFile("wordcount.txt")
58. 58
PageRank in 13 lines
def pageRank(in: SCollection[(String, String)]) = {
val links = in.groupByKey()
var ranks = links.mapValues(_ => 1.0)
for (i <- 1 to 10) {
val contribs = links.join(ranks).values
.flatMap { case (urls, rank) =>
val size = urls.size
urls.map((_, rank / size))
}
ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _)
}
ranks
}
59. 59
SQL and Big Data Pipelines
SQL is easier to write than data pipelines, but
Hive with TSV or Avro
● Row based storage, inefficient full scan
● No integration with other frameworks
Parquet
● Inspired by Google Dremel which powers BigQuery
● Immature Hive integration, hard to scale with Spark SQL
● Poor impedance matching with Scalding, Avro, etc.
60. 60
BigQuery and Scio
BigQuery
● Slicing and dicing, aggregation, etc.
● Scaling independently
● Web UI, Tableau, QlikView etc.
Scio
● Custom logic hard to express in SQL
● Seamless integration with BigQuery IO
● Scala macros for type safety
61. 61
JSON vs Type Safe BigQuery
JSON approach, a.k.a. everything is Object
sc.bigQuerySelect("...").map { r =>
(r.get("track").asInstanceOf[TableRow]
.get("name").asInstanceOf[String],
r.get("audio").asInstanceOf[TableRow]
.get("tempo").toString.toInt
)
}
Compile
Run job
Wait
NullPointerException or ClassCastException
Repeat
Type safe approach
@BigQueryType.fromQuery("...")
class TrackTempo
sc.typedBigQuery[TrackTempo]().map { t =>
(t.track.name, t.audio.tempo.getOrElse(-1))
}
Compile
Run
Profit
62. 62
Spotify Running
60 million tracks
30 million users * 10 tempo buckets * 25 personalized tracks
Audio: tempo, energy, time signature ...
Metadata: genres, categories
Latent vectors from collaborative filtering
64. 64
Spotify Running
SELECT user_id, vector
FROM UserEntity
WHERE ...
SELECT
track_id, audio.tempo ...
FROM TrackEntity
WHERE ...
most popular
per recording
top N tracks
per artist
bucket by
tempo
vector LSH
per bucket
GBK GBK GBK
RBK
top tracks per
user + bucket
side input
Cloud
Datastore