Given at Data Day Seattle 2015.
Bitly generates over 9 billion clicks on shortened links a month, as well as over 100 million unique link shortens. Analyzing data of this scale is not without its challenges. At Bitly, we have started adopting Apache Spark as a way to process our data. In this talk, I’ll elaborate on how I use Spark as part of my data science workflow. I’ll cover how Spark fits into our existing architecture, the kind of problems I’m solving with Spark, and the benefits and challenges of using Spark for large-scale data science.
Data Science at Scale: Using Apache Spark for Data Science at Bitly
1. Data Science at Scale:
Using Apache Spark for Data Science
at Bitly
Sarah Guido
Data Day Seattle 2015
2. Overview
• About me/Bitly
• Spark overview
• Using Spark for data science
• When it works, it’s great! When it works…
3. About me
• Data scientist at Bitly
• NYC Python/PyGotham co-organizer
• O’Reilly Media author
• @sarah_guido
4. About this talk
• This talk is:
– Description of my workflow
– Exploration of within-Spark tools
• This talk is not:
– In-depth exploration of algorithms
– Building new tools on top of Spark
– Any sort of ground truth for how you should be
using Spark
5. A bit of background
• Need for big data analysis tools
• MapReduce for exploratory data analysis ==
• Iterate/prototype quickly
• Overall goal: understand how people use not
only our app, but the Internet!
6. Bitly data!
• Legit big data
• 1 hour of decodes is 10 GB
• 1 day is 240 GB
• 1 month is ~7 TB
7. What is Spark?
• Large-scale distributed data processing tool
• SQL and streaming tools
• Faster than Hadoop
• Python API
8. How does Spark work?
• Partitions your data to operate over in parallel
– A partition by default is 64 MB
• Capability to add map/reduce features
• Lazy – only operates when method is called
– Ex. collect() or writing to a file
22. Exploratory data analysis
• Problem: what’s going on with my decodes?
• Solution: DataFrames!
– Similar to Pandas: describe, drop, fill, aggregate
functions
– You can actually convert to a Pandas DataFrame!
23. Exploratory data analysis
• Get a sense of what’s going on in the data
• Look at distributions, frequencies
• Mostly categorical data here
25. Topic modeling
• Problem: we have so many links but no way to
classify them into certain kinds of content
• Solution: LDA (latent Dirichlet allocation)
– Sort of – compare to other solutions
26. Topic modeling
• Oh, the JVM…
– LDA only in Scala
• Scala jar file
• Store script in S3
27. Topic modeling
• LDA in Spark
– Generative model
– Several different methods
– Term frequency vector as input
• “Note: LDA is a new feature with some missing
functionality...”
32. Topic modeling
• Why not??
– Means to an end
– Current large scale scraping inability
33. Architecture
• Right now: not in production
– Buy-in
• Streaming applications for parts of the app
• Python or Scala?
– Scala by force (LDA, GraphX)
34. Some issues
• Hadoop servers
• JVM
• gzip
• 1.4
• Resource allocation
• Really only got it to this stage very recently
35. Where to go next?
• Spark in production!
• Use for various parts of our app
• Use for R&D and prototyping purposes, with
the potential to expand into the product