Data Science at Scale: Using Apache Spark for Data Science at Bitly

Data Science at Scale:
Using Apache Spark for Data Science
at Bitly
Sarah Guido
Data Day Seattle 2015

Overview
• About me/Bitly
• Spark overview
• Using Spark for data science
• When it works, it’s great! When it works…

About me
• Data scientist at Bitly
• NYC Python/PyGotham co-organizer
• O’Reilly Media author
• @sarah_guido

About this talk
• This talk is:
– Description of my workflow
– Exploration of within-Spark tools
• This talk is not:
– In-depth exploration of algorithms
– Building new tools on top of Spark
– Any sort of ground truth for how you should be
using Spark

A bit of background
• Need for big data analysis tools
• MapReduce for exploratory data analysis == 
• Iterate/prototype quickly
• Overall goal: understand how people use not
only our app, but the Internet!

Bitly data!
• Legit big data
• 1 hour of decodes is 10 GB
• 1 day is 240 GB
• 1 month is ~7 TB

What is Spark?
• Large-scale distributed data processing tool
• SQL and streaming tools
• Faster than Hadoop
• Python API

How does Spark work?
• Partitions your data to operate over in parallel
– A partition by default is 64 MB
• Capability to add map/reduce features
• Lazy – only operates when method is called
– Ex. collect() or writing to a file

Why Spark?
• Fast. Really fast.
• SQL layer – kind of like Hive
• Distributed scientific tools
• Python! Sometimes.
• Cutting edge technology

Setting up the workflow
• Spark journey
– Hadoop server: 1.2
– EMR: 1.3
– EMR: 1.4

How do I use it?
• EMR!
• spark-submit on the cluster
• Can add script as a step to cluster launch

Creating a cluster
• aws emr create-cluster
• --bootstrap-action
• --steps
• --auto-terminate

Let’s set the stage…
• Understanding user behavior
• How do I extract, explore, and model a subset
of our data using Spark?

Data
{"a": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2)
AppleWebKit/600.4.10 (KHTML, like Gecko) Version/8.0.4
Safari/600.4.10",
"c": "US",
"nk": 0,
"tz": "America/Los_Angeles",
"g": "1HfTjh8",
"h": "1HfTjh7",
"u": "http://www.nytimes.com/2015/03/22/opinion/sunday/why-
health-care-tech-is-still-so-bad.html?smid=tw-share",
"t": 1427288425,
"cy": "Seattle"}

Data processing
• Problem: I want to retrieve NYT decodes
• Solution: well, there are two…

Data processing
• SparkSQL: 8 minutes
• Pure Spark: 4 minutes!!!

Data processing
• Yes, we’re going to do a live demo of this!

Exploratory data analysis
• Problem: what’s going on with my decodes?
• Solution: DataFrames!
– Similar to Pandas: describe, drop, fill, aggregate
functions
– You can actually convert to a Pandas DataFrame!

• Get a sense of what’s going on in the data
• Look at distributions, frequencies
• Mostly categorical data here

• Yet another live demo

Topic modeling
• Problem: we have so many links but no way to
classify them into certain kinds of content
• Solution: LDA (latent Dirichlet allocation)
– Sort of – compare to other solutions

Topic modeling
• Oh, the JVM…
– LDA only in Scala
• Scala jar file
• Store script in S3

Topic modeling
• LDA in Spark
– Generative model
– Several different methods
– Term frequency vector as input
• “Note: LDA is a new feature with some missing
functionality...”

Topic modeling
• Term frequency vector
TERM
DOCUMENT
python data hot dogs baseball zoo
doc_1 1 3 0 0 0
doc_2 0 0 4 1 0
doc_3 4 0 0 0 5

Topic modeling
• Why not??
– Means to an end
– Current large scale scraping inability

Architecture
• Right now: not in production
– Buy-in
• Streaming applications for parts of the app
• Python or Scala?
– Scala by force (LDA, GraphX)

Some issues
• Hadoop servers
• JVM
• gzip
• 1.4
• Resource allocation
• Really only got it to this stage very recently

Where to go next?
• Spark in production!
• Use for various parts of our app
• Use for R&D and prototyping purposes, with
the potential to expand into the product

Current/future projects
• Trend detection
• Device prediction
• User affinities
– GraphX!
• A/B testing

Resources
• spark.apache.org - documentation
• Databricks blog
• Cloudera blog

Data Science at Scale: Using Apache Spark for Data Science at Bitly

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Data Science at Scale: Using Apache Spark for Data Science at Bitly

Ähnlich wie Data Science at Scale: Using Apache Spark for Data Science at Bitly (20)

Mehr von Sarah Guido

Mehr von Sarah Guido (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Science at Scale: Using Apache Spark for Data Science at Bitly