50 Billion pins and counting: Using Hadoop to build data driven Products

Confidentia
l
Using Hadoop to build data driven Products
50 Billion pins and counting
Krishna Gade
1

What is Pinterest?
A visual bookmarking tool
Discover an inspiring idea
Save it to a board
Go do it

Krishna Gade
• Data Engineering at
Pinterest
• Search and Data
platforms at Twitter and
Bing
• Follow @krishnagade
Who am I?

Why do we care about data?
How is Hadoop helping us to harness the
power of the data?
What are some of the tools we built on top
of Hadoop Platform?

> odds of making the
best decisions

15
It is a capital mistake to theorize
before one has data.
- Sherlock Holmes

How is Hadoop helping us to harness the
power of the data?

Data at Pinterest
• 50 Billion Pins
• 1 Billion boards
• 40 PB of data on S3
• 3 PB processed every day
• 2000 node Hadoop cluster
• 200 engineers

Pinterest Data Architecture
App

App
events
Kafka
Secor
Singer

App
events
Kafka
Secor
Skyline
Pinball
Redshift
Pinalytics
Features
Qubole
(Hadoop)
Singer

• Ephemeral clusters
• Access control layer
• Shared data store
• Easy deployment
Hadoop Platform Requirements
• Isolated multi-tenancy
• Elasticity
• Support multiple
clusters

Confidentia
l
Design Choices
23

Decoupling compute & storage
Hadoop Cluster 1
Transient
HDFS
Hadoop Cluster 2
Transient
HDFS
S3 Persistent
Store

Centralized Hive Metastore
Hive
Metastore
Pig
Cascading
Hive
HDFS/S3
DataMetadata

Multi-layered Packaging
Mapreduce Jobs
Hadoop Jars/Libs
Job/User level Configs
Software Packages/Libs
Configs (OS/Hadoop)
Misc Sys Admin
OS
Bootstrap Script
Core SW
Runtime Staging
(on S3)
Automated
Configuration
(Masterless Puppet)
Baked AMI

Executor Abstraction Layer
Hive
Metastore
HDFS/S3
Qubole
Managed
Hadoop
EMR
Executor
Pinball
Dev
Server

• API for simplified
executor abstraction
• Advanced support
for spot instances
• Baked AMI
customization
Why Qubole?
• Hadoop & Spark as
managed services
• Tight integration with
Hive
• Graceful cluster
scaling

Confidentia
l
● Scale:
o 50 Billion Pins
o Hundreds of workflows
o Thousands of jobs
o 500+ jobs in a workflow
o 3 petabytes processed daily
● Support:
o Hadoop, Cascading, Hive, Spark …
Scale of Processing
job
workflow

Confidentia
l
Why Pinball?
● Requirements
o Simple abstractions
o Extensible in future
o Reliable stateless computing
o Easy to debug
o Scales horizontally
o Can be upgraded w/o aborting workflows
o Rich features like auto-retries, per-job emails, overrun
policies…
● Options
o Apache Oozie, Azkaban, Luigi

Confidentia
l
● Workflow
o A directed graph of
nodes called jobs
● Edge
o Run after
dependence
● Node
o Job is a node
Workflow Model

Confidentia
l
Job State
● Job state is captured in a token
● Tokens are named hierarchically
Master
Job Token
version: 123
name: /workflow/w1/job
owner: worker_0
expiration: 1234567
data: JobTemplate(....)

Confidentia
l
Job State Machine

Confidentia
l
● Master keeps the state
● Workers claim and execute tasks
● Horizontally scalable
Master Worker Interaction
Worker Master Persistent Store
1: request 2: update
3: ack

Confidentia
l
Master
● Entire state is kept in memory
● Each state update is synchronously persisted
before master replies to client
● Master runs on a single thread – no
concurrency issues

Confidentia
l
Open Source
Git repo:
https://github.com/pinterest/pinball
Mailing list:
https://groups.google.com/forum/#!forum/
pinball-users

Confidentia
l
Data Driven Products
40

What are some of the tools we built on top
of Hadoop Platform?

Confidentia
l
Scalable Data Analytics Engine
Pinalytics
44

Confidentia
l
Architecture
45
Backend
Thrift Services and Hbase databases
Webapp
Rich UI Components
Reporter
Generates formatted data
Metrics
Customized optimizations
1
2
3
4
Main Components

Confidentia
l
Visualizations
• Highcharts
• Time-series updated automatically
daily
Customizability
• Dashboards
• Built-in or user-defined reports
User Interface
47

Confidentia
l
Pinomaly
• Anomalous metric tracking
• Email alerts
Reporting
• Formatted dashboards
• PDF printing
• Duplicated weekly
Metric Manipulation
• Metric Composer
• Global operations (segmentation,
rollup/aggregation, etc).
User Interface
48

Confidentia
l
Date, seg1, seg2, ... => value
• Store the value for every possible segmentation
• On-the-fly aggregation
E.g.
• 2015-01-01, US, Male => 1
• 2015-01-01, US, Female => 2
• 2015-01-01, UK, Male => 3
• 2015-01-01, UK, Female => 4
• 2015-01-01, UK, * => 7
• 2015-01-01, *, Male => 4
Data Model
51

Confidentia
l
Backend Architecture
53
Pinalytics
Thrift Service
2. readMetrics()
5. metrics
HBase
Region Server 1
Region Server N
Region Server 2
Region1 CP
Region2 CP
Region3 CP
Region4 CP
Region5 CP
RegionM
CP
Metric table
Webapp
Server
3. Scan &
Aggregate
1. request
4. Region
aggregation

Confidentia
l
Horizontal Scalability
• No app-level sharding
Flexibility in Aggregation
• FuzzyRowFilter
• Coprocessor
Tables
• Report metadata
• Reports
HBase
54

Confidentia
l
Composite row key
• METRIC|TIME|SEG1|SEG2|...
Filters rows given a row key and a fuzzy row
• 0: match the byte, 1: don’t match the byte
E.g. MAU of male users on 2015-01-01
• Start row: MAU|2015-01-01|
• End row: MAU|2015-01-01||
• Row Key: MAU|2015-01-01|--|M-
• Fuzzy filter: 000|0000000000|11|00
Fuzzy Row Filter
55

Confidentia
l
• Region-local aggregation with coprocessor
• Final aggregation at the Thrift service
• Reduces Network I/O
• Low Latency
HBase Coprocessor
56

Confidentia
l
Flexible python client library for generating
reports
• Arbitrary metrics and segments
Easy-to-access data
• Data is automatically copied to s3
• Hive external table is generated
Reporter
58

Confidentia
l
WAU, WARC and MAU segmented by gender and country
class DemoWAUReport(PinalyticsWideReport):
_METRIC_NAMES = ['wau', 'warc', 'mau']
_SEGKEY_NAMES = ['gender', 'country']
_QUERY_TEMPLATE = """
SELECT dt, gender, country, wau, warc, mau
FROM activity_metrics WHERE dt>='2015-01-01';"""
• Sample query output
[‘2015-01-01’, ‘male’, ‘US’, 102, 53, 110]
Reporter Example
60

Confidentia
l
• Pre-compute a lot of
core metrics
• Standard segmentation
- Gender, Country, App
- Spam-filtering
Core Metrics
62
• Activity
• Event counts
• Retention
• Signups

Confidentia
l
70
Internal Tools Matter
Solving problems inside of our company
400 Unique users
800 Page views per day
1500 Custom charts created and updated daily

50 Billion pins and counting: Using Hadoop to build data driven Products

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 50 Billion pins and counting: Using Hadoop to build data driven Products

Ähnlich wie 50 Billion pins and counting: Using Hadoop to build data driven Products (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

50 Billion pins and counting: Using Hadoop to build data driven Products