6. Why do we care about data?
How is Hadoop helping us to harness the
power of the data?
What are some of the tools we built on top
of Hadoop Platform?
28. • API for simplified
executor abstraction
• Advanced support
for spot instances
• Baked AMI
customization
Why Qubole?
• Hadoop & Spark as
managed services
• Tight integration with
Hive
• Graceful cluster
scaling
29. Confidentia
l
● Scale:
o 50 Billion Pins
o Hundreds of workflows
o Thousands of jobs
o 500+ jobs in a workflow
o 3 petabytes processed daily
● Support:
o Hadoop, Cascading, Hive, Spark …
Scale of Processing
job
workflow
31. Confidentia
l
Why Pinball?
● Requirements
o Simple abstractions
o Extensible in future
o Reliable stateless computing
o Easy to debug
o Scales horizontally
o Can be upgraded w/o aborting workflows
o Rich features like auto-retries, per-job emails, overrun
policies…
● Options
o Apache Oozie, Azkaban, Luigi
33. Confidentia
l
● Workflow
o A directed graph of
nodes called jobs
● Edge
o Run after
dependence
● Node
o Job is a node
Workflow Model
34. Confidentia
l
Job State
● Job state is captured in a token
● Tokens are named hierarchically
Master
Job Token
version: 123
name: /workflow/w1/job
owner: worker_0
expiration: 1234567
data: JobTemplate(....)
36. Confidentia
l
● Master keeps the state
● Workers claim and execute tasks
● Horizontally scalable
Master Worker Interaction
Worker Master Persistent Store
1: request 2: update
3: ack
37. Confidentia
l
Master
● Entire state is kept in memory
● Each state update is synchronously persisted
before master replies to client
● Master runs on a single thread – no
concurrency issues
47. Confidentia
l
Pinomaly
• Anomalous metric tracking
• Email alerts
Reporting
• Formatted dashboards
• PDF printing
• Duplicated weekly
Metric Manipulation
• Metric Composer
• Global operations (segmentation,
rollup/aggregation, etc).
User Interface
48
48. Confidentia
l
Date, seg1, seg2, ... => value
• Store the value for every possible segmentation
• On-the-fly aggregation
E.g.
• 2015-01-01, US, Male => 1
• 2015-01-01, US, Female => 2
• 2015-01-01, UK, Male => 3
• 2015-01-01, UK, Female => 4
• 2015-01-01, UK, * => 7
• 2015-01-01, *, Male => 4
Data Model
51
49. Confidentia
l
Backend Architecture
53
Pinalytics
Thrift Service
2. readMetrics()
5. metrics
HBase
Region Server 1
Region Server N
Region Server 2
Region1 CP
Region2 CP
Region3 CP
Region4 CP
Region5 CP
RegionM
CP
Metric table
Webapp
Server
3. Scan &
Aggregate
1. request
4. Region
aggregation
51. Confidentia
l
Composite row key
• METRIC|TIME|SEG1|SEG2|...
Filters rows given a row key and a fuzzy row
• 0: match the byte, 1: don’t match the byte
E.g. MAU of male users on 2015-01-01
• Start row: MAU|2015-01-01|
• End row: MAU|2015-01-01||
• Row Key: MAU|2015-01-01|--|M-
• Fuzzy filter: 000|0000000000|11|00
Fuzzy Row Filter
55
53. Confidentia
l
Flexible python client library for generating
reports
• Arbitrary metrics and segments
Easy-to-access data
• Data is automatically copied to s3
• Hive external table is generated
Reporter
58
54. Confidentia
l
WAU, WARC and MAU segmented by gender and country
class DemoWAUReport(PinalyticsWideReport):
_METRIC_NAMES = ['wau', 'warc', 'mau']
_SEGKEY_NAMES = ['gender', 'country']
_QUERY_TEMPLATE = """
SELECT dt, gender, country, wau, warc, mau
FROM activity_metrics WHERE dt>='2015-01-01';"""
• Sample query output
[‘2015-01-01’, ‘male’, ‘US’, 102, 53, 110]
Reporter Example
60
55. Confidentia
l
• Pre-compute a lot of
core metrics
• Standard segmentation
- Gender, Country, App
- Spam-filtering
Core Metrics
62
• Activity
• Event counts
• Retention
• Signups