1. Interactive Analytics in Human Time
S u p r e e t h R a o , S u n i l G u p t a ⎪ J u n e 4 , 2 0 1 4
2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
2. Interactive – How we see it?
2 Yahoo Confidential & Proprietary
60B events,
3.5TB of compressed data
Response 400ms
Serve an ad and get insights
< 2s
9. Data Restatement
Real time
Batch
Producer Consumer
quick path, lower
amount of checks or
reconciliation,
typically no lookups
high latency path,
checks and
reconciliations,
can have lookbacks
and lookups
10. Human Time
<1s ( 99 percentile)
Default time grain ( < 300 ms)
Instant overlap ( < 60s)
Data ingested, insights available ( < 2s)
13. Data Ingestion or Collection
Transformations
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Data Pipelines
Data Warehouse/ Analytics and
Optimizations
Reporting Application/UI
Logical View - Scope
Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Impacts
Out of scope
14. Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Batch processing DAG, Real-time topology, SOX,
Traffic protection, Late processing, Retention,
Completeness Monitoring, PII cleansing/masking
Compatible with HDFS, Performance (Indexed,
Columnar, Compression, Serialization, Flexibility,
Concurrency, Grain of data stored)
Distributed/Stand-alone, Caching objects vs caching
results
Access to data with group by, order by etc..; SQL or SQL
like
Translate JSON to SQL(optional)
Logical View - Characteristics
Impacts
Out of scope
15. Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Hadoop MR/PIG /Oozie(Lotus)/Storm(Trident)
Druid, Shark, Hive, Oracle RAC, Mysql, Hbase, Impala
memcached_y, Redis
JSON-REST API ; JDBC; ODBC
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Logical View - Choices
Impacts
Out of scope
16. How we do what we do? Components of Advertising Data Warehouse
Druid
JDBC/ODBC
Data Warehouse-Persistence
Hive
Metrics
Store
JSON-API Persistence and run-time compute
Computation and Ingestion
Quick cache ( using a database for now)
Upstream: API layer, MSTR,
Adhoc access, Identity Service,
Ad-Serving manifests
Data Producers; Serving,
Scoring, Booking, 3rd/1st Party
Data
Real time and batch compute engine
(Hadoop/Storm )
Data filtering/transformations:
Transformations, format
conversions
Custom Algorithms : computing
recursive uniques, indexing
17. Human time, How?
Druid for interactive queries
Storm-Druid for quick ingestion and
index
Specialized computation and
processing for quicker response
› Sketches
› Feature sequence based overlaps
› Custom indexing
21. Overlap
Non-additive
› Require access to raw (user level data) to compute
non-additive
• Billions of events a day
• TBs of data a day
1-1 vs 1-n vs few-n
› Between car commuter and vegan what is the overlap
› For Car commuter which are the top overlap groups
› For Vegan, Car commuters what are the top overlap
groups
22. Re-stating motivation
Given two sets having identifiers, how
can we do exact overlaps in close to
real time?
( < 1 min).
Overlap is like a AND operation or a set
23. Existing Approaches
● Use exact compute paradigms
o Do joins for intersections which will lead to
exact results
Hive, PIG, MR can all support efficient joins
Exact but not real time
● Use sketches
o Approximate algorithms
HLL, KMV, accuracy vs size, performance
Approx, needs high perf tuning
close to real time but not exact
24. Using Feature Sequences – 1/4
Feature sequence encoding
o Encode the sequence
{Ram} - { car commuter, soccer fan,...}
{Tom} - { soccer fan, vegan...}
{Sam} - { car commuter, soccer fan, vegan...}
….
25. Using Feature Sequences – 2/4
Eliminate the user on encoded bitmaps
{car commuter, soccer fan, vegan...}- count -c1- #
{soccer fan, vegan...} - count - c2 - #
{car commuter, vegan...} - count - c2 - #
Counts become additive now
26. Using Feature Sequences – 3/4
● Store row qualifications into a bitmap
o Car commuter- Row1, Row3
1010000000
o Vegan - Row1, Row2, Row3
1110000000
● Load the bitmap into Druid using a
custom indexer
o in-memory or memory mapped
27. Using Feature Sequences – 4/4
Data Structures
› {feature_sequence}->count
› Feature->row qualification bitmaps
AND is now an “AND” on bitmaps
› supported within Druid
› Very efficient
Works alongside topN and
groupBys
28. Comparison with existing algorithm
● 1-n – Bulk Overlap on grid
o 19 hours on grid
o Few-n calls for a re-process
o 1-1 ( <1s)
● Instant Overlap
o < 60s ( pre-processing 3-4 hours)
o Supports “exact” AND
o Flexible ( few-n, 1-n)
o 1-1 ( < 1s)
29. Summary
● Yahoo’s Advertising Data Warehouse
o Peta Byte Scale
o Normalized view across many systems
o Analytics and optimizations with specialized
algorithms
o Data restatement - batch and realtime
o Human time
31. Data Ingestion or Collection
Transformations
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Data Pipelines
Data Warehouse/ Analytics and
Optimizations
Reporting Application/UI
Logical View - Scope
Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Impacts
Out of scope
Logical view of a typical data driven application architecture
Sox compliance
-quick cache store for querying all metrics in a single fetch, to support one-page load UI architecture
- Hive for scheduled job and for adhoc long range generic queries which are not supported on the interactive interface
Bitmap indexed, columnar, can operate on compressed bitmaps, distributed
-