Kalyanpur ) Call Girls in Lucknow Finest Escorts Service đž 8923113531 đ° Avail...
Â
Interactive Analytics in Human Time
1. Interactive Analytics in Human Time
S u p r e e t h R a o , S u n i l G u p t a âȘ J u n e 4 , 2 0 1 4
2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
2. Interactive â How we see it?
2 Yahoo Confidential & Proprietary
60B events,
3.5TB of compressed data
Response 400ms
Serve an ad and get insights
< 2s
9. Data Restatement
Real time
Batch
Producer Consumer
quick path, lower
amount of checks or
reconciliation,
typically no lookups
high latency path,
checks and
reconciliations,
can have lookbacks
and lookups
10. Human Time
<1s ( 99 percentile)
Default time grain ( < 300 ms)
Instant overlap ( < 60s)
Data ingested, insights available ( < 2s)
13. Data Ingestion or Collection
Transformations
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Data Pipelines
Data Warehouse/ Analytics and
Optimizations
Reporting Application/UI
Logical View - Scope
Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Impacts
Out of scope
14. Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Batch processing DAG, Real-time topology, SOX,
Traffic protection, Late processing, Retention,
Completeness Monitoring, PII cleansing/masking
Compatible with HDFS, Performance (Indexed,
Columnar, Compression, Serialization, Flexibility,
Concurrency, Grain of data stored)
Distributed/Stand-alone, Caching objects vs caching
results
Access to data with group by, order by etc..; SQL or SQL
like
Translate JSON to SQL(optional)
Logical View - Characteristics
Impacts
Out of scope
15. Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Hadoop MR/PIG /Oozie(Lotus)/Storm(Trident)
Druid, Shark, Hive, Oracle RAC, Mysql, Hbase, Impala
memcached_y, Redis
JSON-REST API ; JDBC; ODBC
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Logical View - Choices
Impacts
Out of scope
16. How we do what we do? Components of Advertising Data Warehouse
Druid
JDBC/ODBC
Data Warehouse-Persistence
Hive
Metrics
Store
JSON-API Persistence and run-time compute
Computation and Ingestion
Quick cache ( using a database for now)
Upstream: API layer, MSTR,
Adhoc access, Identity Service,
Ad-Serving manifests
Data Producers; Serving,
Scoring, Booking, 3rd/1st Party
Data
Real time and batch compute engine
(Hadoop/Storm )
Data filtering/transformations:
Transformations, format
conversions
Custom Algorithms : computing
recursive uniques, indexing
17. Human time, How?
ï§Druid for interactive queries
ï§Storm-Druid for quick ingestion and
index
ï§Specialized computation and
processing for quicker response
âș Sketches
âș Feature sequence based overlaps
âș Custom indexing
21. Overlap
ï§Non-additive
âș Require access to raw (user level data) to compute
non-additive
âą Billions of events a day
âą TBs of data a day
ï§ 1-1 vs 1-n vs few-n
âș Between car commuter and vegan what is the overlap
âș For Car commuter which are the top overlap groups
âș For Vegan, Car commuters what are the top overlap
groups
22. Re-stating motivation
Given two sets having identifiers, how
can we do exact overlaps in close to
real time?
( < 1 min).
Overlap is like a AND operation or a set
23. Existing Approaches
â Use exact compute paradigms
o Do joins for intersections which will lead to
exact results
ï§ Hive, PIG, MR can all support efficient joins
ï§ Exact but not real time
â Use sketches
o Approximate algorithms
ï§ HLL, KMV, accuracy vs size, performance
ï§ Approx, needs high perf tuning
ï§ close to real time but not exact
24. Using Feature Sequences â 1/4
Feature sequence encoding
o Encode the sequence
ï§ {Ram} - { car commuter, soccer fan,...}
ï§ {Tom} - { soccer fan, vegan...}
ï§ {Sam} - { car commuter, soccer fan, vegan...}
ï§ âŠ.
25. Using Feature Sequences â 2/4
Eliminate the user on encoded bitmaps
ï§ {car commuter, soccer fan, vegan...}- count -c1- #
ï§ {soccer fan, vegan...} - count - c2 - #
ï§ {car commuter, vegan...} - count - c2 - #
Counts become additive now
26. Using Feature Sequences â 3/4
â Store row qualifications into a bitmap
o Car commuter- Row1, Row3
ï” 1010000000
o Vegan - Row1, Row2, Row3
ï” 1110000000
â Load the bitmap into Druid using a
custom indexer
o in-memory or memory mapped
27. Using Feature Sequences â 4/4
ï§ Data Structures
âș {feature_sequence}->count
âș Feature->row qualification bitmaps
ï§ AND is now an âANDâ on bitmaps
âș supported within Druid
âș Very efficient
ï§ Works alongside topN and
groupBys
28. Comparison with existing algorithm
â 1-n â Bulk Overlap on grid
o 19 hours on grid
o Few-n calls for a re-process
o 1-1 ( <1s)
â Instant Overlap
o < 60s ( pre-processing 3-4 hours)
o Supports âexactâ AND
o Flexible ( few-n, 1-n)
o 1-1 ( < 1s)
29. Summary
â Yahooâs Advertising Data Warehouse
o Peta Byte Scale
o Normalized view across many systems
o Analytics and optimizations with specialized
algorithms
o Data restatement - batch and realtime
o Human time
31. Data Ingestion or Collection
Transformations
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Data Pipelines
Data Warehouse/ Analytics and
Optimizations
Reporting Application/UI
Logical View - Scope
Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Impacts
Out of scope
Logical view of a typical data driven application architecture
Sox compliance
-quick cache store for querying all metrics in a single fetch, to support one-page load UI architecture
- Hive for scheduled job and for adhoc long range generic queries which are not supported on the interactive interface
Bitmap indexed, columnar, can operate on compressed bitmaps, distributed
-