More Related Content
Similar to Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013 (20)
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
- 1. Real-time “OLAP” for Big Data (+ use cases)
Cosmin Lehene | Adobe
#bigdataro - 30 January 2013
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
- 2. What we needed … and built
OLAP Semantics
Low Latency Ingestion
High Throughput
Real-time Query API
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 2
- 4. Logical Building Blocks
Dimensions, Metrics
Aggregations
Roll-up, drill-down, slicing and dicing, sorting
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 4
- 5. OLAP 101 – Queries example
Date Country City OS Browser Sale
2012-05-21 USA NY Windows FF 0.0
2012-05-21 USA NY Windows FF 10.0
2012-05-22 USA SF OSX Chrome 25.0
2012-05-22 Canada Ontario Linux Chrome 0.0
2012-05-23 USA Chicago OSX Safari 15.0
5 visits, 2 4 cities: 3 OS-es 3 browsers 50.0
3 days countries NY: 2 Win: 2 FF: 2 3 sales
USA: 4 SF: 1 OSX: 2 Chrome:2
Canada: 1
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 5
- 6. OLAP 101 – Queries example
Rolling up to country level: Country visits sales
SELECT COUNT(visits), SUM(sales)
USA 4 $50
GROUP BY country
Canada 1 0
“Slice” by browser Country visits sales
SELECT COUNT(visits), SUM(sales) USA 2 $10
GROUP BY country
Canada 0 0
HAVING browser = “FF”
Browser sales visits
Top browsers by sales
SELECT SUM(sales), COUNT(visits) Chrome $25 2
GROUP BY browser Safari $15 1
ORDER BY sales FF $10 2
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 6
- 7. OLAP – Runtime Aggregation vs. Pre-aggregation
Aggregate at runtime Pre-aggregate
Most flexible Fast
Fast – scatter gather Efficient – O(1)
Space efficient High throughput
But But
I/O, CPU intensive More effort to process (latency)
slow for larger data Combinatorial explosion (space)
low throughput No flexibility
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 7
- 9. SaasBase Domain Model Mapping
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 9
- 10. SaasBase - Domain Model Mapping
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 10
- 11. SaasBase - Ingestion, Processing, Indexing, Querying
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 11
- 12. SaasBase - Ingestion, Processing, Indexing, Querying
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 12
- 14. Ingestion(ETL) throughput vs. latency
Historical data (large batches)
Optimize for throughput
Increments (latest data, smaller)
Optimize for latency
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 14
- 16. Processing
Processing involves reading the Input (files, tables, events), pre-
aggregating it (reducing cardinality) and generating cubes that can be
queried in real-time
“Super Processor” code running in Storm, Map-Reduce, HBase
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 16
- 17. Processing for OLAP semantics
GROUP BY (process, query)
COUNT, SUM, AVG, etc. (process, query)
SORT (process, query)
HAVING (mostly query, can define pre-process constraints)
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 17
- 18. SaasBase vs. SQL Views Comparison
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 18
- 19. Query Engine
Always reads indexed, compact data
Query parsing
Scan strategy
Single vs. multiple scans
Start/stop rows (prefixes, index positions, etc.)
Index selection (volatile indexes with incremental processing)
Deserialization
Post-aggregation, sorting, fuzzy-sorting etc.
Paging
Custom dimension/metric class loading
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 19
- 20. Adobe Business Catalyst
Online business presence: e-commerce, marketing, web analytics etc.
Use case: Web Analytics (visitors, channels, content, e-
commerce, campaigns, etc.)
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 20
- 21. BC - Workflow
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 21
- 22. Adobe Business Catalyst - Stats
3 active datacenters
Raw data ~6TB (from ~1TB 18 months ago)
Visits table: ~1TB each(compressed)
OLAP cubes (stats): 49GB – 64GB (compressed)
~30 minutes latency (from actual pageview/sale to chart in UI)
10s – 100s of milliseconds latency for queries
~3000/s max concurrent OLAP queries (actual traffic is much lower)
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 22
- 23. Adobe Pass for TV Everywhere
Authentication & Authorization
Single sign-on to Programmer content (e.g.
Turner, NBC, Hulu, MTV, etc) with Cable operator credentials (e.g.
Comcast, Dish, etc.)
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 23
- 24. Adobe Pass – Use Case
Analytics use case: Operational metrics (users, devices, latencies, etc.)
Real-time ingestion in HBase
High Frequency Map Reduce jobs (every 2 minutes)
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 24
- 25. Adobe Pass - Stats (London Olympics 2012)
67M streams ~ 5.3M hours
1.5M concurrent streams
> 7M unique users
1 Technical & Engineering Emmy Award ;)
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 25
- 26. Adobe Primetime – Real-time Video Analytics
Unified video platform (acquisition, transcoding, broadcast, ads,
analytics)
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 26
- 27. Adobe Primetime – Use Case
Use Cases:
Audience metrics – minutes latency ok
Ads metrics – seconds to minutes ok
Streaming QoS metrics – seconds must
Requirements:
Massive throughput (millions of streams, multiple
heartbeats every 10 seconds)
Low latency (end-to-end)
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 27
- 28. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 28
- 29. Conclusions
OLAP semantics on a simple data model
Data as first class citizen
Domain Specific “Language” for Dimensions, Metrics, Aggregations
Framework for vertical analytics systems
Tunable performance, resource allocation
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 29
- 30. Thank you!
Cosmin Lehene @clehene
http://hstack.org
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 30
- 32. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Editor's Notes
- How many HBase users?
- Data as first class citizen
- Add the real building blocks HDFS, MapReduce, Hbase Storm
- Add the real building blocks HDFS, MapReduce, Hbase Storm
- Check contrast on projector
- Two approaches RDBMS / .OLAP
- Dimensions – readtransformserializedeserialize data attributesMetrics – read/transform/aggregate/serializeConstraints: ingestion filteringReport: instrument dimensions groups + metrics with aggregations, sorting
- QUERY ENGINE -> INDEX(always realtime)What’s the difference between this and HIVE/PIG/Impala
- Process = aggregate,generate indexes (natural)Query = uses indexes, can do extra aggregation
- LEFT: report definition, NOT a QUERYLIKE A VIEW - CREATED - THEN QUERIED
- >100K/sec/threadREALTIME
- ~12 hours to reprocess everything from scratch
- 2 datacenters (active-failover) on US West and East coasts (2NN + 19DN, 0.5PB total, 456 cores, 1.1TB RAM)
- ----- Meeting Notes (1/29/13 18:09) -----OlympicsSame SaasBase codebase running in Storm instead of HadoopSimpler aggregations, but strict latency requirements
- ----- Meeting Notes (1/29/13 18:12) -----draw line between player and chart
- Data analysts work with familiar concepts----- Meeting Notes (1/29/13 18:12) -----Future:
- …….