This document discusses using an open source Lambda architecture with Kafka, Hadoop, Samza, and Druid to handle event data streams. It describes the problem of interactively exploring large volumes of time series data. It outlines how Druid was developed as a fast query layer for Hadoop to enable low-latency queries over aggregated data. The architecture ingests raw data streams in real-time via Kafka and Samza, aggregates the data in Druid, and enables reprocessing via Hadoop for reliability.
Boost Fertility New Invention Ups Success Rates.pdf
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
1. OPEN SOURCE LAMBDA ARCHITECTURE
KAFKA · HADOOP · SAMZA · DRUID
FANGJIN YANG · GIAN MERLINO · DRUID COMMITTERS
2. PROBLEM DEALING WITH EVENT DATA
MOTIVATION EVOLUTION OF A “REAL-TIME” STACK
ARCHITECTURE THE “RAD”-STACK
NEXT STEPS TRY IT OUT FOR YOURSELF
OVERVIEW
4. 2013
THE PROBLEM
‣ Arbitrary and interactive exploration of time series data
• Ad-tech, system/app metrics, network/website traffic analysis
‣ Multi-tenancy: lots of concurrent users
‣ Scalability: 10+ TB/day, ad-hoc queries on trillions of events
‣ Recency matters! Real-time analysis
5. 2013
FINDING A SOLUTION
‣ Load all your data into Hadoop. Query it. Done!
‣ Good job guys, let’s go home
7. 2013
PROBLEMS WITH THE NAIVE SOLUTION
‣ MapReduce can handle almost every distributed computing
problem
‣ MapReduce over your raw data is flexible but slow
‣ Hadoop is not optimized for query latency
‣ To optimize queries, we need a query layer
10. 2013
MAKE QUERIES FASTER
‣ What types of queries to optimize for?
• Revenue over time broken down by demographic
• Top publishers by clicks over the last month
• Number of unique visitors broken down by any dimension
• Not dumping the entire dataset
• Not examining individual events
15. 2013
DRUID
‣ Druid project started in 2011, went open source in 2012
‣ Designed for low latency ingestion and ad-hoc aggregations
‣ Designed for keeping around a lot of history (years are ok)
‣ Growing Community
• ~100 contributors
• Used in production at numerous large and small organizations
18. 2013
RAW DATA
timestamp publisher advertiser gender country click price
2011-01-01T01:01:35Z bieberfever.com google.com Male USA 0 0.65
2011-01-01T01:03:63Z bieberfever.com google.com Male USA 0 0.62
2011-01-01T01:04:51Z bieberfever.com google.com Male USA 1 0.45
...
2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87
2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99
2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53
19. 2013
ROLLUP DATA
timestamp publisher advertiser gender country impressions clicks revenue
2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70
2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18
2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 1953 17 17.31
2011-01-01T02:00:00Z bieberfever.com google.com Male UK 3194 170 34.01
‣ Truncate timestamps
‣ GroupBy over string columns (dimensions)
‣ Aggregate numeric columns (metrics)
20. 2013
PARTITION DATA
timestamp publisher advertiser gender country impressions clicks revenue
2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70
2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18
2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 1953 17 17.31
2011-01-01T02:00:00Z bieberfever.com google.com Male UK 3194 170 34.01
‣ Shard data by time
‣ Immutable chunks of data called “segments”
Segment 2011-01-01T02/2011-01-01T03
Segment 2011-01-01T01/2011-01-01T02
21. 2013
IMMUTABLE SEGMENTS
‣ Fundamental storage unit in Druid
‣ Read consistency
‣ One thread scans one segment
‣ Multiple threads can access same underlying data
‣ Segment sizes -> computation completes in ms
‣ Simplifies distribution & replication
22. 2013
COLUMN ORIENTATION
timestamp publisher advertiser gender country impressions clicks revenue
2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70
2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18
‣ Scan/load only what you need
‣ Compression!
‣ Indexes!
23. DRUID INGESTION
‣ Must have denormalized, flat data
‣ Druid cannot do stateful processing at ingestion time
‣ …like stream-stream joins
‣ …or user session reconstruction
‣ …or a bunch of other useful things!
‣ Many Druid users need an ETL pipeline
42. WHY REPROCESS DATA?
‣ Bugs in processing code
‣ Imprecise streaming operations
‣ …like using short join windows
‣ Limitations of current software
‣ …Kafka 0.8.x, Samza 0.9.x can generate duplicate messages
‣ …Druid 0.7.x streaming ingestion is best-effort
44. LAMBDA ARCHITECTURES
‣ Advantages?
• Works as advertised
• Works with a huge variety of open software
• Druid supports batch-replace-by-time-range through Hadoop
45. LAMBDA ARCHITECTURES
‣ Disadvantages?
‣ Need code to run on two very different systems
‣ Maintaining two codebases is perilous
‣ …productivity loss
‣ …code drift
‣ …difficulty training new developers
49. KAPPA ARCHITECTURE
‣ Pure streaming
‣ Reprocess data by replaying the input stream
‣ Doesn’t require operating two systems
‣ Doesn’t overcome software limitations
‣ I don’t have much experience with this
‣ http://radar.oreilly.com/2014/07/questioning-the-lambda-
architecture.html
51. NICE THINGS ABOUT KAFKA
‣ Scalable, replicated pub/sub
‣ Replayable message logs
‣ New consumers can read all old messages
‣ Existing consumers can reprocess all old messages
52. NICE THINGS ABOUT SAMZA
‣ Multi-tenancy: one main thread per container
‣ Robustness: isolated containers limit slowness and failure
‣ Visibility
‣ Multistage jobs, lots of metrics per stage
‣ Can inspect the message queue in Kafka
‣ State is simple
‣ Logging and restoring handled for you
‣ Single-threaded programming
53. NICE THINGS ABOUT DRUID
‣ Fast ingestion, fast queries
‣ Seamlessly merge stream-ingested and batch-ingested data
‣ Batch loads can “replace” stream loads for the same time range
54. NICE THINGS ABOUT HADOOP
‣ Solid batch processing system
‣ Easy to partition and reprocess data by time range
‣ Jobs can process all data, or a pre-partitioned slice
55. MONITORING
‣ Kafka partition availability
‣ Kafka log cleaner
‣ Samza consumer offsets
‣ Druid ingestion process rate
‣ Druid ingestion drop rate
‣ Druid query latency
‣ System metrics: CPU, network, disk
‣ Event counts at various stages
62. TAKE AWAYS
‣ Consider Kafka for making your streams available
‣ Consider Samza for streaming data integration
‣ Consider Druid for interactive exploration of streams
‣ Metrics, metrics, metrics
‣ Have a reprocessing strategy if you’re interested in historical data