Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/15ACXCw.
Tyler Akidau from Google demonstrates Google's Millwheel, a streaming system that promises low latency, strong consistency, and flexibility without relying on Lambda Architecture. Filmed at qconsf.com.
Tyler Akidau is a Senior Software Engineer at Google. The current Tech Lead for the MillWheel team, heâs spent five years working on massive-scale streaming data processing systems.
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda Architecture
1. Have Your Cake & Eat It Too
Further Dispelling the Myths of the Lambda Architecture
Tyler Akidau
Staff Software Engineer
Google Docs version of slides (with animations) available at: http://goo.gl/eX5kxa
2. InfoQ.com: News & Community Site
âą 750,000 unique visitors/month
âą Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
âą Post content from our QCon conferences
âą News 15-20 / week
âą Articles 3-4 / week
âą Presentations (videos) 12-15 / week
âą Interviews 2-3 / week
âą Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/millwheel
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
6. - Slava Chernyak, Josh Haberman, Reuven Lax,
Daniel Mills, Paul Nordstrom, Sam McVeety,
Sam Whittle, and more...
- Robert Bradshaw, Daniel Mills,
and more...
- Robert Bradshaw, Craig Chambers, Reuven
Lax, Daniel Mills, Frances Perry, and
more...
MillWheel
Streaming Flume
Cloud Dataflow
17. âą Mostly correct is not good enough
âą Required for exactly-once processing
âą Required for repeatable results
âą Cannot replace batch without it
Why consistency is important
31. 1. Time-Agnostic Processing
2. Approximation
3. Stream Time Windowing
4. Event Time Windowing
Approaches to reasoning about time
32. 1. Time-Agnostic Processing - Filters
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
Example Input:
Example Output:
Pros:
Cons:
Web server traffic logs
All traffic from specific domains
Straightforward
Efficient
Limited utility
33. 1. Time-Agnostic Processing - Hash Join
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
Example Input:
Example Output:
Pros:
Cons:
Query & Click traffic
Joined stream of Query + Click pairs
Straightforward
Efficient
Limited utility
34. 2. Approximation via Online Algorithms
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
Example Input:
Example Output:
Pros:
Cons:
Twitter hashtags
Approximate top N hashtags per prefix
Efficient
Inexact
Complicated Algorithms
35. 11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
Web server request traffic
Per-minute rate of received requests
Straightforward
Results reflect contents of stream
Results donât reflect events as they happened
If approximating event time, usefulness varies
Example Input:
Example Output:
Pros:
Cons:
3. Windowing by Stream Time
36. 11:00 10:0016:00 15:00 14:00 13:00 12:00 Event Time
Example Input:
Example Output:
Pros:
Cons:
Twitter hashtags
Top N hashtags by prefix per hour.
Reflects events as they occurred
More complicated buffering
Completeness issues
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
4. Windowing by Event Time - Fixed Windows
37. 11:00 10:0016:00 15:00 14:00 13:00 12:00 Event Time
Example Input:
Example Output:
Pros:
Cons:
User activity stream
Per-session group of activities
Reflects events as they occurred
More complicated buffering
Completeness issues
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
4. Windowing by Event Time - Sessions
55. Triggers API
PCollection<KV<String, Long>> sums = Pipeline
.begin()
.read(âuserRequestsâ)
.apply(Window.into(new FixedWindows(2, MINUTE))
.trigger(new SequenceOf(
new RepeatUntil(
new AtPeriod(1, MINUTE),
new AtWatermark()),
new AtWatermark(),
new RepeatUntil(
new AfterCount(1),
new AfterDelay(
14, DAYS, TimeDomain.EVENT_TIME))));
.apply(new Sum());
56. Lambda vs Streaming
Low-latency, approximate results
Complete, correct results as soon as possible
Ability to deal with changes upstream
58. Triggers API
PCollection<KV<String, Long>> sums = Pipeline
.begin()
.read(âuserRequestsâ)
.apply(Window.into(new Sessions(1, MINUTE))
.trigger(new SequenceOf(
new RepeatUntil(
new AtPeriod(1, MINUTE),
new AtWatermark()),
new AtWatermark(),
new RepeatUntil(
new AfterCount(1),
new AfterDelay(
14, DAYS, TimeDomain.EVENT_TIME))));
.apply(new Sum());
60. Summary
Lambda is great
Streaming by itself is better :-)
Strong Consistency = Correctness
Streaming = Aggregation + Windowing + Triggers
Tools For Reasoning About Time = Power + Flexibility