This document discusses the Lambda architecture, which is a design pattern for building data processing systems that require both batch and real-time processing. It describes the key components of a Lambda architecture, including batch and real-time data pipelines, serving layers, and a speed layer for low-latency queries. It also covers some of the main tools and frameworks used to implement Lambda architectures, such as Storm, Trident, Redis, and Summingbird, which provides a common API for both batch and real-time processing.
2. @fdouetteau#lambdataiku
Topics For Today
•WHAT is a lambda architecture
•Examples - Principle
•Motivation – Hard Points
•HOW to you build a lambda architecture ?
•Components per component
4. @fdouetteau#lambdataiku
ƛ : SOME USE CASES
• Online Advertising
• Keep track of number of displays / clicks per
positions / campaigns
• Recommender Systems
• Keep track of production displays / views / click /
buy
• Statistical Time Line
• Keep Track of number of tweets per hashtag /
hour
7. @fdouetteau#lambdataiku
E.g. counting twitter hashtags
EVENTS PROCES
S
STATE
SER
VE
Fmap ( ) = { (#tag, time) -> count }
FReduce( hashmap, hashmap ) = fuse count in
maps
FDisplay( hashmap, events ) = Freduce(hashmap,
Fmap(events))
TWEET COUNTS
(2014-02-31 13, #foo) -> 3
(2014-02-31 13, #foo) -> 3
(2014-02-31 13, #foo) -> 3
(2014-02-31 13, #foo) -> 3
NEW TWEETS TABLE
2014-02-31 13:14 #foo bar
2014-02-31 13:14 #foo bar
2014-02-31 13:14 #foo bar
2014-02-31 13:14 #foo bar
2014-02-31 13:14 #foo bar
8. @fdouetteau#lambdataiku
E.g. counting twitter hashtags in “SQL”
SER
VE
TWEET COUNTS TABLE
(2014-02-31 13, #foo) -> 8
(2014-02-31 13, #foo2) -> 3
(2014-02-31 13, #foo3) -> 3
(2014-02-31 13, #foo4) -> 1
NEW TWEETS TABLE
2014-02-31 13:14 #foo bar
2014-02-31 13:14 #foo bar
2014-02-31 13:14 #foo bar
2014-02-31 13:14 #foo bar
2014-02-31 13:14 #foo bar
PARTIAL TWEET COUNT TABLE
(2014-02-31 13, #foo) -> 1
(2014-02-31 14, #foo) -> 3
(2014-02-31 14, #foo) -> 3
(2014-02-31 14, #foo) ->
NEW TWEET COUNT TABLE
(2014-02-31 13, #foo) -> 9
(2014-02-31 13, #foo) -> 3
(2014-02-31 13, #foo) -> 3
(2014-02-31 13, #foo) -> 3
CREATE … AS SELECT time, tag, COUNT(*) GROUP BY TIME, TAG
CREATE AS
SELEC time, tag, SUM(counts)
FROM ( oldtable … UNION
partialtable)
GROUP BY TIME, TAG
SELECT, time, tag, SUM(c) FROM (
SELECT time, tag, c FROM
oldtable WHERE tag = …
UNION
SELECT time, tag, c FROM partialtable
WHERE tag=…
)
INSERT VALUES …
RENAME TABLE …
EXECUTE EACH 5 MINUTES
EXECUTE
EACH HOUR
10. @fdouetteau#lambdataiku
Backtype Story
Capture events and logs from twitter
25TB binary data
100 Billlios records
400 QPS Average
Scale 1 -> 150 on peak
Take off with a team of 3 engineers with seed funding in 2008
Christopher Golda
Michael Montano
Nathan Marz
Acquired by Twitter ( power twitter trends …) in 2011
Cascalog
Storm
ElephantDB
11. @fdouetteau#lambdataiku
TWITTER HASHTAGS
2014-02-31 13:14
#foo bar
BATCH
VIEW
REAL-TIME
RESULT
BATCH
PROC
REAL-
TIME
PROC
FEDER
ATION
2014-02-31 13:14
#foo bar
2014-02-31 13:14
#foo bar
(2014-02-31 13, #foo) -> 3
(2014-02-31 13, #foo) -> 3
COMPUTE EVERY 5 MINUTES
HASHTAG COUNTS FOR
THE LAST 5 MINUTES
(IN MEMORY)
COMPUTE
EVERY HOUR HASHTAG
COUNT FOR THE LAST HOUR
(ON DISK)
14. @fdouetteau#lambdataiku
DRIVER 1: Support Smooth Evolution
2014-02-31 13:14
#foo bar
BATCH
VIEW
REAL-TIME
RESULT
BATCH
PROC
REAL-
TIME
PROC
FEDER
ATION
2014-02-31 13:14
#foo bar
2014-02-31 13:14
#foo bar
(2014-02-31 13:14,, #foo) -> 3
(2014-02-31 13:14, #foo) -> 3
(1) RECOMPUTE NEW
VERSION
ON BATCH WHILE KEEPING
THE
OLD ONE (2014-02-31 13, #foo) -> 3
(2) THEN UPDATE THE ONLINE
VERSION
15. @fdouetteau#lambdataiku
DRIVER 2: Real-Time System Offline
2014-02-31 13:14
#foo bar
BATCH
VIEW
REAL-TIME
RESULT
BATCH
PROC
REAL-
TIME
PROC
FEDER
ATION
2014-02-31 13:14
#foo bar
2014-02-31 13:14
#foo bar
(2014-02-31 13, #foo) -> 3
(2014-02-31 13, #foo) -> 3
COMPUTE
EVERY HOUR HASHTAG
COUNT FOR THE LAST HOUR
(ON DISK)
FALLBACK TO
PARTIAL RESULT
WHEN REAL-TIME
GRID IS OFFLINE
18. @fdouetteau#lambdataiku
PAINT POINT 1 : EXACTLY ONCE
2014-02-31 13:14 #foo bar
2014-02-31 13:15 toto
2014-02-31 13:15 tutu
2014-02-31 13:16 #two
…
…
Retry
19. @fdouetteau#lambdataiku
PAINT POINT 2 : DYNAMIC SCALE
START AT 100 events per second
HOW TO GROW TO 10k events
per second without rebuilding everything ?
20. @fdouetteau#lambdataiku
PAINT POINT 3 : SCHEMA CHANGE
BATCH
VIEW
REAL-TIME
RESULT
BATCH
PROC
REAL-
TIME
PROC
FEDER
ATION
EVENTS V1
EVENTS V2
MIX OF VERSION 1
AND VERSION 2 !!!!
22. @fdouetteau#lambdataiku
Lambda Architecture Building Blocks
Message
Queue
Batch State
Batch
Pump
Real-Time
State
Real-Time
Views
Service
Federated
View
Batch
Views
Service
Batch
Processin
g
Real-Time
Processing
26. @fdouetteau#lambdataiku
TOPOLOGY : SINGLE PIPE
Message
Queue
Batch State
Batch
Pump
Real-Time
State
Real-Time
Views
Service
Federated
View
Batch
Views
Service
Batch
Processin
g
Real-Time
Processing
STORM
STORM
38. @fdouetteau#lambdataiku
Exactly-Once in state
paul -> { car: 2, txid=2 }
pierre -> {car : 5, txid=3 }
paul -> { car: 3, txid=3 }
pierre -> {car : 5, txid=3 }
{user=paul, item=car, event=imp}
{user=pierre, item=car, event=imp}
{user=pierre, item=car, event=imp}
txid=3
Keep Track of
last transaction in
state
Transaction
does not apply
to newer state parts
39. @fdouetteau#lambdataiku
TOPOLOGY 1 : SHARE STATE
Message
Queue
Batch State
Batch
Pump
Real-Time
State
Real-Time
Views
Service
Federated
View
Batch
Views
Service
Batch
Processin
g
Real-Time
Processing
USE A SINGLE NOSQL
SERVICE FOR ALL USE
CASES