Processing Big Data in Real-Time - Yanai Franchi, Tikal

1
ProcessingProcessing
“BIG-DATA”“BIG-DATA”
InIn Real TimeReal Time
Yanai Franchi , TikalYanai Franchi , Tikal

2
Two years ago...Two years ago...

4
Vacation to BarcelonaVacation to Barcelona

5
After a Long Travel DayAfter a Long Travel Day

7
Best Salsa Club
NOW
● Good Music
● Crowded –
Now!

8
Same Problem in “gogobot”

10
gogobot checkin
Heat Map Service
Lets' Develop
“Gogobot Checkins Heat-Map”

11
Key Notes
● Collector Service - Collects checkins as text addresses
– We need to use GeoLocation ServiceWe need to use GeoLocation Service
● Upon elapsed interval, the last locations list will be
displayed as Heat-Map in GUI.
● Web Scale service – 10Ks checkins/seconds all over the
world (imaginary, but lets do it for the exercise).
● Accuracy – Sample data, NOT critical data.
– Proportionately representative
– Data volume is large enough tois large enough to compensate for data loss.compensate for data loss.

12
Heat-Map Context
Text-Address
Checkins Heat-Map
Service
Gogobot System
Gogobot
Micro Service
Gogobot
Micro Service
Gogobot
Micro Service
Geo Location
Service
Get-GeoCode(Address)
Heat-Map
Last Interval Locations

13
Database
Persist Checkin
Intervals
Processing
Checkins
Read
Text Address
Check-in #1
Check-in #2
Check-in #3
Check-in #4
Check-in #5
Check-in #6
Check-in #7
Check-in #8
Check-in #9
...
Simulate Checkins with a File
Plan A
GET Geo
Location
Geo Location
Service

14
Tons of Addresses
Arriving Every Second

15
Architect - First Reaction...

19
Problems ?
● Tedious: Spend time conf iguring where to send
messages, deploying workers, and deploying
intermediate queues.
● Brittle: There's little fault-tolerance.
● Painful to scale: Partition of running worker/s is
complicated.

20
What We Want ?
● Horizontal scalability
● Fault-tolerance
● No intermediate message brokers!
● Higher level abstraction than message
passing
● “Just works”
● Guaranteed data processing (not in this
case)

21
Apache Storm
✔Horizontal scalability
✔Fault-tolerance
✔No intermediate message brokers!
✔Higher level abstraction than message
passing
✔“Just works”
✔Guaranteed data processing

23
What is Storm ?
● CEP - Open source and distributed realtime
computation system.
– Makes it easy toMakes it easy to reliably process unboundedreliably process unbounded streamsstreams ofof
tuplestuples
– Doing for realtime processing what Hadoop did for batchDoing for realtime processing what Hadoop did for batch
processing.processing.
● Fast - 1M Tuples/sec per node.
– It is scalable,fault-tolerant, guarantees your data will beIt is scalable,fault-tolerant, guarantees your data will be
processed, and is easy to set up and operate.processed, and is easy to set up and operate.

24
Streams
Tuple Tuple Tuple Tuple Tuple Tuple
Unbounded sequence of tuples

25
Spouts
Tuple
Tuple
Sources of Streams
Tuple Tuple

26
Bolts
Tuple
TupleTuple
Processes input streams and produces
new streams
Tuple
TupleTupleTuple
Tuple TupleTuple

27
Storm Topology
Network of spouts and bolts
Tuple
TupleTuple
TupleTuple TupleTuple
Tuple TupleTupleTuple
Tuple
Tuple
Tuple
Tuple TupleTupleTuple

28
Guarantee for Processing
● Storm guarantees the full processing of a tuple by
tracking its state
● In case of failure, Storm can re-process it.
● Source tuples with full “acked” trees are removed
from the system

29
Tasks (Bolt/Spout Instance)
Spouts and bolts execute as
many tasks across the cluster

30
Stream Grouping
When a tuple is emitted, which task
(instance) does it go to?

31
Stream Grouping
● Shuff le grouping: pick a random task
● Fields grouping: consistent hashing on a subset of
tuple f ields
● All grouping: send to all tasks
● Global grouping: pick task with lowest id

32
Tasks , Executors , Workers
Task Task Task
Worker Process
Sput /
Bolt
Sput /
Bolt
Sput /
Bolt
=
Executor Thread
JVM
Executor Thread

33
Bolt B Bolt B
Worker Process
Executor
Spout A
Executor
Node
Supervisor
Bolt C Bolt C
Executor
Bolt B Bolt B
Worker Process
Executor
Spout A
Executor
Node
Supervisor
Bolt C Bolt C
Executor

34
Nimbus
Supervisor Supervisor
Upload/Rebalance
Heat-Map Topology
Zoo Keeper
Nodes
Storm Architecture
Master Node
(similar to Hadoop JobTracker)
NOT critical
for running topology

35
Nimbus
Upload/Rebalance
Heat-Map Topology
Zoo Keeper
Storm Architecture
Used For Cluster Coordination
A few
nodes

36
Nimbus
Upload/Rebalance
Heat-Map Topology
Zoo Keeper
Storm Architecture
Run Worker Processes

37
Assembling Heatmap Topology

38
HeatMap Input/Output Tuples
● Input Tuples: Timestamp and Text Address :
– (9:00:07 PM , “287 Hudson St New York NY 10013”)(9:00:07 PM , “287 Hudson St New York NY 10013”)
● Output Tuple: Time interval, and a list of points for
it:
– (9:00:00 PM to 9:00:15 PM,(9:00:00 PM to 9:00:15 PM,
ListList((((40.719,-73.98740.719,-73.987),(40.726,-74.001),(),(40.726,-74.001),(40.719,-73.98740.719,-73.987))))

39
Checkins
Spout
Geocode
Lookup
Bolt
Heatmap
Builder
Bolt
Persistor
Bolt
(9:01 PM @ 287 Hudson st)
(9:01 PM , (40.736, -74,354)))
Heat Map
Storm
Topology
(9:00 PM – 9:15 PM , List((40.73, -74,34),
(51.36, -83,33),(69.73, -34,24))
Upon
Elapsed Interval

40
Checkins Spout
public class CheckinsSpout extends BaseRichSpout {
private List<String> sampleLocations;
private int nextEmitIndex;
private SpoutOutputCollector outputCollector;
@Override
public void open(Map map, TopologyContext topologyContext,
SpoutOutputCollector spoutOutputCollector) {
this.outputCollector = spoutOutputCollector;
this.nextEmitIndex = 0;
sampleLocations = IOUtils.readLines(
ClassLoader.getSystemResourceAsStream("sanple-locations.txt"));
}
@Override
public void nextTuple() {
String address = checkins.get(nextEmitIndex);
String checkin = new Date().getTime()+"@ADDRESS:"+address;
outputCollector.emit(new Values(checkin));
nextEmitIndex = (nextEmitIndex + 1) % sampleLocations.size();
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("str"));
}
We hold state
No need for thread safety
Declare
output fields
Been called
iteratively by Storm

41
Geocode Lookup Bolt
public class GeocodeLookupBolt extends BaseBasicBolt {
private LocatorService locatorService;
@Override
public void prepare(Map stormConf, TopologyContext context) {
locatorService = new GoogleLocatorService();
}
@Override
public void execute(Tuple tuple, BasicOutputCollector outputCollector) {
String str = tuple.getStringByField("str");
String[] parts = str.split("@");
Long time = Long.valueOf(parts[0]);
String address = parts[1];
LocationDTO locationDTO = locatorService.getLocation(address);
String city = locationDTO.getCity();
outputCollector.emit(new Values(city,time,locationDTO) );
}
@Override
public void declareOutputFields(OutputFieldsDeclarer fieldsDeclarer) {
fieldsDeclarer.declare(new Fields("city","time", "location"));
}
}
Get Geocode,
Create DTO

42
Tick Tuple – Repeating Mantra

43
Two Streams to Heat-Map Builder
On tick tuple, we f lush our Heat-Map
Checkin 1 Checkin 4 Checkin 5 Checkin 6
HeatMap-
Builder Bolt

44
Tick Tuple in Action
public class HeatMapBuilderBolt extends BaseBasicBolt {
private Map<String, List<LocationDTO>> heatmaps;
@Override
public Map<String, Object> getComponentConfiguration() {
Config conf = new Config();
conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 60 );
return conf;
}
@Override
if (isTickTuple(tuple)) {
// Emit accumulated intervals
} else {
// Add check-in info to the current interval in the Map
}
}
private boolean isTickTuple(Tuple tuple) {
return tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID)
&& tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID);
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("time-interval", "city","locationsList"));
}
Tick interval
Hold latest intervals

45
Persister Bolt
public class PersistorBolt extends BaseBasicBolt {
private Jedis jedis;
@Override
Long timeInterval = tuple.getLongByField("time-interval");
String city = tuple.getStringByField("city");
String locationsList = objectMapper.writeValueAsString
( tuple.getValueByField("locationsList"));
String dbKey = "checkins-" + timeInterval+"@"+city;
jedis.setex(dbKey, 3600*24 ,locationsList);
jedis.publish("location-key", dbKey);
}
}
Publish in
Redis channel
for debugging
Persist in Redis
for 24h

46
Shuffle Grouping
Shuffle Grouping
Check-in #1
Check-in #2
Check-in #3
Check-in #4
Check-in #5
Check-in #6
Check-in #7
Check-in #8
Check-in #9
...
Sample Checkins File
Read
Text Addresses
Transforming the Tuples
Checkins
Spout
Geocode
Lookup
Bolt
Heatmap
Builder
Bolt
Database
Persistor
Bolt
Get Geo
Location
Geo Location
Service
Field Grouping(city)
Group by city

47
Heat Map Topology
public class LocalTopologyRunner {
public static void main(String[] args) {
TopologyBuilder builder = buildTopolgy();
StormSubmitter.submitTopology(
"local-heatmap", new Config(), builder.createTopology());
}
private static TopologyBuilder buildTopolgy() {
topologyBuilder builder = new TopologyBuilder();
builder.setSpout("checkins", new CheckinsSpout());
builder.setBolt("geocode-lookup", new GeocodeLookupBolt() )
.shuffleGrouping("checkins");
builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() )
.fieldsGrouping("geocode-lookup", new Fields("city"));
builder.setBolt("persistor", new PersistorBolt() )
.shuffleGrouping("heatmap-builder");
return builder;
}
}

50
Scaling the Topology
conf.setNumWorkers(20);
public static void main(String[] args) {
TopologyBuilder builder = buildTopolgy();
Config conf = new Config();
conf.setNumWorkers(2);
StormSubmitter.submitTopology(
"local-heatmap", conf, builder.createTopology());
}
topologyBuilder builder = new TopologyBuilder();
builder.setSpout("checkins", new CheckinsSpout(), 4 );
builder.setBolt("geocode-lookup", new GeocodeLookupBolt() , 8 )
.shuffleGrouping("checkins").setNumTasks(64);
builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() , 4)
.fieldsGrouping("geocode-lookup", new Fields("city"));
builder.setBolt("persistor", new PersistorBolt() , 2 )
.shuffleGrouping("heatmap-builder").setNumTasks(4);
return builder;
Parallelism hint
Increase Tasks
For Future
Set no. of workers

51
Database
Storm Heat-Map
Topology
Persist Checkin
Intervals
GET Geo
Location
Check-in #1
Check-in #2
Check-in #3
Check-in #4
Check-in #5
Check-in #6
Check-in #7
Check-in #8
Check-in #9
...
Read
Text Address
Sample Checkins File
Recap – Plan A
Geo Location
Service

54
Plan B -
Kafka Spout&Bolt to HeatMap
Geocode
Lookup
Bolt
Heatmap
Builder
Bolt
Kafka
Checkins
Spout
Database
Persistor
Bolt
Geo Location
Service
Read
Text Addresses
Checkin
Kafka
Topic
Publish
Checkins
Locations
Topic
Kafka
Locations
Bolt

56
They all are Good
But not for all use-cases

57
Kafka
A little introduction

66
Stateless Broker &
Doesn't Fear the File System

70
Topics
● Logical collections of partitions (the physical f iles).
● A broker contains some of the partitions for a topic

71
A partition is Consumed by
Exactly One Group's Consumer

72
Distributed &
Fault-Tolerant

73
Broker 1 Broker 3Broker 2
Zoo Keeper
Consumer 1 Consumer 2
Producer 1 Producer 2

74
Broker 1 Broker 4Broker 3Broker 2
Zoo Keeper

75
Zoo Keeper

76
Zoo Keeper

77
Zoo Keeper

78
Zoo Keeper

79
Zoo Keeper

80
Zoo Keeper

81
Zoo Keeper

82
Zoo Keeper

83
Zoo Keeper
Consumer 1

84
Zoo Keeper
Consumer 1

85
Zoo Keeper
Consumer 1

86
Performance Benchmark
3 Brokers
3 Producers
3 Consumers
Cheap Machines

• “Up to 2 million writes/sec on 3 cheap machines”
• Using 3 producers on 3 different machines, 3x async replication,
• Only 1 producer/machine because NIC already saturatedOnly 1 producer/machine because NIC already saturated
• End-to-End Latency is about 10ms for 99.9%
• Sustained throughput as stored data grows
•
•
•
87

88
Add Kafka to our Topology
...
...
builder.setSpout("checkins", new KafkaSpout(kafkaConfig) , 4);
...
builder.setBolt("kafkaProducer", new KafkaOutputBolt
( "localhost:9092",
"kafka.serializer.StringEncoder",
"locations-topic"))
.shuffleGrouping("persistor");
return builder;
}
}
Kafka Bolt
Kafka Spout

89
Checkin HTTP
Reactor
Publish
Checkins
Database
Checkin
Kafka
Topic
Consume Checkins
Storm Heat-Map
Topology
Locations
Kafka
Topic
Publish
Interval Key
Persist Checkin
Intervals
Geo Location
ServiceGET Geo
Location
Text-Address

91
Summary
When You go out to Salsa Club...
● Good Music
● Crowded

92
More Conclusions..
● BigData – Also refers to Velocity of data (not only
Volume of data)
● Storm – Great for real-time BigData processing.
Complementary for Hadoop batch jobs.
● Kafka – Great messaging for logs/events data, been
served as a good “source” for Storm spout

Processing Big Data in Real-Time - Yanai Franchi, Tikal

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Processing Big Data in Real-Time - Yanai Franchi, Tikal

Ähnlich wie Processing Big Data in Real-Time - Yanai Franchi, Tikal (20)

Mehr von Codemotion Tel Aviv

Mehr von Codemotion Tel Aviv (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Processing Big Data in Real-Time - Yanai Franchi, Tikal