Apache Kafka is a distributed streaming platform that forms a key part of the infrastructure at many companies including Uber, Netflix and LinkedIn. In this talk, Matt gave a technical overview of Apache Kafka, discussed practical use cases of Kafka for IoT data and demonstrated how to ingest data from an MQTT server using Kafka Connect.
2. 2
Pub Sub
Messaging Protocol
Pub Sub
Messaging System
(rethought as a distributed commit log)
Distributed Streaming Platform
● Pub Sub Messaging
● Event Storage
● Processing Framework
4. 4
Problem Statement
Let’s build a system to:
• Transport OBD-II data over unreliable links from cars to the data center
• Capable of handling millions of devices*
• Extract information from + respond to this data in (near) real time (at scale)
• Handle surges in usage
• Potential for ad-hoc historical processing
* also less
Architecture / technology / methods applicable to many scenarios.
5. 5
Publish / subscribe messaging protocol:
• Built on top of TCP/IP
• Features that make it well suited to poor connectivity / high latency scenarios
• Lightweight
• Efficient client implementations, low network overhead
• MQTT-SN for non IP networks (’virtual connections’)
• Many (open source) broker implementations
• Mosquitto, RabbitMQ, HiveMQ, VerneMQ
• Many Client Libraries
• C, C++, Java, C#, Python, Javascript, websockets, Arduino …
• Widely used (incl. phone apps!)
• Oil pipeline sensor via satellite link
• Facebook Messenger
• AWS IoT
MQTT Introduction
6. 6
• Simple API
• Hierarchical topics
• myhome/kitchen/door/front/battery/level
• wildcard subscription: myhome/*/door/*/battery/level
• 3 qualities of service (on both produce and consume)
• At most once (QoS 0)
• At least once (QoS 1)
• Exactly once (QoS 2) [not universally supported]
• Persistent consumer sessions
• Important for QoS 1, QoS 2
• Last will and testament
• Last known good value
• Authorization, SSL/TLS
MQTT Features
7. 7
• Device Id
• GPS Location [lon, lat]
• Ignition on / off
• Speedometer reading
• Timestamp
• …plus a lot more
Assume: data sent via 3G wireless connection at ~30 second interval
OBD-II Data
8. 8
Deficiencies:
• Single MQTT server can handle maybe ~100K
connections
• Can’t handle usage surges (no buffering)
• No storage of events or reprocess capability
MQTT
Server 1
Processor 1 Processor 2 ...
Ingest Architecture V1
topic: [deviceid]/obd
11. 11
Apache Kafka
Distributed Streaming Platform:
• Pub Sub Messaging
• (typically clients are within data-center)
• Data Store
• Messages not deleted after delivery
• Stream Processing
• Low or high level libraries
• Data re-processing
13. 13
● Persisted
● Append only
● Immutable
● Delete earliest data based on time / size / never
14. 14
• Allows topics to scale past constraints
of single server
• Message → partition_id deterministic.
Partition relevant to application.
• Ordering guarantees per partition but
not across partitions
17. 17
Kafka Connect
• Use client library producers / consumers in custom applications.
• Often want to bulk transfer data between standard systems:
• Don’t re-invent the wheel – configure Kafka Connect
• Narrow scope: move data into & out of Kafka
• Off-the-shelf connectors
• Fault Tolerant
• Auto-balances load
• Pluggable Serialization
• Standalone and distributed modes of operation
• Configuration / management via REST API
19. 19
MQTT Connector
https://github.com/evokly/kafka-connect-mqtt
• Single Task
• Single MQTT Broker
• Source only
Either:
• Start a bunch of these connectors (in one connect cluster), one per server, or:
• Implement a new multi-task connector, one task per MQTT broker.
• Communicate with MQTT Controller
20. 20
• user_id
• device_id
• name
• address
• phone_number
• speed_alert_level
• ...
SQL Db
User_Info
User Data
21. 21
Example: Car Towed Alert
Detect movement of car when ignition off, send SMS alert
kafka
OBD_Data P1
OBD_Data P5
Consumer 1
Consumer 2
Broker 1
...
OBD_Data P3
OBD_Data P7
Broker 2
...
...
...
SMS Gateway
Last loc. in mem
KV store
Last loc. in mem
KV store
User Info
22. 22
Consumer Implementation
on_message(message m)
{
var device_id = m.key;
var obd_data = m.value;
if (obd_data.ignition_on)
return;
if (!kv_store.contains(device_id)) {
kv_store.add(device_id, obd_data.lon_lat);
return;
}
var prev_lon_lat = kv_store.get(device_id);
var dist = calc_dist(obd_data.lon_lat, prev_lon_lat);
kv_store.set(device_id, obd_data.lon_lat);
if (dist > alert_max_dist) {
// infrequent
send_alert(SQL.get_phone_number(device_id));
}
}
• Message can be from any partition
assigned to this consumer
• Ordering guaranteed per partition, but
not predictable across partitions
• All messages from a particular device
guaranteed to arrive at the same
consumer instance
23. 23
Example: Speed Alert
• Scenario: Parent wants to monitor son/daughter driving and be alerted if they exceed a
specified speed.
• In the Tow Alert example User_Info only needs to be queried in the event of an alert.
• In this example, the table needs to be queried for every OBD data record in every partition.
OBD_data
[can update
at any time]
User Info
table
Not scalable! Cache?
...
Highfrequency
P1
24. 24
Time = 0 1 60 {device_id=1, speed_limit=60}
Time = 1 1 60 {device_id=2, speed_limit=80}
2 80
Time = 2 1 60 {device_id=3, speed_limit=70}
2 80
3 70
Time = 3 1 80 {device_id=1, speed_limit=80}
2 80
3 70
Time = 4 1 80 {device_id=1, speed_limit=65}
2 80
3 70
Table can be represented as stream of updates
device_id speed_limit
Log compaction!
25. 25
Debezium
Kafka Connector that turns database tables into streams of update records.
debezium
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
...
MySQL
User Info
[key: userId]
User_Info
[changelog topic]Partition by device_id
27. 27
Speed Alert: Message handler
on_message(message m)
{
var device_id = m.key;
var obd_data = m.value;
var user_info = user_info_local.get(device_id);
if (obd_data.speedometer > user_info.max_speed) {
alert_user(device_id, user_info);
}
}
28. 28
MQTT Phone Client Connectivity
MQTT
Server
Coordinator
MQTT
Server 1
MQTT
Server 2
[deviceid]/alert
...
Consumer 1 ...
MQTT
Server 3
...
[deviceid]/obd
29. 29
Speed Limit Alert: Rate limiting
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Partition 7
...
app_state kafka topic
• Prefer to rate limit on server to minimize network overhead.
• Create new Kafka topic app_state, partitioned on
device_id.
• When alert triggered, store alert time in this topic.
• [can use this topic as general store for other per device
state info too]
• Materialize this change-log stream on consumers as
necessary.
31. 31
Example: Location Based Special Offers
When Car enters specific region, send available special offers to the user’s phone.
Require:
• User_Info
• Address – so we know whether they are local to their current location or not
• App_state
• Use to persist already sent offers
• Special_Offer_Info
• Table that store list of all special offers.
32. 32
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31 32 33 34 35
36 37 38 39 40 41 42
Regions
• Regions may be simple (as depicted
here) or complex
• F(lon, lat) -> locationId.
• Note: could also implement ride—share
surge pricing using similar partitioning.
33. 33
Special Offer Change-log Stream
debezium
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
...
MySQL
Special Offer
Info
Special_Offers
[changelog,
compacted]
Partition by location_id
34. 34
Multi-stage Data Pipeline
OBD_Data App_State
[offers already sent]
User_Info
[address]
K: device_id
V: OBD record
consume enrich
K: device_id
V: OBD record
address
K: device_id
V: OBD record
Address
offers_sent
enrich
35. 35
Multi-stage Data Pipeline (continued)
K: [device_id]
V: OBD record
Address
offers_sent
K: location_id
V: OBD record
Address
offers_sent
OBD_Data_By_Location
P1
……
…
Repartition by location_id
P2
P1
P3
Data from given device will still all be on the same partition
(except when region changes)
36. 36
Multi-stage Data Pipeline (continued)
K: location_id
V: OBD record
Address
offers_sent
Special_Offers
K: location_id
V: OBD record
address
offers_sent
available_offers
re-partition
enrich
37. 37
Multi-stage Data Pipeline (continued)
Special offer available in
location
Special offer not already
sent
User address near location?
MQTT
Server
filter
filter
filter
...
[deviceId]/alert
40. 40
Discount code: kafcom17
Use the Apache Kafka community discount code to get $50 off
www.kafka-summit.org
Kafka Summit New York: May 8
Kafka Summit San Francisco: August 28
Presented by