Neha Narkhede talks about the experience at LinkedIn moving from batch-oriented ETL to real-time streams using Apache Kafka and how the design and implementation of Kafka was driven by this goal of acting as a real-time platform for event data. She covers some of the challenges of scaling Kafka to hundreds of billions of events per day at Linkedin, supporting thousands of engineers, etc.
28. new world: streaming, real-time and scalable
real-time
scale
EAI
ETL
Streaming
Platform
real-time BUT
not scalable
real-time
AND
scalable
scalable
BUT batch
batch
38. “To enable forward compatibility, redefine
the T in ETL:
Clean data in; Clean data out
39. app logs app logs
app logs
app logs
#1: Extract as
unstructured text
#2: Transform1 = data cleansing
= “what is a product view”
#4: Transform2 =
drop PII fields”
#3: Load into DWH
DWH
40. #1: Extract as
unstructured text
#2: Transform1 = data cleansing =
“what is a product view”
#4: Transform2 =
drop PII fields”
DWH
#2: Transform1 =
data cleansing =
“what is a product view”
#4: Transform2 = drop PII fields”
Cassandra
#1: Extract as
unstructured text
again
#3: Load cleansed data
#3: Load cleansed data
53. a short history of data integration
drawbacks of ETL
needs and requirements for a streaming platform
new, shiny future of ETL: a streaming platform
What does a streaming platform look like and how
it enables Streaming ETL?
71. Kafka’s Connect API = Connectors Made Easy!
- Scalability: Leverages Kafka for scalability
- Fault tolerance: Builds on Kafka’s fault tolerance model
- Management and monitoring: One way of monitoring all
connectors
- Schemas: Offers an option for preserving schemas
from source to sink
91. All your data … everywhere … now
Streaming platform
DWH Hadoop
App
App App App App
App
App
App
request-response
messaging
OR
stream
processing
streaming data pipelines
changelogs
92. VISION: All your data … everywhere … now
Streaming platform
DWH Hadoop
App
App App App App
App
App
App
request-response
messaging
OR
stream
processing
streaming data pipelines
changelogs
93. 93
Kafka Summit SF – August 28
Conference Discount Code: kafcom17
($50 off conference pass)
www.kafka-summit.org
Presented by