Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2gron5O.
Neha Narkhede talks about the experience at LinkedIn moving from batch-oriented ETL to real-time streams using Apache Kafka and how the design and implementation of Kafka was driven by this goal of acting as a real-time platform for event data. She covers some of the challenges of scaling Kafka to hundreds of billions of events per day at Linkedin, supporting thousands of engineers, etc. Filmed at qconsf.com.
Neha Narkhede is co-founder and CTO at Confluent, a company backing the popular Apache Kafka messaging system. Prior to founding Confluent, Neha led streams infrastructure at LinkedIn, where she was responsible for LinkedIn’s streaming infrastructure built on top of Apache Kafka and Apache Samza. She is one of the initial authors of Apache Kafka and a committer and PMC member on the project.
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
etl-streams
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
4. “Data and data systems have really
changed in the past decade
5. Old world: Two popular locations for data
Operational databases Relational data warehouse
DB
DB
DB
DB DWH
6. “Several recent data trends are driving a
dramatic change in the ETL architecture
7. “#1: Single-server databases are replaced
by a myriad of distributed data
platforms that operate at company-wide
scale
8. “#2: Many more types of data sources
beyond transactional data - logs, sensors,
metrics...
9. “#3: Stream data is increasingly
ubiquitous; need for faster processing
than daily
10. “The end result? This is what data
integration ends up looking like in
practice
30. new world: streaming, real-time and scalable
real-time
scale
EAI
ETL
Streaming
Platform
real-time BUT
not scalable
real-time
AND
scalable
scalable
BUT batch
batch
40. “To enable forward compatibility, redefine
the T in ETL:
Clean data in; Clean data out
41. app logs app logs
app logs
app logs
#1: Extract as
unstructured text
#2: Transform1 = data cleansing
= “what is a product view”
#4: Transform2 =
drop PII fields”
#3: Load into DWH
DWH
42. #1: Extract as
unstructured text
#2: Transform1 = data cleansing =
“what is a product view”
#4: Transform2 =
drop PII fields”
DWH
#2: Transform1 =
data cleansing =
“what is a product view”
#4: Transform2 = drop PII fields”
Cassandra
#1: Extract as
unstructured text
again
#3: Load cleansed data
#3: Load cleansed data
57. a short history of data integration
drawbacks of ETL
needs and requirements for a streaming platform
new, shiny future of ETL: a streaming platform
What does a streaming platform look like and how
it enables Streaming ETL?
78. Kafka’s Connect API = Connectors Made Easy!
- Scalability: Leverages Kafka for scalability
- Fault tolerance: Builds on Kafka’s fault tolerance model
- Management and monitoring: One way of monitoring all
connectors
- Schemas: Offers an option for preserving schemas
from source to sink
82. 2 visions for stream processing
Real-time Mapreduce Event-driven microservicesVS
83. 2 visions for stream processing
Real-time Mapreduce Event-driven microservicesVS
- Central cluster
- Custom packaging,
deployment &
monitoring
- Suitable for
analytics-type use
cases
- Embedded library
in any Java app
- Just Kafka and
your app
- Makes stream
processing
accessible to any
use case
103. All your data … everywhere … now
Streaming platform
DWH Hadoop
App
App App App App
App
App
App
request-response
messaging
OR
stream
processing
streaming data pipelines
changelogs
104. VISION: All your data … everywhere … now
Streaming platform
DWH Hadoop
App
App App App App
App
App
App
request-response
messaging
OR
stream
processing
streaming data pipelines
changelogs