Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2vn9JFy.
Anton Gorshkov focuses on "What Ifs" scenarios and how to evaluate and architect a streaming platform that has high level of resilience. He looks at Kafka and Spark Streaming as specific examples and shares his experience of using these frameworks to process financial transactions and shows examples of tools that they built along their streaming journey. Filmed at qconnewyork.com.
Anton Gorshkov is a Managing Director at Goldman Sachs Asset Management where he runs a global Core Platform team, focusing on GSAM’s data strategy and real-time services.
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
streaming-kafka-spark
3. Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. KAFKA@GS
ABOUT US
• Investment Management Division
• $1.4 Trillion Assets Under Management
• Core Front-Office Platform Team
• Interface with 15+ teams
• Kafka, Event Topology, Data Fabric, Akka, etc…
• Sounds fun? Talk to me!
• What if…
5. KAFKA@GS
What If? This presentation is boring
• Topology Strategy, Context & Deployment Goals
• What If? Hosts / Network / DC Failures
• Deployment & Monitoring Strategy
• What If? Framework & Software Failure
6. KAFKA@GS
KAFKA DEPLOYMENT USAGE PATTERNS
• Most used clusters serve ~1.5Tb a week to
consumers
• However message count relatively low – order of
millions per week; avg. several hundred a second
• At peak periods
• ~1,500 messages produced/second
• ~2.5Mb produced/second
• ~12.5Mb consumed/second
7. KAFKA@GS
DEPLOYMENT GOALS
• No data-loss even in case of DC outage
• No Primary/Back-Up notion
• No “failover” scenarios
• Minimize Outage time
12. KAFKA@GS
H1
Datacenter A
Partition assignment
EXAMPLE Single Topic Setup
• 1 topic
• 3 partitions (p1-p3)
• Replication factor of 4
• Ensure even replicas between DCs
• Min.Insync.Replicas 3
• Cross-DC latency is low!
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
13. KAFKA@GS
H1
Datacenter A
What If? A host fails
• 1-5 times / year
• No impact to producers/consumers
(still able to satisfy 3 ISR)
• No manual recovery beyond
replacing host
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
14. KAFKA@GS
H1
Datacenter A
What If? Two hosts fail
• 1/year depending where hosts are
(e.g. bad hypervisor)
• Processing for some topics will be
halted
• Short-term: Add replicas for
affected partitions on remaining
hosts
• ASAP: Replace bad hosts
• GS Dynamic Compute allows
seamless VM replace with no need
to re-point dns aliases/change
kafka config
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
15. KAFKA@GS
H1
Datacenter A
What If? Two hosts fail
• 1/year depending where hosts are
(e.g. bad hypervisor)
• Processing for some topics will be
halted
• Short-term: Add replicas for
affected partitions on remaining
hosts
• ASAP: Replace bad hosts
• GS Dynamic Compute allows
seamless VM replace with no need
to re-point dns aliases/change
kafka config
H3
p1r1
H5
p2 p3
Datacenter B
H2
p1
H4
p1
p1r1
H6
p3
p2
p2 p3
p1
pr2 p3
p1
p2 p3
p1
p1
16. KAFKA@GS
H1
Datacenter A
What If? Three hosts fail
• 1 / few years
• Cluster processing halted as cannot
satisfy in-sync replica requirements
• Proceed with immediate host
replacement
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
17. KAFKA@GS
H1
Datacenter A
What If? Datacenter Failed / Network Partition
• Once a 20-year event
• Short-term strategy: add additional
machines in Datacenter A
• Largest impact on recovery time is
how long to get new hosts
provisioned
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
18. KAFKA@GS
H1
Datacenter A
After adding additional hosts
What If? Datacenter Failed / Network Partition
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
H7
p1
H8
p1
p1r1
H9
p2
p3
p3
p2
• Short-term
strategy: add
additional
machines in
Datacenter A
• Largest impact on
recovery time is
how long to get
new hosts
provisioned
• Stand-by Hosts?
20. KAFKA@GS
TOOLING CLUSTER DASHBOARD
Endpoints include:
• View messages on
topic
• Consumer lag
• Leader & ISRs for
topic
• Highwatermark for
topic
• Run-Time Broker &
zookeeper
configuration
21. KAFKA@GS
TOOLING HEALTHCHECK
• Topic sizes monitoring
• Alert when at risk of losing data due to truncation
• If partitions breach threshold, escalate via GS alerting
infrastructure
23. KAFKA@GS
What If? I need to Replay
• Message Remains on Queue?
• Delivery Semantics
• == 1
• =< 1
• >=1
• Unique ID on everything
• Making Replay Easy
• KIP-98
25. KAFKA@GS
SUMMARY
• Failure will occur
• Belt & Suspenders for everything
• Ultimate Throughput vs Ultimate Reliability
• Kafka has many Knobs, perhaps too many, hide
some
• Resilience through Transparency