Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

When Streams Fail: From Flash Floods to Arroyos

144 Aufrufe

Veröffentlicht am

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2vn9JFy.

Anton Gorshkov focuses on "What Ifs" scenarios and how to evaluate and architect a streaming platform that has high level of resilience. He looks at Kafka and Spark Streaming as specific examples and shares his experience of using these frameworks to process financial transactions and shows examples of tools that they built along their streaming journey. Filmed at qconnewyork.com.

Anton Gorshkov is a Managing Director at Goldman Sachs Asset Management where he runs a global Core Platform team, focusing on GSAM’s data strategy and real-time services.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

When Streams Fail: From Flash Floods to Arroyos

  1. 1. KAFKA@GS When Streams Fail Kafka Off the Shore What If? © 2017 Goldman Sachs.  This presentation should not be relied upon or considered investment advice. Goldman Sachs does not warrant or guarantee to anyone the accuracy, completeness or efficacy of this presentation, and recipients should not rely on it except at their own risk. This presentation may not be forwarded or disclosed except with this disclaimer intact.    These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2017 The Goldman Sachs Group, Inc. All rights reserved.
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ streaming-kafka-spark
  3. 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. KAFKA@GS ABOUT US • Investment Management Division • $1.4 Trillion Assets Under Management • Core Front-Office Platform Team • Interface with 15+ teams • Kafka, Event Topology, Data Fabric, Akka, etc… • Sounds fun? Talk to me! • What if…
  5. 5. KAFKA@GS What If? This presentation is boring • Topology Strategy, Context & Deployment Goals • What If? Hosts / Network / DC Failures • Deployment & Monitoring Strategy • What If? Framework & Software Failure
  6. 6. KAFKA@GS KAFKA DEPLOYMENT USAGE PATTERNS • Most used clusters serve ~1.5Tb a week to consumers • However message count relatively low – order of millions per week; avg. several hundred a second • At peak periods • ~1,500 messages produced/second • ~2.5Mb produced/second • ~12.5Mb consumed/second
  7. 7. KAFKA@GS DEPLOYMENT GOALS • No data-loss even in case of DC outage • No Primary/Back-Up notion • No “failover” scenarios • Minimize Outage time
  8. 8. KAFKA@GS DATALOSS 101 • Tape Backup • Nightly Batch Replication • Async Replication • Sync Replication (ex: SRDF)
  9. 9. KAFKA@GS Speed of Light Network Roundtrips NYC to SF ~60ms Virginia to Ohio ~12ms NYC to NJ ~4ms
  10. 10. KAFKA@GS What If? Datacenters were near… Treat multi-DC environment as a single redundant data-center where half could go down
  11. 11. KAFKA@GS Broker ZK Virtual Machine Datacenter A Broker ZK Virtual Machine Broker ZK Virtual Machine Broker ZK Virtual Machine Datacenter B Broker ZK Virtual Machine Broker ZK Virtual Machine Conceptual cluster Physical cluster Broker ZKZK ZK BrokerBroker Broker Broker Broker ZKZK ZK DEPLOYMENT STRATEGY
  12. 12. KAFKA@GS H1 Datacenter A Partition assignment EXAMPLE Single Topic Setup • 1 topic • 3 partitions (p1-p3) • Replication factor of 4 • Ensure even replicas between DCs • Min.Insync.Replicas 3 • Cross-DC latency is low! p1 H3 p1 p1r1 H5 p2 p3 p3 Datacenter B p2 H2 p1 H4 p1 p1r1 H6 p2 p3 p3 p2
  13. 13. KAFKA@GS H1 Datacenter A What If? A host fails • 1-5 times / year • No impact to producers/consumers (still able to satisfy 3 ISR) • No manual recovery beyond replacing host p1 H3 p1 p1r1 H5 p2 p3 p3 Datacenter B p2 H2 p1 H4 p1 p1r1 H6 p2 p3 p3 p2
  14. 14. KAFKA@GS H1 Datacenter A What If? Two hosts fail • 1/year depending where hosts are (e.g. bad hypervisor) • Processing for some topics will be halted • Short-term: Add replicas for affected partitions on remaining hosts • ASAP: Replace bad hosts • GS Dynamic Compute allows seamless VM replace with no need to re-point dns aliases/change kafka config p1 H3 p1 p1r1 H5 p2 p3 p3 Datacenter B p2 H2 p1 H4 p1 p1r1 H6 p2 p3 p3 p2
  15. 15. KAFKA@GS H1 Datacenter A What If? Two hosts fail • 1/year depending where hosts are (e.g. bad hypervisor) • Processing for some topics will be halted • Short-term: Add replicas for affected partitions on remaining hosts • ASAP: Replace bad hosts • GS Dynamic Compute allows seamless VM replace with no need to re-point dns aliases/change kafka config H3 p1r1 H5 p2 p3 Datacenter B H2 p1 H4 p1 p1r1 H6 p3 p2 p2 p3 p1 pr2 p3 p1 p2 p3 p1 p1
  16. 16. KAFKA@GS H1 Datacenter A What If? Three hosts fail • 1 / few years • Cluster processing halted as cannot satisfy in-sync replica requirements • Proceed with immediate host replacement p1 H3 p1 p1r1 H5 p2 p3 p3 Datacenter B p2 H2 p1 H4 p1 p1r1 H6 p2 p3 p3 p2
  17. 17. KAFKA@GS H1 Datacenter A What If? Datacenter Failed / Network Partition • Once a 20-year event • Short-term strategy: add additional machines in Datacenter A • Largest impact on recovery time is how long to get new hosts provisioned p1 H3 p1 p1r1 H5 p2 p3 p3 Datacenter B p2 H2 p1 H4 p1 p1r1 H6 p2 p3 p3 p2
  18. 18. KAFKA@GS H1 Datacenter A After adding additional hosts What If? Datacenter Failed / Network Partition p1 H3 p1 p1r1 H5 p2 p3 p3 Datacenter B p2 H2 p1 H4 p1 p1r1 H6 p2 p3 p3 p2 H7 p1 H8 p1 p1r1 H9 p2 p3 p3 p2 • Short-term strategy: add additional machines in Datacenter A • Largest impact on recovery time is how long to get new hosts provisioned • Stand-by Hosts?
  19. 19. KAFKA@GS Kafka Broker TOOLING OVERVIEW Zookeeper REST Service Metrics Capture Time- Series
 DB GS CENTRALIZED ALERTING INFRASTRUCTURE Phone/e-mail/IM alerts to team Images are shown for illustrative purposes only and should not be relied upon as representative of actual or future information for Goldman Sachs products and services.
  20. 20. KAFKA@GS TOOLING CLUSTER DASHBOARD Endpoints include: • View messages on topic
 • Consumer lag
 • Leader & ISRs for topic
 • Highwatermark for topic
 • Run-Time Broker & zookeeper configuration
  21. 21. KAFKA@GS TOOLING HEALTHCHECK • Topic sizes monitoring • Alert when at risk of losing data due to truncation • If partitions breach threshold, escalate via GS alerting infrastructure
  22. 22. KAFKA@GS SAMPLE DEPLOYMENT Direct Sink Spark Streaming In-Memory
 RDBMS Data Lake Kafka Batch ETL In-Memory RDBMS VertX/RESTAPIs Order Producers
  23. 23. KAFKA@GS What If? I need to Replay • Message Remains on Queue? • Delivery Semantics • == 1 • =< 1 • >=1 • Unique ID on everything • Making Replay Easy • KIP-98
  24. 24. KAFKA@GS What If? Spark Streaming Fails
  25. 25. KAFKA@GS SUMMARY • Failure will occur • Belt & Suspenders for everything • Ultimate Throughput vs Ultimate Reliability • Kafka has many Knobs, perhaps too many, hide some • Resilience through Transparency
  26. 26. KAFKA@GS Q&A text time…
  27. 27. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ streaming-kafka-spark

×