Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Distributed Database Design Decisions to Support High Performance Event Streaming - Pulsar Summit SF 2022

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 36 Anzeige

Distributed Database Design Decisions to Support High Performance Event Streaming - Pulsar Summit SF 2022

Herunterladen, um offline zu lesen

Event streaming architectures launched a reexamination of applications and systems architectures across the board. We live in a world where answers are needed now in a constant real-time flow. Yet beyond the event streaming system itself, what are the corequisites to ensure our large scale distributed database systems can keep pace with this always-on, always-current real time flow of data? What are the requirements and expectations for this next tech cycle?

Event streaming architectures launched a reexamination of applications and systems architectures across the board. We live in a world where answers are needed now in a constant real-time flow. Yet beyond the event streaming system itself, what are the corequisites to ensure our large scale distributed database systems can keep pace with this always-on, always-current real time flow of data? What are the requirements and expectations for this next tech cycle?

Anzeige
Anzeige

Weitere Verwandte Inhalte

Weitere von StreamNative (20)

Aktuellste (20)

Anzeige

Distributed Database Design Decisions to Support High Performance Event Streaming - Pulsar Summit SF 2022

  1. 1. Pulsar Summit San Francisco Hotel Nikko August 18 2022 Ecosystem Distributed Database Design Decisions to Support High Performance Event Streaming Peter Corless Director of Technical Advocacy • ScyllaDB
  2. 2. Peter Corless is the Director of Technical Advocacy of ScyllaDB, the company behind the monstrously fast and scalable NoSQL database. He is the editor of and frequent contributor to the ScyllaDB blog, and program chair of the ScyllaDB Summit and P99 CONF. He recently hosted the Distributed Systems Masterclass, co-sponsored by StreamNative+ScyllaDB Peter Corless Director of Technical Advocacy ScyllaDB
  3. 3. Distributed Database Design Decisions to Support High Performance Event Streaming Requirements for This Next Tech Cycle
  4. 4. This Next Tech Cycle 2000 2010 2020 2025+ Transistor Count 42M Pentium 4 (2000) 228M Pentium D (2005) 2.3B Xeon Nahalem-EX (2010) 10B SPARC M7 (2015) 39B Epyc Rome (2019) Core Count 1 2 8 32 64 ~60B? Epyc Genoa (2022) 96 ~80B? Epyc Bergamo (2023) 128 1.2 ZB IP traffic (2016) 2 ZB Data stored (2010) 64 ZB Data stored (2020) Broadband Speeds 3G (2002) 105mbps (2014) 1.5 mbps (2002) 16 mbps (2008) Wireless Services 3Gbps (2021) 1Gbps (2018) 4G (2014) 5G (2018) Zettabyte Era ~180 ZB Data stored (2025) Public Cloud to Multicloud AWS (2006) GCP (2008) Azure (2010) 1021 Azure Arc
  5. 5. Hardware Infrastructure is Evolving + Compute + From 100+ cores → 1,000+ cores per server + From multicore CPUs → full System on a Chip (SoC) designs (CPU, GPU, Cache, Memory) + Memory + Terabyte-scale RAM per server + DDR4 — 800 to 1600 MHz, 2011-present + DDR5 — 4600 MHz in 2020, 8000 MHz by 2024 + DDR6 — 9600 MHz by 2025 + Storage + Petabyte-scale storage per server + NVMe 2.0 [2021] — separation of base and transport Distributed Database Design Decisions to Support High Performance Event Streaming
  6. 6. Distributed Database Design Decisions to Support High Performance Event Streaming Databases are Evolving + Consistency Models [CAP Model: AP vs. CP] + Strong, Eventual, Tunable + ACID vs. BASE + Data Model / Query Languages [SQL vs. NoSQL] + RDBMS / SQL + NoSQL [Document, Key-Value, Wide-Column, Graph] + Big Data → HUGE Data + Data Stored: Gigabytes? Terabytes? Petabytes? Exabytes? + Payload Sizes: Kilobytes? Megabytes? + OPS / TPS: Hundreds of thousands? Millions? + Latencies: Sub-millisecond? Single-digit milliseconds?
  7. 7. Distributed Database Design Decisions to Support High Performance Event Streaming Databases are [or should be] designed for specific kinds of data, specific kinds of workloads, and specific kinds of queries. How aligned or far away from your specific use case a database may be in its design & implementation from your desired utility of it determines the resistance of the system Variable Resistors Anyone?
  8. 8. Distributed Database Design Decisions to Support High Performance Event Streaming Sure you can use various databases for tasks they were never designed for — but should you? DATA ENGINEERS
  9. 9. Distributed Database Design Decisions to Support High Performance Event Streaming Δ Data –––––––––––– t t ~ n ×0.001s For a database to be appropriate for event streaming, it needs to support managing changes to data over time in “real time” — measured in single-digit milliseconds or less. And where changes to data can be produced at a rate of hundreds of thousands or millions of events per second. [And greater rates in future]
  10. 10. DBaaS Single-cloud vs. Multi-cloud? Multi-datacenter Elasticity Serverless Orchestration DevSecOps Scalability Reliability Durability Manageability Observability Flexibility Facility / Usability Compatibility Interoperability Linearizability “Batch” → “Stream” Change Data Capture (CDC) Sink & Source Time Series Event Streaming Event Sourcing* [* ≠ Event Streaming] SQL or NoSQL? Query Language Data Model Data Distribution Workload [R/W] Speed Price/TCO/ROI Cloud Native Qualities All the “-ilities” Event-Driven Best Fit to Use Case Distributed Database Design Decisions to Support High Performance Event Streaming
  11. 11. Distributed Database Design Decisions to Support High Performance Event Streaming While many database systems have been incrementally adapted to cloud native environments, they still have underlying architectural limits or presumptions. + Strong consistency / record-locking — limits latencies & throughput + Single primary server for read/writes — replicas are read-only or only for failover; bottlenecks write-heavy workloads + Local clustering/single datacenter design — inappropriate for high availability; hampers global distribution; lack of topology-awareness induces fragility
  12. 12. Distributed Database Design Decisions to Support High Performance Event Streaming Two flavors of responses: + NoSQL — Designed for non-relational data models, various query languages, high availability distributed systems + Key value, document, wide column, graph, etc. + NewSQL — Still RDBMS, still SQL, but designed to operate as a highly available distributed system
  13. 13. Distributed Database Design Decisions to Support High Performance Event Streaming Database-as-a-Service (DBaaS) + Lift-and-Shift to Cloud — Same base offering as on-premises version, offered as a cloud-hosted managed service + Easy/fast to bring to market, but no fundamental design changes + Cloud Native — Designed from-the-ground-up for cloud [only] usage + Elasticity — Dynamic provisioning, scale up/down for throughput, storage + Serverless — Do I need to know what hardware I’m running on? + Microservices & API integration — App integration, connectors, DevEx + Billing — making it easy to consume & measure ROI/TCO + Governance: Privacy Compliance / Data Localization
  14. 14. Distributed Database Design Decisions to Support High Performance Event Streaming What does a database need to be, or have, or do, to properly support event streaming in 2022? + High Availability [“Always On”] + Impedance Match of Database to Event Streaming Systems + Similar characteristics for throughput, latency + All the Appropriate “Goesintos/Goesouttas” + Sink Connector + Change Data Capture (CDC) / Source Connector + Supports your favorite streaming flavor of the day + Kafka, Pulsar, RabbitMQ Streams, etc.
  15. 15. Distributed Database Design Decisions to Support High Performance Event Streaming Event Streaming Journey of a NoSQL Database: ScyllaDB
  16. 16. Distributed Database Design Decisions to Support High Performance Event Streaming ScyllaDB: Building on “Good Bones” + Performant: Shard-per-core, async-everywhere, shared-nothing architecture + Scalable: both horizontal [100s/1000s of nodes] & vertical [100s/1000s cores] + Available: Peer-to-Peer, Active-Active; no single point of failure + Distribution: Multi-datacenter clustering & replication, auto-sharding + Consistency: tunable; primarily eventual, but also Lightweight Transactions (LWT) + Topology Aware: Shard-aware, Node-aware, Rack-aware, Datacenter-aware + Compatible: Cassandra CQL & Amazon DynamoDB APIs
  17. 17. Distributed Database Design Decisions to Support High Performance Event Streaming ScyllaDB Journey to Event Streaming — Starting with Kafka + Shard-Aware Kafka Sink Connector [January 2020] + Github: https://github.com/scylladb/kafka-connect-scylladb + Blog: https://www.scylladb.com/2020/02/18/introducing-the-kafka-scylla-connector/
  18. 18. Distributed Database Design Decisions to Support High Performance Event Streaming ScyllaDB Journey to Event Streaming — Starting with Kafka + Shard-Aware Kafka Sink Connector [January 2020] + Github: https://github.com/scylladb/kafka-connect-scylladb + Blog: https://www.scylladb.com/2020/02/18/introducing-the-kafka-scylla-connector/ + Change Data Capture [January 2020 – October 2021] + January 2020: ScyllaDB Open Source 3.2 — Experimental + Course of 2020 - 3.3, 3.4, 4.0, 4.1, 4.2 — Experimental iterations + January 2021: 4.3: Production-ready, new API + March 2021: 4.4: new API + October 2021: 4.5: performance & stability
  19. 19. Distributed Database Design Decisions to Support High Performance Event Streaming ScyllaDB Journey to Event Streaming — Starting with Kafka + Shard-Aware Kafka Sink Connector [January 2020] + Github: https://github.com/scylladb/kafka-connect-scylladb + Blog: https://www.scylladb.com/2020/02/18/introducing-the-kafka-scylla-connector/ + Change Data Capture [January 2020 – October 2021] + January 2020: ScyllaDB Open Source 3.2 — Experimental + Course of 2020 - 3.3, 3.4, 4.0, 4.1, 4.2 — Experimental iterations + January 2021: 4.3: Production-ready, new API + March 2021: 4.4: new API + October 2021: 4.5: performance & stability + CDC Kafka Source Connector [April 2021] + Github: https://github.com/scylladb/scylla-cdc-source-connector + Blog: https://debezium.io/blog/2021/09/22/deep-dive-into-a-debezium-community-conn ector-scylla-cdc-source-connector/
  20. 20. Distributed Database Design Decisions to Support High Performance Event Streaming
  21. 21. Distributed Database Design Decisions to Support High Performance Event Streaming ScyllaDB Journey to Event Streaming with Pulsar + Pulsar Consumer: Cassandra Sink Connector + Comes by default with Pulsar + ScyllaDB is Cassandra CQL compatible + Docs: https://pulsar.apache.org/docs/io-cassandra-sink/ + Github: https://github.com/apache/pulsar/blob/master/site2/docs/io-cassandra-sink.md + Pulsar Producer: Can use ScyllaDB CDC Source Connector using Kafka Compatibility + Pulsar makes it easy to bring Kafka topics into Pulsar + Docs: https://pulsar.apache.org/docs/adaptors-kafka/ + Potential Developments: + Native Pulsar Shard-Aware ScyllaDB Consumer Connector — even faster ingestion + Native CDC Pulsar Producer — unwrap your topics
  22. 22. Distributed Database Design Decisions to Support High Performance Event Streaming ScyllaDB CDC: How Does It Work?
  23. 23. ScyllaDB Quickstart: Create a Table and Enable CDC CREATE TABLE ks.tbl ( pk int, ck int, val int, col set<int>, PRIMARY KEY (pk, ck) ) WITH cdc = { 'enabled': true }; Distributed Database Design Decisions to Support High Performance Event Streaming
  24. 24. Distributed Database Design Decisions to Support High Performance Event Streaming CDC Options - Record Types Delta Preimage Postimage 'full': contain information about every modified column 'keys': only the primary key of the change will be recorded 'false': Disables the feature 'true': contain only the columns that were changed by the write ‘full’: contain the entire row (how it was before the write was made) 'false': Disables the feature 'true': show the affected row’s state after the write. Postimage row always contains all the columns no matter if they were affected by the change or not What was changed? What was before? What’s the end result?
  25. 25. Distributed Database Design Decisions to Support High Performance Event Streaming CDC Options - Record Types Enabled Postimage 86400: In seconds. By default records on CDC log table expire within 24 hours If set to 0, a separate cleaning mechanism is recommended. 'false': Disables the CDC feature 'true': Enables the CDC feature TTL
  26. 26. Distributed Database Design Decisions to Support High Performance Event Streaming cqlsh> desc table ks.tbl _scylla_cdc_log; CREATE TABLE ks.tbl_scylla_cdc_log ( "cdc$stream_id" blob, "cdc$time" timeuuid, "cdc$batch_seq_no" int, "cdc$deleted_col" boolean, "cdc$deleted_elements_col" frozen<set<int>>, "cdc$deleted_val" boolean, "cdc$end_of_batch" boolean, "cdc$operation" tinyint, "cdc$ttl" bigint, ck int, col frozen<set<int>>, pk int, val int, PRIMARY KEY ("cdc$stream_id" , "cdc$time", "cdc$batch_seq_no") ) Partition Key Sorted by time Batch sequence CDC Log Table
  27. 27. Cassandra DynamoDB MongoDB ScyllaDB Consumer location on-node off-node off-node off-node Replication duplicated deduplicated deduplicated deduplicated Deltas yes no partial optional Pre-image no yes no optional Post-image no yes yes optional Slow consumer reaction Table stopped Consumer loses data Consumer loses data Consumer loses data Ordering no yes yes yes Distributed Database Design Decisions to Support High Performance Event Streaming How Do NoSQL CDC Implementations Compare?
  28. 28. Writing to Base Table [No CDC] CQL write goes to coordinator node. INSERT INTO base_table(...)... Distributed Database Design Decisions to Support High Performance Event Streaming
  29. 29. Coordinator node creates write calls to replica nodes. INSERT INTO base_table(...)... CQL Replicated writes Writing to Base Table [No CDC] Distributed Database Design Decisions to Support High Performance Event Streaming
  30. 30. Writing to CDC Enabled Table CQL write goes to coordinator node. INSERT INTO base_table(...)... Distributed Database Design Decisions to Support High Performance Event Streaming
  31. 31. Writing to CDC enabled table (post/preimage) If required, Coordinator reads existing row data for pre-/post image generation. INSERT INTO base_table(...)... CQL (Opt) preimage read Distributed Database Design Decisions to Support High Performance Event Streaming
  32. 32. Writing to CDC Enabled Table Coordinator creates CDC log table writes and piggybacks on base table writes to same replica nodes. While data size written is larger, the number of writes requests does not change. INSERT INTO base_table(...)... CQL CDC write Distributed Database Design Decisions to Support High Performance Event Streaming
  33. 33. ▪ CDC data is grouped into streams • Divides the token ring space • Each stream represents a tokenization “slot” in current topology • Stream is log partition key • Stream chosen for given write based on base table PK tokenization ▪ Can read from all, one or some streams at a time • Allows “round-robin” traversal of data space to avoid too large or cross-node queries Stream 1, 2, 3, 4... Token ring Distributed Database Design Decisions to Support High Performance Event Streaming CDC Streams
  34. 34. Distributed Database Design Decisions to Support High Performance Event Streaming CDC Streams Token ring CDC Java Driver Kafka Source Conn. The Java driver handles round-robin traversal. Kafka Broker CDC Streams Stream 1, 2, 3, 4...
  35. 35. Distributed Database Design Decisions to Support High Performance Event Streaming Change Data Capture (CDC) lesson here: https://university.scylladb.com/courses/scylla-operations/lessons/change-data-capture-cdc/ Learn NoSQL for free! university.scylladb.com
  36. 36. Peter Corless Thank you! peter@scylladb.com @petercorless Pulsar Summit San Francisco Hotel Nikko August 18 2022

×