SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
Data Policies for the Kafka-API with
WebAssembly
https://github.com/vectorizedio/redpanda
background
● i’ve worked on streaming sys. for 12+ years
● developer, founder & CEO of Vectorized, hacking on
Redpanda, a modern streaming platform for mission
critical workloads.
● previously, principal engineer at Akamai; co-founder &
CTO of concord.io, a high performance stream
processing engine built in C++ and acquired by Akamai
in 2016
alex gallego
@emaxerrno
keystone problem in streaming
The Data
What you
care about
keystone problem in streaming: consume what you want
The Data
What you
care about
The Good Parts Of Streaming
○ Streaming data is immutable
■ Good for understandability
■ Built-in auditing
○ Ability to replay data
○ The essence of a streaming platform is to
decouple producers from consumers
■ Focuses on Data vs Who, What and
When produced it
○ Proven at scale
keystone problem in streaming: consume what you want
The Data
What you
care about
The *Not so* Good Parts Of Streaming
○ Consume *at least* as much as you
produce
■ In the same order
■ common to 10x more than produce
○ Up front architecture cost of splitting your
streams, types, and data contracts for
things like privacy, etc…
○ Can easily saturate your network (simply
shifting the bottleneck)
■ Forces developers to create
specialized clusters
● Mission Critical Prod
● Analytics/Dashboard Prod
● ML sample clusters
keystone problem in streaming: consume what you want
The Data
What you
care about 🥳🤯🎉 - hooray!
performance
improvement
2007 2011 2020
SSD: $2500/TB
typical instance 4 cores
SSD $200/TB - 1000x faster, 10x cheaper
225 core VMs - 30x more cores
100Gbps NICs - 100x more throughput
first open source
solutions
take advantage of cheap
disk
disaggregate compute
and storage
modern hardware +
cloud native
30x taller computers + 1000x faster disks
thread per core
architecture
● explicit scheduling everywhere
○ IO groups
○ x-core groups (smp)
○ memory throttling
● ONLY supports async interfaces
○ requires library re-writes for
threading model to work
well
future<>
● viral primitive (like actors, Orleans, Akka, Pony, etc) - mix, map-reduce, filter,
chain, fail, complete, generate, fulfill, sleep, expire futures, etc
● fundamentally about program structure. w/ concurrent structure, parallelism
is a free variable
● one pinned thread per core - must express parallelism and concurrency
explicitly
● no locks on the hotpath - network of SPSC queues
async-only
cooperative scheduling framework
new way to build software:
no virtual memory
buddy allocator ● preallocate 100% of mem; split across
N-cores for thread-local allocation/access
● create pools by dividing the memory one
layer above/2 and creating a new pool
● large allocations (above 64KB are not
pooled)
● buddy allocator pools for all object sizes
below 64KB
● full free-lists are recycled
● difficult to use this technique in practice,
and requires developer
retraining/accounting for every single byte
present in the system at all times
○ forces developer to pay additional attention
to all hash-maps, allocations, pooling, etc
Pools
of 8KB
Pools of 16KB
Pools of 64KB
Pool 0 - large object pool;
above 64KB+1
memory/2
memory/2
...
memory core local (usually around 2GB+)
memory global/N cores…
iobuf - TPC buffer management
src: https://vectorized.io/blog/tpc-buffers/
request pipelining per partition
● parallelism model == number of
cores/pthreads in the system
● read full request metadata and assign
subrequest to physical core
● for all non-overlapping cores, execute in
parallel
● for all overlapping cores per *partition*
pipeline (enqueue writes in order)
core-local metadata piggybacking
(...pandabacking?)
● maintain core-local metadata cache of
○ bytes written per partition (for future
readers)
○ latencies from the remote core (could be
highly contended and we need TCP
backpressure)
○ per TCP-connection read-ahead pointers
on disk for O(1) access/assignment
copy-on-read cache
x-shard metadata for low
latency access
core-local v8::isolate per topic/partition/policy
● maintain a v8::isolate *per core*
● maintain v8::context per topic/partition
○ thread_local v8::isolate
■ Low latency access
■ Preemption
● Timebound
● CPU Cycles
● Cooperative scheduling
○ No cross-core communications
○ Small memory footprint
applying a .wasm or .js to a Kafka Topic
> bin/kafka-topics.sh 
--alter --topic my_topic_name 
--config x-data-policy={...}
(redpanda has a tool called `rpk`
that is similar to kafka-topics.sh)
● Must be a pure function
○ limit of global state per core is 1MB
● On TCP connection
○ Look up associated data policy for the
topic
○ Instantiate a v8::context
○ Perform reads from disk, and subsystems
as normal
○ *before sending tcp bytes
■ Call v8::isolate
■ Swap v8::context
■ Transform payload
■ Re-checksum payload
■ Return new RecordBatch
flow
import { InlineTransform } from "@vectorizedio/InlineTransform";
const transform = new InlineTransform();
transform.topics([{"input": "lowercase", "output":"uppercase"}]);
...
const uppercase = async (record) => {
const newRecord = {
...record,
value: record.value.map((char) => {
if (char > 97 && char < 122) {
return char - 32;
} else {
return char;
}
}),
};
return newRecord;
}
l
o
w
e
r
c
a
s
e
U
P
P
E
R
C
A
S
E
F
i
l
t
e
r
*using the Kafka-API in your favorite
programming language (JS, Py, Java, C++, etc)
full compatibility with all your tools. No code
changes
check out the code for yourself!
● https://github.com/vectorizedio/redpanda
● ask questions from the maintainers at https://vectorized.io/slack
● say hi on twitter https://twitter.com/vectorizedio
● wasm+kafka-api https://vectorized.io/blog/wasm-architecture/

Weitere ähnliche Inhalte

Was ist angesagt?

Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, NetflixGoing from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
HostedbyConfluent
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
confluent
 

Was ist angesagt? (20)

Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
 
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, NetflixGoing from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
Column and hadoop
Column and hadoopColumn and hadoop
Column and hadoop
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
 
Introduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormIntroduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with Storm
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka
 
Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platform
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
 

Ähnlich wie Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized

CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
IO Visor Project
 
A Consolidation Success Story
A Consolidation Success StoryA Consolidation Success Story
A Consolidation Success Story
Enkitec
 
Benchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesBenchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetes
DoKC
 

Ähnlich wie Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized (20)

Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
What's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyWhat's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon Valley
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Seastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration SummitSeastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration Summit
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
A Consolidation Success Story
A Consolidation Success StoryA Consolidation Success Story
A Consolidation Success Story
 
Benchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesBenchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetes
 
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
PFQ@ PAM12
PFQ@ PAM12PFQ@ PAM12
PFQ@ PAM12
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 

Mehr von HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 

Mehr von HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized

  • 1. Data Policies for the Kafka-API with WebAssembly https://github.com/vectorizedio/redpanda
  • 2. background ● i’ve worked on streaming sys. for 12+ years ● developer, founder & CEO of Vectorized, hacking on Redpanda, a modern streaming platform for mission critical workloads. ● previously, principal engineer at Akamai; co-founder & CTO of concord.io, a high performance stream processing engine built in C++ and acquired by Akamai in 2016 alex gallego @emaxerrno
  • 3. keystone problem in streaming The Data What you care about
  • 4. keystone problem in streaming: consume what you want The Data What you care about The Good Parts Of Streaming ○ Streaming data is immutable ■ Good for understandability ■ Built-in auditing ○ Ability to replay data ○ The essence of a streaming platform is to decouple producers from consumers ■ Focuses on Data vs Who, What and When produced it ○ Proven at scale
  • 5. keystone problem in streaming: consume what you want The Data What you care about The *Not so* Good Parts Of Streaming ○ Consume *at least* as much as you produce ■ In the same order ■ common to 10x more than produce ○ Up front architecture cost of splitting your streams, types, and data contracts for things like privacy, etc… ○ Can easily saturate your network (simply shifting the bottleneck) ■ Forces developers to create specialized clusters ● Mission Critical Prod ● Analytics/Dashboard Prod ● ML sample clusters
  • 6. keystone problem in streaming: consume what you want The Data What you care about 🥳🤯🎉 - hooray!
  • 7. performance improvement 2007 2011 2020 SSD: $2500/TB typical instance 4 cores SSD $200/TB - 1000x faster, 10x cheaper 225 core VMs - 30x more cores 100Gbps NICs - 100x more throughput first open source solutions take advantage of cheap disk disaggregate compute and storage modern hardware + cloud native 30x taller computers + 1000x faster disks
  • 8. thread per core architecture ● explicit scheduling everywhere ○ IO groups ○ x-core groups (smp) ○ memory throttling ● ONLY supports async interfaces ○ requires library re-writes for threading model to work well
  • 9. future<> ● viral primitive (like actors, Orleans, Akka, Pony, etc) - mix, map-reduce, filter, chain, fail, complete, generate, fulfill, sleep, expire futures, etc ● fundamentally about program structure. w/ concurrent structure, parallelism is a free variable ● one pinned thread per core - must express parallelism and concurrency explicitly ● no locks on the hotpath - network of SPSC queues async-only cooperative scheduling framework new way to build software:
  • 10. no virtual memory buddy allocator ● preallocate 100% of mem; split across N-cores for thread-local allocation/access ● create pools by dividing the memory one layer above/2 and creating a new pool ● large allocations (above 64KB are not pooled) ● buddy allocator pools for all object sizes below 64KB ● full free-lists are recycled ● difficult to use this technique in practice, and requires developer retraining/accounting for every single byte present in the system at all times ○ forces developer to pay additional attention to all hash-maps, allocations, pooling, etc Pools of 8KB Pools of 16KB Pools of 64KB Pool 0 - large object pool; above 64KB+1 memory/2 memory/2 ... memory core local (usually around 2GB+) memory global/N cores…
  • 11. iobuf - TPC buffer management src: https://vectorized.io/blog/tpc-buffers/
  • 12. request pipelining per partition ● parallelism model == number of cores/pthreads in the system ● read full request metadata and assign subrequest to physical core ● for all non-overlapping cores, execute in parallel ● for all overlapping cores per *partition* pipeline (enqueue writes in order)
  • 13. core-local metadata piggybacking (...pandabacking?) ● maintain core-local metadata cache of ○ bytes written per partition (for future readers) ○ latencies from the remote core (could be highly contended and we need TCP backpressure) ○ per TCP-connection read-ahead pointers on disk for O(1) access/assignment copy-on-read cache x-shard metadata for low latency access
  • 14. core-local v8::isolate per topic/partition/policy ● maintain a v8::isolate *per core* ● maintain v8::context per topic/partition ○ thread_local v8::isolate ■ Low latency access ■ Preemption ● Timebound ● CPU Cycles ● Cooperative scheduling ○ No cross-core communications ○ Small memory footprint
  • 15. applying a .wasm or .js to a Kafka Topic > bin/kafka-topics.sh --alter --topic my_topic_name --config x-data-policy={...} (redpanda has a tool called `rpk` that is similar to kafka-topics.sh) ● Must be a pure function ○ limit of global state per core is 1MB ● On TCP connection ○ Look up associated data policy for the topic ○ Instantiate a v8::context ○ Perform reads from disk, and subsystems as normal ○ *before sending tcp bytes ■ Call v8::isolate ■ Swap v8::context ■ Transform payload ■ Re-checksum payload ■ Return new RecordBatch
  • 16. flow import { InlineTransform } from "@vectorizedio/InlineTransform"; const transform = new InlineTransform(); transform.topics([{"input": "lowercase", "output":"uppercase"}]); ... const uppercase = async (record) => { const newRecord = { ...record, value: record.value.map((char) => { if (char > 97 && char < 122) { return char - 32; } else { return char; } }), }; return newRecord; } l o w e r c a s e U P P E R C A S E F i l t e r *using the Kafka-API in your favorite programming language (JS, Py, Java, C++, etc) full compatibility with all your tools. No code changes
  • 17. check out the code for yourself! ● https://github.com/vectorizedio/redpanda ● ask questions from the maintainers at https://vectorized.io/slack ● say hi on twitter https://twitter.com/vectorizedio ● wasm+kafka-api https://vectorized.io/blog/wasm-architecture/