Ai big dataconference_jeffrey ricker_kappa_architecture

KAPPA
ARCHITECTURE
(AND BEYOND)

JEFFREY RICKER
Co-founder of Ricker Lyman Robotic
Live in New York
Clients include hedge funds, pharmaceutical, retail
Amazon big data
Distributed Instruments
US Defense HPC Modernization

AGENDA
Review history that led to Kappa architecture
Investigate the power of Kappa
Review what comes next

“In the beginning was
the command line …”
Neal Stephenson

MAP REDUCE
GOOD
Hadoop distributed file
system (HDFS)
Distributed & massively
parallel
Move the compute to the
data
YARN
NOT SO GOOD
Finite data set
Batch process
Begin to chain jobs
together
Failure?
Recovery?
Idempotent?

WORKFLOWS
Oozie
Luigi
Azkaban
Airflow
Pinball
Cascading
Taskflow

AND MORE
Apache Spark 2014
Apache Samza 2014
Apache Flink 2015
Apache Nifi 2015
Apache Gearpump 2016
Apache Apex 2016
Kafka Streams 2016
Akka Streams 2016

STREAMS ARE
DIFFERENT
Data is infinite
Continuous processing
There is no now
Eventual consistency vs false sense of consistency
Closer to reality

CONSISTENCY
Trade data arrives at end of day (EOD)
Processing runs to create EOD status of trades
Corrections exist for previous days
Previous EOD is also changed

LAMBDA
Enterprise SQL architectures have followed the same pattern
for years
Requires maintaining two versions of the same logic
Joining the streaming with the batch is easier said than done

THE MISSING PIECE
All distributed computing has three components:
1. Data (or state)
2. Compute
3. Communication
We had
1. HDFS + Hive + HBASE +++
2. YARN + Spark + Kubernetes +++
3. ?

ADVANTAGES
Works as a queue
Works as pub-sub
Works as a storage system
Scales
Fast

DEFINITION OF
KAPPA
Rather than using a relational DB like SQL or a key-value
store like Cassandra, the canonical data store in a Kappa
Architecture system is an append-only immutable log.
From the log, data is streamed through a computational
system and fed into auxiliary stores for serving.

BASIC CONCEPT
Write immutable events to the append only log
Recreate the state in (multiple) materialized views
Distribute the ability to maintain state in multiple systems in
read optimized formats
“Turn the database inside out”

RESOURCE
CONTENTION
Intraday
Run the
business
Microservices
Exoday
Analyze the
business
Hadoop

MULTIPLE CLUSTERS
Kafka
Microservices
Hadoop
Streaming
Nifi

MICROSERVICE
EXAMPLE
Microservice Hbase

MICROSERVICE
EXAMPLE
microservice kafka HBase

MICROSERVICE
EXAMPLE
microservice kafka
Hbase
Hive
Druid

BOUNDARY LAYER
stream
process A
kafka
stream
process B

DOMAIN KAPPA
Data is the current state
Compute changes the state
Stream publishes the state changed

OBSERVED STATE
Stateful
• Service maintains an in-memory copy of the observed state of the other
service by subscribing to the stream of the other service from the
beginning.
Stateless
• Service reads the state from the other service by request-response.
Semi-stateless
• Hybrid of the other two. The service subscribes to the stream of the
other service and keeps a cache of the observable state. The cache is
limited in size through time outs. If the service is missing a state in its
local cache, then it reads the observable state from the other service
and caches it.

SUMMARY
Canonical data store is an append-only immutable log
• Kappa is not dependent on Kafka
• Kafka is very good for implementing Kappa
From the log, data is streamed through a computational
system and fed into auxiliary stores for serving
Auxiliary stores are materialized views
Multiple views of the same data, read optimized
Meets resource contention requirements of enterprise

1. KAFKA WILL EVOLVE
Kafka streams
Only once processing
Continuous queries

RESOURCE MANAGERS
YARN
• Map reduce jobs running on cluster
• Long running services like Hbase running on cluster
• Why not share the resources?
Kubernetes
• Distribute containers across a collection of servers
Mesos
• An operating system for the data center

SCHEDULERS
Apache Yarn
Kubernetes
Mesos
Docker Swarm
Hashicorp Nomad
Microsoft Apollo

CON/DI/VERGENCE
Compute
• YARN will expand to run microservices and containers
• Microservice and container platforms will run Hadoop
Data storage
• HDFS or S3 or Ceph or ?
Messaging
• Kafka alternatives will arise

3. GPGPU
AI frameworks
• TensorFlow
• MXNet
• Caffe
Databases
• Kinetica
• MapD
• Sqream
• Blazegraph

SUPERCOMPUTER
Hadoop
• 12 m4 nodes x 64 cores = 768
GPU
• 1 p2 node x 16 GPU x 2,496 cores = 39,936

A NEW LAYER
data
at rest
stream (CPU)
processing
GPU
processing

Ai big dataconference_jeffrey ricker_kappa_architecture

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Ai big dataconference_jeffrey ricker_kappa_architecture

Ähnlich wie Ai big dataconference_jeffrey ricker_kappa_architecture (20)

Mehr von Olga Zinkevych

Mehr von Olga Zinkevych (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Ai big dataconference_jeffrey ricker_kappa_architecture