Topic of presentation: Kappa architecture (and beyond)
The main points of the presentation:
We will discuss the evolution of big data architecture, from batch to Lambda to Kappa. I will walk through how to implement a Kappa architecture with practical examples, focusing on how to reach full potential and avoid the pitfalls. We will finish with reviewing what lies ahead, including the inevitable consolidation between microservices, GPGPU and Hadoop.
http://dataconf.com.ua/index.php#agenda
#dataconf
#AIBDConference
2. JEFFREY RICKER
Co-founder of Ricker Lyman Robotic
Live in New York
Clients include hedge funds, pharmaceutical, retail
Amazon big data
Distributed Instruments
US Defense HPC Modernization
3. AGENDA
Review history that led to Kappa architecture
Investigate the power of Kappa
Review what comes next
7. MAP REDUCE
GOOD
Hadoop distributed file
system (HDFS)
Distributed & massively
parallel
Move the compute to the
data
YARN
NOT SO GOOD
Finite data set
Batch process
Begin to chain jobs
together
Failure?
Recovery?
Idempotent?
13. CONSISTENCY
Trade data arrives at end of day (EOD)
Processing runs to create EOD status of trades
Corrections exist for previous days
Previous EOD is also changed
17. LAMBDA
Enterprise SQL architectures have followed the same pattern
for years
Requires maintaining two versions of the same logic
Joining the streaming with the batch is easier said than done
20. THE MISSING PIECE
All distributed computing has three components:
1. Data (or state)
2. Compute
3. Communication
We had
1. HDFS + Hive + HBASE +++
2. YARN + Spark + Kubernetes +++
3. ?
25. DEFINITION OF
KAPPA
Rather than using a relational DB like SQL or a key-value
store like Cassandra, the canonical data store in a Kappa
Architecture system is an append-only immutable log.
From the log, data is streamed through a computational
system and fed into auxiliary stores for serving.
30. BASIC CONCEPT
Write immutable events to the append only log
Recreate the state in (multiple) materialized views
Distribute the ability to maintain state in multiple systems in
read optimized formats
“Turn the database inside out”
42. OBSERVED STATE
Stateful
• Service maintains an in-memory copy of the observed state of the other
service by subscribing to the stream of the other service from the
beginning.
Stateless
• Service reads the state from the other service by request-response.
Semi-stateless
• Hybrid of the other two. The service subscribes to the stream of the
other service and keeps a cache of the observable state. The cache is
limited in size through time outs. If the service is missing a state in its
local cache, then it reads the observable state from the other service
and caches it.
44. SUMMARY
Canonical data store is an append-only immutable log
• Kappa is not dependent on Kafka
• Kafka is very good for implementing Kappa
From the log, data is streamed through a computational
system and fed into auxiliary stores for serving
Auxiliary stores are materialized views
Multiple views of the same data, read optimized
Meets resource contention requirements of enterprise
48. RESOURCE MANAGERS
YARN
• Map reduce jobs running on cluster
• Long running services like Hbase running on cluster
• Why not share the resources?
Kubernetes
• Distribute containers across a collection of servers
Mesos
• An operating system for the data center
50. CON/DI/VERGENCE
Compute
• YARN will expand to run microservices and containers
• Microservice and container platforms will run Hadoop
Data storage
• HDFS or S3 or Ceph or ?
Messaging
• Kafka alternatives will arise