2. Little bit of history
● Data resided within operational data bases.
● Demand for data analysis on a centralized warehouse which was dedicated to
this procedure.
● ETL processes have imerged.
● ETL - Extract Transform Load
3. Changes in ETL process
● Data integration - data integration between sources and destinations
● Single server data bases had been replaced by distributed data platforms
● Rise of big data caused ETL tools to handle more than just Data Bases and
Data Warehouses
● Today data comes from a wided range of sources: logs, sensors, metrics
● Demanding change in approach for continous processing
● Processing need to handle high throughput with low latency
4. Traditional ETL drawbacks
● Originally designed for a ‘’niche’’ problem of connecting between operational
dbs and data warehouses in a ‘’Batch’’ fashion
● Time consuming and resource intensive
● ‘’T’’ in Transform really stood for data cleansing rather than complexed
transformation which could include data enrichment
● Need for a global schema
5. It gets even massier...
● EAI - Enterprise Applications Integration
● Rising need of integration between different applications in our architecture in
real time.
● Used to be solved by traditional enterprise message queues
● Worked well in small scale but not in large scale
● Resulting in not being able to handle the amount and variety of modern data
such as: logs, sensors, real time transactions, etc...
7. So what are we looking for?
● Ability to process high volumes and high diversity data
● Real time model from get go which supports continous processing
● Transition to ‘’event-centric’’ paradigm (pubsub)
● Forward compatible data architecture, the ability to add multiple destinations
that process the data differently
● Low latency
8. Keep looking….
● To enable forward compatability first ‘’T’’ in ETL needs to be redifned.
● Move from data cleansing to data transformations
● Moreover transformations such as data enrichment should not run on the dwh
rather on as continuous transformations on the streaming platform
● To achieve that we need obviously joins aggregations and windowing abilities
● So to summarize we need to extract clean data once transform it in many ways
before loading it to different destinations
9. Stream Processing
● Stream processing is really all about transformations on a continous stream of
data
● Transformations are in forms of filters, maps, joins and aggregations
● We can divide stream processing into 2 paradigms: Real Time MapReduce and
Event Driven Micro Services
10. Real Time MapReduce
● MapReduce is with us for quite a long time
● Main issue is to fit mapreduce with modern needs by build a real time
continuous mapreduce layer for example:
11. Real Time MapReduce
● Processing jobs run on a cenralized dedicated cluster
● Using custom packaging for deployment each platform and it’s respective
deployment
● Most suitable for long run analytics on large multi tanent cluster or
machine/deep learning purposes
● Coupled integration between dev teams and devops teams
● Business logic is divided between 2 layers by expressing some of the logic in a
processing job which needs to be deployed on the rt mr cluster
● In large scale could cause lots of friction
12. Event Driven Micro Services
● This paradigm correlates with event centric paradigm where your streaming
platform acts as a central nervous system
● Micro services layer also acts as stream processing units
● Just kafka and you app by embedded library
● input and output are always streams
15. Kafka Streams Application Overview
● Application which uses kafka streams api is just an ordinarry java application
● Making packaging and deployment as easy as it should be
● Built ontop kafka’s fault tolerance capabilities
● Streams are partitioned and replicated
● Stream tasks are also fault tolerant, if a task runs on a machine which failed
than streams platform will automatically restart the task on one of the
remaining instances
16. Kafka Streams Application Overview
● Abilty to run multiple instances of streams application
● Instances run independently and automatically discover each other
● Abilty to elastically add or remove app instances during live processing
● When instance has a failover other instances will take over it’s work
17. Stream Processors
● Stream processors are nodes in the processor topolgy
● Representing computational steps in the topology which basically means that
they are responsible for the data transformations
● transformations include: map, filter, aggregations, joins and windowing
● These processors come out of the box with the streams api
● processors get data records from upstream processors apply transformation
and send records to downstream processors
18. Stream Processors
● 2 special types of processors:
○ Source Processor - This special type of processor produces input stream to the topology by
consuming record from one or multiple kafka topics. this stream is then forwarded downstream
to one or more downstream processors. obviously this processor is located as a root of the
topology so it doesn’t connected to any upstream processors
○ Sink Processor - This special topic doesn’t have any downstream processors, send it’s output
stream to a specified kafka topic
20. State Stores
● Store states are used to store and query data
● Are really the backbone which enables ‘’stateful stream processing’’
● Kafka streams dsl automatically creates and uses state stores whenever it is
required for a stateful operations such as joins, aggregations and windowing
● State stores can be stored in RocksDB data base or any in memory hash maps
● Kafka streams offers a robust fault tolerant and recovery for local state stores
● Each state store is replicated by a change log topic
● These changelog topics are also partitioned, enabling each task which access
21. Fault Tolerance
● Kafka Streams is embedded with fault tolerance capabilities which are
integrated in kafka itself
● Kafka streams are partiotioned and replicated just as kafka topics are
● Stream tasks are monitored internally so if a task runs on a machine the failed
Kafka streams will automatically detect it and will restart the task on another
app instance.
● As mentioned before state stores are also fault tolerant by maintaining
replicated change log for each store which tracks state's updates
● Actually these change logs are also partitioned so any tasks which require
22. Fault Tolerance
● Log compactions is enabled on the state store’s replicated change logs which
prevents this change log topics from growing indefinitely
23. Threading Model
● Kafka streams allows a configuration of number of threads that the library can
use for parallelize processing
● Each thread can run one or more stream tasks
ETL - extract data from databases, transform into destination’s warehouse schema, load into central data warehouse.
b2 - analysis on separate data warehouse in order to not affect operational db performance, resulted in analysis after a meaningful time gap instead of “real-time”
b1 - need for also EAI enterprise application integration (will be referenced in couple of slides)
b3 - data enrichment really only can be implemented by joins and aggregattions.