20. Problem statement
When a customer uses a credit card to do a transaction, the vendor needs a fast
response to the question, “Is it a fraudulent payment?”
Real time stream processing
21. Fraud Detection ( Is payment fraud? YES/NO )
Broker 1 Broker 2
Kafka Cluster
Kafka
Connect
External
System
Payment 1
Payment 2
Payment 1001
Payment 10002
NO
YES
NO
NO
22. Is a payment fraudulent one?
● Analysis and forensics on historical data to build the machine learning models.
● Use machine learning models to prediction fraud on live streams.
○ Card Velocity
○ Average spending in last 60 mins > 10 * average spending in 60 mins ever
23. Problem statement : Fraud detection
● POS Transaction Data (Live Stream)
● User Information
● User Transaction History
● Fraud Location Estimator
24. Let’s build real-time fraud detection system for Credit Card
Fraud
Detector
Customer
Profile
Step 1
Step 2
Step 3
25. Kafka Core API
1. Producer API
2. Consumer API
3. Connect API
4. Stream API
26. Step 1: Produce messages using Producer API
POS_TRANSACTION_TOPIC
27.
28. Step 2: Capture Data from External Data source
Customer
Profile
POS_TRANSACTION_TOPIC
CUSTOMER_RECORD_TOPIC
KafkaConnect
32. How to use Kafka Streams API?
Just three steps
1. Create one or more streams from Kafka topic(s).
2. Compose transformations on these streams.
3. Write transformed streams back to Kafka.
33. Creating source streams from Kafka
● Input topics to KStream
○ Each app instance gets a subsect of partitions of input streams.
○ Specify the serializer and deserializer .
34. Transform a Stream
● Stateless transformation
○ Don’t require state for processing.
○ Don’t require state store with stream processor.
○ E.g. Branch, Filter, Inverse Filter, FlatMap, Peek, Map etc
35. Transform a Stream
● Stateful transformation
○ Depends on state for processing inputs and producing outputs.
○ Require a state store with stream processor.
○ State stores are fault tolerant.
■ Aggregating
■ Joining
■ Windowing
■ Applying custom processors and transformers
36. Aggregating
○ Group the record by either groupByKey or groupBy.
○ KGroupedStream or KGroupedTable can be aggregated via operations like reduce.
○ Aggregation can be performed on windowed or non-windowed data.
In a way every business generates stream of events. Retail has stream on orders and shipments, finance has stream of stock tickers. Bitcoins exchanges has stream of exchange rates, website has stream of impression and clicks. Every byte of data has story to tell and it
In today’s world Business is becoming more digital. And Ubounded, unorded large scale data sets are increasingly common in day to day business.
Every application generates data in form of use clicks, logs or some transaction.
Every byte has story to tell. Like your single click on Amazon, behind the scene determines which item would you like to see nex.
So data that application generated can be thought as streams of events.
Business is becoming more digital.
Every application generates data in form of use clicks, logs or some transaction.
Every byte has story to tell. Like your single click on Amazon, behind the scene determines which item would you like to see nex.
So data that application generated can be thought as streams of events.
Request response
Batch processing
Real time processing
To caught up with the need to process data in as it arrives companies has implemented data pipelines like this. It is ver messy.
There are applications which talks to each other using some kind of messenging queue
Custom etl scripts written to move data between sources and destinations. This adhoc fashion of connecting source and destination to build real time processing application is pretty chaotic.
In this talk we will see how Apache kafka cleans up the mess providing distributed streaming platform. The idea is to have kafka as central neve of your system. Which has ability to collect data from variety of sources and make it available at real-time and at large scale to any number of destination as it comes up.
Here is how you go about building streaming platform.
Kafka as a Messaging System
It acts as a publish subscribe system where publishers publish messages and consumers reads messages from server.
It is not just limited to pub sub system. It is storage system which stores the stream of data. Persistance and strict ordering
Data written to Kafka is written to disk and replicated for fault-tolerance.
you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.
Distributed by design
Replication
Fault Tolerance
Partitioning
Elastic Scaling
Scalability of file systems
It isn't enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.
In Kafka a stream processor is anything that takes continuous streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.
- Unit of data is Message which has key and data.
- It is like record in database
- Just byte array. No meaning to Kafka
- Message has Optional bit of metadata, referred as key.
- for efficiency message is written into batches
- Batch is just a collection of messages.
- Trade off between Latency and throughput
- Messages are categorized into Topics.
- Closest analogy is database table or folder
- Topics are broken down into partitions
- Partition provides redundancy and scalability
- Each partition can be hosted on to different server ( Single partition can be scaled horizontally across different server to provide performance
- Multiple partition does not guaranty ordering of messages across multiple partitions but ordering is maintain in single partition
Offset
- Another bit of metadata, an Integer that continuously increases
- Kafka adds offset to message as it is produced and which is unique in single partition.
Broker:
A single Kafka server is called a broker.
The broker receives messages from producers, assigns offsets to them, and commits the messages to storage on disk.
It also services consumers, responding to fetch requests for partitions and responding with the messages that have been committed to disk.
Cluster:
Kafka brokers are designed to operate as part of a cluster
Producer:
- Producer creates new messages and publish to Topic Kafka.
Consumer
- Subscribes to one or more topic and read messages in the order in which they were produced.
- keep track of which messages are already consumed by keeping offset of last consumed message.
- With this Consumer can restart and stop without losing its place.
Apache Kafka uses Zookeeper to store metadata about the Kafka cluster, as well as consumer client detail
I think we have covered enough of theory so let’s build our own simple Credit card fraud detection system.
This is very basic example o, So the idea is whenever card-holder use card, transaction events gets generated.
Ingest Transaction Stream into Kafka from Web Application using Kafka Producer API
Capture Card Holder information from external data source using Kafka Connect
Process Stream for Fraud Detection using Kafka stream API