This document discusses new approaches for fraud detection using Apache Kafka and KSQL. It introduces KSQL, an open-source streaming SQL engine for Apache Kafka. KSQL can be used to perform streaming ETL, anomaly detection, and event monitoring using SQL-like queries on streaming data. The document demonstrates how to run KSQL locally or in a client-server configuration, and how Arcadia Data provides a visualization layer on top of KSQL to enable visual analytics on streaming data.
New Approaches for Fraud Detection on Apache Kafka and KSQL
1. Arcadia Data. Proprietary and Confidential
New Approaches for Fraud Detection
on Apache Kafka and KSQL
September 20, 2018
2. Arcadia Data. Proprietary and Confidential2
Featured Speakers
Dale Kim
Sr. Director, Products/Solutions
Arcadia Data
Chong Yan
Solutions Architect
Confluent
3. Arcadia Data. Proprietary and Confidential3
If you have any questions along the way, please type them into the chat window.
If you have audio problems, please chat us for help.
A recording of this presentation will be sent to you in a few days.
Please live tweet! @arcadiadata @confluentinc
Before We Begin Our Presentation
4. Arcadia Data. Proprietary and Confidential4
Primary goals include
Reduce losses due to fraud
Reduce rate of fraudulent activity versus legitimate activity
Cost of fraud often goes beyond the cost of the transaction
Retain high rate of approved transactions
Reduce false positives (in this case, legitimate activities flagged as potentially fraudulent)
Customer experience could be impacted by false positives
What can be done?
Enable a larger user base for monitoring for fraud
Identify risky transactions sooner (i.e., in real-time)
Evolve “better” algorithms (beyond scope of this talk)
First, Let’s Review the Goals of Fraud Detection
5. Arcadia Data. Proprietary and Confidential5
Fraud is largely about anomaly detection
Outlier or unexpected events that signal potential fraud
Anomalies across a population, not only for individuals
Examples
Unusually large transactions
Unusual timing of transactions
Consistent groupings of transactions
Example Fraud Signals
6. Arcadia Data. Proprietary and Confidential6
Does a rise in transactions in the past 30 minutes look suspicious?
Can this be captured in a batch environment?
Fraud detection goes beyond
just a fraud team
Analyzing marketing data might
lead to insights in fraud
What about Trends/Patterns?
7. Arcadia Data. Proprietary and Confidential7
Your approach should not be limited to static data and only known patterns and
signatures
Fraud detection must be holistic across all data, and requires exploration
Your system should provide BI-style dashboards and reports
Goals must be tied in with customer acquisition and revenue strategies
Key Requirements for Fraud Detection
13. Arcadia Data. Proprietary and Confidential13
Traditional Streaming Architectures for Transactions
Kafka Cluster
Source Topics
Stream Processing
System
Job Version N
Job Version N+1
Serving DB
Output Table N
Output Table N+1
Analytics
App
Queries/
Responses
Future Queries/
Responses
Data Sources
Kafka Cluster
Source Topics
Analytics App
Stream Processing
Framework
Custom
End User
Interface
Responses
Data Sources
14. Arcadia Data. Proprietary and Confidential14
Native Streaming Visualizations Architecture
Kafka Cluster
Source Topics
KSQL Cluster
SQL engine
Visual
Analytics
/ BI App
Queries/
Responses
Data Sources
18. Way More Than Messaging
True Storage
Real-Time
Processing
Scalability
Messaging done right.
18
19. 1919
+ Distributed clustered storage
Kafka is a blend of messaging, stream processing, ETL and
modern database designs built around a distributed log.
+ Streaming platform
Pub/Sub
Messaging
ETL
Connectors
Spark
Flink
Beam
IBM MQ
TIBCO
RabbitMQ
Mulesoft
Talend
Informatica
Kafka is much more than messaging
+ Exactly once
+ Designed for the cloud+ Inter-DC
replication
+ Schema evolution
Stream
Processing
20. 2020
Stream Data is
The Faster the Better
We are challenging old assumptions...
Big Data was
The More the Better
ValueofData
Volume of Data
ValueofData
Age of Data
22. Confidential
2
2
KSQL: The streaming SQL engine for Apache Kafka from Confluent
✓ All you need is SQL
✓ No separate processing cluster required
✓ Powered by Kafka: elastic, scalable,
distributed, battle tested
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u
ON c.userid = u.userid
WHERE u.level = 'Platinum';
KSQL is the simplest way to process streams of data in real
time
✓ Perfect for streaming ETL, anomaly detection,
event monitoring and more
✓ Part of Confluent Open Source
https://github.com/confluentinc/ksql
23. Confidential 23
KSQL: The streaming SQL engine for Apache Kafka from Confluent
• Enables stream processing with SQL like syntax.
• The simplest way to process streams of data in real time
• Powered by Kafka: scalable, distributed, battle tested
• All you need is Kafka–no complex deployments of bespoke systems for stream
processing
Ksql>
24. Confidential 24
CREATE STREAM possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
KSQL: The simplest way to do stream processing
25. Confidential 25
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u ON c.userid =
u.user_id
WHERE u.level = 'Platinum';
CREATE STREAM possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
KSQL: The simplest way to do stream processing
1 2 3Streaming ETL Anomaly detection Monitoring
26. 26
KSQL Concepts
● STREAM and TABLE as first-class citizens
• Interpretations of topic content
● STREAM – data in motion
● TABLE – collected state of a stream
• One record per key (per window)
• Current values (compacted topic) ← Not yet in KSQL
● STREAM – TABLE Joins
27. 27
Window Aggregations
Three types supported (same as KStreams):
● TUMBLING: Fixed-size, non-overlapping, gapless windows
• SELECT ip, count(*) AS hits
FROM clickstream WINDOW TUMBLING (size 1 minute)
GROUP BY ip;
● HOPPING: Fixed-size, overlapping windows
• SELECT ip, SUM(bytes) AS bytes_per_ip_and_bucket
FROM clickstream WINDOW HOPPING ( size 20 second, advance by 5 second)
GROUP BY ip;
● SESSION: Dynamically-sized, non-overlapping, data-driven window
• SELECT ip, SUM(bytes) AS bytes_per_ip
FROM clickstream WINDOW SESSION (20 second)
GROUP BY ip;
More: http://docs.confluent.io/current/streams/developer-guide.html#windowing
28. Confidential 28
1)How to run KSQL: Standalone aka “local mode”
• Starts a CLI, an engine and a REST server all in the same JVM
• Ideal for laptop development
• Start with default settings:
> bin/ksql-cli local
• Or with customized settings:
> bin/ksql-cli local –-properties-file foo/bar/ksql.properties
29. Confidential 29
2) How to run KSQL: Client-Server
• Start any number of server nodes
• > bin/ksql-server-start
• Start any number of CLIs and specify “remote” server address
• >bin/ksql-cli remote http://myserver:8090
• All running engines share the processing load
• Technically, instances of the same Kafka Streams
applications
• Scale up/down without restart
30. Confidential 30
2) How to run KSQL: As an application
• Start any number of engine instances
• Pass a file of KSQL statements to execute
> bin/ksql-node query-file=foo/bar.sql
• Ideal for streaming ETL application deployment
• Version control your queries and transformations as code
• All running engines share the processing load
• Technically, instances of the same Kafka Streams
applications
• Scale up/down without restart
33. Arcadia Data. Proprietary and Confidential33
Try Out the Software Yourself
Go to: https://www.arcadiadata.com/product/streaming-visualizations
https://www.arcadiadata.com/product/streaming-visualizations/
34. San Francisco – October 16-17, 2018
Presented by
Kafka Community Discount Code
KS18COMM25 for 25% off
www.kafka-summit.org
35. Arcadia Data. Proprietary and Confidential35
Thank You!
Be sure to also visit:
Try Arcadia Instant (free download)
https://www.arcadiadata.com/instant
Get started with Arcadia Data on KSQL
https://www.arcadiadata.com/resources/knowledge-base/
Read Arcadia Data blog posts
https://www.arcadiadata.com/blog
@arcadiadata
Try Confluent KSQL (free download)
https://cnfl.io/ksql
Sign up for confluentcommunity
https://cnfl.io/slack #ksql
Read Confluent blog posts
https://cnfl.io/blog
@confluentinc