The workshop implements an innovative fraud detection solution as a PoC for a bank who provides payment processing services for commerce to their merchant customers all across the globe, helping them save costs by applying machine learning and advanced analytics to detect fraudulent transactions. Since their customers are around the world, the right solutions should minimize any latencies experienced using their service by distributing as much of the solution as possible, as closely as possible, to the regions in which their customers use the service. The workshop designs a data pipeline solution that leverages Cosmos DB for both the scalable ingest of streaming data, and the globally distributed serving of both pre-scored data and machine learning models. Cosmos DB’s major advantage when operating at a global scale is its high concurrency with low latency and predictable results.
This combination is unique to Cosmos DB and ideal for the bank needs. The solution leverages the Cosmos DB change data feed in concert with the Azure Databricks Delta and Spark capabilities to enable a modern data warehouse solution that can be used to create risk reduction solutions for scoring transactions for fraud in an offline, batch approach and in a near real-time, request/response approach. https://github.com/Microsoft/MCW-Cosmos-DB-Real-Time-Advanced-Analytics Takeaway: How to leverage Azure Cosmos DB + Azure Databricks along with Spark ML for building innovative advanced analytics pipelines.
2. Sri Chintala, Microsoft
Cosmos DB Real-time
Advanced Analytics
Workshop
#UnifiedDataAnalytics #SparkAISummit
Cosmos DB Real-time advanced analytics workshop
3. Today’s customer scenario
ž Woodgrove Bank provides payment processing services
for commerce.
ž Want to build PoC of an innovative online fraud detection
solution.
ž Goal: Monitor fraud in real-time across millions of
transactions to prevent financial loss and detect
widespread attacks.
3#UnifiedDataAnalytics #SparkAISummit
4. Part 1: Customer Scenario
• Woodgrove Banks’ customers – end merchants
– are all around the world.
• The right solution would minimize any latencies
experienced by using their service by
distributing the solution as close as possible to
the regions used by customers.
4
5. Part 1: Customer scenario
• Have decades-worth of historical transactional data, including transactions
identified as fraudulent.
• Data is in tabular format and can be exported to CSVs.
• The analysts are very interested in the recent notebook-driven approach
to data science & data engineering tasks.
• They would prefer a solution that features notebooks to explore and
prepare data, model, & define the logic for scheduled processing.
5
6. Part 1: Customer needs
• Provide fraud detection services to merchant customers, using incoming
payment transaction data to provide early warning of fraudulent activity.
• Schedule offline scoring of “suspicious activity” using trained model, and make
globally available.
• Store data from streaming sources into long-term storage without interfering
with read jobs.
• Use standard platform that supports near-term data pipeline needs and long-
term standard for data science, data engineering, & development.
6
8. Part 2: Design the solution (10 min)
• Design a solution and prepare to present the
solution to the target customer audience in a
chalk-talk format.
8
13. Preferred solution – Data Ingest
ž Payment transactions can be ingested in real-time using Event Hubs
or Azure Cosmos DB.
ž Factors to consider are:
ž rate of flow (how many transactions/second)
ž data source and compatibility
ž level of effort to implement
ž long-term storage needs
14. Preferred solution – Data Ingest
ž Cosmos DB:
ž Is optimized for high write throughput
ž Provides streaming through its change feed.
ž TTL (time to live) – automatic expiration & save in storage cost
ž Event Hub:
ž Data streams through, and can be persisted (Capture) in Blob or ADLS
ž Both guarantee event ordering per-partition. It is important how you
partition your data with either service.
15. Preferred solution – Data Ingest
ž Cosmos DB likely easier for Woodgrove to integrate because they
are already writing payment transactions to a database.
ž Cosmos DB multi-master accepts writes from any region (failover
auto redirects to next available region)
ž Event Hub requires multiple instances in different geographies
(failover requires more planning)
ž Recommend: Cosmos DB – think of as “persistent event store”
16. Preferred solution – Data pipeline processing
ž Azure Databricks:
ž Managed Spark environment that can process streaming & batch data
ž Enables data science, data engineering, and development needs.
ž Features it provides on top of standard Apache Spark include:
ž AAD integration and RBAC
ž Collaborative features such as workspace and git integration
ž Run scheduled jobs for automatic notebook/library execution
ž Integrates with Azure Key Vault
ž Train and evaluate machine learning models at scale
17. Preferred solution – Data pipeline processing
ž Azure Databricks can connect to both Event Hubs and Cosmos DB, using
Spark connectors for both.
ž Spark Structured Streaming to process real-time payment transactions into
Databricks Delta tables.
ž Be sure to set a checkpoint directory on your streams. This allows you to
restart stream processing if the job is stopped at any point.
18. Preferred solution – Data pipeline processing
ž Store secrets such as account keys and connection strings centrally in
Azure Key Vault
ž Set Key Vault as the source for secret scopes in Azure Databricks. Secrets
are [REDACTED].
19. Preferred solution – Data pipeline processing
ž Databricks Delta tables are Spark tables with built-in reliability
and performance optimizations.
ž Supports batch & streaming with additional features:
ACID transactions: Multiple writers can simultaneously modify data,
without interfering with jobs reading the data set.
DELETES/UPDATES/UPSERTS:
Automatic file management: Data access speeds up by organizing data into
large files that can be read efficiently
Statistics and data skipping: Reads are 10-100x faster when statistics are
tracked about data in each file, avoiding irrelevant
information
24. Preferred solution – Model training &
deployment
ž Azure Databricks supports machine learning training at scale.
ž Train model using historical payment transaction data
27. Preferred solution – Model training &
deployment
ž Use Azure Machine Learning service (AML) to:
ž Register the trained model
ž Deploy it to Azure Kubernetes Service (AKS) cluster for easy web
accessibility and high availability.
ž For scheduled, batch scoring, Access model from notebook and
write results to Cosmos via Cosmos DB Spark connector.
30. Preferred solution – serving pre-scored data
Use Cosmos DB for storing offline suspicious transaction data globally.
qAdd applicable customer regions
qEstimate RU/s needed – Cosmos can scale up & down to handle workload.
qConsistency: Session consistency
qPartition key: Choose to get even distribution of request volume & storage
33. Preferred solution – Long-term storage
ž Use Azure Data Lake Storage Gen2 (ADLS Gen2) as the underlying
long-term file store for Databricks Delta tables.
ž Databricks Delta can compact small files together into larger files
up to 1 GB in size using the OPTIMIZE operator. This can improve
query performance over time.
ž Define file paths in ADLS for query, dimension, and summary
tables. Point to those paths when saving to Delta.
ž Delta tables can be accessed by Power BI through a JDBC
connector.
36. Preferred solution – Dashboards & Reporting
ž Connect to Databricks Delta tables from Power BI to allow
analysts to build reports and dashboards.
ž The connection can be made using a JDBC connection string to
an Azure Databricks cluster. Querying the tables is similar to
querying a more traditional relational database.
ž Data scientists and data engineers can use Azure Databricks
notebooks to craft complex queries and data visualizations.
37. Preferred solution – Dashboards & Reporting
ž A more cost-effective option for serving summary data for
business analysts to use from Power BI is to use Azure Analysis
Services.
ž Eliminates having to have a dedicated Databricks cluster running
at all times for reporting and analysis.
ž Data is stored in a tabular semantic data model
ž Write to it during stream processing (using rolling aggregates)
ž Schedule batch writes via Databricks job or ADF.