SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Sri Chintala, Microsoft
Cosmos DB Real-time
Advanced Analytics
Workshop
#UnifiedDataAnalytics #SparkAISummit
Cosmos DB Real-time advanced analytics workshop
Today’s customer scenario
ž Woodgrove Bank provides payment processing services
for commerce.
ž Want to build PoC of an innovative online fraud detection
solution.
ž Goal: Monitor fraud in real-time across millions of
transactions to prevent financial loss and detect
widespread attacks.
3#UnifiedDataAnalytics #SparkAISummit
Part 1: Customer Scenario
• Woodgrove Banks’ customers – end merchants
– are all around the world.
• The right solution would minimize any latencies
experienced by using their service by
distributing the solution as close as possible to
the regions used by customers.
4
Part 1: Customer scenario
• Have decades-worth of historical transactional data, including transactions
identified as fraudulent.
• Data is in tabular format and can be exported to CSVs.
• The analysts are very interested in the recent notebook-driven approach
to data science & data engineering tasks.
• They would prefer a solution that features notebooks to explore and
prepare data, model, & define the logic for scheduled processing.
5
Part 1: Customer needs
• Provide fraud detection services to merchant customers, using incoming
payment transaction data to provide early warning of fraudulent activity.
• Schedule offline scoring of “suspicious activity” using trained model, and make
globally available.
• Store data from streaming sources into long-term storage without interfering
with read jobs.
• Use standard platform that supports near-term data pipeline needs and long-
term standard for data science, data engineering, & development.
6
Common scenarios
Part 2: Design the solution (10 min)
• Design a solution and prepare to present the
solution to the target customer audience in a
chalk-talk format.
8
Part 3: Discuss preferred solution
9
Preferred solution - overall
Preferred solution - overall
Preferred solution - overall
Preferred solution – Data Ingest
ž Payment transactions can be ingested in real-time using Event Hubs
or Azure Cosmos DB.
ž Factors to consider are:
ž rate of flow (how many transactions/second)
ž data source and compatibility
ž level of effort to implement
ž long-term storage needs
Preferred solution – Data Ingest
ž Cosmos DB:
ž Is optimized for high write throughput
ž Provides streaming through its change feed.
ž TTL (time to live) – automatic expiration & save in storage cost
ž Event Hub:
ž Data streams through, and can be persisted (Capture) in Blob or ADLS
ž Both guarantee event ordering per-partition. It is important how you
partition your data with either service.
Preferred solution – Data Ingest
ž Cosmos DB likely easier for Woodgrove to integrate because they
are already writing payment transactions to a database.
ž Cosmos DB multi-master accepts writes from any region (failover
auto redirects to next available region)
ž Event Hub requires multiple instances in different geographies
(failover requires more planning)
ž Recommend: Cosmos DB – think of as “persistent event store”
Preferred solution – Data pipeline processing
ž Azure Databricks:
ž Managed Spark environment that can process streaming & batch data
ž Enables data science, data engineering, and development needs.
ž Features it provides on top of standard Apache Spark include:
ž AAD integration and RBAC
ž Collaborative features such as workspace and git integration
ž Run scheduled jobs for automatic notebook/library execution
ž Integrates with Azure Key Vault
ž Train and evaluate machine learning models at scale
Preferred solution – Data pipeline processing
ž Azure Databricks can connect to both Event Hubs and Cosmos DB, using
Spark connectors for both.
ž Spark Structured Streaming to process real-time payment transactions into
Databricks Delta tables.
ž Be sure to set a checkpoint directory on your streams. This allows you to
restart stream processing if the job is stopped at any point.
Preferred solution – Data pipeline processing
ž Store secrets such as account keys and connection strings centrally in
Azure Key Vault
ž Set Key Vault as the source for secret scopes in Azure Databricks. Secrets
are [REDACTED].
Preferred solution – Data pipeline processing
ž Databricks Delta tables are Spark tables with built-in reliability
and performance optimizations.
ž Supports batch & streaming with additional features:
ACID transactions: Multiple writers can simultaneously modify data,
without interfering with jobs reading the data set.
DELETES/UPDATES/UPSERTS:
Automatic file management: Data access speeds up by organizing data into
large files that can be read efficiently
Statistics and data skipping: Reads are 10-100x faster when statistics are
tracked about data in each file, avoiding irrelevant
information
Preferred solution - overall
Preferred solution - overall
Preferred solution - overall
Preferred solution - overall
Preferred solution – Model training &
deployment
ž Azure Databricks supports machine learning training at scale.
ž Train model using historical payment transaction data
Preferred solution - overall
Preferred solution - overall
Preferred solution – Model training &
deployment
ž Use Azure Machine Learning service (AML) to:
ž Register the trained model
ž Deploy it to Azure Kubernetes Service (AKS) cluster for easy web
accessibility and high availability.
ž For scheduled, batch scoring, Access model from notebook and
write results to Cosmos via Cosmos DB Spark connector.
Preferred solution - overall
Preferred solution - overall
Preferred solution – serving pre-scored data
Use Cosmos DB for storing offline suspicious transaction data globally.
qAdd applicable customer regions
qEstimate RU/s needed – Cosmos can scale up & down to handle workload.
qConsistency: Session consistency
qPartition key: Choose to get even distribution of request volume & storage
Preferred solution - overall
Preferred solution - overall
Preferred solution – Long-term storage
ž Use Azure Data Lake Storage Gen2 (ADLS Gen2) as the underlying
long-term file store for Databricks Delta tables.
ž Databricks Delta can compact small files together into larger files
up to 1 GB in size using the OPTIMIZE operator. This can improve
query performance over time.
ž Define file paths in ADLS for query, dimension, and summary
tables. Point to those paths when saving to Delta.
ž Delta tables can be accessed by Power BI through a JDBC
connector.
Preferred solution - overall
Preferred solution - overall
Preferred solution – Dashboards & Reporting
ž Connect to Databricks Delta tables from Power BI to allow
analysts to build reports and dashboards.
ž The connection can be made using a JDBC connection string to
an Azure Databricks cluster. Querying the tables is similar to
querying a more traditional relational database.
ž Data scientists and data engineers can use Azure Databricks
notebooks to craft complex queries and data visualizations.
Preferred solution – Dashboards & Reporting
ž A more cost-effective option for serving summary data for
business analysts to use from Power BI is to use Azure Analysis
Services.
ž Eliminates having to have a dedicated Databricks cluster running
at all times for reporting and analysis.
ž Data is stored in a tabular semantic data model
ž Write to it during stream processing (using rolling aggregates)
ž Schedule batch writes via Databricks job or ADF.
Preferred solution - overall
Participant Guide
• https://aka.ms/cosmos-mcw
39
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureImproving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Databricks
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 

Was ist angesagt? (20)

End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan Zvara
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureImproving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own Datasource
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 

Ähnlich wie Cosmos DB Real-time Advanced Analytics Workshop

Ähnlich wie Cosmos DB Real-time Advanced Analytics Workshop (20)

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Big Data: It’s all about the Use Cases
Big Data: It’s all about the Use CasesBig Data: It’s all about the Use Cases
Big Data: It’s all about the Use Cases
 
Traditional data word
Traditional data wordTraditional data word
Traditional data word
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptx
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
 
CirrusDB Offerings
CirrusDB OfferingsCirrusDB Offerings
CirrusDB Offerings
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 

Kürzlich hochgeladen

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 

Kürzlich hochgeladen (20)

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Cosmos DB Real-time Advanced Analytics Workshop

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Sri Chintala, Microsoft Cosmos DB Real-time Advanced Analytics Workshop #UnifiedDataAnalytics #SparkAISummit Cosmos DB Real-time advanced analytics workshop
  • 3. Today’s customer scenario ž Woodgrove Bank provides payment processing services for commerce. ž Want to build PoC of an innovative online fraud detection solution. ž Goal: Monitor fraud in real-time across millions of transactions to prevent financial loss and detect widespread attacks. 3#UnifiedDataAnalytics #SparkAISummit
  • 4. Part 1: Customer Scenario • Woodgrove Banks’ customers – end merchants – are all around the world. • The right solution would minimize any latencies experienced by using their service by distributing the solution as close as possible to the regions used by customers. 4
  • 5. Part 1: Customer scenario • Have decades-worth of historical transactional data, including transactions identified as fraudulent. • Data is in tabular format and can be exported to CSVs. • The analysts are very interested in the recent notebook-driven approach to data science & data engineering tasks. • They would prefer a solution that features notebooks to explore and prepare data, model, & define the logic for scheduled processing. 5
  • 6. Part 1: Customer needs • Provide fraud detection services to merchant customers, using incoming payment transaction data to provide early warning of fraudulent activity. • Schedule offline scoring of “suspicious activity” using trained model, and make globally available. • Store data from streaming sources into long-term storage without interfering with read jobs. • Use standard platform that supports near-term data pipeline needs and long- term standard for data science, data engineering, & development. 6
  • 8. Part 2: Design the solution (10 min) • Design a solution and prepare to present the solution to the target customer audience in a chalk-talk format. 8
  • 9. Part 3: Discuss preferred solution 9
  • 13. Preferred solution – Data Ingest ž Payment transactions can be ingested in real-time using Event Hubs or Azure Cosmos DB. ž Factors to consider are: ž rate of flow (how many transactions/second) ž data source and compatibility ž level of effort to implement ž long-term storage needs
  • 14. Preferred solution – Data Ingest ž Cosmos DB: ž Is optimized for high write throughput ž Provides streaming through its change feed. ž TTL (time to live) – automatic expiration & save in storage cost ž Event Hub: ž Data streams through, and can be persisted (Capture) in Blob or ADLS ž Both guarantee event ordering per-partition. It is important how you partition your data with either service.
  • 15. Preferred solution – Data Ingest ž Cosmos DB likely easier for Woodgrove to integrate because they are already writing payment transactions to a database. ž Cosmos DB multi-master accepts writes from any region (failover auto redirects to next available region) ž Event Hub requires multiple instances in different geographies (failover requires more planning) ž Recommend: Cosmos DB – think of as “persistent event store”
  • 16. Preferred solution – Data pipeline processing ž Azure Databricks: ž Managed Spark environment that can process streaming & batch data ž Enables data science, data engineering, and development needs. ž Features it provides on top of standard Apache Spark include: ž AAD integration and RBAC ž Collaborative features such as workspace and git integration ž Run scheduled jobs for automatic notebook/library execution ž Integrates with Azure Key Vault ž Train and evaluate machine learning models at scale
  • 17. Preferred solution – Data pipeline processing ž Azure Databricks can connect to both Event Hubs and Cosmos DB, using Spark connectors for both. ž Spark Structured Streaming to process real-time payment transactions into Databricks Delta tables. ž Be sure to set a checkpoint directory on your streams. This allows you to restart stream processing if the job is stopped at any point.
  • 18. Preferred solution – Data pipeline processing ž Store secrets such as account keys and connection strings centrally in Azure Key Vault ž Set Key Vault as the source for secret scopes in Azure Databricks. Secrets are [REDACTED].
  • 19. Preferred solution – Data pipeline processing ž Databricks Delta tables are Spark tables with built-in reliability and performance optimizations. ž Supports batch & streaming with additional features: ACID transactions: Multiple writers can simultaneously modify data, without interfering with jobs reading the data set. DELETES/UPDATES/UPSERTS: Automatic file management: Data access speeds up by organizing data into large files that can be read efficiently Statistics and data skipping: Reads are 10-100x faster when statistics are tracked about data in each file, avoiding irrelevant information
  • 24. Preferred solution – Model training & deployment ž Azure Databricks supports machine learning training at scale. ž Train model using historical payment transaction data
  • 27. Preferred solution – Model training & deployment ž Use Azure Machine Learning service (AML) to: ž Register the trained model ž Deploy it to Azure Kubernetes Service (AKS) cluster for easy web accessibility and high availability. ž For scheduled, batch scoring, Access model from notebook and write results to Cosmos via Cosmos DB Spark connector.
  • 30. Preferred solution – serving pre-scored data Use Cosmos DB for storing offline suspicious transaction data globally. qAdd applicable customer regions qEstimate RU/s needed – Cosmos can scale up & down to handle workload. qConsistency: Session consistency qPartition key: Choose to get even distribution of request volume & storage
  • 33. Preferred solution – Long-term storage ž Use Azure Data Lake Storage Gen2 (ADLS Gen2) as the underlying long-term file store for Databricks Delta tables. ž Databricks Delta can compact small files together into larger files up to 1 GB in size using the OPTIMIZE operator. This can improve query performance over time. ž Define file paths in ADLS for query, dimension, and summary tables. Point to those paths when saving to Delta. ž Delta tables can be accessed by Power BI through a JDBC connector.
  • 36. Preferred solution – Dashboards & Reporting ž Connect to Databricks Delta tables from Power BI to allow analysts to build reports and dashboards. ž The connection can be made using a JDBC connection string to an Azure Databricks cluster. Querying the tables is similar to querying a more traditional relational database. ž Data scientists and data engineers can use Azure Databricks notebooks to craft complex queries and data visualizations.
  • 37. Preferred solution – Dashboards & Reporting ž A more cost-effective option for serving summary data for business analysts to use from Power BI is to use Azure Analysis Services. ž Eliminates having to have a dedicated Databricks cluster running at all times for reporting and analysis. ž Data is stored in a tabular semantic data model ž Write to it during stream processing (using rolling aggregates) ž Schedule batch writes via Databricks job or ADF.
  • 40. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT