Whirlpools in Structured Streaming

•

2 gefällt mir•442 views

This document summarizes some challenges and solutions related to structured streaming in Spark. It discusses issues with joining streaming and batch data due to lack of pushdown predicates. It also covers problems with caching batch dataframes, lack of a JDBC sink in streaming mode initially, issues with checkpoints being inconsistent, and limitations on aggregating aggregated dataframes. Solutions proposed include caching data outside Spark, looking up batch data in map/flatmap, direct database writes, using NFS for checkpoints, and custom aggregations without Spark SQL.

Daten & Analysen

Whirlpools in the Stream
Misadventures in Structured Streaming
Jayesh Lalwani,
Lead Software Engineer, Quantum, Capital One

Look Ma! No lookups!
• Spark allows you to join Streaming data with Batch Data
• BUT….
• Spark doesn’t support pushdown of join predicates. This means that the State
is loaded into Spark memory for every microbatch
• Caching a Batch dataframe “freezes” it
State
Process
A
Process B
Events

Solutions
• Cache dataframe & Restart the Streaming application
• Cache dataframe & Restart the Streaming query
• Cache the data outside of Spark
• Lookup batch data inside a map/flatmap operation
• Stream updates to Process B and do stream to stream join (2.3+ only)
– works for limited use cases

Data store? We don’t need no stinkin’ data
stores
• Outputs supported Batch
Mode
• JSON
• CSV
• Parquet
• Orc
• Text
• JDBC
• Outputs supported
Streaming Mode
• JSON
• CSV
• Parquet
• Text
WHERE IS JDBC?

My implementation of JDBC sink
• Features:
• Can write streaming data to JDBC data stores
• Uses the same code that is used by Batch JDBC sinks
• Modes: Overwrite, Append
• Supports Atleast Once out of the box
• Supports Exactly Once with some configuration
• Fork: https://github.com/GaalDornick/spark
• Implementation:
https://github.com/GaalDornick/spark/blob/master/sql/core/src/main/scal
a/org/apache/spark/sql/execution/streaming/JdbcSink.scala
• Unit test:
https://github.com/GaalDornick/spark/blob/master/sql/core/src/test/scala
/org/apache/spark/sql/execution/streaming/JDBCSinkSuite.scala

Missing the point with Checkpoints
• Checkpoint stores state
• Allows Streaming application to restart on failure
• Checkpoint should be stored on a location that is
• Resilient to failure
• Shared between executors and drivers (read-write many)
• Immediately consistent
• S3 IS NOT IMMEDIATELY CONSISTENT

Solution
• NFS
• We use EFS on AWS
• On Kubernetes, you need a PV that supports RWX access mode
• NFS
• Ceph
• GlusterFS

Checkpoints had a great fall
• https://issues.apache.org/jira/browse/SPARK-21696
• Leads to corrupt checkpoints.
• Solved by deleting the checkpoint once a day
• This means we lose data
• Fixed in 2.2.1

Aggregations? What Aggregations?
• Structured streaming allows you to aggregate a dataframe
• But you cannot aggregate an aggregated data frame
• PARTITION BY clause is supported only for time windows
Solutions
• If possible, aggregate batch data before joining with stream
• Implement aggregations without Spark SQL using
groupBy..flatMapGroups

Jayesh Lalwani
Lead Software Engineer
Team Heartbeat
Jayesh.Lalwani@capitalone.com

Empfohlen

Spark Summit EU talk by Yiannis GkoufasSpark Summit

Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...Databricks

Spark Summit EU talk by Oscar CastanedaSpark Summit

SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...Databricks

A Collaborative Data Science Development WorkflowDatabricks

Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks

Portable UDFs: Write Once, Run AnywhereDatabricks

Overview of Apache Spark 2.3: What’s New? with Sameer AgarwalDatabricks

Empfohlen

Spark Summit EU talk by Yiannis GkoufasSpark Summit

Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...Databricks

Spark Summit EU talk by Oscar CastanedaSpark Summit

SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...Databricks

A Collaborative Data Science Development WorkflowDatabricks

Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks

Portable UDFs: Write Once, Run AnywhereDatabricks

Overview of Apache Spark 2.3: What’s New? with Sameer AgarwalDatabricks

Spark Summit EU talk by Jakub HavaSpark Summit

Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman

How We Optimize Spark SQL Jobs With parallel and sync IODatabricks

03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful ServingDatabricks

Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDatabricks

MLflow Model ServingDatabricks

Operational Tips for Deploying Spark by Miklos ChristineSpark Summit

Databricks: What We Have Learned by Eating Our Dog FoodDatabricks

Creating Reusable Geospatial PipelinesDatabricks

Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Databricks

Spark Summit EU talk by Jim DowlingSpark Summit

Monitoring of GPU Usage with Tensorflow Models Using PrometheusDatabricks

Structured Streaming Use-Cases at AppleDatabricks

Getting Started with Apache Spark on KubernetesDatabricks

Data Security at Scale through Spark and Parquet EncryptionDatabricks

Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data AnalyticsDatabricks

Understanding and Improving Code GenerationDatabricks

How to deploy Apache Spark  to Mesos/DCOSLegacy Typesafe (now Lightbend)

Semi-Supervised Learning In An Adversarial EnvironmentDataWorks Summit

Infrastructure for Deep Learning in Apache SparkDatabricks

Spark Streaming @ Scale (Clicktale)Yuval Itzchakov

Fault toleranceThisara Pramuditha

Weitere ähnliche Inhalte

Was ist angesagt?

Spark Summit EU talk by Jakub HavaSpark Summit

Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman

How We Optimize Spark SQL Jobs With parallel and sync IODatabricks

03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful ServingDatabricks

Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDatabricks

MLflow Model ServingDatabricks

Operational Tips for Deploying Spark by Miklos ChristineSpark Summit

Databricks: What We Have Learned by Eating Our Dog FoodDatabricks

Creating Reusable Geospatial PipelinesDatabricks

Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Databricks

Spark Summit EU talk by Jim DowlingSpark Summit

Monitoring of GPU Usage with Tensorflow Models Using PrometheusDatabricks

Structured Streaming Use-Cases at AppleDatabricks

Getting Started with Apache Spark on KubernetesDatabricks

Data Security at Scale through Spark and Parquet EncryptionDatabricks

Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data AnalyticsDatabricks

Understanding and Improving Code GenerationDatabricks

How to deploy Apache Spark  to Mesos/DCOSLegacy Typesafe (now Lightbend)

Semi-Supervised Learning In An Adversarial EnvironmentDataWorks Summit

Infrastructure for Deep Learning in Apache SparkDatabricks

Was ist angesagt? (20)

Spark Summit EU talk by Jakub Hava

Efficient State Management With Spark 2.0 And Scale-Out Databases

How We Optimize Spark SQL Jobs With parallel and sync IO

03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving

Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley

MLflow Model Serving

Operational Tips for Deploying Spark by Miklos Christine

Databricks: What We Have Learned by Eating Our Dog Food

Creating Reusable Geospatial Pipelines

Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...

Spark Summit EU talk by Jim Dowling

Monitoring of GPU Usage with Tensorflow Models Using Prometheus

Structured Streaming Use-Cases at Apple

Getting Started with Apache Spark on Kubernetes

Data Security at Scale through Spark and Parquet Encryption

Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics

Understanding and Improving Code Generation

How to deploy Apache Spark  to Mesos/DCOS

Semi-Supervised Learning In An Adversarial Environment

Infrastructure for Deep Learning in Apache Spark

Ähnlich wie Whirlpools in Structured Streaming

Spark Streaming @ Scale (Clicktale)Yuval Itzchakov

Fault toleranceThisara Pramuditha

"Introduction to Sparkling Water" — Jakub Hava, Senior Software Engineer, at ...Provectus

What no one tells you about writing a streaming apphadooparchbook

What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit

Introduction to real time big data with Apache SparkTaras Matyashovsky

Incorta spark integrationDylan Wan

Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks

Spark 101 - First steps to distributed computingDemi Ben-Ari

Apache Spark FundamentalsZahra Eskandari

Top 5 mistakes when writing Streaming applicationshadooparchbook

The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks

Getting started with Apache SparkHabib Ahmed Bhutto

Ratpack Web FrameworkDaniel Woods

Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks

Seattle Spark Meetup Mobius CSharp APIshareddatamsft

Giraph+Gora in ApacheCon14Renato Javier Marroquín Mogrovejo

Productionizing Spark and the REST Job Server- Evan ChanSpark Summit

Productionizing Spark and the Spark Job ServerEvan Chan

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly

Ähnlich wie Whirlpools in Structured Streaming (20)

Spark Streaming @ Scale (Clicktale)

Fault tolerance

"Introduction to Sparkling Water" — Jakub Hava, Senior Software Engineer, at ...

What no one tells you about writing a streaming app

What No One Tells You About Writing a Streaming App: Spark Summit East talk b...

Introduction to real time big data with Apache Spark

Incorta spark integration

Migrating ETL Workflow to Apache Spark at Scale in Pinterest

Spark 101 - First steps to distributed computing

Apache Spark Fundamentals

Top 5 mistakes when writing Streaming applications

The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...

Getting started with Apache Spark

Ratpack Web Framework

Radical Speed for SQL Queries on Databricks: Photon Under the Hood

Seattle Spark Meetup Mobius CSharp API

Giraph+Gora in ApacheCon14

Productionizing Spark and the REST Job Server- Evan Chan

Productionizing Spark and the Spark Job Server

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...

Mehr von Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Kürzlich hochgeladen

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal

CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823

Mature dropshipping via API with DroFx.pptxolyaivanovalion

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

Invezz.com - Grow your wealth with trading signalsInvezz1

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Data-Analysis for Chicago Crime Data 2023ymrp368

Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765

Edukaciniai dropshipping via API with DroFxolyaivanovalion

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Kürzlich hochgeladen (20)

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

Schema on read is obsolete. Welcome metaprogramming..pdf

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Determinants of health, dimensions of health, positive health and spectrum of...

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...

CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online

Mature dropshipping via API with DroFx.pptx

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

Sampling (random) method and Non random.ppt

Ravak dropshipping via API with DroFx.pptx

Invezz.com - Grow your wealth with trading signals

BabyOno dropshipping via API with DroFx.pptx

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Data-Analysis for Chicago Crime Data 2023

Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl

Edukaciniai dropshipping via API with DroFx

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

100-Concepts-of-AI by Anupama Kate .pptx

Whirlpools in Structured Streaming

1. Whirlpools in the Stream Misadventures in Structured Streaming Jayesh Lalwani, Lead Software Engineer, Quantum, Capital One

2. What is Structured Streaming

3. Dunderheaded Data stores!

4. Look Ma! No lookups! • Spark allows you to join Streaming data with Batch Data • BUT…. • Spark doesn’t support pushdown of join predicates. This means that the State is loaded into Spark memory for every microbatch • Caching a Batch dataframe “freezes” it State Process A Process B Events

5. Solutions • Cache dataframe & Restart the Streaming application • Cache dataframe & Restart the Streaming query • Cache the data outside of Spark • Lookup batch data inside a map/flatmap operation • Stream updates to Process B and do stream to stream join (2.3+ only) – works for limited use cases

6. Data store? We don’t need no stinkin’ data stores • Outputs supported Batch Mode • JSON • CSV • Parquet • Orc • Text • JDBC • Outputs supported Streaming Mode • JSON • CSV • Parquet • Text WHERE IS JDBC?

7. My implementation of JDBC sink • Features: • Can write streaming data to JDBC data stores • Uses the same code that is used by Batch JDBC sinks • Modes: Overwrite, Append • Supports Atleast Once out of the box • Supports Exactly Once with some configuration • Fork: https://github.com/GaalDornick/spark • Implementation: https://github.com/GaalDornick/spark/blob/master/sql/core/src/main/scal a/org/apache/spark/sql/execution/streaming/JdbcSink.scala • Unit test: https://github.com/GaalDornick/spark/blob/master/sql/core/src/test/scala /org/apache/spark/sql/execution/streaming/JDBCSinkSuite.scala

8. Clumsy footed Checkpoints

9. Missing the point with Checkpoints • Checkpoint stores state • Allows Streaming application to restart on failure • Checkpoint should be stored on a location that is • Resilient to failure • Shared between executors and drivers (read-write many) • Immediately consistent • S3 IS NOT IMMEDIATELY CONSISTENT

10. Solution • NFS • We use EFS on AWS • On Kubernetes, you need a PV that supports RWX access mode • NFS • Ceph • GlusterFS

11. Checkpoints had a great fall • https://issues.apache.org/jira/browse/SPARK-21696 • Leads to corrupt checkpoints. • Solved by deleting the checkpoint once a day • This means we lose data • Fixed in 2.2.1

12. Anthropophagic Aggregations!

13. Aggregations? What Aggregations? • Structured streaming allows you to aggregate a dataframe • But you cannot aggregate an aggregated data frame • PARTITION BY clause is supported only for time windows Solutions • If possible, aggregate batch data before joining with stream • Implement aggregations without Spark SQL using groupBy..flatMapGroups

14. Jayesh Lalwani Lead Software Engineer Team Heartbeat Jayesh.Lalwani@capitalone.com