SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
LESSONS LEARNED:
USING SPARK AND MICROSERVICES
(TO EMPOWER DATA SCIENTISTS AND DATA ENGINEERS)
Alexis Seigneurin
Who I am
• Software engineer for 15+ years
• Consultant at Ippon USA, previously at Ippon France
• Favorite subjects: Spark, Machine Learning, Cassandra
• Spark trainer
• @aseigneurin
• 200 software engineers in France, the US and Australia
• In the US: offices in DC, NYC and Richmond,Virginia
• Digital, Big Data and Cloud applications
• Java & Agile expertise
• Open-source projects: JHipster,Tatami, etc.
• @ipponusa
The project
• Analyze records from customers → Give feedback to the customer on their data
• High volume of data
• 25 millions records per day (average)
• Need to keep at least 60 days of history = 1.5 Billion records
• Seasonal peaks...
• Need an hybrid platform
• Batch processing for some types of analysis
• Streaming for other analyses
• Hybrid team
• Data Scientists: more familiar with Python
• Software Engineers: Java
Technical Overview
Processing technology - Spark
• Mature platform
• Supports batch jobs and streaming jobs
• Support for multiple programming languages
• Python → Data Scientists
• Scala/Java → Software Engineers
Architecture - Real time platform
• New use cases are implemented by Data Scientists all the time
• Need the implementations to be independent from each other
• One Spark Streaming job per use case
• Microservice-inspired architecture
• Diamond-shaped
• Upstream jobs are written in Scala
• Core is made of multiple Python jobs, one per use case
• Downstream jobs are written in Scala
• Plumbing between the jobs → Kafka
1/2
Architecture - Real time platform 2/2
Messaging technology - Kafka
From kafka.apache.org
• “A high-throughput distributed messaging system”
• Messaging: between 2 Spark jobs
• Distributed: fits well with Spark, can be scaled up or down
• High-throughput: so as to handle an average of 300 messages/second, peaks at 2000 m/s
• “Apache Kafka is publish-subscribe messaging rethought as a distributed
commit log”
• Commit log so that you can go back in time and reprocess data
• Only used as such when a job crashes, for resilience purposes
Storage
• Currently PostgreSQL:
• SQL databases are well known by developers and easy to work with
• PostgreSQL is available “as-a-service” on AWS
• Working on transitioning to Cassandra (more on that
later)
Deployment platform
• Amazon AWS
• Company standard - Everything in the cloud
• Easy to scale up or down, ability to choose the hardware
• Some limitations
• Requirement to use company-crafted AMIs
• Cannot use some services (EMR…)
• AMIs are renewed every 2 months → need to recreate the platform
continuously
Strengths of the platform
Modularity
• One Spark job per use case
• Hot deployments: can roll out new use cases (= new jobs) without
stopping existing jobs
• Can roll out updated code without affecting other jobs
• Able to measure the resources consumed by a single job
• Shared services are provided by upstream and
downstream jobs
A/B testing
• A/B testing of updated features
• Run 2 implementations of the code in parallel
• Let each filter process the data of all the customers
• Post-filter to let the customers receive A or B
• (Measure…)
• Can be used to slowly roll out new features
Data Scientists can contribute
• Spark in Python → pySpark
• Data Scientists know Python (and don’t want to hear about Java/
Scala!)
• Business logic implemented in Python
• Code is easy to write and to read
• Data Scientists are real contributors → quick iterations to production
Challenges
Data Scientist code in production
• Shipping code written by Data Scientists is not ideal
• Need production-grade code (error handling, logging…)
• Code is less tested than Scala code
• Harder to deploy than a JAR file → PythonVirtual Environments
• blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-
hadoop-cluster-for-pyspark-jobs/
Allocation of resources in Spark
• With Spark Streaming, resources (CPU & memory) are allocated per job
• Resources are allocated when the job is submitted and cannot be updated on the
fly
• Have to allocate 1 core to the Driver of the job → unused resource
• Have to allocate extra resources to each job to handle variations in traffic →
unused resources
• For peak periods, easy to add new Spark Workers but jobs have to restarted
• Idea to be tested:
• Over allocation of real resources, e.g let Spark know it has 6 cores on a 4-cores server
Micro-batches
Spark streaming processes events in micro-batches
• Impact on the latency
• Spark Streaming micro-batches → hard to achieve sub-second latency
• See spark.apache.org/docs/latest/streaming-programming-guide.html#task-launching-overheads
• Total latency of the system = sum of the latencies of each stage
• In this use case, events are independent from each other - no need for windowing computation → a
real streaming framework would be more appropriate
• Impact on memory usage
• Kafka+Spark using the direct approach = 1 RDD partition per Kafka partition
• If you start the Spark with lots of unprocessed data in Kafka, RDD partitions can exceed the size of
the memory
Resilience of Spark jobs
• Spark Streaming application = 1 Driver + 1 Application
• Application = N Executors
• If an Executor dies → restarted (seamless)
• If the Driver dies, the whole Application must be restarted
• Scala/Java jobs → “supervised” mode
• Python jobs → not supported with Spark Standalone
Resilience with Spark & Kafka
• Connecting Spark to Kafka, 2 methods:
• Receiver-based approach: not ideal for parallelism
• Direct approach: better for parallelism but have to deal with Kafka offsets
• Dealing with Kafka offsets
• Default: consumes from the end of the Kafka topic (or the beginning)
• Documentation → Use checkpoints
• Tasks have to be Serializable (not always possible: dependent libraries)
• Harder to deploy the application (classes are serialized) → run a new instance in parallel and
kill the first one (harder to automate; messages consumed twice)
• Requires a shared file system (HDFS, S3) → big latency on these FS that forces to increase the
micro-batch interval
1/2
Resilience with Spark & Kafka
• Dealing with Kafka offsets
• Solution: deal with offsets in the Spark Streaming application
• Write the offsets to a reliable storage: ZooKeeper, Kafka…
• Write after processing the data
• Read the offsets on startup (if no offsets, start from the end)
• ippon.tech/blog/spark-kafka-achieving-zero-data-loss/
2/2
Writing to Kafka
• Spark Streaming comes with a library to read from Kafka
but none to write to Kafka!
• Flink or Kafka Streams do that out-of-the-box
• Cloudera provides an open-source library:
• github.com/cloudera/spark-kafka-writer
• (Has been removed by now!)
Idempotence
Spark and fault-tolerance semantics:
• Spark can provide exactly once guarantee only for the transformation
of the data
• Writing the data is at least once with non-transactional systems
(including Kafka in our case)
• See spark.apache.org/docs/latest/streaming-programming-
guide.html#fault-tolerance-semantics
→The overall system has to be idempotent
Message format & schemas
• Spark jobs are decoupled, but each depends on the upstream job
• Message formats have to be agreed upon
• JSON
• Pros: flexible
• Cons: flexible! (missing fields)
• Avro
• Pros: enforces a structure (named fields + types)
• Cons: hard to propagate the schemas
→ Confluent’s Schema Registry (more on that later)
Potential & upcoming
improvements
Confluent’s Schema Registry
docs.confluent.io/3.0.0/schema-registry/docs/index.html
• Separate (web) server to manage & enforce Avro schemas
• Stores schemas, versions them, and can perform compatibility checks
(configurable: backward or forward)
• Makes life simpler:
✓ no need to share schemas (“what version of the schema is this?”)
✓ no need to share generated classes
✓ can update the producer with backward-compatible messages without affecting the
consumers
1/2
Confluent’s Schema Registry
• Comes with:
• A Kafka Serializer (for the producer): sends the schema of the object to the Schema Registry before sending the record to Kafka
• Message sending fails if schema compatibility fails
• A Kafka Decoder (for the consumer): retrieves the schema from the Schema Registry when a message comes in
2/2
Kafka Streams
docs.confluent.io/3.0.0/streams/index.html
• “powerful, easy-to-use library for building highly scalable, fault-tolerant, distributed stream
processing applications on top of Apache Kafka”
• Perfect fit for micro-services on top of Kafka
• Natively consumes messages from Kafka
• Natively pushes produced messages to Kafka
• Processes messages one at a time → very low latency
1/2
• Pros
• API is very similar to Spark’s API
• Deploy new instances of the application to scale out
• Cons
• JVM languages only - no support for Python
• Outside of Spark - one more thing to manage
Kafka Streams
Properties props = new Properties();

props.put(StreamsConfig.APPLICATION_ID_CONFIG, "xxx");

props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9093");

props.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "localhost:2182");

props.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());

props.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());



props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");



KStreamBuilder builder = new KStreamBuilder();



KStream<String, String> kafkaInput = builder.stream(“INPUT-TOPIC");

KStream<String, RealtimeXXX> auths = kafkaInput.mapValues(value -> ...);

KStream<String, byte[]> serializedAuths = auths.mapValues(a -> AvroSerializer.serialize(a));



serializedAuths.to(Serdes.String(), Serdes.ByteArray(), “OUTPUT-TOPIC");



KafkaStreams streams = new KafkaStreams(builder, props);

streams.start();
2/2
Example (Java)
Database migration
• The database stores the state
• Client settings or analyzed behavior
• Historical data (up to 60 days)
• Produced outputs
• Some technologies can store a state (e.g. Samza) but hardly 60 days of data
• Initially used PostgreSQL
• Easy to start with
• Available on AWS “as-a-service”: RDS
• Cannot scale to 60 days of historical data, though
• Cassandra is a good fit
• Scales out for the storage of historical data
• Connects to Spark
• Load Cassandra data into Spark, or saves data from Spark to Cassandra
• Can be used to reprocess existing data for denormalization purposes
Summary
&
Conclusion
Summary
Is the microservices architecture adequate?
• Interesting to separate the implementations of the use cases
• Overhead for the other services
Is Spark adequate?
• Supports Python (not supported by Kafka Streams)
• Micro-batches not adequate
Thank you!
@aseigneurin

Weitere ähnliche Inhalte

Was ist angesagt?

KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...confluent
 
Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Timothy Spann
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producerconfluent
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and ProtobufGuido Schmutz
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®confluent
 
AIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AIAIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AITimothy Spann
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...confluent
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan StanleyA Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan StanleyHostedbyConfluent
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®confluent
 
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021StreamNative
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsTimothy Spann
 
Function Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
Function Mesh for Apache Pulsar, the Way for Simple Streaming SolutionsFunction Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
Function Mesh for Apache Pulsar, the Way for Simple Streaming SolutionsStreamNative
 

Was ist angesagt? (20)

KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
 
Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®
 
AIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AIAIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AI
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan StanleyA Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
 
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Function Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
Function Mesh for Apache Pulsar, the Way for Simple Streaming SolutionsFunction Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
Function Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
 

Ähnlich wie Lessons Learned: Using Spark and Microservices

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine LearningChris Fregly
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
 
Introduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationIntroduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationKnoldus Inc.
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafkaconfluent
 

Ähnlich wie Lessons Learned: Using Spark and Microservices (20)

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Introduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationIntroduction to Kafka Streams Presentation
Introduction to Kafka Streams Presentation
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
 

Mehr von Alexis Seigneurin

Data Quality Monitoring in Realtime and at Scale
Data Quality Monitoring in Realtime and at ScaleData Quality Monitoring in Realtime and at Scale
Data Quality Monitoring in Realtime and at ScaleAlexis Seigneurin
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software DevelopmentAlexis Seigneurin
 
Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)Alexis Seigneurin
 
Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Alexis Seigneurin
 
Spark - Alexis Seigneurin (English)
Spark - Alexis Seigneurin (English)Spark - Alexis Seigneurin (English)
Spark - Alexis Seigneurin (English)Alexis Seigneurin
 
Spark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairSpark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairAlexis Seigneurin
 

Mehr von Alexis Seigneurin (8)

Data Quality Monitoring in Realtime and at Scale
Data Quality Monitoring in Realtime and at ScaleData Quality Monitoring in Realtime and at Scale
Data Quality Monitoring in Realtime and at Scale
 
0712_Seigneurin
0712_Seigneurin0712_Seigneurin
0712_Seigneurin
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software Development
 
Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)
 
Spark - Ippevent 19-02-2015
Spark - Ippevent 19-02-2015Spark - Ippevent 19-02-2015
Spark - Ippevent 19-02-2015
 
Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)
 
Spark - Alexis Seigneurin (English)
Spark - Alexis Seigneurin (English)Spark - Alexis Seigneurin (English)
Spark - Alexis Seigneurin (English)
 
Spark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairSpark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclair
 

Kürzlich hochgeladen

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 

Kürzlich hochgeladen (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 

Lessons Learned: Using Spark and Microservices

  • 1. LESSONS LEARNED: USING SPARK AND MICROSERVICES (TO EMPOWER DATA SCIENTISTS AND DATA ENGINEERS) Alexis Seigneurin
  • 2. Who I am • Software engineer for 15+ years • Consultant at Ippon USA, previously at Ippon France • Favorite subjects: Spark, Machine Learning, Cassandra • Spark trainer • @aseigneurin
  • 3. • 200 software engineers in France, the US and Australia • In the US: offices in DC, NYC and Richmond,Virginia • Digital, Big Data and Cloud applications • Java & Agile expertise • Open-source projects: JHipster,Tatami, etc. • @ipponusa
  • 4. The project • Analyze records from customers → Give feedback to the customer on their data • High volume of data • 25 millions records per day (average) • Need to keep at least 60 days of history = 1.5 Billion records • Seasonal peaks... • Need an hybrid platform • Batch processing for some types of analysis • Streaming for other analyses • Hybrid team • Data Scientists: more familiar with Python • Software Engineers: Java
  • 6. Processing technology - Spark • Mature platform • Supports batch jobs and streaming jobs • Support for multiple programming languages • Python → Data Scientists • Scala/Java → Software Engineers
  • 7. Architecture - Real time platform • New use cases are implemented by Data Scientists all the time • Need the implementations to be independent from each other • One Spark Streaming job per use case • Microservice-inspired architecture • Diamond-shaped • Upstream jobs are written in Scala • Core is made of multiple Python jobs, one per use case • Downstream jobs are written in Scala • Plumbing between the jobs → Kafka 1/2
  • 8. Architecture - Real time platform 2/2
  • 9. Messaging technology - Kafka From kafka.apache.org • “A high-throughput distributed messaging system” • Messaging: between 2 Spark jobs • Distributed: fits well with Spark, can be scaled up or down • High-throughput: so as to handle an average of 300 messages/second, peaks at 2000 m/s • “Apache Kafka is publish-subscribe messaging rethought as a distributed commit log” • Commit log so that you can go back in time and reprocess data • Only used as such when a job crashes, for resilience purposes
  • 10. Storage • Currently PostgreSQL: • SQL databases are well known by developers and easy to work with • PostgreSQL is available “as-a-service” on AWS • Working on transitioning to Cassandra (more on that later)
  • 11. Deployment platform • Amazon AWS • Company standard - Everything in the cloud • Easy to scale up or down, ability to choose the hardware • Some limitations • Requirement to use company-crafted AMIs • Cannot use some services (EMR…) • AMIs are renewed every 2 months → need to recreate the platform continuously
  • 12. Strengths of the platform
  • 13. Modularity • One Spark job per use case • Hot deployments: can roll out new use cases (= new jobs) without stopping existing jobs • Can roll out updated code without affecting other jobs • Able to measure the resources consumed by a single job • Shared services are provided by upstream and downstream jobs
  • 14. A/B testing • A/B testing of updated features • Run 2 implementations of the code in parallel • Let each filter process the data of all the customers • Post-filter to let the customers receive A or B • (Measure…) • Can be used to slowly roll out new features
  • 15. Data Scientists can contribute • Spark in Python → pySpark • Data Scientists know Python (and don’t want to hear about Java/ Scala!) • Business logic implemented in Python • Code is easy to write and to read • Data Scientists are real contributors → quick iterations to production
  • 17. Data Scientist code in production • Shipping code written by Data Scientists is not ideal • Need production-grade code (error handling, logging…) • Code is less tested than Scala code • Harder to deploy than a JAR file → PythonVirtual Environments • blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache- hadoop-cluster-for-pyspark-jobs/
  • 18. Allocation of resources in Spark • With Spark Streaming, resources (CPU & memory) are allocated per job • Resources are allocated when the job is submitted and cannot be updated on the fly • Have to allocate 1 core to the Driver of the job → unused resource • Have to allocate extra resources to each job to handle variations in traffic → unused resources • For peak periods, easy to add new Spark Workers but jobs have to restarted • Idea to be tested: • Over allocation of real resources, e.g let Spark know it has 6 cores on a 4-cores server
  • 19. Micro-batches Spark streaming processes events in micro-batches • Impact on the latency • Spark Streaming micro-batches → hard to achieve sub-second latency • See spark.apache.org/docs/latest/streaming-programming-guide.html#task-launching-overheads • Total latency of the system = sum of the latencies of each stage • In this use case, events are independent from each other - no need for windowing computation → a real streaming framework would be more appropriate • Impact on memory usage • Kafka+Spark using the direct approach = 1 RDD partition per Kafka partition • If you start the Spark with lots of unprocessed data in Kafka, RDD partitions can exceed the size of the memory
  • 20. Resilience of Spark jobs • Spark Streaming application = 1 Driver + 1 Application • Application = N Executors • If an Executor dies → restarted (seamless) • If the Driver dies, the whole Application must be restarted • Scala/Java jobs → “supervised” mode • Python jobs → not supported with Spark Standalone
  • 21. Resilience with Spark & Kafka • Connecting Spark to Kafka, 2 methods: • Receiver-based approach: not ideal for parallelism • Direct approach: better for parallelism but have to deal with Kafka offsets • Dealing with Kafka offsets • Default: consumes from the end of the Kafka topic (or the beginning) • Documentation → Use checkpoints • Tasks have to be Serializable (not always possible: dependent libraries) • Harder to deploy the application (classes are serialized) → run a new instance in parallel and kill the first one (harder to automate; messages consumed twice) • Requires a shared file system (HDFS, S3) → big latency on these FS that forces to increase the micro-batch interval 1/2
  • 22. Resilience with Spark & Kafka • Dealing with Kafka offsets • Solution: deal with offsets in the Spark Streaming application • Write the offsets to a reliable storage: ZooKeeper, Kafka… • Write after processing the data • Read the offsets on startup (if no offsets, start from the end) • ippon.tech/blog/spark-kafka-achieving-zero-data-loss/ 2/2
  • 23. Writing to Kafka • Spark Streaming comes with a library to read from Kafka but none to write to Kafka! • Flink or Kafka Streams do that out-of-the-box • Cloudera provides an open-source library: • github.com/cloudera/spark-kafka-writer • (Has been removed by now!)
  • 24. Idempotence Spark and fault-tolerance semantics: • Spark can provide exactly once guarantee only for the transformation of the data • Writing the data is at least once with non-transactional systems (including Kafka in our case) • See spark.apache.org/docs/latest/streaming-programming- guide.html#fault-tolerance-semantics →The overall system has to be idempotent
  • 25. Message format & schemas • Spark jobs are decoupled, but each depends on the upstream job • Message formats have to be agreed upon • JSON • Pros: flexible • Cons: flexible! (missing fields) • Avro • Pros: enforces a structure (named fields + types) • Cons: hard to propagate the schemas → Confluent’s Schema Registry (more on that later)
  • 27. Confluent’s Schema Registry docs.confluent.io/3.0.0/schema-registry/docs/index.html • Separate (web) server to manage & enforce Avro schemas • Stores schemas, versions them, and can perform compatibility checks (configurable: backward or forward) • Makes life simpler: ✓ no need to share schemas (“what version of the schema is this?”) ✓ no need to share generated classes ✓ can update the producer with backward-compatible messages without affecting the consumers 1/2
  • 28. Confluent’s Schema Registry • Comes with: • A Kafka Serializer (for the producer): sends the schema of the object to the Schema Registry before sending the record to Kafka • Message sending fails if schema compatibility fails • A Kafka Decoder (for the consumer): retrieves the schema from the Schema Registry when a message comes in 2/2
  • 29. Kafka Streams docs.confluent.io/3.0.0/streams/index.html • “powerful, easy-to-use library for building highly scalable, fault-tolerant, distributed stream processing applications on top of Apache Kafka” • Perfect fit for micro-services on top of Kafka • Natively consumes messages from Kafka • Natively pushes produced messages to Kafka • Processes messages one at a time → very low latency 1/2 • Pros • API is very similar to Spark’s API • Deploy new instances of the application to scale out • Cons • JVM languages only - no support for Python • Outside of Spark - one more thing to manage
  • 30. Kafka Streams Properties props = new Properties();
 props.put(StreamsConfig.APPLICATION_ID_CONFIG, "xxx");
 props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9093");
 props.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "localhost:2182");
 props.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
 props.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
 
 props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
 
 KStreamBuilder builder = new KStreamBuilder();
 
 KStream<String, String> kafkaInput = builder.stream(“INPUT-TOPIC");
 KStream<String, RealtimeXXX> auths = kafkaInput.mapValues(value -> ...);
 KStream<String, byte[]> serializedAuths = auths.mapValues(a -> AvroSerializer.serialize(a));
 
 serializedAuths.to(Serdes.String(), Serdes.ByteArray(), “OUTPUT-TOPIC");
 
 KafkaStreams streams = new KafkaStreams(builder, props);
 streams.start(); 2/2 Example (Java)
  • 31. Database migration • The database stores the state • Client settings or analyzed behavior • Historical data (up to 60 days) • Produced outputs • Some technologies can store a state (e.g. Samza) but hardly 60 days of data • Initially used PostgreSQL • Easy to start with • Available on AWS “as-a-service”: RDS • Cannot scale to 60 days of historical data, though • Cassandra is a good fit • Scales out for the storage of historical data • Connects to Spark • Load Cassandra data into Spark, or saves data from Spark to Cassandra • Can be used to reprocess existing data for denormalization purposes
  • 33. Summary Is the microservices architecture adequate? • Interesting to separate the implementations of the use cases • Overhead for the other services Is Spark adequate? • Supports Python (not supported by Kafka Streams) • Micro-batches not adequate