The adoption of Apache Spark to analyze data in real-time is increasing with its ability to handle sophisticated analytical requirements and a common framework for streaming and batch. However, most organizations are also looking for "true streaming" features like lower latency and the ability to process out-of-order data.
Structured Streaming, a new high-level API, introduced in Apache Spark 2.0 promises these and other enhancements to the Spark approach to streaming data processing.
In this webinar, Anand Venugopal (Product Head) and other technical experts from StreamAnalytix, speak about the promising developments in Apache Spark 2.0 and how organizations can leverage structured streaming to make timely and accurate decisions and stay competitive.
The structured streaming upgrade to Apache Spark and how enterprises can benefit- StreamAnalytix Webinar
1. Š 2017 Impetus Technologies
WEBINAR
Anand Venugopal
Product Head & AVP,
StreamAnalytix
The Structured Streaming Upgrade to Apache Spark
and How Enterprises Can Benefit
Amit Assudani
Sr.Technical Architect â Spark,
StreamAnalytix
August 2017
2. Š 2017 Impetus Technologies
Quick Webinar Notes
⢠Our focus: Enabling real-time enterprise, make Spark easy-to-use
⢠Sharing our experience and expertise with you
⢠Level of content
⢠20-80 :: New-Experienced (w.r.t. Spark)
⢠Format: A combination of panel discussion and presentation
⢠Usage of some artifacts and pictures from Apache Spark website and other public sources
⢠Q&A and interactions are important and highly valued
⢠Please send us your comments/ feedback using the Webex console
3. Š 2017 Impetus Technologies
Webinar Outline
⢠About Impetus and what is StreamAnalytix? â 2 minutes
⢠Apache Spark â Know the basics and its evolution â 8 minutes
⢠A deep dive into Structured Streaming â 25 minutes
⢠What is it?
⢠How is it different from 1.0?
⢠Features and technical highlights
⢠Benefits and limitations
⢠Upgrades and migrations
⢠Future roadmap
⢠Talent vs Tooling â 5 minutes
⢠Q&A â 5+ minutes
4. Š 2017 Impetus Technologies
Mission critical technology
solutions since 1996
Fortune 500: Big Data
clients
1700 people; US,
India, global reach
Unique mix of
Big Data products
and services
About Impetus
6. Š 2017 Impetus Technologies
⢠Project in Berkeley AMPLabs â 2009 â Matei Zaharia; open sourced (BSD) in 2010
⢠Framework on distributed resource management system (Mesos)
⢠Speed up ML jobs in Apache Hadoop with in-memory approach
⢠30x performance increase on Hadoop jobs
Apache Spark â The Beginning
7. Š 2017 Impetus Technologies
⢠Robust widely used technology
⢠Survey by Taneja Group in November 2016 highlights:
⢠54% of 7000 enterprise participants â said actively using Spark
⢠55% of workloads were ETL / data processing / engineering
⢠Cloud deployments projected well beyond 30%
⢠Popular new initiatives â Data science exploration, streaming and machine learning
Micro-batch
Hi-speed Batch Sits on Hadoop
and/or CloudInteractive Iterative
Graph Streaming
Apache Spark â Current State
8. Š 2017 Impetus Technologies
Spark Evolution
Major
Version
Date of
Release
Minor
Version
Feature Remarks
Spark
0.X
Feb
2014
Spark
0.7-0.9
⢠Becomes a top level Apache project
⢠RDD concept introduced with Spark
⢠Scala and Java binding
⢠Adds a Python API called PySpark
⢠Introduces Spark Streaming
⢠Introduces MLlib
⢠Includes a first version of GraphX
⢠PySpark makes it possible to use Spark
from Python
⢠Spark Streaming adds near real-time
processing capability
⢠Spark Streaming is now out of alpha and
includes significant optimizations and
simplified high availability deployment
9. Š 2017 Impetus Technologies
Spark Evolution
Major
Version
Date of
Release
Minor
Version
Feature Remarks
Spark
1.0-1.2
May
2014
Spark 1.0 ⢠Adds Spark SQL
⢠Guarantees stability of its core API
⢠Full support for running seamlessly in
secured Hadoop clusters
⢠Spark 1.0 was the first production ready
backward compatible release. Viewed
spark streaming as faster batch
processing rather than streaming
⢠Became 1st open source Big Data
framework to embrace in-memory
computing
Sep
2014
Spark 1.1 ⢠Migrates all customer workloads from Shark
to Spark SQL
⢠Expansion of MLlib
⢠Extends libraries and sources for Spark
streaming
⢠First minor release in the 1.X series.
Added significant extensions to the newly
added Spark SQL and the Spark MLlib
Dec
2014
Spark 1.2 ⢠A new API for external data sources
⢠New H/A driver support through a Write
Ahead Log (WAL), removes any single-
point-of-failure from Spark streaming
⢠A higher-level API for constructing pipelines
in the spark.ml package
⢠GraphX project provides a stable API
⢠Recognized the need for structured data
and started to evolve to support it.
Introduced a specialized RDD schema as
a first step.
⢠However still lacked a direct API to read
structured data from Spark
10. Š 2017 Impetus Technologies
Spark Evolution
Major
Version
Date of
Release
Minor
Version
Feature Remarks
Spark
1.3-1.5
Mar
2015
Spark
1.3
⢠A new DataFrames API
⢠Provides a rich set of new MLlib algorithms
⢠Adds APIs to direct Kakfa streaming source
⢠DataFrames allow Spark to better
understand the structure of data as well as
the computation being performed.
⢠First unified API to read from structured
and semi-structured sources (both
RDBMS and NoSQL databases)
Jun
2015
Spark
1.4
⢠Introduces SparkR
⢠ML pipelines API graduates from alpha with new
transformers and improved Python coverage
⢠Adds visual debugging and monitoring
utilities to evaluate running of Spark applications
⢠A REST API for Initial performance improvements
in project Tungsten
⢠A pluggable interface for write ahead logs
⢠Targets data scientists with SparkR on
new DataFrame API.
⢠Ships the initial pieces of Project Tungsten,
becomes first version of custom memory
management
Sep
2015
Spark
1.5
⢠1st major pieces of Project Tungsten
⢠New ML algorithms, extends new R API
⢠Adds visualization of SQL and DataFrame query
plans in the web UI
⢠Operational features for the streaming
component, such as backpressure support
⢠Pushes Project Tungsten
⢠Focused on increasing Sparkâs
performance through several low-level
architectural optimizations
⢠Another major theme was data science
11. Š 2017 Impetus Technologies
Spark Evolution
Major
Version
Date of
Release
Minor
Version
Feature Remarks
Spark
1.0
Jan
2016
Spark 1.6 ⢠Experimental Dataset API
⢠New data science functionalities; ML
pipeline persistence and new algorithms
⢠A new and efficient âmapWithState APIâ,
replaces updateStateByKey
⢠Speedup of 10X for streaming state
management
⢠SQL queries on files
⢠Datasets, a typed extension of the
DataFrame API allows to work with custom
objects and lambda functions with benefits
of Spark SQL
12. Š 2017 Impetus Technologies
Spark Evolution
Date of
Release
Major
Version
Minor
Version
Feature Remarks
Spark
2.0-2.2
Jul 2016 Spark
2.0
⢠A new API, Structured Streaming
⢠Second generation Tungsten engine
⢠Unified DataFrame and Dataset in Scala/Java
⢠Substantial (2-10X) performance speedup for
common operators in SQL and DataFrames with
a new technique called whole stage code
generation
⢠Structured Streaming launched
experimentally Aims to integrate batch and
Stream. Introduces the concept of
continuous applications
Dec 2016 Spark
2.1
⢠Hardening of Structured Streaming â still
experimental
⢠Adds a number of SQL functionalities
⢠Focuses on advanced analytics
⢠SparkR becomes most comprehensive library
for distributed machine learning on R
Introduced Structured Streaming as a high-
level API for building continuous applications.
Aims to make it easier to build end-to-end
streaming applications. Introduces;
⢠Event-time watermarks
⢠Support for all file-based formats and all
file-based features
⢠Adds native support for Kafka 0.10
Jul 2017 Spark
2.2
⢠Production ready Structured Streaming
⢠Focuses on advanced analytics and Python
⢠Cost-based optimizer
⢠Limit the max number of records written per file
⢠Support for parsing multi-line JSON & CSV files
⢠The Structured Streaming APIs are now
GA and is no longer labeled experimental
⢠Add various SQL functionalities and
introduces Additional Algorithms in MLlib
and GraphX
13. Š 2017 Impetus Technologies
Poll Question
What is your currently used Spark version?
- 1.6 or prior
- 2.1
- 2.2
- Planning to start soon
- No plans
14. Š 2017 Impetus Technologies
A Deep Dive into Structured Streaming
15. Š 2017 Impetus Technologies
Structured Streaming â What is it?
⢠Strongly improved framework over Spark Streaming (DStream API) of Spark 1.x
⢠High level streaming API built on Spark SQL (DataFrame/Dataset API) and Catalyst Optimizer
⢠Express streaming computations the same way as batch computations
⢠Repeated query / incremental execution on unbounded table
16. Š 2017 Impetus Technologies
Structured Streaming â What is it?
⢠âNO REASONING ABOUT STREAMINGâ
⢠Simply define a flow:
⢠source ď transformation ď sink ď mode
& trigger time ď checkpoint
⢠Structured Streaming makes Streaming ETL +
Analytics easier and a natural single flow
⢠Not restricted to hard batch duration limits (delivers
lower latency)
⢠Exactly-once guarantee now truly end-end: includes
sink layer
17. Š 2017 Impetus Technologies
Structured Streaming â Code Snippet
(Structured Streaming vs Batch)
// Structured Streaming
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
df.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.start()
//Batch
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
df.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save()
19. Š 2017 Impetus Technologies
Streaming Code â Executed on âTriggerâ
(One Time Batch)
// Structured Streaming
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS
STRING)")
.as[(String, String)]
//One Time Trigger
df.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.trigger(Trigger.Once)
.start()
⢠No worry about figuring out âchanged dataâ and output
consistency
⢠Much easier stateful processing like deduping
⢠Unified code: No different code base for Lambda
solutions
⢠Cost saving by not running the cluster 24/7
21. Š 2017 Impetus Technologies
Structured Streaming â Features and Highlights
(Event Time; Window Duration and Triggers)
⢠Event time orientation
⢠In combination with âwindowsâ and triggers
⢠Aggregates maintained by Structured Streaming
⢠No need to write separate code
⢠Incremental query and output modes
⢠append / complete / update
22. Š 2017 Impetus Technologies
Structured Streaming â Features and Highlights
(Late Data Handling)
23. Š 2017 Impetus Technologies
Structured Streaming â Features and Highlights
(Watermarking (âData too late!â))
24. Š 2017 Impetus Technologies
⢠New data formats:
⢠Native - multi-line JSON support
⢠Native CSV data source
⢠Stateful processing and time-outs beyond aggregations
⢠Using mapgroupswithstate and flatmapgroupswithstate
⢠New built-in ârateâ source for benchmarking and testing for data generation
⢠x number of events, <xyz> format
⢠Metrics for Structured Streaming: New metrics sink
⢠Connect with Graphite
⢠Streaming listener (for metrics for every batch execution)
⢠Kafka 010 support; from_json, to_json, explode
Structured Streaming â Features and Highlights
25. Š 2017 Impetus Technologies
⢠New â Input / output features:
⢠Kafka stream / batch writer (DStream - didn't have Kafka writer)
⢠Kafka batch / stream source (Kafka wasn't available as a source for batch earlier)
⢠Partitioning output data files (Example: Hive data output)
⢠Deduplication is a built in function
⢠Example: Major Bank use case
⢠Without Structured Streaming â manual record and check for hash value in external store
⢠With Structured Streaming - unbounded table with hash values
Structured Streaming â Features and Highlights
26. Š 2017 Impetus Technologies
⢠Improvements (not new) :
⢠Easier stream to batch join
⢠Recovering failures using checkpoint (this was there in DStream also)
⢠âCode Productivityâ enhanced / continuous SQL over batches and aggregations
(maintained by Structured Streaming)
⢠Enhanced batch inter-operability
Structured Streaming â Additional Features
27. Š 2017 Impetus Technologies
⢠Co-existence of 1.6 and 2.x â on the same Hadoop cluster
⢠Forward compatibility changes
⢠SparkSession is now the new entry point of Spark
⢠Replaces the old (1.x) SQLContext and HiveContext
⢠Dataset API and DataFrame API are unified
⢠Scala: DataFrame becomes a type alias for Dataset[Row]
⢠Java API users must replace DataFrame with Dataset<Row>
Spark Version Management Considerations
(Migration, Co-existence)
28. Š 2017 Impetus Technologies
⢠Machine learning support still weak (coming soon)
⢠Multiple (chained) aggregations not supported
⢠Limit, take, collect, show, count, foreach â Donât work
⢠Join limitations
⢠Caching for multiple actions
⢠Aggregation queries / SQL on single micro batch
⢠No kinesis support
⢠Java8 only
Structured Streaming â Limitations
29. Š 2017 Impetus Technologies
⢠Streaming without micro-batches
⢠~1 ms latency â has been promised (and without code changes)
⢠Berkeley - Drizzle project - potential replacement of Streaming engine
⢠For users: will not be much different
⢠No changes in code
Structured Streaming â Future: Mid-Long Term
31. Š 2017 Impetus Technologies
Shortage of Talent and the Urgent Need For It
⢠Spark projects are increasing
⢠Need to get done quickly with budget controls
⢠The big barrier
⢠Talent - Deep Spark / Scala skills are hard to find
⢠Big gap between Spark prototype app vs. production grade scale, stability
⢠Lot of engineers on other projects need to be made productive quickly
32. Š 2017 Impetus Technologies
The Need for Tooling
⢠Need very good enterprise grade, UI driven tooling around Spark to make it easy
⢠Need to cover all bases:
⢠Development, Debugging, Deployment, DevOps, Monitoring
⢠Also need to cover the full data processing journey
⢠Ingest
⢠Data Quality
⢠Blending
⢠Transformation / Enrichment
⢠Analytics / Machine Learning
⢠Loading of target databases
⢠Visualization
33. Š 2017 Impetus Technologies
StreamAnalytix â âVisual Sparkâ and MoreâŚ
⢠StreamAnalytix is one such platform which makes Spark easy
⢠Drag-and-drop UI to build and deploy Spark apps in minutes
⢠Real-time and Batch Data360 platform â on Apache Spark 2.1
⢠Support for Spark 2.2 and Structured Streaming coming in 4Q
34. Š 2017 Impetus Technologies
About StreamAnalytix
Based on Multiple
Open-Source Engines
â Spark, Storm
and Flink (Future)
On Premise and
Cloud Compatible
Enterprise Grade â UI
Driven Streaming, IoT
and Batch Analytics and
Machine Learning
Platform
35. Š 2017 Impetus TechnologiesŠ 2017 Impetus Technologies
Please provide your feedback on the webinar and your
interest to attend our upcoming webinars.
Meet us at Booth # 127
Strata Data Conference in New York
September 26-28, 2017