SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Š 2017 Impetus Technologies
WEBINAR
Anand Venugopal
Product Head & AVP,
StreamAnalytix
The Structured Streaming Upgrade to Apache Spark
and How Enterprises Can Benefit
Amit Assudani
Sr.Technical Architect – Spark,
StreamAnalytix
August 2017
Š 2017 Impetus Technologies
Quick Webinar Notes
• Our focus: Enabling real-time enterprise, make Spark easy-to-use
• Sharing our experience and expertise with you
• Level of content
• 20-80 :: New-Experienced (w.r.t. Spark)
• Format: A combination of panel discussion and presentation
• Usage of some artifacts and pictures from Apache Spark website and other public sources
• Q&A and interactions are important and highly valued
• Please send us your comments/ feedback using the Webex console
Š 2017 Impetus Technologies
Webinar Outline
• About Impetus and what is StreamAnalytix? – 2 minutes
• Apache Spark – Know the basics and its evolution – 8 minutes
• A deep dive into Structured Streaming – 25 minutes
• What is it?
• How is it different from 1.0?
• Features and technical highlights
• Benefits and limitations
• Upgrades and migrations
• Future roadmap
• Talent vs Tooling – 5 minutes
• Q&A – 5+ minutes
Š 2017 Impetus Technologies
Mission critical technology
solutions since 1996
Fortune 500: Big Data
clients
1700 people; US,
India, global reach
Unique mix of
Big Data products
and services
About Impetus
Š 2017 Impetus TechnologiesŠ 2017 Impetus Technologies
Real-time Stream Processing & Machine Learning Platform
+
Visual Spark Studio
Š 2017 Impetus Technologies
• Project in Berkeley AMPLabs – 2009 – Matei Zaharia; open sourced (BSD) in 2010
• Framework on distributed resource management system (Mesos)
• Speed up ML jobs in Apache Hadoop with in-memory approach
• 30x performance increase on Hadoop jobs
Apache Spark – The Beginning
Š 2017 Impetus Technologies
• Robust widely used technology
• Survey by Taneja Group in November 2016 highlights:
• 54% of 7000 enterprise participants – said actively using Spark
• 55% of workloads were ETL / data processing / engineering
• Cloud deployments projected well beyond 30%
• Popular new initiatives – Data science exploration, streaming and machine learning
Micro-batch
Hi-speed Batch Sits on Hadoop
and/or CloudInteractive Iterative
Graph Streaming
Apache Spark – Current State
Š 2017 Impetus Technologies
Spark Evolution
Major
Version
Date of
Release
Minor
Version
Feature Remarks
Spark
0.X
Feb
2014
Spark
0.7-0.9
• Becomes a top level Apache project
• RDD concept introduced with Spark
• Scala and Java binding
• Adds a Python API called PySpark
• Introduces Spark Streaming
• Introduces MLlib
• Includes a first version of GraphX
• PySpark makes it possible to use Spark
from Python
• Spark Streaming adds near real-time
processing capability
• Spark Streaming is now out of alpha and
includes significant optimizations and
simplified high availability deployment
Š 2017 Impetus Technologies
Spark Evolution
Major
Version
Date of
Release
Minor
Version
Feature Remarks
Spark
1.0-1.2
May
2014
Spark 1.0 • Adds Spark SQL
• Guarantees stability of its core API
• Full support for running seamlessly in
secured Hadoop clusters
• Spark 1.0 was the first production ready
backward compatible release. Viewed
spark streaming as faster batch
processing rather than streaming
• Became 1st open source Big Data
framework to embrace in-memory
computing
Sep
2014
Spark 1.1 • Migrates all customer workloads from Shark
to Spark SQL
• Expansion of MLlib
• Extends libraries and sources for Spark
streaming
• First minor release in the 1.X series.
Added significant extensions to the newly
added Spark SQL and the Spark MLlib
Dec
2014
Spark 1.2 • A new API for external data sources
• New H/A driver support through a Write
Ahead Log (WAL), removes any single-
point-of-failure from Spark streaming
• A higher-level API for constructing pipelines
in the spark.ml package
• GraphX project provides a stable API
• Recognized the need for structured data
and started to evolve to support it.
Introduced a specialized RDD schema as
a first step.
• However still lacked a direct API to read
structured data from Spark
Š 2017 Impetus Technologies
Spark Evolution
Major
Version
Date of
Release
Minor
Version
Feature Remarks
Spark
1.3-1.5
Mar
2015
Spark
1.3
• A new DataFrames API
• Provides a rich set of new MLlib algorithms
• Adds APIs to direct Kakfa streaming source
• DataFrames allow Spark to better
understand the structure of data as well as
the computation being performed.
• First unified API to read from structured
and semi-structured sources (both
RDBMS and NoSQL databases)
Jun
2015
Spark
1.4
• Introduces SparkR
• ML pipelines API graduates from alpha with new
transformers and improved Python coverage
• Adds visual debugging and monitoring
utilities to evaluate running of Spark applications
• A REST API for Initial performance improvements
in project Tungsten
• A pluggable interface for write ahead logs
• Targets data scientists with SparkR on
new DataFrame API.
• Ships the initial pieces of Project Tungsten,
becomes first version of custom memory
management
Sep
2015
Spark
1.5
• 1st major pieces of Project Tungsten
• New ML algorithms, extends new R API
• Adds visualization of SQL and DataFrame query
plans in the web UI
• Operational features for the streaming
component, such as backpressure support
• Pushes Project Tungsten
• Focused on increasing Spark’s
performance through several low-level
architectural optimizations
• Another major theme was data science
Š 2017 Impetus Technologies
Spark Evolution
Major
Version
Date of
Release
Minor
Version
Feature Remarks
Spark
1.0
Jan
2016
Spark 1.6 • Experimental Dataset API
• New data science functionalities; ML
pipeline persistence and new algorithms
• A new and efficient ’mapWithState API’,
replaces updateStateByKey
• Speedup of 10X for streaming state
management
• SQL queries on files
• Datasets, a typed extension of the
DataFrame API allows to work with custom
objects and lambda functions with benefits
of Spark SQL
Š 2017 Impetus Technologies
Spark Evolution
Date of
Release
Major
Version
Minor
Version
Feature Remarks
Spark
2.0-2.2
Jul 2016 Spark
2.0
• A new API, Structured Streaming
• Second generation Tungsten engine
• Unified DataFrame and Dataset in Scala/Java
• Substantial (2-10X) performance speedup for
common operators in SQL and DataFrames with
a new technique called whole stage code
generation
• Structured Streaming launched
experimentally Aims to integrate batch and
Stream. Introduces the concept of
continuous applications
Dec 2016 Spark
2.1
• Hardening of Structured Streaming – still
experimental
• Adds a number of SQL functionalities
• Focuses on advanced analytics
• SparkR becomes most comprehensive library
for distributed machine learning on R
Introduced Structured Streaming as a high-
level API for building continuous applications.
Aims to make it easier to build end-to-end
streaming applications. Introduces;
• Event-time watermarks
• Support for all file-based formats and all
file-based features
• Adds native support for Kafka 0.10
Jul 2017 Spark
2.2
• Production ready Structured Streaming
• Focuses on advanced analytics and Python
• Cost-based optimizer
• Limit the max number of records written per file
• Support for parsing multi-line JSON & CSV files
• The Structured Streaming APIs are now
GA and is no longer labeled experimental
• Add various SQL functionalities and
introduces Additional Algorithms in MLlib
and GraphX
Š 2017 Impetus Technologies
Poll Question
What is your currently used Spark version?
- 1.6 or prior
- 2.1
- 2.2
- Planning to start soon
- No plans
Š 2017 Impetus Technologies
A Deep Dive into Structured Streaming
Š 2017 Impetus Technologies
Structured Streaming – What is it?
• Strongly improved framework over Spark Streaming (DStream API) of Spark 1.x
• High level streaming API built on Spark SQL (DataFrame/Dataset API) and Catalyst Optimizer
• Express streaming computations the same way as batch computations
• Repeated query / incremental execution on unbounded table
Š 2017 Impetus Technologies
Structured Streaming – What is it?
• “NO REASONING ABOUT STREAMING”
• Simply define a flow:
• source  transformation  sink  mode
& trigger time  checkpoint
• Structured Streaming makes Streaming ETL +
Analytics easier and a natural single flow
• Not restricted to hard batch duration limits (delivers
lower latency)
• Exactly-once guarantee now truly end-end: includes
sink layer
Š 2017 Impetus Technologies
Structured Streaming – Code Snippet
(Structured Streaming vs Batch)
// Structured Streaming
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
df.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.start()
//Batch
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
df.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save()
Š 2017 Impetus Technologies
Structured Streaming – Code Snippet
(DStreams vs Batch)
//DStream
val topics = Array("topicA", "topicB")
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092,anotherhost:9092",
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val stream:DStream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
//NO Kafka Write Support
//Batch
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
df.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save()
Š 2017 Impetus Technologies
Streaming Code – Executed on “Trigger”
(One Time Batch)
// Structured Streaming
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS
STRING)")
.as[(String, String)]
//One Time Trigger
df.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.trigger(Trigger.Once)
.start()
• No worry about figuring out “changed data” and output
consistency
• Much easier stateful processing like deduping
• Unified code: No different code base for Lambda
solutions
• Cost saving by not running the cluster 24/7
Š 2017 Impetus Technologies
Poll Results
Š 2017 Impetus Technologies
Structured Streaming – Features and Highlights
(Event Time; Window Duration and Triggers)
• Event time orientation
• In combination with “windows” and triggers
• Aggregates maintained by Structured Streaming
• No need to write separate code
• Incremental query and output modes
• append / complete / update
Š 2017 Impetus Technologies
Structured Streaming – Features and Highlights
(Late Data Handling)
Š 2017 Impetus Technologies
Structured Streaming – Features and Highlights
(Watermarking (“Data too late!”))
Š 2017 Impetus Technologies
• New data formats:
• Native - multi-line JSON support
• Native CSV data source
• Stateful processing and time-outs beyond aggregations
• Using mapgroupswithstate and flatmapgroupswithstate
• New built-in ‘rate‘ source for benchmarking and testing for data generation
• x number of events, <xyz> format
• Metrics for Structured Streaming: New metrics sink
• Connect with Graphite
• Streaming listener (for metrics for every batch execution)
• Kafka 010 support; from_json, to_json, explode
Structured Streaming – Features and Highlights
Š 2017 Impetus Technologies
• New – Input / output features:
• Kafka stream / batch writer (DStream - didn't have Kafka writer)
• Kafka batch / stream source (Kafka wasn't available as a source for batch earlier)
• Partitioning output data files (Example: Hive data output)
• Deduplication is a built in function
• Example: Major Bank use case
• Without Structured Streaming – manual record and check for hash value in external store
• With Structured Streaming - unbounded table with hash values
Structured Streaming – Features and Highlights
Š 2017 Impetus Technologies
• Improvements (not new) :
• Easier stream to batch join
• Recovering failures using checkpoint (this was there in DStream also)
• “Code Productivity” enhanced / continuous SQL over batches and aggregations
(maintained by Structured Streaming)
• Enhanced batch inter-operability
Structured Streaming – Additional Features
Š 2017 Impetus Technologies
• Co-existence of 1.6 and 2.x – on the same Hadoop cluster
• Forward compatibility changes
• SparkSession is now the new entry point of Spark
• Replaces the old (1.x) SQLContext and HiveContext
• Dataset API and DataFrame API are unified
• Scala: DataFrame becomes a type alias for Dataset[Row]
• Java API users must replace DataFrame with Dataset<Row>
Spark Version Management Considerations
(Migration, Co-existence)
Š 2017 Impetus Technologies
• Machine learning support still weak (coming soon)
• Multiple (chained) aggregations not supported
• Limit, take, collect, show, count, foreach – Don’t work
• Join limitations
• Caching for multiple actions
• Aggregation queries / SQL on single micro batch
• No kinesis support
• Java8 only
Structured Streaming – Limitations
Š 2017 Impetus Technologies
• Streaming without micro-batches
• ~1 ms latency – has been promised (and without code changes)
• Berkeley - Drizzle project - potential replacement of Streaming engine
• For users: will not be much different
• No changes in code
Structured Streaming – Future: Mid-Long Term
Š 2017 Impetus Technologies
Talent vs. Tooling
Š 2017 Impetus Technologies
Shortage of Talent and the Urgent Need For It
• Spark projects are increasing
• Need to get done quickly with budget controls
• The big barrier
• Talent - Deep Spark / Scala skills are hard to find
• Big gap between Spark prototype app vs. production grade scale, stability
• Lot of engineers on other projects need to be made productive quickly
Š 2017 Impetus Technologies
The Need for Tooling
• Need very good enterprise grade, UI driven tooling around Spark to make it easy
• Need to cover all bases:
• Development, Debugging, Deployment, DevOps, Monitoring
• Also need to cover the full data processing journey
• Ingest
• Data Quality
• Blending
• Transformation / Enrichment
• Analytics / Machine Learning
• Loading of target databases
• Visualization
Š 2017 Impetus Technologies
StreamAnalytix – “Visual Spark” and More…
• StreamAnalytix is one such platform which makes Spark easy
• Drag-and-drop UI to build and deploy Spark apps in minutes
• Real-time and Batch Data360 platform – on Apache Spark 2.1
• Support for Spark 2.2 and Structured Streaming coming in 4Q
Š 2017 Impetus Technologies
About StreamAnalytix
Based on Multiple
Open-Source Engines
– Spark, Storm
and Flink (Future)
On Premise and
Cloud Compatible
Enterprise Grade – UI
Driven Streaming, IoT
and Batch Analytics and
Machine Learning
Platform
Š 2017 Impetus TechnologiesŠ 2017 Impetus Technologies
Please provide your feedback on the webinar and your
interest to attend our upcoming webinars.
Meet us at Booth # 127
Strata Data Conference in New York
September 26-28, 2017
Š 2017 Impetus Technologies
Thank you.
Questions?
© 2017 Impetus Technologies –
Confidential

Weitere ähnliche Inhalte

Was ist angesagt?

Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastDatabricks
 
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and CloudbreakData Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and CloudbreakDataWorks Summit
 
A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?DataWorks Summit/Hadoop Summit
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data PlatformsChris Kernaghan
 
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark Summit
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big dataTrieu Nguyen
 
Migrating Your Data Platform At a High Growth Startup
Migrating Your Data Platform At a High Growth StartupMigrating Your Data Platform At a High Growth Startup
Migrating Your Data Platform At a High Growth StartupDatabricks
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in RealtimeDataWorks Summit
 
Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing ChallengeTEST Huddle
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastDatabricks
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Databricks
 
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...SnapLogic
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Big Data Spain
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
 
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Databricks
 

Was ist angesagt? (20)

Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
 
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and CloudbreakData Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
 
A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data Platforms
 
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
 
OpenPOWER Update
OpenPOWER UpdateOpenPOWER Update
OpenPOWER Update
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Migrating Your Data Platform At a High Growth Startup
Migrating Your Data Platform At a High Growth StartupMigrating Your Data Platform At a High Growth Startup
Migrating Your Data Platform At a High Growth Startup
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
 
Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing Challenge
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
 
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
 
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
 

Ähnlich wie The structured streaming upgrade to Apache Spark and how enterprises can benefit- StreamAnalytix Webinar

Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with SparkVincent GALOPIN
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyondXiao Li
 
What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3DataWorks Summit
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
BDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkBDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkJerry Wen
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltreMarco Parenzan
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops TrainingSpark Summit
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3Databricks
 
Data streaming
Data streamingData streaming
Data streamingAlberto Paro
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 Chester Chen
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...confluent
 
2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei ZahariaDatabricks
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache SparkMiklos Christine
 
AnilKumarT_Resume_latest
AnilKumarT_Resume_latestAnilKumarT_Resume_latest
AnilKumarT_Resume_latestanil_thyagarajan
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebookAniket Mokashi
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 

Ähnlich wie The structured streaming upgrade to Apache Spark and how enterprises can benefit- StreamAnalytix Webinar (20)

Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyond
 
What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
BDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkBDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of spark
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
Data streaming
Data streamingData streaming
Data streaming
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan Kessler
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
 
2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
AnilKumarT_Resume_latest
AnilKumarT_Resume_latestAnilKumarT_Resume_latest
AnilKumarT_Resume_latest
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 

Mehr von Impetus Technologies

The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...
The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...
The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...Impetus Technologies
 
Eliminate cyber-security threats using data analytics – Build a resilient ent...
Eliminate cyber-security threats using data analytics – Build a resilient ent...Eliminate cyber-security threats using data analytics – Build a resilient ent...
Eliminate cyber-security threats using data analytics – Build a resilient ent...Impetus Technologies
 
Automated EDW Assessment and Actionable Recommendations - Impetus Webinar
Automated EDW Assessment and Actionable Recommendations - Impetus WebinarAutomated EDW Assessment and Actionable Recommendations - Impetus Webinar
Automated EDW Assessment and Actionable Recommendations - Impetus WebinarImpetus Technologies
 
Building a mature foundation for life in the cloud
Building a mature foundation for life in the cloudBuilding a mature foundation for life in the cloud
Building a mature foundation for life in the cloudImpetus Technologies
 
Best practices to build a sustainable data lake on cloud - Impetus Webinar
Best practices to build a sustainable data lake on cloud - Impetus WebinarBest practices to build a sustainable data lake on cloud - Impetus Webinar
Best practices to build a sustainable data lake on cloud - Impetus WebinarImpetus Technologies
 
Automate and Optimize Data Warehouse Migration to Snowflake
Automate and Optimize Data Warehouse Migration to SnowflakeAutomate and Optimize Data Warehouse Migration to Snowflake
Automate and Optimize Data Warehouse Migration to SnowflakeImpetus Technologies
 
Instantly convert Teradata ETL and EDW to Spark- Impetus webinar
Instantly convert Teradata ETL and EDW to Spark- Impetus webinarInstantly convert Teradata ETL and EDW to Spark- Impetus webinar
Instantly convert Teradata ETL and EDW to Spark- Impetus webinarImpetus Technologies
 
Keys to establish sustainable DW and analytics on the cloud -Impetus webinar
Keys to establish sustainable DW and analytics on the cloud -Impetus webinarKeys to establish sustainable DW and analytics on the cloud -Impetus webinar
Keys to establish sustainable DW and analytics on the cloud -Impetus webinarImpetus Technologies
 
Solving the EDW transformation conundrum - Impetus webinar
Solving the EDW transformation conundrum - Impetus webinarSolving the EDW transformation conundrum - Impetus webinar
Solving the EDW transformation conundrum - Impetus webinarImpetus Technologies
 
Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleImpetus Technologies
 
Keys to Formulating an Effective Data Management Strategy in the Age of Data
Keys to Formulating an Effective Data Management Strategy in the Age of DataKeys to Formulating an Effective Data Management Strategy in the Age of Data
Keys to Formulating an Effective Data Management Strategy in the Age of DataImpetus Technologies
 
Build Spark-based ETL Workflows on Cloud in Minutes
Build Spark-based ETL Workflows on Cloud in MinutesBuild Spark-based ETL Workflows on Cloud in Minutes
Build Spark-based ETL Workflows on Cloud in MinutesImpetus Technologies
 
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Impetus Technologies
 
Streaming Analytics for IoT with Apache Spark
Streaming Analytics for IoT with Apache SparkStreaming Analytics for IoT with Apache Spark
Streaming Analytics for IoT with Apache SparkImpetus Technologies
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationImpetus Technologies
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxImpetus Technologies
 
Importance of Big Data Analytics
Importance of Big Data AnalyticsImportance of Big Data Analytics
Importance of Big Data AnalyticsImpetus Technologies
 

Mehr von Impetus Technologies (17)

The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...
The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...
The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...
 
Eliminate cyber-security threats using data analytics – Build a resilient ent...
Eliminate cyber-security threats using data analytics – Build a resilient ent...Eliminate cyber-security threats using data analytics – Build a resilient ent...
Eliminate cyber-security threats using data analytics – Build a resilient ent...
 
Automated EDW Assessment and Actionable Recommendations - Impetus Webinar
Automated EDW Assessment and Actionable Recommendations - Impetus WebinarAutomated EDW Assessment and Actionable Recommendations - Impetus Webinar
Automated EDW Assessment and Actionable Recommendations - Impetus Webinar
 
Building a mature foundation for life in the cloud
Building a mature foundation for life in the cloudBuilding a mature foundation for life in the cloud
Building a mature foundation for life in the cloud
 
Best practices to build a sustainable data lake on cloud - Impetus Webinar
Best practices to build a sustainable data lake on cloud - Impetus WebinarBest practices to build a sustainable data lake on cloud - Impetus Webinar
Best practices to build a sustainable data lake on cloud - Impetus Webinar
 
Automate and Optimize Data Warehouse Migration to Snowflake
Automate and Optimize Data Warehouse Migration to SnowflakeAutomate and Optimize Data Warehouse Migration to Snowflake
Automate and Optimize Data Warehouse Migration to Snowflake
 
Instantly convert Teradata ETL and EDW to Spark- Impetus webinar
Instantly convert Teradata ETL and EDW to Spark- Impetus webinarInstantly convert Teradata ETL and EDW to Spark- Impetus webinar
Instantly convert Teradata ETL and EDW to Spark- Impetus webinar
 
Keys to establish sustainable DW and analytics on the cloud -Impetus webinar
Keys to establish sustainable DW and analytics on the cloud -Impetus webinarKeys to establish sustainable DW and analytics on the cloud -Impetus webinar
Keys to establish sustainable DW and analytics on the cloud -Impetus webinar
 
Solving the EDW transformation conundrum - Impetus webinar
Solving the EDW transformation conundrum - Impetus webinarSolving the EDW transformation conundrum - Impetus webinar
Solving the EDW transformation conundrum - Impetus webinar
 
Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scale
 
Keys to Formulating an Effective Data Management Strategy in the Age of Data
Keys to Formulating an Effective Data Management Strategy in the Age of DataKeys to Formulating an Effective Data Management Strategy in the Age of Data
Keys to Formulating an Effective Data Management Strategy in the Age of Data
 
Build Spark-based ETL Workflows on Cloud in Minutes
Build Spark-based ETL Workflows on Cloud in MinutesBuild Spark-based ETL Workflows on Cloud in Minutes
Build Spark-based ETL Workflows on Cloud in Minutes
 
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
 
Streaming Analytics for IoT with Apache Spark
Streaming Analytics for IoT with Apache SparkStreaming Analytics for IoT with Apache Spark
Streaming Analytics for IoT with Apache Spark
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
 
Importance of Big Data Analytics
Importance of Big Data AnalyticsImportance of Big Data Analytics
Importance of Big Data Analytics
 

KĂźrzlich hochgeladen

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 

KĂźrzlich hochgeladen (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

The structured streaming upgrade to Apache Spark and how enterprises can benefit- StreamAnalytix Webinar

  • 1. Š 2017 Impetus Technologies WEBINAR Anand Venugopal Product Head & AVP, StreamAnalytix The Structured Streaming Upgrade to Apache Spark and How Enterprises Can Benefit Amit Assudani Sr.Technical Architect – Spark, StreamAnalytix August 2017
  • 2. Š 2017 Impetus Technologies Quick Webinar Notes • Our focus: Enabling real-time enterprise, make Spark easy-to-use • Sharing our experience and expertise with you • Level of content • 20-80 :: New-Experienced (w.r.t. Spark) • Format: A combination of panel discussion and presentation • Usage of some artifacts and pictures from Apache Spark website and other public sources • Q&A and interactions are important and highly valued • Please send us your comments/ feedback using the Webex console
  • 3. Š 2017 Impetus Technologies Webinar Outline • About Impetus and what is StreamAnalytix? – 2 minutes • Apache Spark – Know the basics and its evolution – 8 minutes • A deep dive into Structured Streaming – 25 minutes • What is it? • How is it different from 1.0? • Features and technical highlights • Benefits and limitations • Upgrades and migrations • Future roadmap • Talent vs Tooling – 5 minutes • Q&A – 5+ minutes
  • 4. Š 2017 Impetus Technologies Mission critical technology solutions since 1996 Fortune 500: Big Data clients 1700 people; US, India, global reach Unique mix of Big Data products and services About Impetus
  • 5. Š 2017 Impetus TechnologiesŠ 2017 Impetus Technologies Real-time Stream Processing & Machine Learning Platform + Visual Spark Studio
  • 6. Š 2017 Impetus Technologies • Project in Berkeley AMPLabs – 2009 – Matei Zaharia; open sourced (BSD) in 2010 • Framework on distributed resource management system (Mesos) • Speed up ML jobs in Apache Hadoop with in-memory approach • 30x performance increase on Hadoop jobs Apache Spark – The Beginning
  • 7. Š 2017 Impetus Technologies • Robust widely used technology • Survey by Taneja Group in November 2016 highlights: • 54% of 7000 enterprise participants – said actively using Spark • 55% of workloads were ETL / data processing / engineering • Cloud deployments projected well beyond 30% • Popular new initiatives – Data science exploration, streaming and machine learning Micro-batch Hi-speed Batch Sits on Hadoop and/or CloudInteractive Iterative Graph Streaming Apache Spark – Current State
  • 8. Š 2017 Impetus Technologies Spark Evolution Major Version Date of Release Minor Version Feature Remarks Spark 0.X Feb 2014 Spark 0.7-0.9 • Becomes a top level Apache project • RDD concept introduced with Spark • Scala and Java binding • Adds a Python API called PySpark • Introduces Spark Streaming • Introduces MLlib • Includes a first version of GraphX • PySpark makes it possible to use Spark from Python • Spark Streaming adds near real-time processing capability • Spark Streaming is now out of alpha and includes significant optimizations and simplified high availability deployment
  • 9. Š 2017 Impetus Technologies Spark Evolution Major Version Date of Release Minor Version Feature Remarks Spark 1.0-1.2 May 2014 Spark 1.0 • Adds Spark SQL • Guarantees stability of its core API • Full support for running seamlessly in secured Hadoop clusters • Spark 1.0 was the first production ready backward compatible release. Viewed spark streaming as faster batch processing rather than streaming • Became 1st open source Big Data framework to embrace in-memory computing Sep 2014 Spark 1.1 • Migrates all customer workloads from Shark to Spark SQL • Expansion of MLlib • Extends libraries and sources for Spark streaming • First minor release in the 1.X series. Added significant extensions to the newly added Spark SQL and the Spark MLlib Dec 2014 Spark 1.2 • A new API for external data sources • New H/A driver support through a Write Ahead Log (WAL), removes any single- point-of-failure from Spark streaming • A higher-level API for constructing pipelines in the spark.ml package • GraphX project provides a stable API • Recognized the need for structured data and started to evolve to support it. Introduced a specialized RDD schema as a first step. • However still lacked a direct API to read structured data from Spark
  • 10. Š 2017 Impetus Technologies Spark Evolution Major Version Date of Release Minor Version Feature Remarks Spark 1.3-1.5 Mar 2015 Spark 1.3 • A new DataFrames API • Provides a rich set of new MLlib algorithms • Adds APIs to direct Kakfa streaming source • DataFrames allow Spark to better understand the structure of data as well as the computation being performed. • First unified API to read from structured and semi-structured sources (both RDBMS and NoSQL databases) Jun 2015 Spark 1.4 • Introduces SparkR • ML pipelines API graduates from alpha with new transformers and improved Python coverage • Adds visual debugging and monitoring utilities to evaluate running of Spark applications • A REST API for Initial performance improvements in project Tungsten • A pluggable interface for write ahead logs • Targets data scientists with SparkR on new DataFrame API. • Ships the initial pieces of Project Tungsten, becomes first version of custom memory management Sep 2015 Spark 1.5 • 1st major pieces of Project Tungsten • New ML algorithms, extends new R API • Adds visualization of SQL and DataFrame query plans in the web UI • Operational features for the streaming component, such as backpressure support • Pushes Project Tungsten • Focused on increasing Spark’s performance through several low-level architectural optimizations • Another major theme was data science
  • 11. Š 2017 Impetus Technologies Spark Evolution Major Version Date of Release Minor Version Feature Remarks Spark 1.0 Jan 2016 Spark 1.6 • Experimental Dataset API • New data science functionalities; ML pipeline persistence and new algorithms • A new and efficient ’mapWithState API’, replaces updateStateByKey • Speedup of 10X for streaming state management • SQL queries on files • Datasets, a typed extension of the DataFrame API allows to work with custom objects and lambda functions with benefits of Spark SQL
  • 12. Š 2017 Impetus Technologies Spark Evolution Date of Release Major Version Minor Version Feature Remarks Spark 2.0-2.2 Jul 2016 Spark 2.0 • A new API, Structured Streaming • Second generation Tungsten engine • Unified DataFrame and Dataset in Scala/Java • Substantial (2-10X) performance speedup for common operators in SQL and DataFrames with a new technique called whole stage code generation • Structured Streaming launched experimentally Aims to integrate batch and Stream. Introduces the concept of continuous applications Dec 2016 Spark 2.1 • Hardening of Structured Streaming – still experimental • Adds a number of SQL functionalities • Focuses on advanced analytics • SparkR becomes most comprehensive library for distributed machine learning on R Introduced Structured Streaming as a high- level API for building continuous applications. Aims to make it easier to build end-to-end streaming applications. Introduces; • Event-time watermarks • Support for all file-based formats and all file-based features • Adds native support for Kafka 0.10 Jul 2017 Spark 2.2 • Production ready Structured Streaming • Focuses on advanced analytics and Python • Cost-based optimizer • Limit the max number of records written per file • Support for parsing multi-line JSON & CSV files • The Structured Streaming APIs are now GA and is no longer labeled experimental • Add various SQL functionalities and introduces Additional Algorithms in MLlib and GraphX
  • 13. Š 2017 Impetus Technologies Poll Question What is your currently used Spark version? - 1.6 or prior - 2.1 - 2.2 - Planning to start soon - No plans
  • 14. Š 2017 Impetus Technologies A Deep Dive into Structured Streaming
  • 15. Š 2017 Impetus Technologies Structured Streaming – What is it? • Strongly improved framework over Spark Streaming (DStream API) of Spark 1.x • High level streaming API built on Spark SQL (DataFrame/Dataset API) and Catalyst Optimizer • Express streaming computations the same way as batch computations • Repeated query / incremental execution on unbounded table
  • 16. Š 2017 Impetus Technologies Structured Streaming – What is it? • “NO REASONING ABOUT STREAMING” • Simply define a flow: • source  transformation  sink  mode & trigger time  checkpoint • Structured Streaming makes Streaming ETL + Analytics easier and a natural single flow • Not restricted to hard batch duration limits (delivers lower latency) • Exactly-once guarantee now truly end-end: includes sink layer
  • 17. Š 2017 Impetus Technologies Structured Streaming – Code Snippet (Structured Streaming vs Batch) // Structured Streaming val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] df.writeStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .start() //Batch val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] df.write .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .save()
  • 18. Š 2017 Impetus Technologies Structured Streaming – Code Snippet (DStreams vs Batch) //DStream val topics = Array("topicA", "topicB") val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "localhost:9092,anotherhost:9092", "group.id" -> "use_a_separate_group_id_for_each_stream", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) val stream:DStream = KafkaUtils.createDirectStream[String, String]( streamingContext, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ) stream.map(record => (record.key, record.value)) //NO Kafka Write Support //Batch val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] df.write .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .save()
  • 19. Š 2017 Impetus Technologies Streaming Code – Executed on “Trigger” (One Time Batch) // Structured Streaming val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] //One Time Trigger df.writeStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .trigger(Trigger.Once) .start() • No worry about figuring out “changed data” and output consistency • Much easier stateful processing like deduping • Unified code: No different code base for Lambda solutions • Cost saving by not running the cluster 24/7
  • 20. Š 2017 Impetus Technologies Poll Results
  • 21. Š 2017 Impetus Technologies Structured Streaming – Features and Highlights (Event Time; Window Duration and Triggers) • Event time orientation • In combination with “windows” and triggers • Aggregates maintained by Structured Streaming • No need to write separate code • Incremental query and output modes • append / complete / update
  • 22. Š 2017 Impetus Technologies Structured Streaming – Features and Highlights (Late Data Handling)
  • 23. Š 2017 Impetus Technologies Structured Streaming – Features and Highlights (Watermarking (“Data too late!”))
  • 24. Š 2017 Impetus Technologies • New data formats: • Native - multi-line JSON support • Native CSV data source • Stateful processing and time-outs beyond aggregations • Using mapgroupswithstate and flatmapgroupswithstate • New built-in ‘rate‘ source for benchmarking and testing for data generation • x number of events, <xyz> format • Metrics for Structured Streaming: New metrics sink • Connect with Graphite • Streaming listener (for metrics for every batch execution) • Kafka 010 support; from_json, to_json, explode Structured Streaming – Features and Highlights
  • 25. Š 2017 Impetus Technologies • New – Input / output features: • Kafka stream / batch writer (DStream - didn't have Kafka writer) • Kafka batch / stream source (Kafka wasn't available as a source for batch earlier) • Partitioning output data files (Example: Hive data output) • Deduplication is a built in function • Example: Major Bank use case • Without Structured Streaming – manual record and check for hash value in external store • With Structured Streaming - unbounded table with hash values Structured Streaming – Features and Highlights
  • 26. Š 2017 Impetus Technologies • Improvements (not new) : • Easier stream to batch join • Recovering failures using checkpoint (this was there in DStream also) • “Code Productivity” enhanced / continuous SQL over batches and aggregations (maintained by Structured Streaming) • Enhanced batch inter-operability Structured Streaming – Additional Features
  • 27. Š 2017 Impetus Technologies • Co-existence of 1.6 and 2.x – on the same Hadoop cluster • Forward compatibility changes • SparkSession is now the new entry point of Spark • Replaces the old (1.x) SQLContext and HiveContext • Dataset API and DataFrame API are unified • Scala: DataFrame becomes a type alias for Dataset[Row] • Java API users must replace DataFrame with Dataset<Row> Spark Version Management Considerations (Migration, Co-existence)
  • 28. Š 2017 Impetus Technologies • Machine learning support still weak (coming soon) • Multiple (chained) aggregations not supported • Limit, take, collect, show, count, foreach – Don’t work • Join limitations • Caching for multiple actions • Aggregation queries / SQL on single micro batch • No kinesis support • Java8 only Structured Streaming – Limitations
  • 29. Š 2017 Impetus Technologies • Streaming without micro-batches • ~1 ms latency – has been promised (and without code changes) • Berkeley - Drizzle project - potential replacement of Streaming engine • For users: will not be much different • No changes in code Structured Streaming – Future: Mid-Long Term
  • 30. Š 2017 Impetus Technologies Talent vs. Tooling
  • 31. Š 2017 Impetus Technologies Shortage of Talent and the Urgent Need For It • Spark projects are increasing • Need to get done quickly with budget controls • The big barrier • Talent - Deep Spark / Scala skills are hard to find • Big gap between Spark prototype app vs. production grade scale, stability • Lot of engineers on other projects need to be made productive quickly
  • 32. Š 2017 Impetus Technologies The Need for Tooling • Need very good enterprise grade, UI driven tooling around Spark to make it easy • Need to cover all bases: • Development, Debugging, Deployment, DevOps, Monitoring • Also need to cover the full data processing journey • Ingest • Data Quality • Blending • Transformation / Enrichment • Analytics / Machine Learning • Loading of target databases • Visualization
  • 33. Š 2017 Impetus Technologies StreamAnalytix – “Visual Spark” and More… • StreamAnalytix is one such platform which makes Spark easy • Drag-and-drop UI to build and deploy Spark apps in minutes • Real-time and Batch Data360 platform – on Apache Spark 2.1 • Support for Spark 2.2 and Structured Streaming coming in 4Q
  • 34. Š 2017 Impetus Technologies About StreamAnalytix Based on Multiple Open-Source Engines – Spark, Storm and Flink (Future) On Premise and Cloud Compatible Enterprise Grade – UI Driven Streaming, IoT and Batch Analytics and Machine Learning Platform
  • 35. Š 2017 Impetus TechnologiesŠ 2017 Impetus Technologies Please provide your feedback on the webinar and your interest to attend our upcoming webinars. Meet us at Booth # 127 Strata Data Conference in New York September 26-28, 2017
  • 36. Š 2017 Impetus Technologies Thank you. Questions? Š 2017 Impetus Technologies – Confidential