SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
New Frontiers for
Apache Spark
Matei Zaharia
@matei_zaharia
Welcome to Spark Summit 2017
Our largest summit,followinganother year of communitygrowth
66K
225K
365K
2015 2016 2017
Spark Meetup Members
Worldwide
0%
20%
40%
60%
80%
100%
06/2016 12/2016 06/2017
Spark Version Usage in
Databricks
2.1
2.0
1.6
1.5
Summit Highlights
Apache Spark Philosophy
Unified engine for complete
data applications
High-leveluser-friendlyAPIs
SQLStreaming ML Graph
…
Coming in Spark 2.2
• Data warehousing:cost-based SQL optimizer
• StructuredStreaming:markedproduction-ready
• Pythonusability: pip install pyspark
Currentlyin release candidate stage on dev list
Two New Open Source Efforts
from Databricks
Deep Learning
Streaming Performance
1
2
Deep Learning has Huge Potential
Unprecedentedability to work with unstructureddata
suchas images and text
But Deep Learning is Hard to Use
Current APIs (TensorFlow,Keras, BigDL, etc) are low-level
• Build a computation graph from scratch
• Scale-out typically requires manual parallelization
Hard to expose models in larger applications
Very similar to early big data APIs (MapReduce)
Our Goal
Enable an order of magnitude more users to build applications
using deep learning
Provide scale & production use out of the box
Deep Learning Pipelines
A new high-level API for deep learning that integrates with
Apache Spark’s ML Pipelines
• Commonuse cases in just a few lines of code
• Automaticallyscale out on Spark
• Expose modelsin batch/streamingapps & Spark SQL
Builds on existingengines (TensorFlow,Keras, BigDL)
Deep Learning Pipelines Demo
Tim Hunter - @timjhunter
Spark Summit 2017
UsingApache Spark and Deep Learning
• New library: DeepLearning Pipelines
• Simple API for Deep Learning, integratedwith ML pipelines
• Scales commontasks with transformersand estimators
• Embeds Deep Learningmodels in Spark
Example:
Image classification
Example: Identify the James Bond cars
Example: Identify the James Bond cars
Good application for Deep Learning
• Neural networksare very good at dealing withimages
• Can work with complexsituations:
INVARIANCE TO ROTATIONS INCOMPLETE DATA
Transfer Learning
Transfer Learning
Transfer Learning
Transfer Learning
Transfer Learning
SoftMax
GIANT PANDA 0.9
RED PANDA 0.05
RACCOON 0.01
…
Classifier
Transfer Learning
DeepImageFeaturizer
Deep Learningwithout Deep Pockets
• Simple API for Deep Learning, integratedwith MLlib
• Scales commontasks with transformersand estimators
• Embeds Deep Learningmodels in MLlib and SparkSQL
• Earlyrelease of Deep Learning Pipelines
https://github.com/databricks/spark-deep-learning
Two New Open Source Efforts
from Databricks
Deep Learning
Streaming Performance
1
2
Structured Streaming:
Ready For Production
Michael Armbrust - @michaelarmbrust
Spark Summit 2017
What is Structured Streaming?
Our goal was to build the easieststreaming engineusingthe
power of Spark SQL.
• High-Level APIs — DataFrames, Datasets and SQL. Same in streaming
and in batch.
• Event-time Processing — Native support for working with out-of-order
and late data.
• End-to-end Exactly Once — Transactional both in processing and output.
Simple APIs: Benchmark
28
KStream<String, ProjectedEvent> filteredEvents = kEvents.filter((key, value) -> {
return value.event_type.equals("view");
}).mapValues((value) -> {
return new ProjectedEvent(value.ad_id, value.event_time);
});
KTable<String, String> kCampaigns = builder.table("campaigns", "campaign-state");
KTable<String, CampaignAd> deserCampaigns = kCampaigns.mapValues((value) -> {
Map<String, String> campMap = Json.parser.readValue(value);
return new CampaignAd(campMap.get("ad_id"), campMap.get("campaign_id"));
});
KStream<String, String> joined =
filteredEvents.join(deserCampaigns, (value1, value2) -> {
return value2.campaign_id;
},
Serdes.String(), Serdes.serdeFrom(new ProjectedEventSerializer(),
new ProjectedEventDeserializer()));
KStream<String, String> keyedByCampaign = joined.selectKey((key, value) -> value);
KTable<Windowed<String>, Long> counts = keyedByCampaign.groupByKey()
.count(TimeWindows.of(10000), "time-windows");
Filter by click type and project
Join with campaigns table
Group and windowed count
streams
KTable<String, String> kCampaigns = builder.table("campaigns", "campaign-state");
KTable<String, CampaignAd> deserCampaigns = kCampaigns.mapValues((value) -> {
Map<String, String> campMap = Json.parser.readValue(value);
return new CampaignAd(campMap.get("ad_id"), campMap.get("campaign_id"));
});
KStream<String, String> joined =
filteredEvents.join(deserCampaigns, (value1, value2) -> {
return value2.campaign_id;
},
Serdes.String(), Serdes.serdeFrom(new ProjectedEventSerializer(),
new ProjectedEventDeserializer()));
KStream<String, ProjectedEvent> filteredEvents = kEvents.filter((key, value) -> {
return value.event_type.equals("view");
}).mapValues((value) -> {
return new ProjectedEvent(value.ad_id, value.event_time);
});
KStream<String, String> keyedByCampaign = joined.selectKey((key, value) -> value);
KTable<Windowed<String>, Long> counts = keyedByCampaign.groupByKey()
.count(TimeWindows.of(10000), "time-windows");
Simple APIs: Benchmark
29
DataFrames streams
events
.where("event_type = 'view'")
.join(table("campaigns"), "ad_id")
.groupBy(
window('event_time, "10 seconds"),
'campaign_id)
.count()
Simple APIs: Benchmark
30
SQL
SELECT COUNT(*)
FROM events
JOIN campaigns USING ad_id
WHERE event_type = 'view'
GROUP BY
window(event_time, "10 seconds"),
campaign_id)
KTable<String, String> kCampaigns = builder.table("campaigns", "campaign-state");
KTable<String, CampaignAd> deserCampaigns = kCampaigns.mapValues((value) -> {
Map<String, String> campMap = Json.parser.readValue(value);
return new CampaignAd(campMap.get("ad_id"), campMap.get("campaign_id"));
});
KStream<String, String> joined =
filteredEvents.join(deserCampaigns, (value1, value2) -> {
return value2.campaign_id;
},
Serdes.String(), Serdes.serdeFrom(new ProjectedEventSerializer(),
new ProjectedEventDeserializer()));
KStream<String, ProjectedEvent> filteredEvents = kEvents.filter((key, value) -> {
return value.event_type.equals("view");
}).mapValues((value) -> {
return new ProjectedEvent(value.ad_id, value.event_time);
});
KStream<String, String> keyedByCampaign = joined.selectKey((key, value) -> value);
KTable<Windowed<String>, Long> counts =keyedByCampaign.groupByKey()
.count(TimeWindows.of(10000), "time-windows");
streams
Performance: Benchmark
31
https://data-artisans.com/blog/extending-the-yahoo-streaming-benchmark
Throughput
At ~200ms Latency
5xlower cost
Streamingqueries use the
Catalyst Optimizerand the
Tungsten Execution Engine.
0
10
20
30
40
50
60
70
Kafka
Streams
Apache
Flink
Apache
SparkRecords/Second
Millions
GA in Spark 2.2
• Processedover 3 trillionrows last monthin production
• More production use cases of StructuredStreamingthan
DStreamsamongDatabricks Customers
• Spark 2.2 removesthe “Experimental”tagfrom all APIs
32
What about Latency?
33
Demo: StreamingFind James Bond
34
• Stream of events containing images
• Need to join with locations table and filter using ML
• Alert quickly on any suspicious Bond sightings…
Need < 10ms to catch him!
Continuous Processing
A new executionmode that allows fullypipelined execution.
• Streaming execution without microbatches
• Supports async checkpointing and ~1 ms latency
• No changes required for user code
Proposal available at https://issues.apache.org/jira/browse/SPARK-20928
Apache Spark
Structured Streaming
The easiest streamingengine is now also the fastest!
Conclusion
We’re bringing two new workloads to Apache Spark:
• Deep learning
• Low-latencystreaming
Find out more in the sessions today!
Thanks
Enjoy Spark Summit!

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf JagermanSpark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf Jagerman
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 

Ähnlich wie Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and demos by Michael Armbrust and Tim Hunter

Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
confluent
 

Ähnlich wie Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and demos by Michael Armbrust and Tim Hunter (20)

Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Real Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With SparkReal Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With Spark
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
 
Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with Cascading
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Real timeml visualizationwithspark_v6
Real timeml visualizationwithspark_v6Real timeml visualizationwithspark_v6
Real timeml visualizationwithspark_v6
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 

Kürzlich hochgeladen (20)

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 

Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and demos by Michael Armbrust and Tim Hunter

  • 1. New Frontiers for Apache Spark Matei Zaharia @matei_zaharia
  • 2. Welcome to Spark Summit 2017 Our largest summit,followinganother year of communitygrowth 66K 225K 365K 2015 2016 2017 Spark Meetup Members Worldwide 0% 20% 40% 60% 80% 100% 06/2016 12/2016 06/2017 Spark Version Usage in Databricks 2.1 2.0 1.6 1.5
  • 4.
  • 5. Apache Spark Philosophy Unified engine for complete data applications High-leveluser-friendlyAPIs SQLStreaming ML Graph …
  • 6. Coming in Spark 2.2 • Data warehousing:cost-based SQL optimizer • StructuredStreaming:markedproduction-ready • Pythonusability: pip install pyspark Currentlyin release candidate stage on dev list
  • 7. Two New Open Source Efforts from Databricks Deep Learning Streaming Performance 1 2
  • 8. Deep Learning has Huge Potential Unprecedentedability to work with unstructureddata suchas images and text
  • 9. But Deep Learning is Hard to Use Current APIs (TensorFlow,Keras, BigDL, etc) are low-level • Build a computation graph from scratch • Scale-out typically requires manual parallelization Hard to expose models in larger applications Very similar to early big data APIs (MapReduce)
  • 10. Our Goal Enable an order of magnitude more users to build applications using deep learning Provide scale & production use out of the box
  • 11. Deep Learning Pipelines A new high-level API for deep learning that integrates with Apache Spark’s ML Pipelines • Commonuse cases in just a few lines of code • Automaticallyscale out on Spark • Expose modelsin batch/streamingapps & Spark SQL Builds on existingengines (TensorFlow,Keras, BigDL)
  • 12. Deep Learning Pipelines Demo Tim Hunter - @timjhunter Spark Summit 2017
  • 13. UsingApache Spark and Deep Learning • New library: DeepLearning Pipelines • Simple API for Deep Learning, integratedwith ML pipelines • Scales commontasks with transformersand estimators • Embeds Deep Learningmodels in Spark
  • 15. Example: Identify the James Bond cars
  • 16. Example: Identify the James Bond cars
  • 17. Good application for Deep Learning • Neural networksare very good at dealing withimages • Can work with complexsituations: INVARIANCE TO ROTATIONS INCOMPLETE DATA
  • 23. SoftMax GIANT PANDA 0.9 RED PANDA 0.05 RACCOON 0.01 … Classifier Transfer Learning DeepImageFeaturizer
  • 24. Deep Learningwithout Deep Pockets • Simple API for Deep Learning, integratedwith MLlib • Scales commontasks with transformersand estimators • Embeds Deep Learningmodels in MLlib and SparkSQL • Earlyrelease of Deep Learning Pipelines https://github.com/databricks/spark-deep-learning
  • 25. Two New Open Source Efforts from Databricks Deep Learning Streaming Performance 1 2
  • 26. Structured Streaming: Ready For Production Michael Armbrust - @michaelarmbrust Spark Summit 2017
  • 27. What is Structured Streaming? Our goal was to build the easieststreaming engineusingthe power of Spark SQL. • High-Level APIs — DataFrames, Datasets and SQL. Same in streaming and in batch. • Event-time Processing — Native support for working with out-of-order and late data. • End-to-end Exactly Once — Transactional both in processing and output.
  • 28. Simple APIs: Benchmark 28 KStream<String, ProjectedEvent> filteredEvents = kEvents.filter((key, value) -> { return value.event_type.equals("view"); }).mapValues((value) -> { return new ProjectedEvent(value.ad_id, value.event_time); }); KTable<String, String> kCampaigns = builder.table("campaigns", "campaign-state"); KTable<String, CampaignAd> deserCampaigns = kCampaigns.mapValues((value) -> { Map<String, String> campMap = Json.parser.readValue(value); return new CampaignAd(campMap.get("ad_id"), campMap.get("campaign_id")); }); KStream<String, String> joined = filteredEvents.join(deserCampaigns, (value1, value2) -> { return value2.campaign_id; }, Serdes.String(), Serdes.serdeFrom(new ProjectedEventSerializer(), new ProjectedEventDeserializer())); KStream<String, String> keyedByCampaign = joined.selectKey((key, value) -> value); KTable<Windowed<String>, Long> counts = keyedByCampaign.groupByKey() .count(TimeWindows.of(10000), "time-windows"); Filter by click type and project Join with campaigns table Group and windowed count streams
  • 29. KTable<String, String> kCampaigns = builder.table("campaigns", "campaign-state"); KTable<String, CampaignAd> deserCampaigns = kCampaigns.mapValues((value) -> { Map<String, String> campMap = Json.parser.readValue(value); return new CampaignAd(campMap.get("ad_id"), campMap.get("campaign_id")); }); KStream<String, String> joined = filteredEvents.join(deserCampaigns, (value1, value2) -> { return value2.campaign_id; }, Serdes.String(), Serdes.serdeFrom(new ProjectedEventSerializer(), new ProjectedEventDeserializer())); KStream<String, ProjectedEvent> filteredEvents = kEvents.filter((key, value) -> { return value.event_type.equals("view"); }).mapValues((value) -> { return new ProjectedEvent(value.ad_id, value.event_time); }); KStream<String, String> keyedByCampaign = joined.selectKey((key, value) -> value); KTable<Windowed<String>, Long> counts = keyedByCampaign.groupByKey() .count(TimeWindows.of(10000), "time-windows"); Simple APIs: Benchmark 29 DataFrames streams events .where("event_type = 'view'") .join(table("campaigns"), "ad_id") .groupBy( window('event_time, "10 seconds"), 'campaign_id) .count()
  • 30. Simple APIs: Benchmark 30 SQL SELECT COUNT(*) FROM events JOIN campaigns USING ad_id WHERE event_type = 'view' GROUP BY window(event_time, "10 seconds"), campaign_id) KTable<String, String> kCampaigns = builder.table("campaigns", "campaign-state"); KTable<String, CampaignAd> deserCampaigns = kCampaigns.mapValues((value) -> { Map<String, String> campMap = Json.parser.readValue(value); return new CampaignAd(campMap.get("ad_id"), campMap.get("campaign_id")); }); KStream<String, String> joined = filteredEvents.join(deserCampaigns, (value1, value2) -> { return value2.campaign_id; }, Serdes.String(), Serdes.serdeFrom(new ProjectedEventSerializer(), new ProjectedEventDeserializer())); KStream<String, ProjectedEvent> filteredEvents = kEvents.filter((key, value) -> { return value.event_type.equals("view"); }).mapValues((value) -> { return new ProjectedEvent(value.ad_id, value.event_time); }); KStream<String, String> keyedByCampaign = joined.selectKey((key, value) -> value); KTable<Windowed<String>, Long> counts =keyedByCampaign.groupByKey() .count(TimeWindows.of(10000), "time-windows"); streams
  • 31. Performance: Benchmark 31 https://data-artisans.com/blog/extending-the-yahoo-streaming-benchmark Throughput At ~200ms Latency 5xlower cost Streamingqueries use the Catalyst Optimizerand the Tungsten Execution Engine. 0 10 20 30 40 50 60 70 Kafka Streams Apache Flink Apache SparkRecords/Second Millions
  • 32. GA in Spark 2.2 • Processedover 3 trillionrows last monthin production • More production use cases of StructuredStreamingthan DStreamsamongDatabricks Customers • Spark 2.2 removesthe “Experimental”tagfrom all APIs 32
  • 34. Demo: StreamingFind James Bond 34 • Stream of events containing images • Need to join with locations table and filter using ML • Alert quickly on any suspicious Bond sightings… Need < 10ms to catch him!
  • 35. Continuous Processing A new executionmode that allows fullypipelined execution. • Streaming execution without microbatches • Supports async checkpointing and ~1 ms latency • No changes required for user code Proposal available at https://issues.apache.org/jira/browse/SPARK-20928
  • 36. Apache Spark Structured Streaming The easiest streamingengine is now also the fastest!
  • 37. Conclusion We’re bringing two new workloads to Apache Spark: • Deep learning • Low-latencystreaming Find out more in the sessions today!