SlideShare ist ein Scribd-Unternehmen logo
1 von 64
Downloaden Sie, um offline zu lesen
Apache Spark 2.4 and Beyond
Xiao Li, Wenchen Fan
Mar 2019 @ Strata Data Conf
About US
• Software Engineers at
• Apache Spark Committers and PMC Members
Xiao Li (Github: gatorsmile) Wenchen Fan (Github: cloud-fan)
Databricks Customers Across Industries
Financial Services Healthcare & Pharma Media & Entertainment Technology
Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech
Data & Analytics Services
DATABRICKS WORKSPACE
Databricks Delta ML Frameworks
DATABRICKS CLOUD SERVICE
DATABRICKS RUNTIME
Reliable & Scalable Simple & Integrated
Databricks Unified Analytics Platform
APIs
Jobs
Models
Notebooks
Dashboards End to end ML lifecycle
https://insights.stackoverflow.com/survey/2018
20000+ Stars in Github
Top Skill in 2018: Apache Spark
https://economicgraph.linkedin.com/research/linkedin-2018-emerging-jobs-report
10
Release: Nov 8, 2018
Blog: https://t.co/k7kEHrNZXp
Above 1100 tickets.
Higher-order
Functions
Major Features on Spark 2.4
11
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Beta support
Scala 2.12
Various SQL
Features
Survey Key Findings: The AI Dilemma
Investment in AI is Growing Quickly
However, very few are succeeding
Major contributing factors to this AI dilemma
7
Different machine learning
frameworks and tools
https://databricks.com/CIO-survey-report
CIO Survey Key Findings: Unified Analytics Enables AI Success
Unifying data science & engineering
Benefits Expected from a Unified Approach to Data & AI
Increased operational efficiency
More effective decision making
Accelerated time to market
Improved security
Increased innovation
Improved customer experience
Increased competitive advantage
Increased employee satisfaction
Increased customer engagement
Product/service transformation
Topline growth
Other (specify)
Big data v.s. AI Technologies
14
X
Project Hydrogen: Spark + AI
A gang scheduling to Apache Spark that embeds a
distributed DL job as a Spark stage to simplify the
distributed training workflow. [SPARK-24374]
15
Task 1
Task 2
Task 3
Higher-order
Functions
Major Features on Spark 2.4
16
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
Flexible Streaming Sink
[SPARK-24565] Exposing output rows of each microbatch as a
DataFrame
foreachBatch(f: Dataset[T] => Unit)
• Scala/Java/Python APIs in DataStreamWriter.
• Reuse existing batch data sources
• Write to multiple locations
• Apply additional DataFrame operations
17
Reuse existing batch data sources
18
Write to multiple location
19
Higher-order
Functions
Major Features on Spark 2.4
20
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
Parquet
21
Update from 1.8.2 to 1.10.0 [SPARK-23972].
• PARQUET-1025 - Support new min-max statistics in parquet-mr
• PARQUET-225 - INT64 support for delta encoding
• PARQUET-1142 Enable parquet.filter.dictionary.enabled by default.
Predicate pushdown
• STRING [SPARK-23972] [20x faster]
• Decimal [SPARK-24549]
• Timestamp [SPARK-24718]
• Date [SPARK-23727]
• Byte/Short [SPARK-24706]
• StringStartsWith [SPARK-24638]
• IN [SPARK-17091]
ORC
Native vectorized ORC reader is GAed!
• Native ORC reader is on by default [SPARK-23456]
• Update ORC from 1.4.1 to 1.5.2 [SPARK-24576]
• Turn on ORC filter push-down by default [SPARK-21783]
• Use native ORC reader to read Hive serde tables by default
[SPARK-22279]
• Avoid creating reader for all ORC files [SPARK-25126]
22
Higher-order
Functions
Major Features on Upcoming Spark 2.4
23
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
Higher-order Functions
Transformation on complex objects like arrays, maps and structures
inside of columns.
24
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
UDF ? Expensive data serialization
1) Check for element existence
SELECT EXISTS(values, e -> e > 30) AS v
FROM tbl_nested;
2) Transform an array
SELECT TRANSFORM(values, e -> e * e) AS v
FROM tbl_nested;
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
Higher-order Functions
4) Aggregate an array
SELECT REDUCE(values, 0, (value, acc) -> value + acc) AS sum
FROM tbl_nested;
Ref Databricks Blog: http://dbricks.co/2rUKQ1A
3) Filter an array
SELECT FILTER(values, e -> e > 30) AS v
FROM tbl_nested;
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
Higher-order Functions
Built-in Functions
[SPARK-23899] New or extended built-in functions for ArrayTypes and
MapTypes
• 26 functions for ArrayTypes
transform, filter, reduce, array_distinct, array_intersect, array_union,
array_except, array_join, array_max, array_min, ...
• 3 functions for MapTypes
map_from_arrays, map_from_entries, map_concat
27
Blog: Introducing New Built-in and Higher-Order Functions for
Complex Data Types in Apache Spark 2.4. https://t.co/p1TRRtabJJ
Higher-order
Functions
Major Features on Spark 2.4
28
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
Native Spark App in K8S
New Spark scheduler backend
• PySpark support [SPARK-23984]
• SparkR support [SPARK-24433]
• Client-mode support [SPARK-23146]
• Support for mounting K8S volumes
[SPARK-23529]
Blog: What’s New for Apache Spark on Kubernetes in
the Upcoming Apache Spark 2.4 Release
https://t.co/uUpdUj2Z4B
29
on
Higher-order
Functions
Major Features on Spark 2.4
30
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Beta support
Scala 2.12
Various SQL
Features
What’s Next?
This presentation may contain projections or other forward-
looking statements regarding the upcoming release (Apache
Spark 3.0). The statements are intended to outline our general
direction. They are intended for information purposes only.
They are not a commitment to deliver code or functionality.
The development, release and timing of any feature or
functionality described for Apache Spark remains at the sole
discretion of ASF and the Apache Spark PMC.
Safe Harbor Statement
What’s Next?
33
Data Source
APIs
Spark on
Kubernetes
PySpark
Usability
Scala 2.12 Various SQL
Features
GPU-aware
Scheduling
Adaptive
Execution
Hadoop 3.x
Spark Graph
mlflow
What’s Next?
34
Data Source
APIs
Spark on
Kubernetes
PySpark
Usability
Scala 2.12 Various SQL
Features
GPU-aware
Scheduling
Adaptive
Execution
Hadoop 3.x
Spark Graph
mlflow
Project Hydrogen: Spark + AI
GPU Aware Scheduling
• widely used for accelerating special workloads, e.g., deep
learning and signal processing
35
It’s Hard to Productionize ML
ML Lifecycle is Manual, Inconsistent
and Disconnected
● Ad hoc approach to track
experiments
● Very hard to reproduce
experiments
Data Prep
● Multiple tightly coupled
deployment options
● Different monitoring approach
for each framework
Build Model Deploy Model
● Low level integrations for
Data and ML
● Difficult to track data used
for a model
What is ?
Open source platform to manage ML development
• Lightweight APIs & abstractions that work with any ML library
• Designed to be useful for 1 user or 1000+ person orgs
• Runs the same way anywhere (e.g. any cloud)
Key principle: “open interface” APIs that work with any existing
ML library, app, deployment tool, etc
MLflow Components
39
Tracking
Record and query
experiments: code,
params, results, etc
Projects
Code packaging for
reproducible runs
on any platform
Models
Model packaging and
deployment to diverse
environments
MLflow Projects
• Docker-based project environment
specification (0.9)
• X-coordinate logging for metrics &
batched logging (1.0)
• Packaging projects with build steps (1.0+)
MLflow Models
• Custom model logging in Python, R
and Java (0.8, 0.9, 1.0)
• Better environment isolation when
loading models (1.0)
• Logging schema of models (1.0+)
MLflow Tracking
• SQL database backend for scaling the tracking server (0.9)
• UI scalability improvements (0.8, 0.9, etc.)
• X-coordinate logging for metrics & batched logging (1.0)
• Fluent API for Java and Scala (1.0)
What’s
Next?
41
What’s Next?
42
Data Source
APIs
Spark on
Kubernetes
PySpark
Usability
Scala 2.12 Various SQL
Features
GPU-aware
Scheduling
Adaptive
Execution
Hadoop 3.x
Spark Graph
mlflow
Challenges in Existing Graph Library
GraphX
• Not DataFrame based
• Not actively maintained
GraphFrame
• Limited graph
pattern matching
• Semantically weak
graph data model
(:Cypher)-[: FOR] -> (Apache:SparkTM)
Spark Graph
Given a single Property Graph data model and a
Cypher query, Spark returns a tabular result
[DataFrame]
What’s Next?
46
Data Source
APIs
Spark on
Kubernetes
PySpark
Usability
Scala 2.12 Various SQL
Features
GPU-aware
Scheduling
Adaptive
Execution
Hadoop 3.x
Spark Graph
mlflow
Data Source API V2
• Unified API for batch and streaming
• Flexible API for high performance
implementation
• Flexible API for metadata management
JDBC source with data source v1
df.write.format("jdbc")
.option("url", ...)
.option("dbtable", ...)
.option("driver", ...)
.save()
1. Specify the info of remote catalog for each op
JDBC source with data source v1
CREATE TABLE tab1(...) USING jdbc OPTIONS("url" ..., "dbtable" ..., ...)
CREATE TABLE tab2(...) USING jdbc OPTIONS("url" ..., "dbtable" ..., ...)
SELECT * FROM tab1 join tab2
INSERT INTO tab1 SELECT ...
2. Register each table before usage
JDBC source with data source v2
New: Register the catalog before usage
spark-defaults.conf
spark.sql.catalog.jdbcCatalogName my.jdbc.v2.impl
spark.sql.catalog.jdbcCatalogName.url …
Spark.sql.catalog.jdbcCatalogName.driver …
JDBC source with data source v2
• No need to register the tables.
• Access the tables using n-part name.
• DDL/DML support.
CREATE TABLE jdbcCatalogName.db1.t1(...)
ALTER TABLE jdbcCatalogName.db1.t2 CHANGE COLUMN ...
SELECT * FROM jdbcCatalogName.db2.t3
INSERT INTO jdbcCatalogName.db3.t4 SELECT ...
What’s Next?
52
Data Source
APIs
Spark on
Kubernetes
PySpark
Usability
Scala 2.12 Various SQL
Features
GPU-aware
Scheduling
Adaptive
Execution
Hadoop 3.x
Spark Graph
mlflow
53
Adaptive Query Processing
Intel Blog: https://tinyurl.com/y3rjwcos
Based on statistics of the materialized plan nodes, re-
optimize the execution plan of the remaining queries
• Self tuning the number of reducers
• Adaptive join strategy
• Automatic skew join handling
Adaptive Query Processing
What’s Next?
56
Data Source
APIs
Spark on
Kubernetes
PySpark
Usability
Scala 2.12 Various SQL
Features
GPU-aware
Scheduling
Adaptive
Execution
Hadoop 3.x
Spark Graph
mlflow
Native Spark App in K8S
• Support for using a pod template to
customize the driver and executor pods.
• Dynamic resource allocation and
external shuffle service.
• Better support for local application
dependencies on client machines
• Driver resilience for Spark Streaming
• Better scheduling support.
57
on
58
The other targets in Apache Spark 3.0
• Hadoop 3.x support
• Hive execution from
1.2.1 to 2.3.4
• Scala 2.12 GA
• Better ANSI SQL
compliance
• PySpark usability
59
Please follow the announcements
in Spark + AI Summit @ SF
What’s Next?
60
Data Source
APIs
Spark on
Kubernetes
PySpark
Usability
Scala 2.12 Various SQL
Features
GPU-aware
Scheduling
Adaptive
Execution
Hadoop 3.x
Spark Graph
mlflow
Apache Spark 3.x
61
Catalyst Optimization & Tungsten Execution
SparkSession / DataFrame / DataSet APIs
SQL
Spark ML
Spark
Streaming
Spark
Graph
3rd-party
Libraries
Spark CoreData Source Connectors
Apache Spark™
• Use Cases
• Research
• Technical Deep Dives
AI
• Productionizing ML
• Deep Learning
• Cloud Hardware
Fields
• Data Science
• Data Engineering
• Enterprise
5000+ ATTENDEES
Practitioners:
Data Scientists, Data Engineers,
Analysts, Architects
Leaders:
Engineering Management, VPs,
Heads of Analytics & Data, CxOs
TRACKS
databricks.com/sparkaisummit
63
Nike: Enabling Data Scientists to bring their Models to Market
Facebook: Vectorized Query Execution in Apache Spark at Facebook
Tencent: Large-scale Malicious Domain Detection with Spark AI
IBM: In-memory storage Evolution in Apache Spark
Capital One: Apache Spark and Sights at Speed: Streaming, Feature
management and Execution
Apple: Making Nested Columns as First Citizen in Apache Spark SQL
EBay: Managing Apache Spark workload and automatic optimizing.
Google: Validating Spark ML Jobs
HP: Apache Spark for Cyber Security in big company
Microsoft: Apache Spark Serving: Unifying Batch, Streaming and
RESTful Serving
ABSA Group: A Mainframe Data Source for Spark SQL and Streaming
Facebook: an efficient Facebook-scale shuffle service
IBM: Make your PySpark Data Fly with Arrow!
Facebook : Distributed Scheduling Framework for Apache Spark
Zynga: Automating Predictive Modeling at Zynga with PySpark
World Bank: Using Crowdsourced Images to Create Image
Recognition Models and NLP to Augment Global Trade indicator
JD.com: Optimizing Performance and Computing Resource.
Microsoft: Azure Databricks with R: Deep Dive
Airbnb: Apache Spark at Airbnb
Netflix: Migrating to Apache Spark at Netflix
Microsoft: Infrastructure for Deep Learning in Apache
Spark
Intel: Game playing using AI on Apache Spark
Facebook: Scaling Apache Spark @ Facebook
Lyft: Scaling Apache Spark on K8S at Lyft
Uber: Using Spark Mllib Models in a Production
Training and Serving Platform
Apple: Bridging the gap between Datasets and
DataFrames
Salesforce: The Rule of 10,000 Spark Jobs
Target: Lessons in Linear Algebra at Scale with
Apache Spark
Nationwide: Deploying Enterprise Scale Deep
Learning in Actuarial Modeling at Nationwide
Workday: Lesson Learned Using Apache Spark
Thank you
Xiao Li (lixiao@databricks.com)
Wenchen Fan (wenchen@databricks.com)
64

Weitere ähnliche Inhalte

Was ist angesagt?

Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
Using Apache Spark in the Cloud—A Devops Perspective with Telmo OliveiraUsing Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
Using Apache Spark in the Cloud—A Devops Perspective with Telmo OliveiraSpark Summit
 
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Databricks
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark OverviewairisData
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
 
Apache Spark Best Practices Meetup Talk
Apache Spark Best Practices Meetup TalkApache Spark Best Practices Meetup Talk
Apache Spark Best Practices Meetup TalkEren Avşaroğulları
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesAnand Narayanan
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationDatabricks
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit
 
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful ServingDatabricks
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemApache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemDatabricks
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit
 
Scale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNScale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNDataWorks Summit/Hadoop Summit
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaDatabricks
 

Was ist angesagt? (20)

Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
Using Apache Spark in the Cloud—A Devops Perspective with Telmo OliveiraUsing Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
 
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
 
Apache Spark Best Practices Meetup Talk
Apache Spark Best Practices Meetup TalkApache Spark Best Practices Meetup Talk
Apache Spark Best Practices Meetup Talk
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan Kessler
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...
 
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemApache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source Ecosystem
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon Whitear
 
Scale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNScale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARN
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 

Ähnlich wie Apache spark 2.4 and beyond

What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3DataWorks Summit
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4boxu42
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3Databricks
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 Chester Chen
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectSaltlux Inc.
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...Impetus Technologies
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDatabricks
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
 
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Overview of Apache Spark 2.3: What’s New? with Sameer AgarwalDatabricks
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams APIconfluent
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 

Ähnlich wie Apache spark 2.4 and beyond (20)

What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC Project
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 
Spark7
Spark7Spark7
Spark7
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 

Kürzlich hochgeladen

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 

Kürzlich hochgeladen (20)

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 

Apache spark 2.4 and beyond

  • 1. Apache Spark 2.4 and Beyond Xiao Li, Wenchen Fan Mar 2019 @ Strata Data Conf
  • 2. About US • Software Engineers at • Apache Spark Committers and PMC Members Xiao Li (Github: gatorsmile) Wenchen Fan (Github: cloud-fan)
  • 3. Databricks Customers Across Industries Financial Services Healthcare & Pharma Media & Entertainment Technology Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech Data & Analytics Services
  • 4. DATABRICKS WORKSPACE Databricks Delta ML Frameworks DATABRICKS CLOUD SERVICE DATABRICKS RUNTIME Reliable & Scalable Simple & Integrated Databricks Unified Analytics Platform APIs Jobs Models Notebooks Dashboards End to end ML lifecycle
  • 5.
  • 8. Top Skill in 2018: Apache Spark https://economicgraph.linkedin.com/research/linkedin-2018-emerging-jobs-report
  • 9.
  • 10. 10 Release: Nov 8, 2018 Blog: https://t.co/k7kEHrNZXp Above 1100 tickets.
  • 11. Higher-order Functions Major Features on Spark 2.4 11 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Beta support Scala 2.12 Various SQL Features
  • 12. Survey Key Findings: The AI Dilemma Investment in AI is Growing Quickly However, very few are succeeding Major contributing factors to this AI dilemma 7 Different machine learning frameworks and tools https://databricks.com/CIO-survey-report
  • 13. CIO Survey Key Findings: Unified Analytics Enables AI Success Unifying data science & engineering Benefits Expected from a Unified Approach to Data & AI Increased operational efficiency More effective decision making Accelerated time to market Improved security Increased innovation Improved customer experience Increased competitive advantage Increased employee satisfaction Increased customer engagement Product/service transformation Topline growth Other (specify)
  • 14. Big data v.s. AI Technologies 14 X
  • 15. Project Hydrogen: Spark + AI A gang scheduling to Apache Spark that embeds a distributed DL job as a Spark stage to simplify the distributed training workflow. [SPARK-24374] 15 Task 1 Task 2 Task 3
  • 16. Higher-order Functions Major Features on Spark 2.4 16 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL Features
  • 17. Flexible Streaming Sink [SPARK-24565] Exposing output rows of each microbatch as a DataFrame foreachBatch(f: Dataset[T] => Unit) • Scala/Java/Python APIs in DataStreamWriter. • Reuse existing batch data sources • Write to multiple locations • Apply additional DataFrame operations 17
  • 18. Reuse existing batch data sources 18
  • 19. Write to multiple location 19
  • 20. Higher-order Functions Major Features on Spark 2.4 20 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL Features
  • 21. Parquet 21 Update from 1.8.2 to 1.10.0 [SPARK-23972]. • PARQUET-1025 - Support new min-max statistics in parquet-mr • PARQUET-225 - INT64 support for delta encoding • PARQUET-1142 Enable parquet.filter.dictionary.enabled by default. Predicate pushdown • STRING [SPARK-23972] [20x faster] • Decimal [SPARK-24549] • Timestamp [SPARK-24718] • Date [SPARK-23727] • Byte/Short [SPARK-24706] • StringStartsWith [SPARK-24638] • IN [SPARK-17091]
  • 22. ORC Native vectorized ORC reader is GAed! • Native ORC reader is on by default [SPARK-23456] • Update ORC from 1.4.1 to 1.5.2 [SPARK-24576] • Turn on ORC filter push-down by default [SPARK-21783] • Use native ORC reader to read Hive serde tables by default [SPARK-22279] • Avoid creating reader for all ORC files [SPARK-25126] 22
  • 23. Higher-order Functions Major Features on Upcoming Spark 2.4 23 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL Features
  • 24. Higher-order Functions Transformation on complex objects like arrays, maps and structures inside of columns. 24 tbl_nested |-- key: long (nullable = false) |-- values: array (nullable = false) | |-- element: long (containsNull = false) UDF ? Expensive data serialization
  • 25. 1) Check for element existence SELECT EXISTS(values, e -> e > 30) AS v FROM tbl_nested; 2) Transform an array SELECT TRANSFORM(values, e -> e * e) AS v FROM tbl_nested; tbl_nested |-- key: long (nullable = false) |-- values: array (nullable = false) | |-- element: long (containsNull = false) Higher-order Functions
  • 26. 4) Aggregate an array SELECT REDUCE(values, 0, (value, acc) -> value + acc) AS sum FROM tbl_nested; Ref Databricks Blog: http://dbricks.co/2rUKQ1A 3) Filter an array SELECT FILTER(values, e -> e > 30) AS v FROM tbl_nested; tbl_nested |-- key: long (nullable = false) |-- values: array (nullable = false) | |-- element: long (containsNull = false) Higher-order Functions
  • 27. Built-in Functions [SPARK-23899] New or extended built-in functions for ArrayTypes and MapTypes • 26 functions for ArrayTypes transform, filter, reduce, array_distinct, array_intersect, array_union, array_except, array_join, array_max, array_min, ... • 3 functions for MapTypes map_from_arrays, map_from_entries, map_concat 27 Blog: Introducing New Built-in and Higher-Order Functions for Complex Data Types in Apache Spark 2.4. https://t.co/p1TRRtabJJ
  • 28. Higher-order Functions Major Features on Spark 2.4 28 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL Features
  • 29. Native Spark App in K8S New Spark scheduler backend • PySpark support [SPARK-23984] • SparkR support [SPARK-24433] • Client-mode support [SPARK-23146] • Support for mounting K8S volumes [SPARK-23529] Blog: What’s New for Apache Spark on Kubernetes in the Upcoming Apache Spark 2.4 Release https://t.co/uUpdUj2Z4B 29 on
  • 30. Higher-order Functions Major Features on Spark 2.4 30 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Beta support Scala 2.12 Various SQL Features
  • 32. This presentation may contain projections or other forward- looking statements regarding the upcoming release (Apache Spark 3.0). The statements are intended to outline our general direction. They are intended for information purposes only. They are not a commitment to deliver code or functionality. The development, release and timing of any feature or functionality described for Apache Spark remains at the sole discretion of ASF and the Apache Spark PMC. Safe Harbor Statement
  • 33. What’s Next? 33 Data Source APIs Spark on Kubernetes PySpark Usability Scala 2.12 Various SQL Features GPU-aware Scheduling Adaptive Execution Hadoop 3.x Spark Graph mlflow
  • 34. What’s Next? 34 Data Source APIs Spark on Kubernetes PySpark Usability Scala 2.12 Various SQL Features GPU-aware Scheduling Adaptive Execution Hadoop 3.x Spark Graph mlflow
  • 35. Project Hydrogen: Spark + AI GPU Aware Scheduling • widely used for accelerating special workloads, e.g., deep learning and signal processing 35
  • 36. It’s Hard to Productionize ML
  • 37. ML Lifecycle is Manual, Inconsistent and Disconnected ● Ad hoc approach to track experiments ● Very hard to reproduce experiments Data Prep ● Multiple tightly coupled deployment options ● Different monitoring approach for each framework Build Model Deploy Model ● Low level integrations for Data and ML ● Difficult to track data used for a model
  • 38. What is ? Open source platform to manage ML development • Lightweight APIs & abstractions that work with any ML library • Designed to be useful for 1 user or 1000+ person orgs • Runs the same way anywhere (e.g. any cloud) Key principle: “open interface” APIs that work with any existing ML library, app, deployment tool, etc
  • 39. MLflow Components 39 Tracking Record and query experiments: code, params, results, etc Projects Code packaging for reproducible runs on any platform Models Model packaging and deployment to diverse environments
  • 40. MLflow Projects • Docker-based project environment specification (0.9) • X-coordinate logging for metrics & batched logging (1.0) • Packaging projects with build steps (1.0+) MLflow Models • Custom model logging in Python, R and Java (0.8, 0.9, 1.0) • Better environment isolation when loading models (1.0) • Logging schema of models (1.0+) MLflow Tracking • SQL database backend for scaling the tracking server (0.9) • UI scalability improvements (0.8, 0.9, etc.) • X-coordinate logging for metrics & batched logging (1.0) • Fluent API for Java and Scala (1.0) What’s Next?
  • 41. 41
  • 42. What’s Next? 42 Data Source APIs Spark on Kubernetes PySpark Usability Scala 2.12 Various SQL Features GPU-aware Scheduling Adaptive Execution Hadoop 3.x Spark Graph mlflow
  • 43. Challenges in Existing Graph Library GraphX • Not DataFrame based • Not actively maintained GraphFrame • Limited graph pattern matching • Semantically weak graph data model
  • 44. (:Cypher)-[: FOR] -> (Apache:SparkTM)
  • 45. Spark Graph Given a single Property Graph data model and a Cypher query, Spark returns a tabular result [DataFrame]
  • 46. What’s Next? 46 Data Source APIs Spark on Kubernetes PySpark Usability Scala 2.12 Various SQL Features GPU-aware Scheduling Adaptive Execution Hadoop 3.x Spark Graph mlflow
  • 47. Data Source API V2 • Unified API for batch and streaming • Flexible API for high performance implementation • Flexible API for metadata management
  • 48. JDBC source with data source v1 df.write.format("jdbc") .option("url", ...) .option("dbtable", ...) .option("driver", ...) .save() 1. Specify the info of remote catalog for each op
  • 49. JDBC source with data source v1 CREATE TABLE tab1(...) USING jdbc OPTIONS("url" ..., "dbtable" ..., ...) CREATE TABLE tab2(...) USING jdbc OPTIONS("url" ..., "dbtable" ..., ...) SELECT * FROM tab1 join tab2 INSERT INTO tab1 SELECT ... 2. Register each table before usage
  • 50. JDBC source with data source v2 New: Register the catalog before usage spark-defaults.conf spark.sql.catalog.jdbcCatalogName my.jdbc.v2.impl spark.sql.catalog.jdbcCatalogName.url … Spark.sql.catalog.jdbcCatalogName.driver …
  • 51. JDBC source with data source v2 • No need to register the tables. • Access the tables using n-part name. • DDL/DML support. CREATE TABLE jdbcCatalogName.db1.t1(...) ALTER TABLE jdbcCatalogName.db1.t2 CHANGE COLUMN ... SELECT * FROM jdbcCatalogName.db2.t3 INSERT INTO jdbcCatalogName.db3.t4 SELECT ...
  • 52. What’s Next? 52 Data Source APIs Spark on Kubernetes PySpark Usability Scala 2.12 Various SQL Features GPU-aware Scheduling Adaptive Execution Hadoop 3.x Spark Graph mlflow
  • 53. 53
  • 54. Adaptive Query Processing Intel Blog: https://tinyurl.com/y3rjwcos Based on statistics of the materialized plan nodes, re- optimize the execution plan of the remaining queries • Self tuning the number of reducers • Adaptive join strategy • Automatic skew join handling
  • 56. What’s Next? 56 Data Source APIs Spark on Kubernetes PySpark Usability Scala 2.12 Various SQL Features GPU-aware Scheduling Adaptive Execution Hadoop 3.x Spark Graph mlflow
  • 57. Native Spark App in K8S • Support for using a pod template to customize the driver and executor pods. • Dynamic resource allocation and external shuffle service. • Better support for local application dependencies on client machines • Driver resilience for Spark Streaming • Better scheduling support. 57 on
  • 58. 58
  • 59. The other targets in Apache Spark 3.0 • Hadoop 3.x support • Hive execution from 1.2.1 to 2.3.4 • Scala 2.12 GA • Better ANSI SQL compliance • PySpark usability 59 Please follow the announcements in Spark + AI Summit @ SF
  • 60. What’s Next? 60 Data Source APIs Spark on Kubernetes PySpark Usability Scala 2.12 Various SQL Features GPU-aware Scheduling Adaptive Execution Hadoop 3.x Spark Graph mlflow
  • 61. Apache Spark 3.x 61 Catalyst Optimization & Tungsten Execution SparkSession / DataFrame / DataSet APIs SQL Spark ML Spark Streaming Spark Graph 3rd-party Libraries Spark CoreData Source Connectors
  • 62. Apache Spark™ • Use Cases • Research • Technical Deep Dives AI • Productionizing ML • Deep Learning • Cloud Hardware Fields • Data Science • Data Engineering • Enterprise 5000+ ATTENDEES Practitioners: Data Scientists, Data Engineers, Analysts, Architects Leaders: Engineering Management, VPs, Heads of Analytics & Data, CxOs TRACKS databricks.com/sparkaisummit
  • 63. 63 Nike: Enabling Data Scientists to bring their Models to Market Facebook: Vectorized Query Execution in Apache Spark at Facebook Tencent: Large-scale Malicious Domain Detection with Spark AI IBM: In-memory storage Evolution in Apache Spark Capital One: Apache Spark and Sights at Speed: Streaming, Feature management and Execution Apple: Making Nested Columns as First Citizen in Apache Spark SQL EBay: Managing Apache Spark workload and automatic optimizing. Google: Validating Spark ML Jobs HP: Apache Spark for Cyber Security in big company Microsoft: Apache Spark Serving: Unifying Batch, Streaming and RESTful Serving ABSA Group: A Mainframe Data Source for Spark SQL and Streaming Facebook: an efficient Facebook-scale shuffle service IBM: Make your PySpark Data Fly with Arrow! Facebook : Distributed Scheduling Framework for Apache Spark Zynga: Automating Predictive Modeling at Zynga with PySpark World Bank: Using Crowdsourced Images to Create Image Recognition Models and NLP to Augment Global Trade indicator JD.com: Optimizing Performance and Computing Resource. Microsoft: Azure Databricks with R: Deep Dive Airbnb: Apache Spark at Airbnb Netflix: Migrating to Apache Spark at Netflix Microsoft: Infrastructure for Deep Learning in Apache Spark Intel: Game playing using AI on Apache Spark Facebook: Scaling Apache Spark @ Facebook Lyft: Scaling Apache Spark on K8S at Lyft Uber: Using Spark Mllib Models in a Production Training and Serving Platform Apple: Bridging the gap between Datasets and DataFrames Salesforce: The Rule of 10,000 Spark Jobs Target: Lessons in Linear Algebra at Scale with Apache Spark Nationwide: Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide Workday: Lesson Learned Using Apache Spark
  • 64. Thank you Xiao Li (lixiao@databricks.com) Wenchen Fan (wenchen@databricks.com) 64