SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Koalas: Unifying Spark and
pandas APIs
1
Xiao Li @ gatorsmile
PyBay Conf @ SF | Aug 2019
About Me
• Engineering Manager at Databricks
• Apache Spark Committer and PMC Member
• Previously, IBM Master Inventor
• Spark, Database Replication, Information Integration
• Ph.D. in University of Florida
• Github: gatorsmile
DATABRICKS WORKSPACE
APIs
Jobs
Models
Notebooks
Dashboards
DATA ENGINEERS DATA SCIENTISTS
DATABRICKS CLOUD SERVICE
DATABRICKS RUNTIME
Databricks Delta ML Frameworks
Reliable & Scalable Simple & Integrated
+ +
End to end ML lifecycle
Databricks Unified Analytics Platform
Apache Spark
Originally created by Databricks’ founders at UC Berkeley in 2009
A de facto unified analytics engine for large-scale data processing
- Just-in-time Data Warehouse [with Delta], Streaming, ETL,
ML, Graph Processing
PySpark API for Python; also API support for Scala, R and SQL
4
5
Image: Stack Overflow
pandas
Authored by Wes McKinney in 2008
The standard tool for data manipulation and analysis in Python
Deeply integrated into Python data science ecosystem, e.g.
numpy, matplotlib
Can deal with a lot of different situations, including:
- basic statistical analysis
- handling missing data
- time series, categorical variables, strings
6
Why Spark Performs Faster in Big Data?
Distributed computing in Spark
More lazy execution in Spark
- Triggered until users call the action APIs (collect, save, show)
- Mixed, combined, optimized and executed holistically
More efficient execution in Spark
- Tungsten execution engine: whole-stage code generation
- Catalyst optimizer: heuristics-based and cost-based query
optimization, adaptive query optimization [Spark 3.0]
7
Spark-ify pandas Code???
• The increasing scale and complexity of data
operations
• pandas-based Python scripts become too slow
• But,,, “Spark switch” is time consuming and not
straightforward
8
9
Koalas
• Announced April 24, 2019
• Pure Python library
• Familiar if coming from pandas
• Aims at providing the pandas
API on top of Spark
• Unifies the two ecosystems
with a familiar API
• Seamless transition between
small and large data
10
API Differences
pandas
- Born of need + batteries included: providing APIs for common tasks
- Type system from NumPy
- Be Pythonic
PySpark
- Abstraction: tasks are implemented by primitives composition
- Type system from ANSI SQL
- Consistent with Scala DataFrame APIs
11
12
pandas DataFrame Spark DataFrame
Column df[‘col’] df[‘col’]
Mutability Mutable Immutable
Add a column df[‘c’] = df[‘a’] + df[‘b’] df.withColumn(‘c’, df[‘a’] + df[‘b’])
Rename columns df.columns = [‘a’,’b’] df.select(df[‘c1’].alias(‘a’),
df[‘c2’].alias(‘b’))
Value count df[‘col’].value_counts() df.groupBy(df[‘col’]).count()
.orderBy(‘count’, ascending =
False)
Pandas DataFrame vs Spark DataFrame
A short example
13
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = [‘x’, ‘y’, ‘z1’]
df[‘x2’] = df.x * df.x
df = (spark.read
.option("inferSchema", "true")
.option("comment", True)
.csv("my_data.csv"))
df = df.toDF(‘x’, ‘y’, ‘z1’)
df = df.withColumn(‘x2’, df.x*df.x)
pandas PySpark
A short example
14
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = [‘x’, ‘y’, ‘z1’]
df[‘x2’] = df.x * df.x
pandas Koalas
import databricks.koalas as ks
df = ks.read_csv("my_data.csv")
df.columns = [‘x’, ‘y’, ‘z1’]
df[‘x2’] = df.x * df.x
Koalas
• Provide discoverable APIs for common data science
tasks (i.e., follows pandas)
• Unify pandas API and Spark API, but pandas first
• pandas APIs that are appropriate for distributed
dataset
• Easy conversion from/to pandas DataFrame or
numpy array.
15
Koalas
16
Catalyst Optimization &
Tungsten Execution
DataFrame APIsSQL
Koalas
Core
Data Source
Connectors
Pandas
SPARK
A lean API layer
Demo
17
Current status
• Bi-weekly releases, very active community with daily changes
• The most common functions have been implemented:
- 60% of the DataFrame/Series API
- 50% of the DataFrameGroupBy/SeriesGroupBy API
- 15% of the Index/MultiIndex API
- to_datetime, get_dummies, …
- to_delta, to_parquet, to_spark_io, sql, cache, …
18
Quickly gaining traction
19
- 300+ patches merged
since announcement
- 20 significant
contributors outside of
Databricks
- 6K+ daily downloads
What to expect soon?
• Performance enhancements
• Better indexing support
• Better error handling
• Better coverage of pandas APIs
• More time series related functions
• Better visualization support
20
Getting started
• pip install koalas
• conda install koalas
• Look for docs and updates on github.com/databricks/koalas
• Project docs are published here: https://koalas.readthedocs.io
21
Do you have suggestions or requests?
Submit requests to github.com/databricks/koalas/issues
Very easy to contribute
github.com/databricks/koalas/blob/master/CONTRIBUTING.md
22
Thank you
Xiao Li
(lixiao@databricks.com)

Weitere ähnliche Inhalte

Was ist angesagt?

Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Spark Summit
 

Was ist angesagt? (20)

Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan Kessler
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Spark Summit 2015 keynote: Making Big Data Simple with Spark
Spark Summit 2015 keynote: Making Big Data Simple with SparkSpark Summit 2015 keynote: Making Big Data Simple with Spark
Spark Summit 2015 keynote: Making Big Data Simple with Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf JagermanSpark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf Jagerman
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
 
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan FritzAnalytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
 
Tailored for Spark
Tailored for SparkTailored for Spark
Tailored for Spark
 

Ähnlich wie Koalas: Unifying Spark and pandas APIs

Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 

Ähnlich wie Koalas: Unifying Spark and pandas APIs (20)

Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 

Kürzlich hochgeladen

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Kürzlich hochgeladen (20)

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 

Koalas: Unifying Spark and pandas APIs

  • 1. Koalas: Unifying Spark and pandas APIs 1 Xiao Li @ gatorsmile PyBay Conf @ SF | Aug 2019
  • 2. About Me • Engineering Manager at Databricks • Apache Spark Committer and PMC Member • Previously, IBM Master Inventor • Spark, Database Replication, Information Integration • Ph.D. in University of Florida • Github: gatorsmile
  • 3. DATABRICKS WORKSPACE APIs Jobs Models Notebooks Dashboards DATA ENGINEERS DATA SCIENTISTS DATABRICKS CLOUD SERVICE DATABRICKS RUNTIME Databricks Delta ML Frameworks Reliable & Scalable Simple & Integrated + + End to end ML lifecycle Databricks Unified Analytics Platform
  • 4. Apache Spark Originally created by Databricks’ founders at UC Berkeley in 2009 A de facto unified analytics engine for large-scale data processing - Just-in-time Data Warehouse [with Delta], Streaming, ETL, ML, Graph Processing PySpark API for Python; also API support for Scala, R and SQL 4
  • 6. pandas Authored by Wes McKinney in 2008 The standard tool for data manipulation and analysis in Python Deeply integrated into Python data science ecosystem, e.g. numpy, matplotlib Can deal with a lot of different situations, including: - basic statistical analysis - handling missing data - time series, categorical variables, strings 6
  • 7. Why Spark Performs Faster in Big Data? Distributed computing in Spark More lazy execution in Spark - Triggered until users call the action APIs (collect, save, show) - Mixed, combined, optimized and executed holistically More efficient execution in Spark - Tungsten execution engine: whole-stage code generation - Catalyst optimizer: heuristics-based and cost-based query optimization, adaptive query optimization [Spark 3.0] 7
  • 8. Spark-ify pandas Code??? • The increasing scale and complexity of data operations • pandas-based Python scripts become too slow • But,,, “Spark switch” is time consuming and not straightforward 8
  • 9. 9
  • 10. Koalas • Announced April 24, 2019 • Pure Python library • Familiar if coming from pandas • Aims at providing the pandas API on top of Spark • Unifies the two ecosystems with a familiar API • Seamless transition between small and large data 10
  • 11. API Differences pandas - Born of need + batteries included: providing APIs for common tasks - Type system from NumPy - Be Pythonic PySpark - Abstraction: tasks are implemented by primitives composition - Type system from ANSI SQL - Consistent with Scala DataFrame APIs 11
  • 12. 12 pandas DataFrame Spark DataFrame Column df[‘col’] df[‘col’] Mutability Mutable Immutable Add a column df[‘c’] = df[‘a’] + df[‘b’] df.withColumn(‘c’, df[‘a’] + df[‘b’]) Rename columns df.columns = [‘a’,’b’] df.select(df[‘c1’].alias(‘a’), df[‘c2’].alias(‘b’)) Value count df[‘col’].value_counts() df.groupBy(df[‘col’]).count() .orderBy(‘count’, ascending = False) Pandas DataFrame vs Spark DataFrame
  • 13. A short example 13 import pandas as pd df = pd.read_csv("my_data.csv") df.columns = [‘x’, ‘y’, ‘z1’] df[‘x2’] = df.x * df.x df = (spark.read .option("inferSchema", "true") .option("comment", True) .csv("my_data.csv")) df = df.toDF(‘x’, ‘y’, ‘z1’) df = df.withColumn(‘x2’, df.x*df.x) pandas PySpark
  • 14. A short example 14 import pandas as pd df = pd.read_csv("my_data.csv") df.columns = [‘x’, ‘y’, ‘z1’] df[‘x2’] = df.x * df.x pandas Koalas import databricks.koalas as ks df = ks.read_csv("my_data.csv") df.columns = [‘x’, ‘y’, ‘z1’] df[‘x2’] = df.x * df.x
  • 15. Koalas • Provide discoverable APIs for common data science tasks (i.e., follows pandas) • Unify pandas API and Spark API, but pandas first • pandas APIs that are appropriate for distributed dataset • Easy conversion from/to pandas DataFrame or numpy array. 15
  • 16. Koalas 16 Catalyst Optimization & Tungsten Execution DataFrame APIsSQL Koalas Core Data Source Connectors Pandas SPARK A lean API layer
  • 18. Current status • Bi-weekly releases, very active community with daily changes • The most common functions have been implemented: - 60% of the DataFrame/Series API - 50% of the DataFrameGroupBy/SeriesGroupBy API - 15% of the Index/MultiIndex API - to_datetime, get_dummies, … - to_delta, to_parquet, to_spark_io, sql, cache, … 18
  • 19. Quickly gaining traction 19 - 300+ patches merged since announcement - 20 significant contributors outside of Databricks - 6K+ daily downloads
  • 20. What to expect soon? • Performance enhancements • Better indexing support • Better error handling • Better coverage of pandas APIs • More time series related functions • Better visualization support 20
  • 21. Getting started • pip install koalas • conda install koalas • Look for docs and updates on github.com/databricks/koalas • Project docs are published here: https://koalas.readthedocs.io 21
  • 22. Do you have suggestions or requests? Submit requests to github.com/databricks/koalas/issues Very easy to contribute github.com/databricks/koalas/blob/master/CONTRIBUTING.md 22