In this talk, we present Koalas, a new open source project that was announced at the Spark + AI Summit in April. Koalas is a Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
2. About Me
• Engineering Manager at Databricks
• Apache Spark Committer and PMC Member
• Previously, IBM Master Inventor
• Spark, Database Replication, Information Integration
• Ph.D. in University of Florida
• Github: gatorsmile
4. Apache Spark
Originally created by Databricks’ founders at UC Berkeley in 2009
A de facto unified analytics engine for large-scale data processing
- Just-in-time Data Warehouse [with Delta], Streaming, ETL,
ML, Graph Processing
PySpark API for Python; also API support for Scala, R and SQL
4
6. pandas
Authored by Wes McKinney in 2008
The standard tool for data manipulation and analysis in Python
Deeply integrated into Python data science ecosystem, e.g.
numpy, matplotlib
Can deal with a lot of different situations, including:
- basic statistical analysis
- handling missing data
- time series, categorical variables, strings
6
7. Why Spark Performs Faster in Big Data?
Distributed computing in Spark
More lazy execution in Spark
- Triggered until users call the action APIs (collect, save, show)
- Mixed, combined, optimized and executed holistically
More efficient execution in Spark
- Tungsten execution engine: whole-stage code generation
- Catalyst optimizer: heuristics-based and cost-based query
optimization, adaptive query optimization [Spark 3.0]
7
8. Spark-ify pandas Code???
• The increasing scale and complexity of data
operations
• pandas-based Python scripts become too slow
• But,,, “Spark switch” is time consuming and not
straightforward
8
10. Koalas
• Announced April 24, 2019
• Pure Python library
• Familiar if coming from pandas
• Aims at providing the pandas
API on top of Spark
• Unifies the two ecosystems
with a familiar API
• Seamless transition between
small and large data
10
11. API Differences
pandas
- Born of need + batteries included: providing APIs for common tasks
- Type system from NumPy
- Be Pythonic
PySpark
- Abstraction: tasks are implemented by primitives composition
- Type system from ANSI SQL
- Consistent with Scala DataFrame APIs
11
14. A short example
14
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = [‘x’, ‘y’, ‘z1’]
df[‘x2’] = df.x * df.x
pandas Koalas
import databricks.koalas as ks
df = ks.read_csv("my_data.csv")
df.columns = [‘x’, ‘y’, ‘z1’]
df[‘x2’] = df.x * df.x
15. Koalas
• Provide discoverable APIs for common data science
tasks (i.e., follows pandas)
• Unify pandas API and Spark API, but pandas first
• pandas APIs that are appropriate for distributed
dataset
• Easy conversion from/to pandas DataFrame or
numpy array.
15
18. Current status
• Bi-weekly releases, very active community with daily changes
• The most common functions have been implemented:
- 60% of the DataFrame/Series API
- 50% of the DataFrameGroupBy/SeriesGroupBy API
- 15% of the Index/MultiIndex API
- to_datetime, get_dummies, …
- to_delta, to_parquet, to_spark_io, sql, cache, …
18
20. What to expect soon?
• Performance enhancements
• Better indexing support
• Better error handling
• Better coverage of pandas APIs
• More time series related functions
• Better visualization support
20
21. Getting started
• pip install koalas
• conda install koalas
• Look for docs and updates on github.com/databricks/koalas
• Project docs are published here: https://koalas.readthedocs.io
21
22. Do you have suggestions or requests?
Submit requests to github.com/databricks/koalas/issues
Very easy to contribute
github.com/databricks/koalas/blob/master/CONTRIBUTING.md
22