4. What’s Koalas?
▪ Announced April 24, 2019
▪ Aims at providing the pandas API on top of Apache Spark
▪ Unifies the two ecosystems with a familiar API
▪ Seamless transition between small and large data
▪ For pandas users
▪ Scale out the pandas code using Koalas
▪ Make learning PySpark much easier
▪ For PySpark users
▪ More productive by pandas-like functions
5. pandas
▪ Authored by Wes McKinney in 2008
▪ The standard tool for data
manipulation and analysis in Python
▪ The current version: 1.1.0
Stack Overflow Trends
6. pandas
▪ Deeply integrated into Python data science ecosystem
▪ numpy
▪ matplotlib
▪ scikit-learn
▪ Can deal with a lot of different situations, including:
▪ Basic statistical analysis
▪ Handling missing data
▪ Time series, categorical variables, strings
7. Apache Spark
▪ De facto unified analytics engine for large-scale data processing
▪ Streaming
▪ ETL
▪ ML
▪ Originally created at UC Berkeley by Databricks’ founders
▪ PySpark API for Python; also API support for Scala, R and SQL
▪ The latest version: 3.0.0
12. Koalas 1.0
▪ Spark 3.0 support
▪ Optimize using Spark 3.0 functions, such as mapInPandas().
▪ Python 3.8 support
▪ pandas 1.0 support (since 0.28.0)
▪ Basically Koalas will follow pandas 1.0+ behavior.
▪ Remove deprecated functions
▪ Functions removed in pandas 1.0
▪ @pandas_wraps, DataFrame.map_in_pandas()
▪ Introduce spark accessor and move Spark-specific functions
13. Koalas 1.0
▪ Most common pandas functions have been implemented in Koalas:
▪ Series : 74%
▪ DataFrame : 82%
▪ Index : 68%
▪ APIs for Spark users:
▪ to_koalas(), to_spark()
▪ DataFrame.spark.to_spark_io(),
ks.read_spark_io(), ...
▪ DataFrame.spark.cache(), ks.sql(), …
14. Koalas 1.0
▪ Better type hint support
▪ Allow users to specify column names in the type hints (experimental).
▪ Wider support of in-place update
▪ The in-place update of DataFrame also updates Series, and vice versa.
▪ Less restriction on compute.ops_on_diff_frames
▪ Make internals work and run on a simpler Spark plan.
https://github.com/databricks/koalas/releases/tag/v1.0.0
15. Koalas 1.1
▪ API extensions
▪ Registering a custom accessors to DataFrame, Series, and Index.
▪ Plotting backend
▪ Switch the plotting backend with a config plotting.backend; Plotly, pandas-bokeh
▪ Koalas accessor
▪ Provide Koalas specific functions, apply_batch, transform_batch, ...
https://github.com/databricks/koalas/releases/tag/v1.1.0
17. Roadmap
▪ Release DBR/MLR 7.1 pre-installs Koalas 1.0
▪ DBR/MLR 7.2 will pre-install Koalas 1.1
▪ ...
▪ Improve the coverage and the behavior compatibility of APIs.
▪ ML libraries
▪ Documentations
▪ More examples
▪ Workarounds for APIs we won’t support
18. Getting started
▪ pip install koalas
▪ conda install -c conda-forge koalas
▪ Look for docs on https://koalas.readthedocs.io/en/latest/
and updates on github.com/databricks/koalas
▪ 10 min tutorial in a Live Jupyter notebook is available from the docs.
▪ blog post: 10 Minutes from pandas to Koalas on Apache Spark
https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html