Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Koalas: Unifying Spark and pandas APIs

762 Aufrufe

Veröffentlicht am

The introduction of Koalas and the current development status.

This is a slide for Spark Meetup Tokyo #1 (Spark+AI Summit 2019)

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Koalas: Unifying Spark and pandas APIs

  1. 1. Koalas: Unifying Spark and pandas APIs 1 Takuya UESHIN Spark Meetup Tokyo #1, Jun 2019
  2. 2. 2 About Me - Software Engineer @databricks - Apache Spark Committer - Twitter: @ueshin - GitHub: github.com/ueshin
  3. 3. Typical journey of a data scientist Education (MOOCs, books, universities) → pandas Analyze small data sets → pandas Analyze big data sets → DataFrame in Spark 3
  4. 4. pandas Authored by Wes McKinney in 2008 The standard tool for data manipulation and analysis in Python Deeply integrated into Python data science ecosystem, e.g. numpy, matplotlib Can deal with a lot of different situations, including: - basic statistical analysis - handling missing data - time series, categorical variables, strings 4
  5. 5. Apache Spark De facto unified analytics engine for large-scale data processing (Streaming, ETL, ML) Created at UC Berkeley by Databricks’ founders PySpark API for Python; also API support for Scala, R and SQL 5
  6. 6. 6 Pandas DataFrame Spark DataFrame Column df[‘col’] df[‘col’] Mutability Mutable Immutable Add a column df[‘c’] = df[‘a’] + df[‘b’] df.withColumn(‘c’, df[‘a’] + df[‘b’]) Rename columns df.columns = [‘a’,’b’] df.select(df[‘c1’].alias(‘a’), df[‘c2’].alias(‘b’)) Value count df[‘col’].value_counts() df.groupBy(df[‘col’]).count().order By(‘count’, ascending = False) Pandas DataFrame vs Spark DataFrame
  7. 7. A short example 7 import pandas as pd df = pd.read_csv("my_data.csv") df.columns = [‘x’, ‘y’, ‘z1’] df[‘x2’] = df.x * df.x df = (spark.read .option("inferSchema", "true") .option("comment", True) .csv("my_data.csv")) df = df.toDF(‘x’, ‘y’, ‘z1’) df = df.withColumn(‘x2’, df.x*df.x)
  8. 8. Koalas Announced April 24, 2019 Pure Python library Aims at providing the pandas API on top of Apache Spark: - unifies the two ecosystems with a familiar API - seamless transition between small and large data 8
  9. 9. Quickly gaining traction 9 Weekly releases! > 100 patches merged since announcement at Spark Summit (April 24) > 10 significant contributors outside of Databricks > 2.5k daily downloads
  10. 10. A short example 10 import pandas as pd df = pd.read_csv("my_data.csv") df.columns = [‘x’, ‘y’, ‘z1’] df[‘x2’] = df.x * df.x import databricks.koalas as ks df = ks.read_csv("my_data.csv") df.columns = [‘x’, ‘y’, ‘z1’] df[‘x2’] = df.x * df.x
  11. 11. Key Differences Spark is more lazy by nature: - most operations only happen when displaying or writing a DataFrame Spark does not implicitly have ordering Performance when working at scale 11
  12. 12. Current status Weekly releases, very active community with daily changes The most common functions have been implemented: - 30% of the DataFrame API - 30% of the Series API - to_datetime, get_dummies, … Special thanks to Hyukjin Kwon and Takuya Ueshin for major contributions to the project 12
  13. 13. What to expect soon? Performance enhancements, e.g. caching Plotting Better indexing support More data manipulation (string and date manipulations) 13
  14. 14. Getting started pip install koalas conda install koalas Look for docs and updates on github.com/databricks/koalas 14
  15. 15. Do you have suggestions or requests? Submit requests to github.com/databricks/koalas/issues Very easy to contribute github.com/databricks/koalas/blob/master/CONTRIBUTING.md 15
  16. 16. 16 Thank you Takuya UESHIN (ueshin@databricks.com)