SlideShare a Scribd company logo
Suche senden
Hochladen
Python Data Ecosystem: Thoughts on Building for the Future
Melden
Teilen
Wes McKinney
Director of Ursa Labs, Open Source Developer um Ursa Labs
Folgen
•
6 gefällt mir
•
5,379 views
1
von
37
Python Data Ecosystem: Thoughts on Building for the Future
•
6 gefällt mir
•
5,379 views
Melden
Teilen
Jetzt herunterladen
Downloaden Sie, um offline zu lesen
Technologie
Keynote from PyData Berlin 2016-05-21
Mehr lesen
Wes McKinney
Director of Ursa Labs, Open Source Developer um Ursa Labs
Folgen
Recomendados
Next-generation Python Big Data Tools, powered by Apache Arrow von
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
13K views
•
22 Folien
An Incomplete Data Tools Landscape for Hackers in 2015 von
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
8.1K views
•
22 Folien
Ibis: Scaling the Python Data Experience von
Ibis: Scaling the Python Data Experience
Wes McKinney
3.8K views
•
13 Folien
My Data Journey with Python (SciPy 2015 Keynote) von
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
7.4K views
•
37 Folien
Memory Interoperability in Analytics and Machine Learning von
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
5.6K views
•
27 Folien
PyData: The Next Generation von
PyData: The Next Generation
Wes McKinney
22.2K views
•
31 Folien
Más contenido relacionado
Was ist angesagt?
Python Data Wrangling: Preparing for the Future von
Python Data Wrangling: Preparing for the Future
Wes McKinney
12.5K views
•
27 Folien
Improving Python and Spark (PySpark) Performance and Interoperability von
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
19.8K views
•
37 Folien
Apache Arrow (Strata-Hadoop World San Jose 2016) von
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
17K views
•
28 Folien
Apache Arrow -- Cross-language development platform for in-memory data von
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
2.9K views
•
23 Folien
Improving data interoperability in Python and R von
Improving data interoperability in Python and R
Wes McKinney
2.6K views
•
14 Folien
High Performance Python on Apache Spark von
High Performance Python on Apache Spark
Wes McKinney
16.6K views
•
35 Folien
Was ist angesagt?
(20)
Python Data Wrangling: Preparing for the Future von Wes McKinney
Python Data Wrangling: Preparing for the Future
Wes McKinney
•
12.5K views
Improving Python and Spark (PySpark) Performance and Interoperability von Wes McKinney
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
•
19.8K views
Apache Arrow (Strata-Hadoop World San Jose 2016) von Wes McKinney
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
•
17K views
Apache Arrow -- Cross-language development platform for in-memory data von Wes McKinney
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
•
2.9K views
Improving data interoperability in Python and R von Wes McKinney
Improving data interoperability in Python and R
Wes McKinney
•
2.6K views
High Performance Python on Apache Spark von Wes McKinney
High Performance Python on Apache Spark
Wes McKinney
•
16.6K views
Apache Arrow at DataEngConf Barcelona 2018 von Wes McKinney
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
•
2K views
Apache Arrow and Python: The latest von Wes McKinney
Apache Arrow and Python: The latest
Wes McKinney
•
5.8K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward" von Wes McKinney
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
•
1.1K views
Apache Arrow: Cross-language Development Platform for In-memory Data von Wes McKinney
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
•
6.6K views
Apache Spark Briefing von Thomas W. Dinsmore
Apache Spark Briefing
Thomas W. Dinsmore
•
4K views
Ibis: Scaling Python Analytics on Hadoop and Impala von Wes McKinney
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
•
7.6K views
Improving Python and Spark Performance and Interoperability with Apache Arrow von Julien Le Dem
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
•
4.4K views
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P... von Wes McKinney
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
•
103.9K views
Large Scale Graph Analytics with JanusGraph von P. Taylor Goetz
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz
•
19.1K views
Apache Arrow - An Overview von Dremio Corporation
Apache Arrow - An Overview
Dremio Corporation
•
2K views
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney von Hakka Labs
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
•
1.1K views
PyData: The Next Generation | Data Day Texas 2015 von Cloudera, Inc.
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
•
1.8K views
HBase and Drill: How loosley typed SQL is ideal for NoSQL von DataWorks Summit
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit
•
641 views
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes von DataWorks Summit
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
•
361 views
Destacado
Raising the Tides: Open Source Analytics for Data Science von
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
3.2K views
•
28 Folien
PyCon APAC 2016 Keynote von
PyCon APAC 2016 Keynote
Wes McKinney
3.6K views
•
36 Folien
Enabling Python to be a Better Big Data Citizen von
Enabling Python to be a Better Big Data Citizen
Wes McKinney
6K views
•
19 Folien
pandas: Powerful data analysis tools for Python von
pandas: Powerful data analysis tools for Python
Wes McKinney
9.8K views
•
38 Folien
Productive Data Tools for Quants von
Productive Data Tools for Quants
Wes McKinney
1.7K views
•
21 Folien
What's new in pandas and the SciPy stack for financial users von
What's new in pandas and the SciPy stack for financial users
Wes McKinney
11.8K views
•
23 Folien
Destacado
(14)
Raising the Tides: Open Source Analytics for Data Science von Wes McKinney
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
•
3.2K views
PyCon APAC 2016 Keynote von Wes McKinney
PyCon APAC 2016 Keynote
Wes McKinney
•
3.6K views
Enabling Python to be a Better Big Data Citizen von Wes McKinney
Enabling Python to be a Better Big Data Citizen
Wes McKinney
•
6K views
pandas: Powerful data analysis tools for Python von Wes McKinney
pandas: Powerful data analysis tools for Python
Wes McKinney
•
9.8K views
Productive Data Tools for Quants von Wes McKinney
Productive Data Tools for Quants
Wes McKinney
•
1.7K views
What's new in pandas and the SciPy stack for financial users von Wes McKinney
What's new in pandas and the SciPy stack for financial users
Wes McKinney
•
11.8K views
Data Tools and the Data Scientist Shortage von Wes McKinney
Data Tools and the Data Scientist Shortage
Wes McKinney
•
3.7K views
DataFrames: The Good, Bad, and Ugly von Wes McKinney
DataFrames: The Good, Bad, and Ugly
Wes McKinney
•
12.9K views
Python for Financial Data Analysis with pandas von Wes McKinney
Python for Financial Data Analysis with pandas
Wes McKinney
•
61.8K views
Structured Data Challenges in Finance and Statistics von Wes McKinney
Structured Data Challenges in Finance and Statistics
Wes McKinney
•
5.3K views
User Experience for Business Analysts von Carol Smith
User Experience for Business Analysts
Carol Smith
•
3.7K views
Python for Science and Engineering: a presentation to A*STAR and the Singapor... von pythoncharmers
Python for Science and Engineering: a presentation to A*STAR and the Singapor...
pythoncharmers
•
7K views
Falcon: Fault Localization in Concurrent Programs von Sangmin Park
Falcon: Fault Localization in Concurrent Programs
Sangmin Park
•
539 views
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding... von Sangmin Park
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Sangmin Park
•
412 views
Similar a Python Data Ecosystem: Thoughts on Building for the Future
Improving Data Interoperability for Python and R von
Improving Data Interoperability for Python and R
Work-Bench
10.3K views
•
14 Folien
High-Performance Python On Spark von
High-Performance Python On Spark
Jen Aman
1.7K views
•
35 Folien
Building a Hadoop Data Warehouse with Impala von
Building a Hadoop Data Warehouse with Impala
huguk
2K views
•
37 Folien
Part 2: A Visual Dive into Machine Learning and Deep Learning von
Part 2: A Visual Dive into Machine Learning and Deep Learning
Cloudera, Inc.
1.5K views
•
32 Folien
Building a Hadoop Data Warehouse with Impala von
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
7.3K views
•
40 Folien
Data Science and CDSW von
Data Science and CDSW
Jason Hubbard
1.3K views
•
19 Folien
Similar a Python Data Ecosystem: Thoughts on Building for the Future
(20)
Improving Data Interoperability for Python and R von Work-Bench
Improving Data Interoperability for Python and R
Work-Bench
•
10.3K views
High-Performance Python On Spark von Jen Aman
High-Performance Python On Spark
Jen Aman
•
1.7K views
Building a Hadoop Data Warehouse with Impala von huguk
Building a Hadoop Data Warehouse with Impala
huguk
•
2K views
Part 2: A Visual Dive into Machine Learning and Deep Learning von Cloudera, Inc.
Part 2: A Visual Dive into Machine Learning and Deep Learning
Cloudera, Inc.
•
1.5K views
Building a Hadoop Data Warehouse with Impala von Swiss Big Data User Group
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
•
7.3K views
Data Science and CDSW von Jason Hubbard
Data Science and CDSW
Jason Hubbard
•
1.3K views
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,... von Data Con LA
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
•
369 views
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud von Stefan Lipp
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stefan Lipp
•
402 views
Building data pipelines with kite von Joey Echeverria
Building data pipelines with kite
Joey Echeverria
•
5.7K views
Hadoop 3 (2017 hadoop taiwan workshop) von Wei-Chiu Chuang
Hadoop 3 (2017 hadoop taiwan workshop)
Wei-Chiu Chuang
•
551 views
A brave new world in mutable big data relational storage (Strata NYC 2017) von Todd Lipcon
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
•
7.3K views
Pandas & Cloudera: Scaling the Python Data Experience von Turi, Inc.
Pandas & Cloudera: Scaling the Python Data Experience
Turi, Inc.
•
648 views
Analyzing Hadoop Data Using Sparklyr von Cloudera, Inc.
Analyzing Hadoop Data Using Sparklyr
Cloudera, Inc.
•
2.4K views
Data Science and Machine Learning for the Enterprise von Cloudera, Inc.
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
•
1.3K views
GSJUG: Mastering Data Streaming Pipelines 09May2023 von Timothy Spann
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
•
255 views
Twitter with hadoop for oow von Gwen (Chen) Shapira
Twitter with hadoop for oow
Gwen (Chen) Shapira
•
1.5K views
Machine Learning in the Enterprise 2019 von Timothy Spann
Machine Learning in the Enterprise 2019
Timothy Spann
•
878 views
Large-Scale Data Science on Hadoop (Intel Big Data Day) von Uri Laserson
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
•
1.8K views
PyData Boston 2013 von Travis Oliphant
PyData Boston 2013
Travis Oliphant
•
3.9K views
Hambug R Meetup - Intro to H2O von Sri Ambati
Hambug R Meetup - Intro to H2O
Sri Ambati
•
272 views
Más de Wes McKinney
Solving Enterprise Data Challenges with Apache Arrow von
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
1.1K views
•
31 Folien
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity von
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
1.1K views
•
26 Folien
Apache Arrow: High Performance Columnar Data Framework von
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
1.5K views
•
53 Folien
New Directions for Apache Arrow von
New Directions for Apache Arrow
Wes McKinney
1.9K views
•
27 Folien
Apache Arrow Flight: A New Gold Standard for Data Transport von
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
2.2K views
•
31 Folien
ACM TechTalks : Apache Arrow and the Future of Data Frames von
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
2K views
•
47 Folien
Más de Wes McKinney
(14)
Solving Enterprise Data Challenges with Apache Arrow von Wes McKinney
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
•
1.1K views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity von Wes McKinney
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
•
1.1K views
Apache Arrow: High Performance Columnar Data Framework von Wes McKinney
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
•
1.5K views
New Directions for Apache Arrow von Wes McKinney
New Directions for Apache Arrow
Wes McKinney
•
1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport von Wes McKinney
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
•
2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames von Wes McKinney
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
•
2K views
Apache Arrow: Present and Future @ ScaledML 2020 von Wes McKinney
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
•
970 views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future von Wes McKinney
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
•
2.1K views
Apache Arrow: Leveling Up the Analytics Stack von Wes McKinney
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
•
1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session von Wes McKinney
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
•
2.5K views
Apache Arrow: Leveling Up the Data Science Stack von Wes McKinney
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
•
3.5K views
Ursa Labs and Apache Arrow in 2019 von Wes McKinney
Ursa Labs and Apache Arrow in 2019
Wes McKinney
•
4.2K views
Shared Infrastructure for Data Science von Wes McKinney
Shared Infrastructure for Data Science
Wes McKinney
•
8.5K views
Data Science Without Borders (JupyterCon 2017) von Wes McKinney
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
•
6.2K views
Último
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... von
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker
48 views
•
69 Folien
"Running students' code in isolation. The hard way", Yurii Holiuk von
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays
24 views
•
34 Folien
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf von
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
Dr. Jimmy Schwarzkopf
24 views
•
29 Folien
Igniting Next Level Productivity with AI-Infused Data Integration Workflows von
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software
317 views
•
86 Folien
Data Integrity for Banking and Financial Services von
Data Integrity for Banking and Financial Services
Precisely
29 views
•
26 Folien
HTTP headers that make your website go faster - devs.gent November 2023 von
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn
26 views
•
151 Folien
Último
(20)
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... von Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker
•
48 views
"Running students' code in isolation. The hard way", Yurii Holiuk von Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays
•
24 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf von Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
Dr. Jimmy Schwarzkopf
•
24 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows von Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software
•
317 views
Data Integrity for Banking and Financial Services von Precisely
Data Integrity for Banking and Financial Services
Precisely
•
29 views
HTTP headers that make your website go faster - devs.gent November 2023 von Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn
•
26 views
Scaling Knowledge Graph Architectures with AI von Enterprise Knowledge
Scaling Knowledge Graph Architectures with AI
Enterprise Knowledge
•
50 views
Kyo - Functional Scala 2023.pdf von Flavio W. Brasil
Kyo - Functional Scala 2023.pdf
Flavio W. Brasil
•
418 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive von Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Network Automation Forum
•
43 views
Five Things You SHOULD Know About Postman von Postman
Five Things You SHOULD Know About Postman
Postman
•
38 views
Ransomware is Knocking your Door_Final.pdf von Security Bootcamp
Ransomware is Knocking your Door_Final.pdf
Security Bootcamp
•
66 views
Microsoft Power Platform.pptx von Uni Systems S.M.S.A.
Microsoft Power Platform.pptx
Uni Systems S.M.S.A.
•
61 views
Zero to Automated in Under a Year von Network Automation Forum
Zero to Automated in Under a Year
Network Automation Forum
•
22 views
"Surviving highload with Node.js", Andrii Shumada von Fwdays
"Surviving highload with Node.js", Andrii Shumada
Fwdays
•
33 views
The Research Portal of Catalonia: Growing more (information) & more (services) von CSUC - Consorci de Serveis Universitaris de Catalunya
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya
•
115 views
PRODUCT LISTING.pptx von angelicacueva6
PRODUCT LISTING.pptx
angelicacueva6
•
18 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... von The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
The Digital Insurer
•
24 views
2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe von Simone Puorto
2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe
Simone Puorto
•
13 views
virtual reality.pptx von G036GaikwadSnehal
virtual reality.pptx
G036GaikwadSnehal
•
18 views
Democratising digital commerce in India-Report von Kapil Khandelwal (KK)
Democratising digital commerce in India-Report
Kapil Khandelwal (KK)
•
20 views
Python Data Ecosystem: Thoughts on Building for the Future
1.
1 © Cloudera,
Inc. All rights reserved. Python Data Ecosystem: Thoughts on Building for the Future Wes McKinney @wesmckinn PyData Berlin 2016-‐05-‐21
2.
2 © Cloudera,
Inc. All rights reserved. Me • Data Science Tools at Cloudera, formerly DataPad CEO/founder • Serial creator of structured data tools / user interfaces • Wrote bestseller Python for Data Analysis 2012 • Open source projects • Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incubaWng)} • Mostly work in Python and Cython/C/C++
3.
3 © Cloudera,
Inc. All rights reserved. In process: Python for Data Analysis: 2nd Edi4on Coming early 2017
4.
4 © Cloudera,
Inc. All rights reserved. Building open source communiWes
5.
5 © Cloudera,
Inc. All rights reserved. Social architecture is the conscious design of an environment that encourages a desired range of social behaviors leading towards some goal or set of goals. Wikipedia
6.
6 © Cloudera,
Inc. All rights reserved. Step 1 Be open and transparent
7.
7 © Cloudera,
Inc. All rights reserved. Step 2 Reach out to others
8.
8 © Cloudera,
Inc. All rights reserved. Step 3 Strive for consensus
9.
9 © Cloudera,
Inc. All rights reserved. Step 4 Value contribuWons extending beyond lines of code
10.
10 © Cloudera,
Inc. All rights reserved. Step 5 Make things harder for bad actors
11.
11 © Cloudera,
Inc. All rights reserved.
12.
12 © Cloudera,
Inc. All rights reserved. Handling problems carefully
13.
13 © Cloudera,
Inc. All rights reserved. http://numfocus.org http://apache.org
14.
14 © Cloudera,
Inc. All rights reserved. Python packaging
15.
15 © Cloudera,
Inc. All rights reserved. Packaging is hard • Reproducible infrastructure • Reproducible toolchains • Reproducible build scripts • IntegraWon tesWng • MulWple library version builds • MulWple Python versions • Dependency resoluWon • HosWng and distribuWon • MulWple environment management
16.
16 © Cloudera,
Inc. All rights reserved. ReflecWng on the past
17.
17 © Cloudera,
Inc. All rights reserved.
18.
18 © Cloudera,
Inc. All rights reserved. conda-‐forge • Community-‐curated conda package channel (on anaconda.org) • Reproducible build infrastructure (Docker + Circle CI + Travis CI + Appveyor) • Automated GitHub helper tools conda config --add channels conda-forge
19.
19 © Cloudera,
Inc. All rights reserved. What’s important to me right now?
20.
20 © Cloudera,
Inc. All rights reserved. Important things • Building bridges with other data science communiWes (R, Julia, Scala, etc.) • Enabling Python to more efficiently talk to other systems (e.g. Hadoop things) • Building Python tools for new and changing varieWes of data
21.
21 © Cloudera,
Inc. All rights reserved. RAM as the new disk? • SSD – DRAM performance convergence • NVM developments (3D Xpoint)Memory working set Consumer Consumer Consumer
22.
22 © Cloudera,
Inc. All rights reserved. Problems • Memory (data structure) representaWons • Metadata representaWons • Memory ownership, life-‐cycle
23.
23 © Cloudera,
Inc. All rights reserved. NumPy solved this problem for Python scienWsts • Common memory representaWon • ndarray strided, homogeneous buffer • Common metadata • NumPy dtypes • No well-‐defined memory sharing / messaging model: case by case basis
24.
24 © Cloudera,
Inc. All rights reserved. Problems NumPy doesn’t solve as well • Nested data types (think JSON) • Missing / NULL data • Strings and category types • Columnar memory representaWon for tables (think: analyWc SQL databases)
25.
25 © Cloudera,
Inc. All rights reserved. Apache Arrow http://arrow.apache.org Some slides from Strata-HW talk w/ Jacques Nadeau
26.
26 © Cloudera,
Inc. All rights reserved. Arrow in a Slide • New Top-‐level Apache Sonware FoundaWon project • Focused on Columnar In-‐Memory AnalyWcs 1. 10-‐100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relaWonal and complex data as-‐is • Developers from 13+ major open source projects involved • A significant % of the world’s data will be processed through Arrow! Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
27.
27 © Cloudera,
Inc. All rights reserved. Focus on CPU Efficiency 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer Arrow Memory Buffer • Cache Locality • Super-‐scalar & vectorized operaWon • Minimal Structure Overhead • Constant value access • With minimal structure overhead • Operate directly on columnar compressed data
28.
28 © Cloudera,
Inc. All rights reserved. High Performance Sharing & Interchange Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)
29.
29 © Cloudera,
Inc. All rights reserved. Arrow in acWon: Feather File Format for Python and R • Problem: fast, language-‐ agnosWc binary data frame file format • By Wes McKinney (Python) and Hadley Wickham (R) • Read speeds close to disk IO performance
30.
30 © Cloudera,
Inc. All rights reserved. Real World Example: Feather File Format for Python and R library(feather) path <-‐ "my_data.feather" write_feather(df, path) df <-‐ read_feather(path) import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path) R Python
31.
31 © Cloudera,
Inc. All rights reserved. More on Feather array 0 array 1 array 2 ... array n - 1 METADATA Feather File libfeather C++ library Rcpp Cython R data.frame pandas DataFrame
32.
32 © Cloudera,
Inc. All rights reserved. Feather: the good and not-‐so-‐good • Good • Language-‐agnosWc memory representaWon • Extremely fast • New storage features can be added without much difficulty • Not-‐so-‐good • Data must be convert to/from storage representaWon (Arrow) and in-‐ memory “proprietary” data structures (R / Python data frames)
33.
33 © Cloudera,
Inc. All rights reserved. Apache Parquet: Python support is coming • Collaborating with Uwe Korn from Blue Yonder pandas Arrow (C++ / Python) Parquet (C++)
34.
34 © Cloudera,
Inc. All rights reserved. Shared needs for Python, R, Julia, ... • If PLs can establish a common data frame C/C++-‐level memory representaWon, we can share algorithms and libraries much more easily • Example: dplyr’s in-‐memory backend • Other requirements • Permissive licensing (Python / Julia require MIT/Apache-‐like) • Common build/test/packaging for shared C/C++ library components
35.
35 © Cloudera,
Inc. All rights reserved. Real World Example: Python With Spark, Drill, Impala in partition 0 … in partition n - 1 SQL Engine Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 SQL Engine
36.
36 © Cloudera,
Inc. All rights reserved. Get Involved in Arrow • Join the community • dev@arrow.apache.org • Slack: hups://apachearrowslackin.herokuapp.com/ • hup://arrow.apache.org • @ApacheArrow
37.
37 © Cloudera,
Inc. All rights reserved. Thank you Wes McKinney @wesmckinn Views are my own