Python Data Ecosystem: Thoughts on Building for the Future

Wes McKinney
Wes McKinneyDirector of Ursa Labs, Open Source Developer um Ursa Labs
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  Data	
  Ecosystem:	
  
Thoughts	
  on	
  Building	
  for	
  the	
  
Future	
  
Wes	
  McKinney	
  @wesmckinn	
  
PyData	
  Berlin	
  2016-­‐05-­‐21	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Me	
  
•  Data	
  Science	
  Tools	
  at	
  Cloudera,	
  formerly	
  DataPad	
  CEO/founder	
  
•  Serial	
  creator	
  of	
  structured	
  data	
  tools	
  /	
  user	
  interfaces	
  
•  Wrote	
  bestseller	
  Python	
  for	
  Data	
  Analysis	
  2012	
  
•  Open	
  source	
  projects	
  
•  Python	
  {pandas,	
  Ibis,	
  statsmodels}	
  
•  Apache	
  {Arrow,	
  Parquet,	
  Kudu	
  (incubaWng)}	
  
•  Mostly	
  work	
  in	
  Python	
  and	
  Cython/C/C++	
  
	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
In	
  process:	
  
Python	
  for	
  Data	
  Analysis:	
  2nd	
  Edi4on	
  
Coming	
  early	
  2017	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Building	
  open	
  source	
  communiWes	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Social architecture is the
conscious design of an
environment that
encourages a desired range
of social behaviors leading
towards some goal or set of
goals.
Wikipedia
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  1	
  
	
  
Be	
  open	
  and	
  transparent	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  2	
  
	
  
Reach	
  out	
  to	
  others	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  3	
  
	
  
Strive	
  for	
  consensus	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  4	
  
Value	
  contribuWons	
  extending	
  
beyond	
  lines	
  of	
  code	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  5	
  
	
  
Make	
  things	
  harder	
  for	
  bad	
  actors	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Handling
problems
carefully
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
http://numfocus.org
http://apache.org
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  packaging	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Packaging	
  is	
  hard	
  
• 	
  Reproducible	
  infrastructure	
  	
  
• 	
  Reproducible	
  toolchains	
  	
  
• 	
  Reproducible	
  build	
  scripts	
  
• 	
  IntegraWon	
  tesWng	
  
• 	
  MulWple	
  library	
  version	
  builds	
  
• 	
  MulWple	
  Python	
  versions	
  
• 	
  Dependency	
  resoluWon	
  
• 	
  HosWng	
  and	
  distribuWon	
  
• 	
  MulWple	
  environment	
  management	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
ReflecWng	
  on	
  the	
  past	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
conda-­‐forge	
  
• 	
  Community-­‐curated	
  conda	
  package	
  channel	
  (on	
  anaconda.org)	
  
• 	
  Reproducible	
  build	
  infrastructure	
  (Docker	
  +	
  Circle	
  CI	
  +	
  Travis	
  CI	
  +	
  Appveyor)	
  
• 	
  Automated	
  GitHub	
  helper	
  tools	
  
conda config --add channels conda-forge
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What’s	
  important	
  to	
  me	
  right	
  now?	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Important	
  things	
  
• 	
  Building	
  bridges	
  with	
  other	
  data	
  science	
  communiWes	
  (R,	
  Julia,	
  Scala,	
  etc.)	
  
• 	
  Enabling	
  Python	
  to	
  more	
  efficiently	
  talk	
  to	
  other	
  systems	
  (e.g.	
  Hadoop	
  things)	
  
• 	
  Building	
  Python	
  tools	
  for	
  new	
  and	
  changing	
  varieWes	
  of	
  data	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
RAM	
  as	
  the	
  new	
  disk?	
  
•  SSD – DRAM
performance
convergence
•  NVM developments
(3D Xpoint)Memory working set
Consumer Consumer Consumer
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Problems	
  
• 	
  Memory	
  (data	
  structure)	
  representaWons	
  
• 	
  Metadata	
  representaWons	
  
• 	
  Memory	
  ownership,	
  life-­‐cycle	
  
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
NumPy	
  solved	
  this	
  problem	
  for	
  Python	
  scienWsts	
  
• 	
  Common	
  memory	
  representaWon	
  
• 	
  ndarray	
  strided,	
  homogeneous	
  buffer	
  
• 	
  Common	
  metadata	
  
• 	
  NumPy	
  dtypes	
  
• 	
  No	
  well-­‐defined	
  memory	
  sharing	
  /	
  messaging	
  model:	
  case	
  by	
  case	
  basis	
  
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Problems	
  NumPy	
  doesn’t	
  solve	
  as	
  well	
  
• 	
  Nested	
  data	
  types	
  (think	
  JSON)	
  
• 	
  Missing	
  /	
  NULL	
  data	
  
• 	
  Strings	
  and	
  category	
  types	
  
• 	
  Columnar	
  memory	
  representaWon	
  for	
  tables	
  (think:	
  analyWc	
  SQL	
  databases)	
  
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  
Arrow	
  
http://arrow.apache.org
Some slides from Strata-HW talk w/
Jacques Nadeau
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Arrow	
  in	
  a	
  Slide	
  
•  New	
  Top-­‐level	
  Apache	
  Sonware	
  FoundaWon	
  project	
  
	
  
•  Focused	
  on	
  Columnar	
  In-­‐Memory	
  AnalyWcs	
  
1.  10-­‐100x	
  speedup	
  on	
  many	
  workloads	
  
2.  Common	
  data	
  layer	
  enables	
  companies	
  to	
  choose	
  best	
  of	
  
breed	
  systems	
  	
  
3.  Designed	
  to	
  work	
  with	
  any	
  programming	
  language	
  
4.  Support	
  for	
  both	
  relaWonal	
  and	
  complex	
  data	
  as-­‐is	
  
	
  
•  Developers	
  from	
  13+	
  major	
  open	
  source	
  projects	
  involved	
  
•  A	
  significant	
  %	
  of	
  the	
  world’s	
  data	
  will	
  be	
  processed	
  through	
  
Arrow!	
  
	
  
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Focus	
  on	
  CPU	
  Efficiency	
  
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional
Memory Buffer
Arrow
Memory Buffer
• Cache	
  Locality	
  
• Super-­‐scalar	
  &	
  vectorized	
  
operaWon	
  
• Minimal	
  Structure	
  Overhead	
  
• Constant	
  value	
  access	
  	
  
• With	
  minimal	
  structure	
  
overhead	
  
• Operate	
  directly	
  on	
  columnar	
  
compressed	
  data	
  
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
High	
  Performance	
  Sharing	
  &	
  Interchange	
  
Today With Arrow
•  Each system has its own internal
memory format
•  70-80% CPU wasted on serialization
and deserialization
•  Similar functionality implemented in
multiple projects
•  All systems utilize the same memory
format
•  No overhead for cross-system
communication
•  Projects can share functionality (eg,
Parquet-to-Arrow reader)
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Arrow	
  in	
  acWon:	
  Feather	
  File	
  Format	
  for	
  Python	
  and	
  R	
  
• Problem:	
  fast,	
  language-­‐
agnosWc	
  binary	
  data	
  frame	
  
file	
  format	
  
• By	
  Wes	
  McKinney	
  (Python)	
  
and	
  Hadley	
  Wickham	
  (R)	
  
• Read	
  speeds	
  close	
  to	
  disk	
  IO	
  
performance	
  
30	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real	
  World	
  Example:	
  Feather	
  File	
  Format	
  for	
  Python	
  
and	
  R	
  
library(feather)	
  
	
  	
  
path	
  <-­‐	
  "my_data.feather"	
  
write_feather(df,	
  path)	
  
	
  	
  
df	
  <-­‐	
  read_feather(path)	
  
import	
  feather	
  
	
  	
  
path	
  =	
  'my_data.feather'	
  
	
  	
  
feather.write_dataframe(df,	
  path)	
  
df	
  =	
  feather.read_dataframe(path)	
  
R	
   Python	
  
31	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
More	
  on	
  Feather	
  
array 0
array 1
array 2
...
array n - 1
METADATA
Feather File
libfeather
C++ library
Rcpp
Cython
R data.frame
pandas DataFrame
32	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Feather:	
  the	
  good	
  and	
  not-­‐so-­‐good	
  
•  Good	
  
•  Language-­‐agnosWc	
  memory	
  representaWon	
  
•  Extremely	
  fast	
  
•  New	
  storage	
  features	
  can	
  be	
  added	
  without	
  much	
  difficulty	
  
	
  
•  Not-­‐so-­‐good	
  
•  Data	
  must	
  be	
  convert	
  to/from	
  storage	
  representaWon	
  (Arrow)	
  and	
  in-­‐
memory	
  “proprietary”	
  data	
  structures	
  (R	
  /	
  Python	
  data	
  frames)	
  
33	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Parquet:	
  Python	
  support	
  is	
  coming	
  
•  Collaborating with Uwe Korn from
Blue Yonder
pandas
Arrow (C++ / Python)
Parquet (C++)
34	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Shared	
  needs	
  for	
  Python,	
  R,	
  Julia,	
  ...	
  
•  If	
  PLs	
  can	
  establish	
  a	
  common	
  data	
  frame	
  C/C++-­‐level	
  memory	
  representaWon,	
  
we	
  can	
  share	
  algorithms	
  and	
  libraries	
  much	
  more	
  easily	
  
•  Example:	
  dplyr’s	
  in-­‐memory	
  backend	
  
	
  
•  Other	
  requirements	
  
•  Permissive	
  licensing	
  (Python	
  /	
  Julia	
  require	
  MIT/Apache-­‐like)	
  
•  Common	
  build/test/packaging	
  for	
  shared	
  C/C++	
  library	
  components	
  
35	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real	
  World	
  Example:	
  Python	
  With	
  Spark,	
  Drill,	
  Impala	
  
in partition 0
…
in partition
n - 1
SQL Engine
Python
function
input
Python
function
input
User-supplied
Python code
output
output
out partition 0
…
out partition
n - 1
SQL Engine
36	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Get	
  Involved	
  in	
  Arrow	
  
•  Join	
  the	
  community	
  
•  dev@arrow.apache.org	
  
•  Slack:	
  hups://apachearrowslackin.herokuapp.com/	
  
•  hup://arrow.apache.org	
  
•  @ApacheArrow	
  
37	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
Wes	
  McKinney	
  @wesmckinn	
  
Views	
  are	
  my	
  own	
  
1 von 37

Recomendados

Next-generation Python Big Data Tools, powered by Apache Arrow von
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
13K views22 Folien
An Incomplete Data Tools Landscape for Hackers in 2015 von
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
8.1K views22 Folien
Ibis: Scaling the Python Data Experience von
Ibis: Scaling the Python Data ExperienceIbis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceWes McKinney
3.8K views13 Folien
My Data Journey with Python (SciPy 2015 Keynote) von
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney
7.4K views37 Folien
Memory Interoperability in Analytics and Machine Learning von
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
5.6K views27 Folien
PyData: The Next Generation von
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next GenerationWes McKinney
22.2K views31 Folien

Más contenido relacionado

Was ist angesagt?

Python Data Wrangling: Preparing for the Future von
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
12.5K views27 Folien
Improving Python and Spark (PySpark) Performance and Interoperability von
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
19.8K views37 Folien
Apache Arrow (Strata-Hadoop World San Jose 2016) von
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
17K views28 Folien
Apache Arrow -- Cross-language development platform for in-memory data von
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
2.9K views23 Folien
Improving data interoperability in Python and R von
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and RWes McKinney
2.6K views14 Folien
High Performance Python on Apache Spark von
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
16.6K views35 Folien

Was ist angesagt?(20)

Python Data Wrangling: Preparing for the Future von Wes McKinney
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney12.5K views
Improving Python and Spark (PySpark) Performance and Interoperability von Wes McKinney
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney19.8K views
Apache Arrow (Strata-Hadoop World San Jose 2016) von Wes McKinney
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney17K views
Apache Arrow -- Cross-language development platform for in-memory data von Wes McKinney
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney2.9K views
Improving data interoperability in Python and R von Wes McKinney
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
Wes McKinney2.6K views
High Performance Python on Apache Spark von Wes McKinney
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney16.6K views
Apache Arrow at DataEngConf Barcelona 2018 von Wes McKinney
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney2K views
Apache Arrow and Python: The latest von Wes McKinney
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney5.8K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward" von Wes McKinney
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney1.1K views
Apache Arrow: Cross-language Development Platform for In-memory Data von Wes McKinney
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney6.6K views
Ibis: Scaling Python Analytics on Hadoop and Impala von Wes McKinney
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney7.6K views
Improving Python and Spark Performance and Interoperability with Apache Arrow von Julien Le Dem
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem4.4K views
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P... von Wes McKinney
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney103.9K views
Large Scale Graph Analytics with JanusGraph von P. Taylor Goetz
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz19.1K views
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney von Hakka Labs
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs1.1K views
PyData: The Next Generation | Data Day Texas 2015 von Cloudera, Inc.
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.1.8K views
HBase and Drill: How loosley typed SQL is ideal for NoSQL von DataWorks Summit
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit641 views
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes von DataWorks Summit
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit361 views

Destacado

Raising the Tides: Open Source Analytics for Data Science von
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
3.2K views28 Folien
PyCon APAC 2016 Keynote von
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 KeynoteWes McKinney
3.6K views36 Folien
Enabling Python to be a Better Big Data Citizen von
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
6K views19 Folien
pandas: Powerful data analysis tools for Python von
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
9.8K views38 Folien
Productive Data Tools for Quants von
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for QuantsWes McKinney
1.7K views21 Folien
What's new in pandas and the SciPy stack for financial users von
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWes McKinney
11.8K views23 Folien

Destacado(14)

Raising the Tides: Open Source Analytics for Data Science von Wes McKinney
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney3.2K views
PyCon APAC 2016 Keynote von Wes McKinney
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney3.6K views
Enabling Python to be a Better Big Data Citizen von Wes McKinney
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney6K views
pandas: Powerful data analysis tools for Python von Wes McKinney
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney9.8K views
Productive Data Tools for Quants von Wes McKinney
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
Wes McKinney1.7K views
What's new in pandas and the SciPy stack for financial users von Wes McKinney
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney11.8K views
Data Tools and the Data Scientist Shortage von Wes McKinney
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist Shortage
Wes McKinney3.7K views
DataFrames: The Good, Bad, and Ugly von Wes McKinney
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
Wes McKinney12.9K views
Python for Financial Data Analysis with pandas von Wes McKinney
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney61.8K views
Structured Data Challenges in Finance and Statistics von Wes McKinney
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
Wes McKinney5.3K views
User Experience for Business Analysts von Carol Smith
User Experience for Business AnalystsUser Experience for Business Analysts
User Experience for Business Analysts
Carol Smith3.7K views
Python for Science and Engineering: a presentation to A*STAR and the Singapor... von pythoncharmers
Python for Science and Engineering: a presentation to A*STAR and the Singapor...Python for Science and Engineering: a presentation to A*STAR and the Singapor...
Python for Science and Engineering: a presentation to A*STAR and the Singapor...
pythoncharmers7K views
Falcon: Fault Localization in Concurrent Programs von Sangmin Park
Falcon: Fault Localization in Concurrent ProgramsFalcon: Fault Localization in Concurrent Programs
Falcon: Fault Localization in Concurrent Programs
Sangmin Park539 views
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding... von Sangmin Park
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Sangmin Park412 views

Similar a Python Data Ecosystem: Thoughts on Building for the Future

Improving Data Interoperability for Python and R von
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and RWork-Bench
10.3K views14 Folien
High-Performance Python On Spark von
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On SparkJen Aman
1.7K views35 Folien
Building a Hadoop Data Warehouse with Impala von
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
2K views37 Folien
Part 2: A Visual Dive into Machine Learning and Deep Learning 
 von
Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Cloudera, Inc.
1.5K views32 Folien
Building a Hadoop Data Warehouse with Impala von
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
7.3K views40 Folien
Data Science and CDSW von
Data Science and CDSWData Science and CDSW
Data Science and CDSWJason Hubbard
1.3K views19 Folien

Similar a Python Data Ecosystem: Thoughts on Building for the Future(20)

Improving Data Interoperability for Python and R von Work-Bench
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and R
Work-Bench10.3K views
High-Performance Python On Spark von Jen Aman
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
Jen Aman1.7K views
Building a Hadoop Data Warehouse with Impala von huguk
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk2K views
Part 2: A Visual Dive into Machine Learning and Deep Learning 
 von Cloudera, Inc.
Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.1.5K views
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,... von Data Con LA
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA369 views
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud von Stefan Lipp
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stefan Lipp402 views
Building data pipelines with kite von Joey Echeverria
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
Joey Echeverria5.7K views
Hadoop 3 (2017 hadoop taiwan workshop) von Wei-Chiu Chuang
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
Wei-Chiu Chuang551 views
A brave new world in mutable big data relational storage (Strata NYC 2017) von Todd Lipcon
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon7.3K views
Pandas & Cloudera: Scaling the Python Data Experience von Turi, Inc.
Pandas & Cloudera: Scaling the Python Data ExperiencePandas & Cloudera: Scaling the Python Data Experience
Pandas & Cloudera: Scaling the Python Data Experience
Turi, Inc.648 views
Analyzing Hadoop Data Using Sparklyr
 von Cloudera, Inc.
Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

Cloudera, Inc.2.4K views
Data Science and Machine Learning for the Enterprise von Cloudera, Inc.
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.1.3K views
GSJUG: Mastering Data Streaming Pipelines 09May2023 von Timothy Spann
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann255 views
Machine Learning in the Enterprise 2019 von Timothy Spann
Machine Learning in the Enterprise 2019   Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019
Timothy Spann878 views
Large-Scale Data Science on Hadoop (Intel Big Data Day) von Uri Laserson
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson1.8K views
Hambug R Meetup - Intro to H2O von Sri Ambati
Hambug R Meetup - Intro to H2OHambug R Meetup - Intro to H2O
Hambug R Meetup - Intro to H2O
Sri Ambati272 views

Más de Wes McKinney

Solving Enterprise Data Challenges with Apache Arrow von
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
1.1K views31 Folien
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity von
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
1.1K views26 Folien
Apache Arrow: High Performance Columnar Data Framework von
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
1.5K views53 Folien
New Directions for Apache Arrow von
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
1.9K views27 Folien
Apache Arrow Flight: A New Gold Standard for Data Transport von
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
2.2K views31 Folien
ACM TechTalks : Apache Arrow and the Future of Data Frames von
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
2K views47 Folien

Más de Wes McKinney(14)

Solving Enterprise Data Challenges with Apache Arrow von Wes McKinney
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney1.1K views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity von Wes McKinney
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney1.1K views
Apache Arrow: High Performance Columnar Data Framework von Wes McKinney
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney1.5K views
New Directions for Apache Arrow von Wes McKinney
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport von Wes McKinney
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames von Wes McKinney
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney2K views
Apache Arrow: Present and Future @ ScaledML 2020 von Wes McKinney
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney970 views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future von Wes McKinney
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney2.1K views
Apache Arrow: Leveling Up the Analytics Stack von Wes McKinney
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session von Wes McKinney
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney2.5K views
Apache Arrow: Leveling Up the Data Science Stack von Wes McKinney
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney3.5K views
Ursa Labs and Apache Arrow in 2019 von Wes McKinney
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney4.2K views
Shared Infrastructure for Data Science von Wes McKinney
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney8.5K views
Data Science Without Borders (JupyterCon 2017) von Wes McKinney
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney6.2K views

Último

iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... von
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...Bernd Ruecker
48 views69 Folien
"Running students' code in isolation. The hard way", Yurii Holiuk von
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk Fwdays
24 views34 Folien
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf von
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdfDr. Jimmy Schwarzkopf
24 views29 Folien
Igniting Next Level Productivity with AI-Infused Data Integration Workflows von
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Safe Software
317 views86 Folien
Data Integrity for Banking and Financial Services von
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial ServicesPrecisely
29 views26 Folien
HTTP headers that make your website go faster - devs.gent November 2023 von
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023Thijs Feryn
26 views151 Folien

Último(20)

iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... von Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker48 views
"Running students' code in isolation. The hard way", Yurii Holiuk von Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays24 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf von Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
Igniting Next Level Productivity with AI-Infused Data Integration Workflows von Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software317 views
Data Integrity for Banking and Financial Services von Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely29 views
HTTP headers that make your website go faster - devs.gent November 2023 von Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn26 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive von Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Five Things You SHOULD Know About Postman von Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman38 views
"Surviving highload with Node.js", Andrii Shumada von Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays33 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... von The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe von Simone Puorto
2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe
2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe
Simone Puorto13 views

Python Data Ecosystem: Thoughts on Building for the Future

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Python  Data  Ecosystem:   Thoughts  on  Building  for  the   Future   Wes  McKinney  @wesmckinn   PyData  Berlin  2016-­‐05-­‐21  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  Data  Science  Tools  at  Cloudera,  formerly  DataPad  CEO/founder   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Wrote  bestseller  Python  for  Data  Analysis  2012   •  Open  source  projects   •  Python  {pandas,  Ibis,  statsmodels}   •  Apache  {Arrow,  Parquet,  Kudu  (incubaWng)}   •  Mostly  work  in  Python  and  Cython/C/C++    
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   In  process:   Python  for  Data  Analysis:  2nd  Edi4on   Coming  early  2017  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Building  open  source  communiWes  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Social architecture is the conscious design of an environment that encourages a desired range of social behaviors leading towards some goal or set of goals. Wikipedia
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Step  1     Be  open  and  transparent  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Step  2     Reach  out  to  others  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Step  3     Strive  for  consensus  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Step  4   Value  contribuWons  extending   beyond  lines  of  code  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Step  5     Make  things  harder  for  bad  actors  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Handling problems carefully
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   http://numfocus.org http://apache.org
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Python  packaging  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Packaging  is  hard   •   Reproducible  infrastructure     •   Reproducible  toolchains     •   Reproducible  build  scripts   •   IntegraWon  tesWng   •   MulWple  library  version  builds   •   MulWple  Python  versions   •   Dependency  resoluWon   •   HosWng  and  distribuWon   •   MulWple  environment  management  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   ReflecWng  on  the  past  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   conda-­‐forge   •   Community-­‐curated  conda  package  channel  (on  anaconda.org)   •   Reproducible  build  infrastructure  (Docker  +  Circle  CI  +  Travis  CI  +  Appveyor)   •   Automated  GitHub  helper  tools   conda config --add channels conda-forge
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  important  to  me  right  now?  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Important  things   •   Building  bridges  with  other  data  science  communiWes  (R,  Julia,  Scala,  etc.)   •   Enabling  Python  to  more  efficiently  talk  to  other  systems  (e.g.  Hadoop  things)   •   Building  Python  tools  for  new  and  changing  varieWes  of  data  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   RAM  as  the  new  disk?   •  SSD – DRAM performance convergence •  NVM developments (3D Xpoint)Memory working set Consumer Consumer Consumer
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Problems   •   Memory  (data  structure)  representaWons   •   Metadata  representaWons   •   Memory  ownership,  life-­‐cycle  
  • 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   NumPy  solved  this  problem  for  Python  scienWsts   •   Common  memory  representaWon   •   ndarray  strided,  homogeneous  buffer   •   Common  metadata   •   NumPy  dtypes   •   No  well-­‐defined  memory  sharing  /  messaging  model:  case  by  case  basis  
  • 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   Problems  NumPy  doesn’t  solve  as  well   •   Nested  data  types  (think  JSON)   •   Missing  /  NULL  data   •   Strings  and  category  types   •   Columnar  memory  representaWon  for  tables  (think:  analyWc  SQL  databases)  
  • 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   Apache   Arrow   http://arrow.apache.org Some slides from Strata-HW talk w/ Jacques Nadeau
  • 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   Arrow  in  a  Slide   •  New  Top-­‐level  Apache  Sonware  FoundaWon  project     •  Focused  on  Columnar  In-­‐Memory  AnalyWcs   1.  10-­‐100x  speedup  on  many  workloads   2.  Common  data  layer  enables  companies  to  choose  best  of   breed  systems     3.  Designed  to  work  with  any  programming  language   4.  Support  for  both  relaWonal  and  complex  data  as-­‐is     •  Developers  from  13+  major  open  source  projects  involved   •  A  significant  %  of  the  world’s  data  will  be  processed  through   Arrow!     Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   Focus  on  CPU  Efficiency   1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer Arrow Memory Buffer • Cache  Locality   • Super-­‐scalar  &  vectorized   operaWon   • Minimal  Structure  Overhead   • Constant  value  access     • With  minimal  structure   overhead   • Operate  directly  on  columnar   compressed  data  
  • 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   High  Performance  Sharing  &  Interchange   Today With Arrow •  Each system has its own internal memory format •  70-80% CPU wasted on serialization and deserialization •  Similar functionality implemented in multiple projects •  All systems utilize the same memory format •  No overhead for cross-system communication •  Projects can share functionality (eg, Parquet-to-Arrow reader)
  • 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   Arrow  in  acWon:  Feather  File  Format  for  Python  and  R   • Problem:  fast,  language-­‐ agnosWc  binary  data  frame   file  format   • By  Wes  McKinney  (Python)   and  Hadley  Wickham  (R)   • Read  speeds  close  to  disk  IO   performance  
  • 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Feather  File  Format  for  Python   and  R   library(feather)       path  <-­‐  "my_data.feather"   write_feather(df,  path)       df  <-­‐  read_feather(path)   import  feather       path  =  'my_data.feather'       feather.write_dataframe(df,  path)   df  =  feather.read_dataframe(path)   R   Python  
  • 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   More  on  Feather   array 0 array 1 array 2 ... array n - 1 METADATA Feather File libfeather C++ library Rcpp Cython R data.frame pandas DataFrame
  • 32. 32  ©  Cloudera,  Inc.  All  rights  reserved.   Feather:  the  good  and  not-­‐so-­‐good   •  Good   •  Language-­‐agnosWc  memory  representaWon   •  Extremely  fast   •  New  storage  features  can  be  added  without  much  difficulty     •  Not-­‐so-­‐good   •  Data  must  be  convert  to/from  storage  representaWon  (Arrow)  and  in-­‐ memory  “proprietary”  data  structures  (R  /  Python  data  frames)  
  • 33. 33  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Parquet:  Python  support  is  coming   •  Collaborating with Uwe Korn from Blue Yonder pandas Arrow (C++ / Python) Parquet (C++)
  • 34. 34  ©  Cloudera,  Inc.  All  rights  reserved.   Shared  needs  for  Python,  R,  Julia,  ...   •  If  PLs  can  establish  a  common  data  frame  C/C++-­‐level  memory  representaWon,   we  can  share  algorithms  and  libraries  much  more  easily   •  Example:  dplyr’s  in-­‐memory  backend     •  Other  requirements   •  Permissive  licensing  (Python  /  Julia  require  MIT/Apache-­‐like)   •  Common  build/test/packaging  for  shared  C/C++  library  components  
  • 35. 35  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Python  With  Spark,  Drill,  Impala   in partition 0 … in partition n - 1 SQL Engine Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 SQL Engine
  • 36. 36  ©  Cloudera,  Inc.  All  rights  reserved.   Get  Involved  in  Arrow   •  Join  the  community   •  dev@arrow.apache.org   •  Slack:  hups://apachearrowslackin.herokuapp.com/   •  hup://arrow.apache.org   •  @ApacheArrow  
  • 37. 37  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own