SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  Data	
  Ecosystem:	
  
Thoughts	
  on	
  Building	
  for	
  the	
  
Future	
  
Wes	
  McKinney	
  @wesmckinn	
  
PyData	
  Berlin	
  2016-­‐05-­‐21	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Me	
  
•  Data	
  Science	
  Tools	
  at	
  Cloudera,	
  formerly	
  DataPad	
  CEO/founder	
  
•  Serial	
  creator	
  of	
  structured	
  data	
  tools	
  /	
  user	
  interfaces	
  
•  Wrote	
  bestseller	
  Python	
  for	
  Data	
  Analysis	
  2012	
  
•  Open	
  source	
  projects	
  
•  Python	
  {pandas,	
  Ibis,	
  statsmodels}	
  
•  Apache	
  {Arrow,	
  Parquet,	
  Kudu	
  (incubaWng)}	
  
•  Mostly	
  work	
  in	
  Python	
  and	
  Cython/C/C++	
  
	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
In	
  process:	
  
Python	
  for	
  Data	
  Analysis:	
  2nd	
  Edi4on	
  
Coming	
  early	
  2017	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Building	
  open	
  source	
  communiWes	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Social architecture is the
conscious design of an
environment that
encourages a desired range
of social behaviors leading
towards some goal or set of
goals.
Wikipedia
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  1	
  
	
  
Be	
  open	
  and	
  transparent	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  2	
  
	
  
Reach	
  out	
  to	
  others	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  3	
  
	
  
Strive	
  for	
  consensus	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  4	
  
Value	
  contribuWons	
  extending	
  
beyond	
  lines	
  of	
  code	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  5	
  
	
  
Make	
  things	
  harder	
  for	
  bad	
  actors	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Handling
problems
carefully
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
http://numfocus.org
http://apache.org
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  packaging	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Packaging	
  is	
  hard	
  
• 	
  Reproducible	
  infrastructure	
  	
  
• 	
  Reproducible	
  toolchains	
  	
  
• 	
  Reproducible	
  build	
  scripts	
  
• 	
  IntegraWon	
  tesWng	
  
• 	
  MulWple	
  library	
  version	
  builds	
  
• 	
  MulWple	
  Python	
  versions	
  
• 	
  Dependency	
  resoluWon	
  
• 	
  HosWng	
  and	
  distribuWon	
  
• 	
  MulWple	
  environment	
  management	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
ReflecWng	
  on	
  the	
  past	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
conda-­‐forge	
  
• 	
  Community-­‐curated	
  conda	
  package	
  channel	
  (on	
  anaconda.org)	
  
• 	
  Reproducible	
  build	
  infrastructure	
  (Docker	
  +	
  Circle	
  CI	
  +	
  Travis	
  CI	
  +	
  Appveyor)	
  
• 	
  Automated	
  GitHub	
  helper	
  tools	
  
conda config --add channels conda-forge
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What’s	
  important	
  to	
  me	
  right	
  now?	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Important	
  things	
  
• 	
  Building	
  bridges	
  with	
  other	
  data	
  science	
  communiWes	
  (R,	
  Julia,	
  Scala,	
  etc.)	
  
• 	
  Enabling	
  Python	
  to	
  more	
  efficiently	
  talk	
  to	
  other	
  systems	
  (e.g.	
  Hadoop	
  things)	
  
• 	
  Building	
  Python	
  tools	
  for	
  new	
  and	
  changing	
  varieWes	
  of	
  data	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
RAM	
  as	
  the	
  new	
  disk?	
  
•  SSD – DRAM
performance
convergence
•  NVM developments
(3D Xpoint)Memory working set
Consumer Consumer Consumer
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Problems	
  
• 	
  Memory	
  (data	
  structure)	
  representaWons	
  
• 	
  Metadata	
  representaWons	
  
• 	
  Memory	
  ownership,	
  life-­‐cycle	
  
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
NumPy	
  solved	
  this	
  problem	
  for	
  Python	
  scienWsts	
  
• 	
  Common	
  memory	
  representaWon	
  
• 	
  ndarray	
  strided,	
  homogeneous	
  buffer	
  
• 	
  Common	
  metadata	
  
• 	
  NumPy	
  dtypes	
  
• 	
  No	
  well-­‐defined	
  memory	
  sharing	
  /	
  messaging	
  model:	
  case	
  by	
  case	
  basis	
  
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Problems	
  NumPy	
  doesn’t	
  solve	
  as	
  well	
  
• 	
  Nested	
  data	
  types	
  (think	
  JSON)	
  
• 	
  Missing	
  /	
  NULL	
  data	
  
• 	
  Strings	
  and	
  category	
  types	
  
• 	
  Columnar	
  memory	
  representaWon	
  for	
  tables	
  (think:	
  analyWc	
  SQL	
  databases)	
  
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  
Arrow	
  
http://arrow.apache.org
Some slides from Strata-HW talk w/
Jacques Nadeau
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Arrow	
  in	
  a	
  Slide	
  
•  New	
  Top-­‐level	
  Apache	
  Sonware	
  FoundaWon	
  project	
  
	
  
•  Focused	
  on	
  Columnar	
  In-­‐Memory	
  AnalyWcs	
  
1.  10-­‐100x	
  speedup	
  on	
  many	
  workloads	
  
2.  Common	
  data	
  layer	
  enables	
  companies	
  to	
  choose	
  best	
  of	
  
breed	
  systems	
  	
  
3.  Designed	
  to	
  work	
  with	
  any	
  programming	
  language	
  
4.  Support	
  for	
  both	
  relaWonal	
  and	
  complex	
  data	
  as-­‐is	
  
	
  
•  Developers	
  from	
  13+	
  major	
  open	
  source	
  projects	
  involved	
  
•  A	
  significant	
  %	
  of	
  the	
  world’s	
  data	
  will	
  be	
  processed	
  through	
  
Arrow!	
  
	
  
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Focus	
  on	
  CPU	
  Efficiency	
  
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional
Memory Buffer
Arrow
Memory Buffer
• Cache	
  Locality	
  
• Super-­‐scalar	
  &	
  vectorized	
  
operaWon	
  
• Minimal	
  Structure	
  Overhead	
  
• Constant	
  value	
  access	
  	
  
• With	
  minimal	
  structure	
  
overhead	
  
• Operate	
  directly	
  on	
  columnar	
  
compressed	
  data	
  
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
High	
  Performance	
  Sharing	
  &	
  Interchange	
  
Today With Arrow
•  Each system has its own internal
memory format
•  70-80% CPU wasted on serialization
and deserialization
•  Similar functionality implemented in
multiple projects
•  All systems utilize the same memory
format
•  No overhead for cross-system
communication
•  Projects can share functionality (eg,
Parquet-to-Arrow reader)
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Arrow	
  in	
  acWon:	
  Feather	
  File	
  Format	
  for	
  Python	
  and	
  R	
  
• Problem:	
  fast,	
  language-­‐
agnosWc	
  binary	
  data	
  frame	
  
file	
  format	
  
• By	
  Wes	
  McKinney	
  (Python)	
  
and	
  Hadley	
  Wickham	
  (R)	
  
• Read	
  speeds	
  close	
  to	
  disk	
  IO	
  
performance	
  
30	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real	
  World	
  Example:	
  Feather	
  File	
  Format	
  for	
  Python	
  
and	
  R	
  
library(feather)	
  
	
  	
  
path	
  <-­‐	
  "my_data.feather"	
  
write_feather(df,	
  path)	
  
	
  	
  
df	
  <-­‐	
  read_feather(path)	
  
import	
  feather	
  
	
  	
  
path	
  =	
  'my_data.feather'	
  
	
  	
  
feather.write_dataframe(df,	
  path)	
  
df	
  =	
  feather.read_dataframe(path)	
  
R	
   Python	
  
31	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
More	
  on	
  Feather	
  
array 0
array 1
array 2
...
array n - 1
METADATA
Feather File
libfeather
C++ library
Rcpp
Cython
R data.frame
pandas DataFrame
32	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Feather:	
  the	
  good	
  and	
  not-­‐so-­‐good	
  
•  Good	
  
•  Language-­‐agnosWc	
  memory	
  representaWon	
  
•  Extremely	
  fast	
  
•  New	
  storage	
  features	
  can	
  be	
  added	
  without	
  much	
  difficulty	
  
	
  
•  Not-­‐so-­‐good	
  
•  Data	
  must	
  be	
  convert	
  to/from	
  storage	
  representaWon	
  (Arrow)	
  and	
  in-­‐
memory	
  “proprietary”	
  data	
  structures	
  (R	
  /	
  Python	
  data	
  frames)	
  
33	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Parquet:	
  Python	
  support	
  is	
  coming	
  
•  Collaborating with Uwe Korn from
Blue Yonder
pandas
Arrow (C++ / Python)
Parquet (C++)
34	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Shared	
  needs	
  for	
  Python,	
  R,	
  Julia,	
  ...	
  
•  If	
  PLs	
  can	
  establish	
  a	
  common	
  data	
  frame	
  C/C++-­‐level	
  memory	
  representaWon,	
  
we	
  can	
  share	
  algorithms	
  and	
  libraries	
  much	
  more	
  easily	
  
•  Example:	
  dplyr’s	
  in-­‐memory	
  backend	
  
	
  
•  Other	
  requirements	
  
•  Permissive	
  licensing	
  (Python	
  /	
  Julia	
  require	
  MIT/Apache-­‐like)	
  
•  Common	
  build/test/packaging	
  for	
  shared	
  C/C++	
  library	
  components	
  
35	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real	
  World	
  Example:	
  Python	
  With	
  Spark,	
  Drill,	
  Impala	
  
in partition 0
…
in partition
n - 1
SQL Engine
Python
function
input
Python
function
input
User-supplied
Python code
output
output
out partition 0
…
out partition
n - 1
SQL Engine
36	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Get	
  Involved	
  in	
  Arrow	
  
•  Join	
  the	
  community	
  
•  dev@arrow.apache.org	
  
•  Slack:	
  hups://apachearrowslackin.herokuapp.com/	
  
•  hup://arrow.apache.org	
  
•  @ApacheArrow	
  
37	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
Wes	
  McKinney	
  @wesmckinn	
  
Views	
  are	
  my	
  own	
  

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Andere mochten auch

What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney
 
Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
Wes McKinney
 
Falcon: Fault Localization in Concurrent Programs
Falcon: Fault Localization in Concurrent ProgramsFalcon: Fault Localization in Concurrent Programs
Falcon: Fault Localization in Concurrent Programs
Sangmin Park
 
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Sangmin Park
 

Andere mochten auch (14)

Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
 
Data Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist Shortage
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
 
User Experience for Business Analysts
User Experience for Business AnalystsUser Experience for Business Analysts
User Experience for Business Analysts
 
Python for Science and Engineering: a presentation to A*STAR and the Singapor...
Python for Science and Engineering: a presentation to A*STAR and the Singapor...Python for Science and Engineering: a presentation to A*STAR and the Singapor...
Python for Science and Engineering: a presentation to A*STAR and the Singapor...
 
Falcon: Fault Localization in Concurrent Programs
Falcon: Fault Localization in Concurrent ProgramsFalcon: Fault Localization in Concurrent Programs
Falcon: Fault Localization in Concurrent Programs
 
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
 

Ähnlich wie Python Data Ecosystem: Thoughts on Building for the Future

28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
Timothy Spann
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Timothy Spann
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
 

Ähnlich wie Python Data Ecosystem: Thoughts on Building for the Future (20)

Improving Data Interoperability for Python and R
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and R
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python Processors
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Pandas & Cloudera: Scaling the Python Data Experience
Pandas & Cloudera: Scaling the Python Data ExperiencePandas & Cloudera: Scaling the Python Data Experience
Pandas & Cloudera: Scaling the Python Data Experience
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
 

Mehr von Wes McKinney

Mehr von Wes McKinney (15)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Python Data Ecosystem: Thoughts on Building for the Future

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Python  Data  Ecosystem:   Thoughts  on  Building  for  the   Future   Wes  McKinney  @wesmckinn   PyData  Berlin  2016-­‐05-­‐21  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  Data  Science  Tools  at  Cloudera,  formerly  DataPad  CEO/founder   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Wrote  bestseller  Python  for  Data  Analysis  2012   •  Open  source  projects   •  Python  {pandas,  Ibis,  statsmodels}   •  Apache  {Arrow,  Parquet,  Kudu  (incubaWng)}   •  Mostly  work  in  Python  and  Cython/C/C++    
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   In  process:   Python  for  Data  Analysis:  2nd  Edi4on   Coming  early  2017  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Building  open  source  communiWes  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Social architecture is the conscious design of an environment that encourages a desired range of social behaviors leading towards some goal or set of goals. Wikipedia
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Step  1     Be  open  and  transparent  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Step  2     Reach  out  to  others  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Step  3     Strive  for  consensus  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Step  4   Value  contribuWons  extending   beyond  lines  of  code  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Step  5     Make  things  harder  for  bad  actors  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Handling problems carefully
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   http://numfocus.org http://apache.org
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Python  packaging  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Packaging  is  hard   •   Reproducible  infrastructure     •   Reproducible  toolchains     •   Reproducible  build  scripts   •   IntegraWon  tesWng   •   MulWple  library  version  builds   •   MulWple  Python  versions   •   Dependency  resoluWon   •   HosWng  and  distribuWon   •   MulWple  environment  management  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   ReflecWng  on  the  past  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   conda-­‐forge   •   Community-­‐curated  conda  package  channel  (on  anaconda.org)   •   Reproducible  build  infrastructure  (Docker  +  Circle  CI  +  Travis  CI  +  Appveyor)   •   Automated  GitHub  helper  tools   conda config --add channels conda-forge
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  important  to  me  right  now?  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Important  things   •   Building  bridges  with  other  data  science  communiWes  (R,  Julia,  Scala,  etc.)   •   Enabling  Python  to  more  efficiently  talk  to  other  systems  (e.g.  Hadoop  things)   •   Building  Python  tools  for  new  and  changing  varieWes  of  data  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   RAM  as  the  new  disk?   •  SSD – DRAM performance convergence •  NVM developments (3D Xpoint)Memory working set Consumer Consumer Consumer
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Problems   •   Memory  (data  structure)  representaWons   •   Metadata  representaWons   •   Memory  ownership,  life-­‐cycle  
  • 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   NumPy  solved  this  problem  for  Python  scienWsts   •   Common  memory  representaWon   •   ndarray  strided,  homogeneous  buffer   •   Common  metadata   •   NumPy  dtypes   •   No  well-­‐defined  memory  sharing  /  messaging  model:  case  by  case  basis  
  • 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   Problems  NumPy  doesn’t  solve  as  well   •   Nested  data  types  (think  JSON)   •   Missing  /  NULL  data   •   Strings  and  category  types   •   Columnar  memory  representaWon  for  tables  (think:  analyWc  SQL  databases)  
  • 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   Apache   Arrow   http://arrow.apache.org Some slides from Strata-HW talk w/ Jacques Nadeau
  • 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   Arrow  in  a  Slide   •  New  Top-­‐level  Apache  Sonware  FoundaWon  project     •  Focused  on  Columnar  In-­‐Memory  AnalyWcs   1.  10-­‐100x  speedup  on  many  workloads   2.  Common  data  layer  enables  companies  to  choose  best  of   breed  systems     3.  Designed  to  work  with  any  programming  language   4.  Support  for  both  relaWonal  and  complex  data  as-­‐is     •  Developers  from  13+  major  open  source  projects  involved   •  A  significant  %  of  the  world’s  data  will  be  processed  through   Arrow!     Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   Focus  on  CPU  Efficiency   1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer Arrow Memory Buffer • Cache  Locality   • Super-­‐scalar  &  vectorized   operaWon   • Minimal  Structure  Overhead   • Constant  value  access     • With  minimal  structure   overhead   • Operate  directly  on  columnar   compressed  data  
  • 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   High  Performance  Sharing  &  Interchange   Today With Arrow •  Each system has its own internal memory format •  70-80% CPU wasted on serialization and deserialization •  Similar functionality implemented in multiple projects •  All systems utilize the same memory format •  No overhead for cross-system communication •  Projects can share functionality (eg, Parquet-to-Arrow reader)
  • 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   Arrow  in  acWon:  Feather  File  Format  for  Python  and  R   • Problem:  fast,  language-­‐ agnosWc  binary  data  frame   file  format   • By  Wes  McKinney  (Python)   and  Hadley  Wickham  (R)   • Read  speeds  close  to  disk  IO   performance  
  • 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Feather  File  Format  for  Python   and  R   library(feather)       path  <-­‐  "my_data.feather"   write_feather(df,  path)       df  <-­‐  read_feather(path)   import  feather       path  =  'my_data.feather'       feather.write_dataframe(df,  path)   df  =  feather.read_dataframe(path)   R   Python  
  • 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   More  on  Feather   array 0 array 1 array 2 ... array n - 1 METADATA Feather File libfeather C++ library Rcpp Cython R data.frame pandas DataFrame
  • 32. 32  ©  Cloudera,  Inc.  All  rights  reserved.   Feather:  the  good  and  not-­‐so-­‐good   •  Good   •  Language-­‐agnosWc  memory  representaWon   •  Extremely  fast   •  New  storage  features  can  be  added  without  much  difficulty     •  Not-­‐so-­‐good   •  Data  must  be  convert  to/from  storage  representaWon  (Arrow)  and  in-­‐ memory  “proprietary”  data  structures  (R  /  Python  data  frames)  
  • 33. 33  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Parquet:  Python  support  is  coming   •  Collaborating with Uwe Korn from Blue Yonder pandas Arrow (C++ / Python) Parquet (C++)
  • 34. 34  ©  Cloudera,  Inc.  All  rights  reserved.   Shared  needs  for  Python,  R,  Julia,  ...   •  If  PLs  can  establish  a  common  data  frame  C/C++-­‐level  memory  representaWon,   we  can  share  algorithms  and  libraries  much  more  easily   •  Example:  dplyr’s  in-­‐memory  backend     •  Other  requirements   •  Permissive  licensing  (Python  /  Julia  require  MIT/Apache-­‐like)   •  Common  build/test/packaging  for  shared  C/C++  library  components  
  • 35. 35  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Python  With  Spark,  Drill,  Impala   in partition 0 … in partition n - 1 SQL Engine Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 SQL Engine
  • 36. 36  ©  Cloudera,  Inc.  All  rights  reserved.   Get  Involved  in  Arrow   •  Join  the  community   •  dev@arrow.apache.org   •  Slack:  hups://apachearrowslackin.herokuapp.com/   •  hup://arrow.apache.org   •  @ApacheArrow  
  • 37. 37  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own