Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  Data	
  Ecosystem:	
  
Thoughts	
  on	
  Building	...
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Me	
  
•  Data	
  Science	
  Tools	
  at	
  Cloudera,	
  for...
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
In	
  process:	
  
Python	
  for	
  Data	
  Analysis:	
  2nd...
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Building	
  open	
  source	
  communiWes	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Social architecture is the
conscious design of an
environmen...
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  1	
  
	
  
Be	
  open	
  and	
  transparent	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  2	
  
	
  
Reach	
  out	
  to	
  others	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  3	
  
	
  
Strive	
  for	
  consensus	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  4	
  
Value	
  contribuWons	
  extending	
  
beyond	...
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  5	
  
	
  
Make	
  things	
  harder	
  for	
  bad	
...
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Handling
problems
carefully
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
http://numfocus.org
http://apache.org
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  packaging	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Packaging	
  is	
  hard	
  
• 	
  Reproducible	
  infrastru...
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
ReflecWng	
  on	
  the	
  past	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
conda-­‐forge	
  
• 	
  Community-­‐curated	
  conda	
  pac...
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What’s	
  important	
  to	
  me	
  right	
  now?	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Important	
  things	
  
• 	
  Building	
  bridges	
  with	
...
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
RAM	
  as	
  the	
  new	
  disk?	
  
•  SSD – DRAM
performa...
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Problems	
  
• 	
  Memory	
  (data	
  structure)	
  represe...
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
NumPy	
  solved	
  this	
  problem	
  for	
  Python	
  scie...
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Problems	
  NumPy	
  doesn’t	
  solve	
  as	
  well	
  
• 	...
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  
Arrow	
  
http://arrow.apache.org
Some slides fr...
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Arrow	
  in	
  a	
  Slide	
  
•  New	
  Top-­‐level	
  Apac...
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Focus	
  on	
  CPU	
  Efficiency	
  
1331246660
1331246351
13...
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
High	
  Performance	
  Sharing	
  &	
  Interchange	
  
Toda...
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Arrow	
  in	
  acWon:	
  Feather	
  File	
  Format	
  for	
...
30	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real	
  World	
  Example:	
  Feather	
  File	
  Format	
  f...
31	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
More	
  on	
  Feather	
  
array 0
array 1
array 2
...
array...
32	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Feather:	
  the	
  good	
  and	
  not-­‐so-­‐good	
  
•  Go...
33	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Parquet:	
  Python	
  support	
  is	
  coming	
  ...
34	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Shared	
  needs	
  for	
  Python,	
  R,	
  Julia,	
  ...	
 ...
35	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real	
  World	
  Example:	
  Python	
  With	
  Spark,	
  Dr...
36	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Get	
  Involved	
  in	
  Arrow	
  
•  Join	
  the	
  commun...
37	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
Wes	
  McKinney	
  @wesmckinn	
  
Views	
 ...
Nächste SlideShare
Wird geladen in …5
×

von

Python Data Ecosystem: Thoughts on Building for the Future Slide 1 Python Data Ecosystem: Thoughts on Building for the Future Slide 2 Python Data Ecosystem: Thoughts on Building for the Future Slide 3 Python Data Ecosystem: Thoughts on Building for the Future Slide 4 Python Data Ecosystem: Thoughts on Building for the Future Slide 5 Python Data Ecosystem: Thoughts on Building for the Future Slide 6 Python Data Ecosystem: Thoughts on Building for the Future Slide 7 Python Data Ecosystem: Thoughts on Building for the Future Slide 8 Python Data Ecosystem: Thoughts on Building for the Future Slide 9 Python Data Ecosystem: Thoughts on Building for the Future Slide 10 Python Data Ecosystem: Thoughts on Building for the Future Slide 11 Python Data Ecosystem: Thoughts on Building for the Future Slide 12 Python Data Ecosystem: Thoughts on Building for the Future Slide 13 Python Data Ecosystem: Thoughts on Building for the Future Slide 14 Python Data Ecosystem: Thoughts on Building for the Future Slide 15 Python Data Ecosystem: Thoughts on Building for the Future Slide 16 Python Data Ecosystem: Thoughts on Building for the Future Slide 17 Python Data Ecosystem: Thoughts on Building for the Future Slide 18 Python Data Ecosystem: Thoughts on Building for the Future Slide 19 Python Data Ecosystem: Thoughts on Building for the Future Slide 20 Python Data Ecosystem: Thoughts on Building for the Future Slide 21 Python Data Ecosystem: Thoughts on Building for the Future Slide 22 Python Data Ecosystem: Thoughts on Building for the Future Slide 23 Python Data Ecosystem: Thoughts on Building for the Future Slide 24 Python Data Ecosystem: Thoughts on Building for the Future Slide 25 Python Data Ecosystem: Thoughts on Building for the Future Slide 26 Python Data Ecosystem: Thoughts on Building for the Future Slide 27 Python Data Ecosystem: Thoughts on Building for the Future Slide 28 Python Data Ecosystem: Thoughts on Building for the Future Slide 29 Python Data Ecosystem: Thoughts on Building for the Future Slide 30 Python Data Ecosystem: Thoughts on Building for the Future Slide 31 Python Data Ecosystem: Thoughts on Building for the Future Slide 32 Python Data Ecosystem: Thoughts on Building for the Future Slide 33 Python Data Ecosystem: Thoughts on Building for the Future Slide 34 Python Data Ecosystem: Thoughts on Building for the Future Slide 35 Python Data Ecosystem: Thoughts on Building for the Future Slide 36 Python Data Ecosystem: Thoughts on Building for the Future Slide 37
Nächste SlideShare
Python Data Wrangling: Preparing for the Future
Weiter
Herunterladen, um offline zu lesen und im Vollbildmodus anzuzeigen.

6 Gefällt mir

Teilen

Herunterladen, um offline zu lesen

Python Data Ecosystem: Thoughts on Building for the Future

Herunterladen, um offline zu lesen

Keynote from PyData Berlin 2016-05-21

Ähnliche Bücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Ähnliche Hörbücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Python Data Ecosystem: Thoughts on Building for the Future

  1. 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Python  Data  Ecosystem:   Thoughts  on  Building  for  the   Future   Wes  McKinney  @wesmckinn   PyData  Berlin  2016-­‐05-­‐21  
  2. 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  Data  Science  Tools  at  Cloudera,  formerly  DataPad  CEO/founder   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Wrote  bestseller  Python  for  Data  Analysis  2012   •  Open  source  projects   •  Python  {pandas,  Ibis,  statsmodels}   •  Apache  {Arrow,  Parquet,  Kudu  (incubaWng)}   •  Mostly  work  in  Python  and  Cython/C/C++    
  3. 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   In  process:   Python  for  Data  Analysis:  2nd  Edi4on   Coming  early  2017  
  4. 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Building  open  source  communiWes  
  5. 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Social architecture is the conscious design of an environment that encourages a desired range of social behaviors leading towards some goal or set of goals. Wikipedia
  6. 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Step  1     Be  open  and  transparent  
  7. 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Step  2     Reach  out  to  others  
  8. 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Step  3     Strive  for  consensus  
  9. 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Step  4   Value  contribuWons  extending   beyond  lines  of  code  
  10. 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Step  5     Make  things  harder  for  bad  actors  
  11. 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.  
  12. 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Handling problems carefully
  13. 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   http://numfocus.org http://apache.org
  14. 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Python  packaging  
  15. 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Packaging  is  hard   •   Reproducible  infrastructure     •   Reproducible  toolchains     •   Reproducible  build  scripts   •   IntegraWon  tesWng   •   MulWple  library  version  builds   •   MulWple  Python  versions   •   Dependency  resoluWon   •   HosWng  and  distribuWon   •   MulWple  environment  management  
  16. 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   ReflecWng  on  the  past  
  17. 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.  
  18. 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   conda-­‐forge   •   Community-­‐curated  conda  package  channel  (on  anaconda.org)   •   Reproducible  build  infrastructure  (Docker  +  Circle  CI  +  Travis  CI  +  Appveyor)   •   Automated  GitHub  helper  tools   conda config --add channels conda-forge
  19. 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  important  to  me  right  now?  
  20. 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Important  things   •   Building  bridges  with  other  data  science  communiWes  (R,  Julia,  Scala,  etc.)   •   Enabling  Python  to  more  efficiently  talk  to  other  systems  (e.g.  Hadoop  things)   •   Building  Python  tools  for  new  and  changing  varieWes  of  data  
  21. 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   RAM  as  the  new  disk?   •  SSD – DRAM performance convergence •  NVM developments (3D Xpoint)Memory working set Consumer Consumer Consumer
  22. 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Problems   •   Memory  (data  structure)  representaWons   •   Metadata  representaWons   •   Memory  ownership,  life-­‐cycle  
  23. 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   NumPy  solved  this  problem  for  Python  scienWsts   •   Common  memory  representaWon   •   ndarray  strided,  homogeneous  buffer   •   Common  metadata   •   NumPy  dtypes   •   No  well-­‐defined  memory  sharing  /  messaging  model:  case  by  case  basis  
  24. 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   Problems  NumPy  doesn’t  solve  as  well   •   Nested  data  types  (think  JSON)   •   Missing  /  NULL  data   •   Strings  and  category  types   •   Columnar  memory  representaWon  for  tables  (think:  analyWc  SQL  databases)  
  25. 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   Apache   Arrow   http://arrow.apache.org Some slides from Strata-HW talk w/ Jacques Nadeau
  26. 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   Arrow  in  a  Slide   •  New  Top-­‐level  Apache  Sonware  FoundaWon  project     •  Focused  on  Columnar  In-­‐Memory  AnalyWcs   1.  10-­‐100x  speedup  on  many  workloads   2.  Common  data  layer  enables  companies  to  choose  best  of   breed  systems     3.  Designed  to  work  with  any  programming  language   4.  Support  for  both  relaWonal  and  complex  data  as-­‐is     •  Developers  from  13+  major  open  source  projects  involved   •  A  significant  %  of  the  world’s  data  will  be  processed  through   Arrow!     Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  27. 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   Focus  on  CPU  Efficiency   1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer Arrow Memory Buffer • Cache  Locality   • Super-­‐scalar  &  vectorized   operaWon   • Minimal  Structure  Overhead   • Constant  value  access     • With  minimal  structure   overhead   • Operate  directly  on  columnar   compressed  data  
  28. 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   High  Performance  Sharing  &  Interchange   Today With Arrow •  Each system has its own internal memory format •  70-80% CPU wasted on serialization and deserialization •  Similar functionality implemented in multiple projects •  All systems utilize the same memory format •  No overhead for cross-system communication •  Projects can share functionality (eg, Parquet-to-Arrow reader)
  29. 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   Arrow  in  acWon:  Feather  File  Format  for  Python  and  R   • Problem:  fast,  language-­‐ agnosWc  binary  data  frame   file  format   • By  Wes  McKinney  (Python)   and  Hadley  Wickham  (R)   • Read  speeds  close  to  disk  IO   performance  
  30. 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Feather  File  Format  for  Python   and  R   library(feather)       path  <-­‐  "my_data.feather"   write_feather(df,  path)       df  <-­‐  read_feather(path)   import  feather       path  =  'my_data.feather'       feather.write_dataframe(df,  path)   df  =  feather.read_dataframe(path)   R   Python  
  31. 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   More  on  Feather   array 0 array 1 array 2 ... array n - 1 METADATA Feather File libfeather C++ library Rcpp Cython R data.frame pandas DataFrame
  32. 32. 32  ©  Cloudera,  Inc.  All  rights  reserved.   Feather:  the  good  and  not-­‐so-­‐good   •  Good   •  Language-­‐agnosWc  memory  representaWon   •  Extremely  fast   •  New  storage  features  can  be  added  without  much  difficulty     •  Not-­‐so-­‐good   •  Data  must  be  convert  to/from  storage  representaWon  (Arrow)  and  in-­‐ memory  “proprietary”  data  structures  (R  /  Python  data  frames)  
  33. 33. 33  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Parquet:  Python  support  is  coming   •  Collaborating with Uwe Korn from Blue Yonder pandas Arrow (C++ / Python) Parquet (C++)
  34. 34. 34  ©  Cloudera,  Inc.  All  rights  reserved.   Shared  needs  for  Python,  R,  Julia,  ...   •  If  PLs  can  establish  a  common  data  frame  C/C++-­‐level  memory  representaWon,   we  can  share  algorithms  and  libraries  much  more  easily   •  Example:  dplyr’s  in-­‐memory  backend     •  Other  requirements   •  Permissive  licensing  (Python  /  Julia  require  MIT/Apache-­‐like)   •  Common  build/test/packaging  for  shared  C/C++  library  components  
  35. 35. 35  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Python  With  Spark,  Drill,  Impala   in partition 0 … in partition n - 1 SQL Engine Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 SQL Engine
  36. 36. 36  ©  Cloudera,  Inc.  All  rights  reserved.   Get  Involved  in  Arrow   •  Join  the  community   •  dev@arrow.apache.org   •  Slack:  hups://apachearrowslackin.herokuapp.com/   •  hup://arrow.apache.org   •  @ApacheArrow  
  37. 37. 37  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own  
  • JavierBosch

    May. 7, 2018
  • YeeYonfai

    Jun. 21, 2016
  • XiaoNan2

    Jun. 21, 2016
  • lucasshen73

    Jun. 8, 2016
  • alex2509

    Jun. 2, 2016
  • MarcosColebrookSantamaria

    May. 22, 2016

Keynote from PyData Berlin 2016-05-21

Aufrufe

Aufrufe insgesamt

5.287

Auf Slideshare

0

Aus Einbettungen

0

Anzahl der Einbettungen

253

Befehle

Downloads

75

Geteilt

0

Kommentare

0

Likes

6

×