11. April 2008 - Avant garde PyData
â Socializing Python inside AQR, a quantitative
hedge fund
â scipy.stats.models enabled some R ->
Python workload migration
12. Dec 2009 - pandas 0.1
â First open source release after ~18 months
of internal-only use
13. May 2011 - âPyDataâ core dev meetings
"Need a toolset that is robust, fast, and suitable
for a production environment..."
14. May 2011
"Need a toolset that is robust, fast, and suitable
for a production environment..."
"... but also good for interactive research... "
May 2011 - âPyDataâ core dev meetings
15. May 2011
"Need a toolset that is robust, fast, and suitable
for a production environment..."
"... but also good for interactive research... "
"... and easy / intuitive for non-software
engineers to use"
May 2011 - âPyDataâ core dev meetings
16. May 2011
* also, we need to fix packaging
May 2011 - âPyDataâ core dev meetings
17. July 2011- Concerns
"... the current state of affairs has me rather
anxious ⊠these tools [e.g. pandas] have
largely not been integrated with any other tools
because of the community's collective
commitment anxiety"
http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/
18. July 2011- Concerns
"Fragmentation is killing usâ
http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/
20. Python for Data Analysis book - 2012
â A primer in data
manipulation in Python
â Focus: NumPy, IPython
/Jupyter, pandas,
matplotlib
â 2 editions (2012, 2017)
â 8 translations so far
21. 2013-2014 - An Entrepeneurial Detour
DataPad
Python-powered
Business Analytics
â Backend built with
PyData stack + custom
analytics
â Goal to contribute tech
back to OSS
ecosystem
23. PyData NYC 2013: 10 Things I Hate About pandas
â November 2013
â Summary: âpandas is
not designed like, or
intended to be used
as, a database query
engineâ
26. Fall 2014: Python in a Big Data World
Task: Helping Python
become a first-class
technology for Big Data
Some Problems
â File formats
â JVM interop
â Non-array-oriented
interfaces
28. Apache Arrow:
Defragmenting data systems
â Language-independent open
standard in-memory
representation for columnar data
(i.e. data frames)
â Easily reuse code targeting
Arrow memory
â Efficient memory interchange
Arrow
memory
JVM Data Ecosystem
Database Systems
Data Science Libraries
29. Apache Arrow:
Defragmenting data systems
â https://github.com/apache/arrow
â Over 200 unique contributors
â Some level of support for 11 programming
languages