Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
WHY PYTHON IS BETTER
FOR DATA SCIENCE
ÍCARO MEDEIROS
São Paulo Big Data Meetup

São Paulo - SP, 25/11/2015
DATA SCIENTISTS SHOULD DO…
http://berkeleysciencereview.com/article/first-rule-data-science/
WHY PYTHON?
▸ General purpose

▸ Smooth learning curve

▸ REPL (IPython!)

▸ Programmer productivity

▸ Popular and mature...
PYTHON IS POPULAR: IT MEANS WIDESPREAD KNOWLEDGE AND MANY TOOLS
http://githut.info/
PYTHON IS POPULAR: IT MEANS WIDESPREAD KNOWLEDGE AND MANY TOOLS
pypl.github.io/PYPL.html
AVOID THE TWO LANGUAGE PROBLEM
PYTHON CAN BE USED IN WHOLE DATA SCIENCE WORKFLOW
https://speakerdeck.com/chdoig/the-state-of-python-for-data-science-pyss...
AUTHOR A MULTISTAGE PROCESSING PIPELINE IN
PYTHON, DESIGN A HYPOTHESIS TEST, PERFORM A
REGRESSION ANALYSIS OVER DATA SAMPL...
OPTIONS FOR PROCESSING PIPELINE
Airflow
https://github.com/airbnb/airflow
https://github.com/spotify/luigi
AIRFLOW EXAMPLE
https://github.com/airbnb/airflow
REGRESSION ANALYSIS IN PYTHON: EASY
http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html
PYTHON <3 BIG DATA
map reduce in python
pure python HDFS client
fast and general engine for large-scale
data processing
mr...
OH, BUT SCALA/JAVA IS FASTER. PYTHON IS 2 *FASTER: [WRITING, RUNNING]
DataFrame operations are optimized and compiled into...
RDD AVERAGE: EXAMPLE FROM ‘LEARNING SPARK'
RDD AVERAGE: EXAMPLE FROM ‘LEARNING SPARK'
SO CONCISE
COMMUNICATE RESULTS WITH IPYTHON / JUPYTER
Language agnostic :)
COMMUNICATE RESULTS WITH IPYTHON / JUPYTER
DEMO
TIME
MATPLOTLIB / SEABORN / PLOT.LY / BOKEH: SUCH VISUALIZATION!!
PYTHON FITS ALL!
PYTHON FITS ALL!
PYTHON FOR
SCIENCE IS
GROWING
SCIENCE IS GETTING MORE AND MORE IMPORTANT FOR PYTHON COMMUNITY
# module imports imports/numpy
1 sys 2437939 5.85
2 os 200...
SCIENCE IS IMPORTANT FOR PYTHON: MATRIX MULTIPLICATION
https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multipli...
SCIENCE STACK IS GETTING BETTER EACH DAY
https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote?slide=8
SCIENCE STACK IS ALWAYS EVOLVING…
https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote?slide=29
CONDA: AUTOMATING ENVIRONMENTS
https://speakerdeck.com/chdoig/the-state-of-python-for-data-science-pyss-2015?slide=60
THE STACK IS STILL GETTING NEW MEMBERS…
http://www.tensorflow.org/
TAKEAWAY MESSAGE
TRY PYTHON. IT WILL BE
A ONE WAY TRIP!
slides
icaromedeiros.com.br
slideshare.net/icaromedeiros
@icaromedeiros
Why Python is better for Data Science
Nächste SlideShare
Wird geladen in …5
×

Why Python is better for Data Science

4.839 Aufrufe

Veröffentlicht am

Discover why Python is better for Data Science: the whole workflow of Data Analysis is covered by Python. Tools for various tasks are shown, including: workflow, data analysis, data visualization, integration with Hadoop ecosystem, and communication.

Veröffentlicht in: Software
  • Als Erste(r) kommentieren

Why Python is better for Data Science

  1. 1. WHY PYTHON IS BETTER FOR DATA SCIENCE ÍCARO MEDEIROS São Paulo Big Data Meetup São Paulo - SP, 25/11/2015
  2. 2. DATA SCIENTISTS SHOULD DO… http://berkeleysciencereview.com/article/first-rule-data-science/
  3. 3. WHY PYTHON? ▸ General purpose ▸ Smooth learning curve ▸ REPL (IPython!) ▸ Programmer productivity ▸ Popular and mature ▸ Glue language (high level API, low level C/Fortran bindings) ▸ Science ecosystem (growing!)
  4. 4. PYTHON IS POPULAR: IT MEANS WIDESPREAD KNOWLEDGE AND MANY TOOLS http://githut.info/
  5. 5. PYTHON IS POPULAR: IT MEANS WIDESPREAD KNOWLEDGE AND MANY TOOLS pypl.github.io/PYPL.html
  6. 6. AVOID THE TWO LANGUAGE PROBLEM
  7. 7. PYTHON CAN BE USED IN WHOLE DATA SCIENCE WORKFLOW https://speakerdeck.com/chdoig/the-state-of-python-for-data-science-pyss-2015?slide=22
  8. 8. AUTHOR A MULTISTAGE PROCESSING PIPELINE IN PYTHON, DESIGN A HYPOTHESIS TEST, PERFORM A REGRESSION ANALYSIS OVER DATA SAMPLES WITH R, DESIGN AND IMPLEMENT AN ALGORITHM FOR SOME DATA-INTENSIVE PRODUCT OR SERVICE IN HADOOP, OR COMMUNICATE THE RESULTS OF OUR ANALYSES Jeff Hammerbacher ONE DAY AT FACEBOOK’S DATA SCIENCE TEAM, A MEMBER COULD… http://berkeleysciencereview.com/scientific-collaborations-uc-berkeley-data-driven-cover/
  9. 9. OPTIONS FOR PROCESSING PIPELINE Airflow https://github.com/airbnb/airflow https://github.com/spotify/luigi
  10. 10. AIRFLOW EXAMPLE https://github.com/airbnb/airflow
  11. 11. REGRESSION ANALYSIS IN PYTHON: EASY http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html
  12. 12. PYTHON <3 BIG DATA map reduce in python pure python HDFS client fast and general engine for large-scale data processing mrjob http://spark.apache.org https://github.com/spotify/snakebite https://pythonhosted.org/mrjob …
  13. 13. OH, BUT SCALA/JAVA IS FASTER. PYTHON IS 2 *FASTER: [WRITING, RUNNING] DataFrame operations are optimized and compiled into JVM bytecode https://databricks.com/blog/2015/04/24/recent-performance-improvements-in-apache-spark-sql-python- dataframes-and-more.html
  14. 14. RDD AVERAGE: EXAMPLE FROM ‘LEARNING SPARK'
  15. 15. RDD AVERAGE: EXAMPLE FROM ‘LEARNING SPARK' SO CONCISE
  16. 16. COMMUNICATE RESULTS WITH IPYTHON / JUPYTER Language agnostic :)
  17. 17. COMMUNICATE RESULTS WITH IPYTHON / JUPYTER DEMO TIME
  18. 18. MATPLOTLIB / SEABORN / PLOT.LY / BOKEH: SUCH VISUALIZATION!!
  19. 19. PYTHON FITS ALL!
  20. 20. PYTHON FITS ALL!
  21. 21. PYTHON FOR SCIENCE IS GROWING
  22. 22. SCIENCE IS GETTING MORE AND MORE IMPORTANT FOR PYTHON COMMUNITY # module imports imports/numpy 1 sys 2437939 5.85 2 os 2009086 4.82 3 re 1303009 3.12 4 numpy 416981 1.00 5 warnings 371345 0.89 6 subprocess 344934 0.83 7 django 282097 0.68 8 math 281987 0.68 11 matplotlib 146913 0.35 13 pylab 77817 0.19 14 scipy 69092 0.17 22 pandas 18928 0.05 24 theano 5482 0.051 6/25 MOST POPULAR LIBRARIES ARE FOR DATA SCIENCE https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multiplication-a-pretty-niche-requirement
  23. 23. SCIENCE IS IMPORTANT FOR PYTHON: MATRIX MULTIPLICATION https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multiplication-a-pretty-niche-requirement import numpy as np from numpy.linalg import inv, solve # Using dot function: S = np.dot((np.dot(H, beta) - r).T, np.dot(inv(np.dot(np.dot(H, V), H.T)), np.dot(H, beta) - r)) # With the @ operator S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r) S = ( H β − r ) T ( H V H T ) − 1 ( H β − r ) PEP 0465: PROPOSED FEB/14. SINCE PY 3.5 (SEP/15) 2013: 7 INTERNATIONAL CONFERENCES ON NUMERICAL PYTHON AT PYCON 2014, ~20% OF THE TUTORIALS INVOLVED THE USE OF MATRICES
  24. 24. SCIENCE STACK IS GETTING BETTER EACH DAY https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote?slide=8
  25. 25. SCIENCE STACK IS ALWAYS EVOLVING… https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote?slide=29
  26. 26. CONDA: AUTOMATING ENVIRONMENTS https://speakerdeck.com/chdoig/the-state-of-python-for-data-science-pyss-2015?slide=60
  27. 27. THE STACK IS STILL GETTING NEW MEMBERS… http://www.tensorflow.org/
  28. 28. TAKEAWAY MESSAGE TRY PYTHON. IT WILL BE A ONE WAY TRIP!
  29. 29. slides icaromedeiros.com.br slideshare.net/icaromedeiros @icaromedeiros

×