Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

The Artful Business of Data Mining: Computational Statistics with Open Source Tools

The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Herunterladen, um offline zu lesen

This talk goes over a concepts of data mining and data analysis using open source tools, mainly Python and R with interesting libraries and the tools I have used and currently use at Engine Yard.

This talk goes over a concepts of data mining and data analysis using open source tools, mainly Python and R with interesting libraries and the tools I have used and currently use at Engine Yard.

Weitere Verwandte Inhalte

The Artful Business of Data Mining: Computational Statistics with Open Source Tools

  1. 1. The Artful Business of Data Mining Computational Statistics with Open Source Tool Wednesday 20 March 13
  2. 2. David Coallier @davidcoallier Wednesday 20 March 13
  3. 3. Data Scientist At Engine Yard (.com) Wednesday 20 March 13
  4. 4. Find Data Wednesday 20 March 13
  5. 5. Clean Data Wednesday 20 March 13
  6. 6. Analyse Data? Wednesday 20 March 13
  7. 7. Analyse Data Wednesday 20 March 13
  8. 8. Question Data Wednesday 20 March 13
  9. 9. Report Findings Wednesday 20 March 13
  10. 10. Data Scientist Wednesday 20 March 13
  11. 11. Data Janitor Wednesday 20 March 13
  12. 12. Actual Tasks Wednesday 20 March 13
  13. 13. “If your model is elegant, it’s probably wrong” Wednesday 20 March 13
  14. 14. “The Times they are a-Changing” — Bob Dylan Wednesday 20 March 13
  15. 15. Python & R Wednesday 20 March 13
  16. 16. SciPy http://www.scipy.org Wednesday 20 March 13
  17. 17. scipy.stats Wednesday 20 March 13
  18. 18. scipy.stats Descriptive Statistics Wednesday 20 March 13
  19. 19. from scipy.stats import describe s = [1,2,1,3,4,5] print describe(s) Wednesday 20 March 13
  20. 20. scipy.stats Probability Distributions Wednesday 20 March 13
  21. 21. Example Poisson Distribution Wednesday 20 March 13
  22. 22. λ e k −k f (k; λ ) = k! for k >= 0 Wednesday 20 March 13
  23. 23. import scipy.stats.poisson p = poisson.pmf([1,2,3,4,1,2,3], 2) Wednesday 20 March 13
  24. 24. print p.mean() print p.sum() ... Wednesday 20 March 13
  25. 25. NumPy http://www.numpy.org/ Wednesday 20 March 13
  26. 26. NumPy Linear Algebra Wednesday 20 March 13
  27. 27. ⎛ 1 0 ⎞ ⎜ 0 1 ⎟ ⎝ ⎠ Wednesday 20 March 13
  28. 28. import numpy as np x = np.array([ [1, 0], [0, 1] ]) vec, val = np.linalg.eig(x) np.linalg.eigvals(x) Wednesday 20 March 13
  29. 29. >>> np.linalg.eig(x) ( array([ 1., 1.]), array([ [ 1., 0.], [ 0., 1.] ]) ) Wednesday 20 March 13
  30. 30. Matplotlib Python Plotting Wednesday 20 March 13
  31. 31. statsmodels Advanced Statistics Modeling Wednesday 20 March 13
  32. 32. NLTK Natural Language Tool Kit Wednesday 20 March 13
  33. 33. scikit-learn Machine Learning Wednesday 20 March 13
  34. 34. from sklearn import tree X = [[0, 0], [1, 1]] Y = [0, 1] clf = tree.DecisionTreeClassifier() clf = clf.fit(X, Y) clf.predict([[2., 2.]]) >>> array([1]) Wednesday 20 March 13
  35. 35. PyBrain ... Machine Learning Wednesday 20 March 13
  36. 36. PyMC Bayesian Inference Wednesday 20 March 13
  37. 37. Pattern Web Mining for Python Wednesday 20 March 13
  38. 38. NetworkX Study Networks Wednesday 20 March 13
  39. 39. MILK MOAR machine LEARNING! Wednesday 20 March 13
  40. 40. Pandas easy-to-use data structures Wednesday 20 March 13
  41. 41. from pandas import * x = DataFrame([ {"age": 26}, {"age": 19}, {"age": 21}, {"age": 18} ]) print x[x['age'] > 20].count() print x[x['age'] > 20].mean() Wednesday 20 March 13
  42. 42. R Wednesday 20 March 13
  43. 43. RStudio The IDE Wednesday 20 March 13
  44. 44. lubridate and zoo Dealing with Dates... Wednesday 20 March 13
  45. 45. yy/mm/dd mm/dd/yy YYYY-mm-dd HH:MM:ss TZ yy-mm-dd 1363784094.513425 yy/mm different timezone Wednesday 20 March 13
  46. 46. reshape2 Reshape your Data Wednesday 20 March 13
  47. 47. ggplot2 Visualise your Data Wednesday 20 March 13
  48. 48. RCurl, RJSONIO Find more Data Wednesday 20 March 13
  49. 49. HMisc Miscellaneous useful functions Wednesday 20 March 13
  50. 50. forecast Can you guess? Wednesday 20 March 13
  51. 51. garch And ruGarch Wednesday 20 March 13
  52. 52. quantmod Statistical Financial Trading Wednesday 20 March 13
  53. 53. xts Extensible Time Series Wednesday 20 March 13
  54. 54. igraph Study Networks Wednesday 20 March 13
  55. 55. maptools Read & View Maps Wednesday 20 March 13
  56. 56. map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T) Wednesday 20 March 13
  57. 57. Sto rage Wednesday 20 March 13
  58. 58. Oppose “big” Data Wednesday 20 March 13
  59. 59. “Learn how to sample” Wednesday 20 March 13
  60. 60. Experim ents Wednesday 20 March 13
  61. 61. What Do You Want to Answer? Wednesday 20 March 13
  62. 62. Understand Your Audience Wednesday 20 March 13
  63. 63. Scientific Reporting Wednesday 20 March 13
  64. 64. Busy-ness Time is money Wednesday 20 March 13
  65. 65. Public Visualisation Wednesday 20 March 13
  66. 66. Best Visualisation, Bad Data Wednesday 20 March 13
  67. 67. Best Forecasting models... Bad Visualisation Wednesday 20 March 13
  68. 68. Wednesday 20 March 13
  69. 69. Wednesday 20 March 13
  70. 70. Sean chaí Wednesday 20 March 13
  71. 71. Wednesday 20 March 13
  72. 72. Feel it Wednesday 20 March 13
  73. 73. Wednesday 20 March 13
  74. 74. Wednesday 20 March 13
  75. 75. Wednesday 20 March 13
  76. 76. “Don’t be scared of bar charts.” Wednesday 20 March 13
  77. 77. Mathematical Statistics Engineering Business Economics Curiosity Wednesday 20 March 13
  78. 78. davidcoallier.github.com @davidcoallier on Twitter Wednesday 20 March 13

×