Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Analytics Boot Camp - Slides

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
probabilistic ranking
probabilistic ranking
Wird geladen in …3
×

Hier ansehen

1 von 124 Anzeige

Analytics Boot Camp - Slides

Herunterladen, um offline zu lesen

Slides covered during Analytics Boot Camp conducted with the help of IBM, Venturesity. Special credits to Kumar Rishabh (Google) and Srinivas Nv Gannavarapu (IBM)

Slides covered during Analytics Boot Camp conducted with the help of IBM, Venturesity. Special credits to Kumar Rishabh (Google) and Srinivas Nv Gannavarapu (IBM)

Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Analytics Boot Camp - Slides (20)

Anzeige

Aktuellste (20)

Analytics Boot Camp - Slides

  1. 1. ANALYTICS BOOT CAMP COURSE
  2. 2. Analytics Boot Camp Course Aditya Joshi Kumar Rishabh Srinivas Nv Gannavarapu
  3. 3. the science of examining raw data with the purpose of drawing conclusions about that information - Techtarget
  4. 4. Data analytics refers to qualitative and quantitative techniques and processes used to enhance productivity and business gain. Data is extracted and categorized to identify and analyze behavioral data and patterns, and techniques vary according to organizational requirements. - Techopedia
  5. 5. Brainstorm Where else is Data Analytics used?
  6. 6. Workflow • Planning, organizing and requirement gathering • Gathering Data • Data Cleaning • Analyzing Data, Predictive Modelling and Result Generation • Result Presentation
  7. 7. Learning Process Obtain Data Extract Features Training Model
  8. 8. Learning Process New Data Features Model Result
  9. 9. Evaluation How is data obtained? Train-Test data Evaluation methodologies Precision, Recall, F-Measure Other methods of evaluation, Confusion Matrix
  10. 10. Data Attributes Types Nominal Categorical Ordinal Continuous
  11. 11. Tools We Use
  12. 12. R Programming Language Best suited for data- oriented problems Strong package ecosystem Graphics and charting Simple learning curve Memory management, speed, and efficiency Not a complete programming language Not for advanced programmers Pros Cons
  13. 13. Matlab Contains a lot of advance toolboxes Good documentation Good customer support Expensive Not as much open source code available because matlab requires a license Cannot integrated your code into a Webservice Pros Cons
  14. 14. Python Great data analysis libraries (pandas, statsmodels) Code can be easily integrated into a web service Simple and easy to begin A lot of cutting edge, advanced academic research is still being done in R/Matlab Advanced features of the language might offer a steep learning curve for newcomers Pros Cons
  15. 15. SAS Easy to learn dedicated customer service along with the community Simple learning curve Expensive enterprise tool Limited Functionality Pros Cons
  16. 16. Julia High performance, efficient good to write a computationally intensive program that uses multiple CPUs Decent visualization capabilities Relatively new; growing community not syntactically optimized for statistical operations on data arrays Pros Cons
  17. 17. Brainstorm What’s the Best Language?
  18. 18. The Big Blue
  19. 19. Cloud Computing, Bluemix and Analytics SAAS – PAAS – IAAS IBM Bluemix
  20. 20. Platform as a Service Zero Infrastructure, Lower Risk Lower cost and improved profitability Easy and quick development, Monetize quickly Reusable code and business logics Integration with other web services Google App Engine Heroku IBM Bluemix Openshift Cloud Foundry … PaaS Offerings
  21. 21. https://www.zoho.com/creator/paas.html
  22. 22. Bluemix Offerings Storage Analytics Watson Mobile IOT Containers And much more
  23. 23. IBM Bluemix Data & Analytics Data Storage • Cloudant NOSQL DB • Redis • IBM DashDB Graph Processing • IBM Graph Number Crunching • IBM Analytics for Apache Spark
  24. 24. Why Bluemix? Cloud Service Ready to Use Platform Because it’s IBM Open Source Tools
  25. 25. DashDB Data Warehousing and Analytics • Relational Data • Special Data types, eg. geospatial data Data Analysis • SQL Access • Advanced Built-in Analytics • R Studio Performance • IBM’s BLU Acceleration • processes terabytes of data extremely quickly
  26. 26. DashDB Database Data Warehouse Collection of data organized for storage, accessibility, and retrieval. Integrated copies of transaction data from multiple sources optimized for analytical use. OLTP OLTP + OLAP Very quick transaction processing for a single application Data from multiple applications used to guide analysis and decision-making Optimized for performing read-write operations of single point transactions. Optimized for efficiently reading/retrieving large data sets and for aggregating data. Analysis is hard because of slow multiple joins. Queries become very complex. Analysis queries are easy to form and execute faster.
  27. 27. Let’s Do It
  28. 28. Getting Started Setup and Basics
  29. 29. Basics of R Programming
  30. 30. Learning further with swirl
  31. 31. Need a Break?
  32. 32. Overview of Machine Learning
  33. 33. What is Machine Learning?
  34. 34. A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E - Tom Mitchell
  35. 35. A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E - Tom Mitchell
  36. 36. A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E - Tom Mitchell
  37. 37. A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E - Tom Mitchell
  38. 38. Supervised Learning - Regression
  39. 39. Linear Regression modeling the relationship between a scalar dependent variable and one or more explanatory variables (or independent variables) If we have only one independent variable, the model is called as simple linear regression, otherwise, multiple linear regression
  40. 40. Grade Point Average / Average Marks Salary in the First Job
  41. 41. Independent Variable Dependent Variable
  42. 42. Linear Regression Goal: Find the line such that distance from line to each point is minimized. We will “fit” the points with a line, so that an “objective function” is minimized. The line we thus obtain would minimize the sum of squared residues (least squares).
  43. 43. S (predictedi – actuali)2 =S (residuei)2
  44. 44. Logistic Regression A regression model where the dependent variable (DV) is categorical. Logistic regression is technically a classification technique – do not get confused by the word “Regression”
  45. 45. Grade Point Average / Average Marks Assessment Score In Programming Skills Got a job Didn’t get job ?
  46. 46. Grade Point Average / Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  47. 47. Grade Point Average / Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  48. 48. Grade Point Average / Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  49. 49. Linear Regression – Animation
  50. 50. Logistic Regression Goal: Find the parameters to fit We will “fit” the points with a line, so that an “objective function” is minimized. The line we thus obtain would minimize the sum of squared residues (least squares).
  51. 51. The logistic function • Where Y-hat is the estimated probability that the ith case is in a category and u is the regular linear regression equation: 1 u i u e Y e   1 1 2 2 K Ku A B X B X B X    
  52. 52. Supervised Learning - Classification
  53. 53. Nearest Neighbor Approaches Find k closest training examples, and poll their class values
  54. 54. Grade Point Average / Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  55. 55. K Nearest Neighbors (k-NN) k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. One of the simplest machine learning algorithms.
  56. 56. K Nearest Neighbors (k-NN) Require no training Easy to understand Easy for active learning processes Doesn’t scale up very well for large training set. (require special implementations like KD Trees) ‘K’ is hard to determine Require full dataset in memory Pros Cons
  57. 57. Decision Trees Find a model for class attribute as a function of the values of other attributes.
  58. 58. Grade Point Average / Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  59. 59. Grade Point Average / Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  60. 60. Decision Trees Goal: Build a tree; At each node, split the data on the basis of one attribute which provides the maximum split Dt ? > If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt > If Dt is an empty set, then t is a leaf node labeled by the default class, yd > If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
  61. 61. Decision Trees • Determine how to split the records • How to specify the attribute test condition? • How to determine the best split? • Determine when to stop splitting
  62. 62. Decision Trees When a node p is split into k partitions (children), the quality of split is computed as, where, ni = number of records at child i, n = number of records at node p. An alternate method is using Information Gain and Entropy.   k i i split iGINI n n GINI 1 )(
  63. 63. Decision Trees – Travel Time to Office Leave At Stall? Accident? 10 AM 9 AM 8 AM Long Short Medium Medium Long No Yes No Yes
  64. 64. Decision Trees – Travel Time to Office Leave At Stall? Accident? 10 AM 9 AM 8 AM Long Short Medium Medium Long No Yes No Yes if hour == 8am commute time = Short else if hour == 9am if accident == yes commute time = long else commute time = medium else if hour == 10am if stall == yes commute time = long else commute time = medium
  65. 65. Decision Trees Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Time for building a tree may be higher than another type of classifier Error propagates as the number of classes increases Pros Cons
  66. 66. Random Forests Ensemble classifier containing many decision trees and outputs the class that is the mode of the class's output by individual trees.
  67. 67. Random Forests Runs efficiently on large databases. Can handle thousands of input variables without variable deletion. It has methods for balancing error in class population unbalanced data sets. Have been observed to overfit for noisy datasets. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Pros Cons
  68. 68. Support Vector Machines SVM
  69. 69. Grade Point Average / Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  70. 70. Grade Point Average / Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  71. 71. Grade Point Average / Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  72. 72. Grade Point Average / Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  73. 73. • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Classify as.. +1 if w . x + b >= 1 -1 if w . x + b <= -1 Support Vector Machines
  74. 74. Slide By Andrew W. Moore, CMU What we know: • w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + l w • |x+ - x- | = M • M = Margin Width = M = |x+ - x- | =| l w |= x- x+ w.w 2 λ wwww ww . 2 . .2  www .|| λλ  ww. 2
  75. 75. Maximize     R k R l kllk R k k Qααα 1 11 2 1 where ).( lklkkl yyQ xx Subject to these constraints: kCαk 0 Then define:   R k kkk yα 1 xw k k KKKK αK εyb maxargwhere .)1(   wx Then classify with: f(x,w,b) = sign(w. x - b) 0 1  R k kk yα Slide By Andrew W. Moore, CMU
  76. 76. Support Vector Machines - Kernels Let’s consider data points in only one dimension for simplicity
  77. 77. Support Vector Machines - Kernels
  78. 78. Support Vector Machines - Kernels
  79. 79. Support Vector Machines - Kernels The linear classifier relies on dot product between vectors K(xi,xj)=xi Txj If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes: K(xi,xj)= φ(xi) Tφ(xj) A kernel function is some function that corresponds to an inner product in some expanded feature space. Example: 2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi Txj)2 , Need to show that K(xi,xj)= φ(xi) Tφ(xj): K(xi,xj)=(1 + xi Txj)2 , = 1+ xi1 2xj1 2 + 2 xi1xj1 xi2xj2+ xi2 2xj2 2 + 2xi1xj1 + 2xi2xj2 = [1 xi1 2 √2 xi1xi2 xi2 2 √2xi1 √2xi2]T [1 xj1 2 √2 xj1xj2 xj2 2 √2xj1 √2xj2] = φ(xi) Tφ(xj), where φ(x) = [1 x1 2 √2 x1x2 x2 2 √2x1 √2x2]
  80. 80. In 1D In 3D
  81. 81. Support Vector Machines - Animation
  82. 82. Support Vector Machines Less prone to overfitting Needs less memory to store the predictive model Reach the global optimum due to quadratic programming Works well with smaller sized datasets Hyperparameter search is important, and complex Deep NN are performing better than previously SVM based state-of-the- art solutions Pros Cons
  83. 83. Naïve Bayes Apply Bayes’ theorem with the “naive” assumption of independence between every pair of features
  84. 84. Naïve Bayes Before the evidence is obtained; prior probability • P(a) the prior probability that the proposition is true • P(cavity)=0.1 After the evidence is obtained; posterior probability • P(a|b) • The probability of a given that all we know is b • P(cavity|toothache)=0.8
  85. 85. Naïve Bayes Bayes Theorem (Thomas Bayes, 1763) Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP) hypothesis for the data We are interested in the best hypothesis for some space H given observed training data D.  P(b | a)  P(a |b)P(b) P(a)
  86. 86. Naïve Bayes Bayes Theorem (Thomas Bayes, 1763) Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP) hypothesis for the data We are interested in the best hypothesis for some space H given observed training data D.  P(b | a)  P(a |b)P(b) P(a) )|(argmax DhPh Hh MAP   )( )()|( argmax DP hPhDP Hh  )()|(argmax hPhDP Hh  We can drop this because P(D) is independent of the hypothesis
  87. 87. Naïve Bayes Training Set: instances of different classes described cj as conjunctions of attributes values Classify a new instance d based on attribute values into one of the classes cj  C Key idea: assign the most probable class CMAPusing Bayes Theorem. ),,,|(argmax 21 nj Cc MAP xxxcPc j    ),,,( )()|,,,( argmax 21 21 n jjn Cc xxxP cPcxxxP j     )()|,,,(argmax 21 jjn Cc cPcxxxP j   
  88. 88. Naïve Bayes Performs at state of the art level for some use- cases Performs good in small datasets Converges quickly Not suitable for very large datasets Performs poorly if features are correlated Pros Cons
  89. 89. Brainstorm Which classifier should I use?
  90. 90. Questions What is the relative importance of each predictor? How does each variable affect the outcome? Does a predictor make the solution better or worse or have no effect? Can parameters be accurately predicted? How good is the model at classifying cases for which the outcome is known ? What is the prediction equation in the presence of covariates? Can prediction models be tested for relative fit to the data? So called “goodness of fit” statistics What is the strength of association between the outcome variable and a set of predictors? Often in model comparison you want non-significant differences so strength of association is reported for even non-significant effects.
  91. 91. Unsupervised Learning
  92. 92. Clustering Draw inferences from datasets consisting of input data without labeled responses. Clustering is used for exploratory data analysis to find hidden patterns or grouping in data
  93. 93. Where is Clustering Used? Marketing: segment customer behaviors Banking: fraud detection Gene Analysis: identify gene responsible for a disease Image Processing: identifying objects in an image (e.g. face recognition) Insurance: identify policy holders with high average claim cost
  94. 94. K-Means Algorithm 1. Fix a number k = number of required clusters K=3
  95. 95. K-Means Algorithm 2. Select K random points in the given space K=3
  96. 96. K-Means Algorithm 3. For each point, find distance from the ‘k’ centroids, and assign it to the closest centroid
  97. 97. K-Means Algorithm 4. Within each newly formed cluster, find the new centroid
  98. 98. K-Means Algorithm 5. Repeat the process till the centroids are stable (converge)
  99. 99. K-Means Algorithm 5. Repeat the process till the centroids are stable (converge)
  100. 100. K-Means Algorithm 5. Repeat the process till the centroids are stable (converge)
  101. 101. K-Means Algorithm 6. Obtain final clusters
  102. 102. Hands On
  103. 103. What’s Next?

×