Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Machine Learning Essentials Demystified part1 | Big Data Demystified

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 76 Anzeige

Machine Learning Essentials Demystified part1 | Big Data Demystified

Herunterladen, um offline zu lesen

Machine Learning Essentials Abstract:
Machine Learning (ML) is one of the hottest topics in the IT world today. But what is it really all about?
In this session we will talk about what ML actually is and in which cases it is useful.
We will talk about a few common algorithms for creating ML models and demonstrate their use with Python. We will also take a peek at Deep Learning (DL) and Artificial Neural Networks and explain how they work (without too much math) and demonstrate DL model with Python.

The target audience are developers, data engineers and DBAs that do not have prior experience with ML and want to know how it actually works.

Machine Learning Essentials Abstract:
Machine Learning (ML) is one of the hottest topics in the IT world today. But what is it really all about?
In this session we will talk about what ML actually is and in which cases it is useful.
We will talk about a few common algorithms for creating ML models and demonstrate their use with Python. We will also take a peek at Deep Learning (DL) and Artificial Neural Networks and explain how they work (without too much math) and demonstrate DL model with Python.

The target audience are developers, data engineers and DBAs that do not have prior experience with ML and want to know how it actually works.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Machine Learning Essentials Demystified part1 | Big Data Demystified (20)

Anzeige

Weitere von Omid Vahdaty (20)

Aktuellste (20)

Anzeige

Machine Learning Essentials Demystified part1 | Big Data Demystified

  1. 1. Machine Learning Essentials Part 1: Basic algorithms Lior King Lior.King@gmail.com 1
  2. 2. Agenda • Introduction to Machine Learning (ML) • What is Machine Learning? • The problems we can solved using ML. • The learning process • Basic ML Algorithms using Python and Scikit-learn library • Linear Regression • Naïve Bayes • K-Means • Artificial Neural Networks (ANN) and Deep Learning (DL) using TensorFlow library • Single layered ANN (using MNIST demo). • Deep Learning (DL) with Multi-layered Neural Networks • DL example: Convolutional Neural Network (CNN). 2
  3. 3. An astronaut lands on an alien planet 3
  4. 4. The astronaut’s dilemma 4 Male or Female ?
  5. 5. An alien lands on earth 5
  6. 6. 6 Male or Female ? The alien’s dilemma
  7. 7. Gender recognition algorithm for the alien If the height is > 180 cm and/or weight > 75 kg or has a beard or has short hair or has a deep voice or is bold or… 7 There might be exceptions… The Rule based approach
  8. 8. The learning approach • We show the alien 500 humans and tell them who are the males and who are the female • The alien will find the characteristics that differentiate males and females – by identifying repeated patterns. • The alien needs to be exposed to a lot of humans to identify repeated patterns. That is how he gains EXPERIENCE. 8
  9. 9. How can a computer learn? Experience = Data 9
  10. 10. What is machine learning approach? • An alternative for rule based approach • Based on a lot of data. • Implements a pre-determined model that use a standard algorithm that finds DATA CORRELATIONS. 10
  11. 11. Some ML use cases • Self driving cars and auto pilots • Cortana, Siri, Google Assistant (NLP – Natural Language Processing) • Recommendations - Netflix and Amazon know what you like • Data security • Healthcare – Computer Assisted Diagnosis • Spam detection • Fraud detection • Algo-trading • IoT 11
  12. 12. AI vs. ML vs. DL 12
  13. 13. When to use machine learning? •When it is difficult for humans to express rules •Too many variables •Difficult to understand relationships •When there is a large amount of available historical data •When data items relationships and pattern are dynamic and keep changing 13
  14. 14. ML is not new - why is it so hot these days? Problem 1: ML usually requires a lot of data Solution: We are in the “big data” era. Problem 2: ML requires a lot of computations. Solutions: • The CPUs have got very fast • GPUs can be harnessed and multiply the speed. • The cloud enables you to build a computing grid fast and cheap. Problem 3: ML is complex and difficult Solution: Available free open source libraries and tools 14
  15. 15. How to use machine learning? 1. Define the problem you wish to solve – ask the right question. 2. Prepare the data - make sure you use relevant data which is represented with meaningful numbers 3. Choose the right algorithm. 4. Use the algorithm to train a model with training data. 5. Test the model to see if it is correct enough. 15 Define the problem Represent Data with numbers Choose the algorithm Train the model Test the model
  16. 16. On going learning 16 Rules Historical Data New Data Retraining Deploying
  17. 17. Use cases 17
  18. 18. Machine learning - problem categories 18 Supervised Unsupervised Reinforcement
  19. 19. Machine Learning Categories •Supervised machine learning: • The program is “trained” on a pre-defined set of “training examples” • Can reach a pretty accurate conclusion when given new data. •Unsupervised machine learning: • The program is given a bunch of data and must find patterns and relationships therein. •Reinforcement machine learning: • The program is a given just an “environment” and a “reward” function for successful actions – without an exact instructions what to do. • The program finds a set of actions that will grant it maximum total “rewards”. 19
  20. 20. Machine learning - problem categories 20 Classification Regression (Prediction) Clustering (grouping)
  21. 21. Classification • A “Yes or No” choices: • Does the patient have cancer? • Is this email a SPAM? • Is this credit card transaction – a fraud? • Is the stock market is going up or down? • A discrete number of choices: • Determine age group: 0-18, 18-35, 35-60, 60+ • Recognize handwritten characters – a, b, c, d … • Customers sentiment analysis – very positive/slightly positive/neutral/slightly negative/strongly negative. Classification requires training data 21 Classification (discrete number)
  22. 22. Regression •Regression – for Predictions or forecasts • What will be the value of MSFT stock tomorrow? • How much will we sell in the next quarter? • How many bugs will we need to fix? • How long will it take to commute from A to B? • Outputs a continuous value – a float • Requires training data 22 Regression (Prediction)
  23. 23. Clustering • Clustering is grouping variables into groups • Customers segmentation • Pattern recognition and image analysis • Bio informatics • Training data is not required (unsupervised). 23 Clustering
  24. 24. The set of rules known as a MODEL MODEL = A quantitative representation of relationships between variables. • Can be a mathematical equation Or • A set of if-then-else statements created dynamically. Example: A spam filtering model represents the relationship between the text in the email and whether it is a spam or not. 26
  25. 25. Model = Function ⁞ 27 Model f (X1, X2, … ,Xn) Data attribute Data attribute Data attribute Data attribute Outcome Numbers A number
  26. 26. The Goal To find the best model (function) that produces the desired result on any set of inputs 28
  27. 27. Supervised Learning 29
  28. 28. Supervised learning – the training process 30 Prepared Data Training Data Test Data Algorithm Model Splitting the data Training a model Testing the model Model Good Bad
  29. 29. 31 Basic ML Algorithms and how to use them with Python
  30. 30. Most common ML algorithms • Prediction: • Linear Regression • Polynomial Regression • Decision Tree Regression • Random Forest Regression • Support Vector Regression (SVR) • Classification: • Naïve Bayes • Logistic Regression • Decision Tree Classification • Random Forest Classification • Support Vector Machines (SVM) • K-Nearest Neighbors classification (K-NN) 32 • Clustering: • K-Means • Hierarchical clustering • Artificial Neural Networks: • Convolutional Neural Network (CNN) • Recurrent Neural Network (RNN)
  31. 31. Some Other Algorithms • Enhanced algorithms: • Variations of basic algorithms • Enhanced to perform better and/or add more functionality. • Complex to understand and use properly • Ensemble algorithms: • Special algorithms that contain/combine multiple algorithms under one interface • Used when you need to tune the model to increase performance • Can be complex and difficult to debug and troubleshoot. 33
  32. 32. Regression problems 34 Regression (Prediction)
  33. 33. Common ML algorithms for regression • Linear Regression • Polynomial Regression • Decision Trees • Random Forest 35 Regression (Prediction)
  34. 34. Regression examples • What will be the stocks returns? • What will be the sales of a product next week? • If flight is delayed, how does this affect customer satisfaction? • If I change my investment portfolio, how would it affect my risk? • How much will I get on my house? 36
  35. 35. Linear regression Finding the relation between the age and the salary. Predicting the salary for any given age 38 Historical Data points Experience Salary
  36. 36. Historical Data points Salary (dependent) Minimize the error The Error (or Residual) is the offset of the dependent variable from the independent variable. The goal of any regression is to minimize the error for the training data and to FIND THE OPTIMAL LINE (or curve in case of logistic regression). 39 Error Experience (independent)
  37. 37. Historical Data points Salary (dependent) Minimize the error – sum of square diffs The error = 𝑖=1 𝑁 (𝑦𝑖 − 𝑦𝑖)2 40 y Error 𝒚 Experience
  38. 38. Minimize the error with Stochastic Gradient Descent (SGD) Error = 1 𝑁 𝑖=1 𝑁 (𝑦𝑖 − 𝑦𝑖)2 N -> number of historical data points 1. Initialize some value for the slope and intercept. 2. Find the current value of the error function. 41 Error Slope Intercept 3. Find the slope at the current point (partial derivative) and move slightly downwards in the direction. 4. Repeat until you reach a minimum OR stop after certain number of iterations
  39. 39. Historical Data points Salary (dependent) Experience Minimize the error The iterative SGD process will slowly change the slope and the intercept until the error is minimal. 42
  40. 40. Multiple Linear Regression • Simple linear regression: 𝑌 = 𝑏0 + 𝑏1*𝑥1 • Multiple linear regression: 𝑌 = 𝑏0 + 𝑏1*𝑥1 + 𝑏2*𝑥2 + … + 𝑏 𝑛∗𝑥 𝑛 Important note: You need to exclude variables that will “mess” the prediction and keep the ones that actually help predicting the desired result. 43
  41. 41. Polynomial Linear Regression 44 Simple linear regression: 𝑌 = 𝑏0 + 𝑏1*𝑥1 Polynomial linear regression: 𝑌 = 𝑏0 + 𝑏1*𝑥1 + 𝑏2∗𝑥1 𝟐 + … + 𝑏 𝑛∗𝑥1 𝒏 Quadratic: degree = 2 Cubic: degree = 3
  42. 42. Why Python? • Fast learning curve • Combines the power of general-purpose language with the ease of use. • Everything you need for ML: Libraries for data loading, visualization, statistics, natural language processing, image processing, and more: • numpy, scipy • scikit-learn • matplotlib • pandas • tensorflow, pytorch, GraphLab • A lot of free IDEs and iterative tools (like Spyder, PyCharm, VS code and more) • Allows for the creation of complex graphical user interfaces (GUIs) • Easy integration into existing systems and web services. 45
  43. 43. Python becomes the leader for ML 46 * KDnuggests is a leading news site on Business Analytics, Big Data, Data Mining, Data Science, and Machine Learning
  44. 44. Python’s Scikit-learn library • Makes it easier to perform training and evaluation tasks: • Splitting the data into training and test sets. • Pre processing before we train with it. • Selection the important features. • Model training • Model tuning for better performance • Provides a common interface for accessing algorithms • Based on often used mathematical libraries such as NumPy, SciPy, Matplotlib • Supports Pandas dataframes. 47
  45. 45. Regression Demo 48
  46. 46. Classification problems 49 Classification (Yes/No or a discrete number)
  47. 47. Common ML algorithms for classification • Naïve Bayes • Logistic Regression • Support Vector Machines • Decision Trees • K-Nearest Neighbors 50 Classification (Yes/No or a discrete number)
  48. 48. Classification examples • Gender detection: • Using the first name, length, prefix/suffix, ends with a vowel? • Age group detection: • Using users selection and preferences • Sentiment Analysis – Positive or Negative (polarity identification) • Using a large bank of tweets and post – unstructured and complicated. • Trading stocks/derivatives – Up day or Down day? • Using week day, month, prices in previous days, prices of related stocks. 51
  49. 49. Classification examples • Detecting Ads – Is the image an Ad or not an Ad? • Using the Image URL, page URL, Page text, Image caption and so on… • Customer Churn – Is the customer is about to quit? • Using: purchases, days since the last purchase, geo location etc… • Fraud detection – a Fraud or not a Fraud • Using: payment type, location, failed attempts history, frequency of use • Credit risk – Will the customer default on a loan? • Using: Income, employment sector, education, history of defaults 52
  50. 50. The goal is to classify an unknown review as positive or negative. Sentiment Analysis Classification 53 ClassificationMovie review Positive“The movie was pretty good” Negative“It was boring. I almost fell asleep” Positive“We had a great evening” Negative“The leading actor really sucked” … Negative“It is the worst film ever”
  51. 51. Naïve Bayes 𝑃 𝑐 𝑥 ) = 𝑃 𝑥1 𝑐 ) ∗ 𝑃 𝑥2 𝑐 ) ∗ … ∗ 𝑃 𝑥𝑛 𝑐 ) ∗ 𝑃(𝑐) 𝑃(𝑥) “A great movie” – is it a positive review? 54 Prior probabilityLikelihood Marginal likelihoodPosterior probability 𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 "𝐺𝑟𝑒𝑎𝑡 𝑀𝑜𝑣𝑖𝑒" ) = 𝑃 "𝐺𝑟𝑒𝑎𝑡" 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒) ∗ 𝑃 "𝑀𝑜𝑣𝑖𝑒" 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒) ∗ 𝑃(𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒) 𝑃 "Great" ∗ 𝑃("𝑀𝑜𝑣𝑖𝑒") Prior probability – What is the probability of a positive review Likelihood – what is the probability to find the word X in a positive review Marginal likelihood – What is the probability of the word in all the set (positive & negative) Posterior probability – What is the probability of the word X to indicate a positive review
  52. 52. Naïve Bayes 𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 "𝐴 𝑔𝑟𝑒𝑎𝑡 𝑚𝑜𝑣𝑖𝑒" ) = 𝑃 "𝐺𝑟𝑒𝑎𝑡" 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒) ∗ 𝑃 "𝑀𝑜𝑣𝑖𝑒" 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒) ∗ 𝑃(𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒) 𝑃(𝑋) 𝑃 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 "𝐴 𝑔𝑟𝑒𝑎𝑡 𝑚𝑜𝑣𝑖𝑒") = 𝑃 "𝐺𝑟𝑒𝑎𝑡" 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒) ∗ 𝑃 "𝑀𝑜𝑣𝑖𝑒" 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒) ∗ 𝑃(𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒) 𝑃(𝑋) 𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 "𝐴 𝑔𝑟𝑒𝑎𝑡 𝑚𝑜𝑣𝑖𝑒") > ? < 𝑃 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 "𝐴 𝑔𝑟𝑒𝑎𝑡 𝑚𝑜𝑣𝑖𝑒") 55
  53. 53. Naïve Bayes algorithm 1. Extract every word (get rid of words like the/is/a etc.). 2. Calculate the probability of each word in positive comments. 𝑃𝑃𝑜𝑠("𝑤𝑜𝑟𝑑") = Sum(freq. of “word” in positive comments) Sum (Freq. of “word” in the entire set). 3. For every sentence, calculate PPos and PNeg. 𝑃𝑃𝑜𝑠 sentence = 𝑃𝑃𝑜𝑠 word1 ∗ 𝑃𝑃𝑜𝑠 𝑤𝑜𝑟𝑑2 ∗ ⋯ ∗ 𝑃𝑃𝑜𝑠 𝐴𝑙𝑙 𝑃𝑁𝑒𝑔 sentence = (1 − 𝑃𝑃𝑜𝑠 word1 ) ∗ (1 − 𝑃𝑃𝑜𝑠 𝑤𝑜𝑟𝑑2 ) ∗ ⋯ ∗ (1 − 𝑃𝑃𝑜𝑠 𝐴𝑙𝑙 ) 4. Compare PPos(sentence) and Pneg(sentence) 56 ClassificationMovie review Positive“The movie was pretty good” Negative“It was boring. I almost fell asleep” Positive“We had a great evening” Negative“The leading actor really sucked” … Negative“It is the worst film ever” PPos(word)Word 95%Great 10%Boring 50%Movie 10%Worst … For the entire set: 55% positive -> PPos (All) PPos(“The movie is great”) = 0.5*0.95*0.55 = 0.261 PNeg(“The movie is great”) = (1-0.5)*(1-0.95)* (1-0.55) = 0.011 0.261 > 0.011 → Positive 
  54. 54. Naïve Bayes – Continuous values (Gaussian) 57 Salary Age Features: Age, Salary Blue circle = did not purchase = 40 Red cross = purchased = 30 did not purchase = 15 purchased = 10 The chance that X will purchase = The chance the customers around X purchased * The chance of purchasing in general / The chance for a customer to be around X = (# of purchases around X/Total purchases) * (# of purchases/Total customers) / (Total customers around X/ Total purchases) 𝑃 𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒 𝑥) = 𝑃 𝑥 𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒) ∗ 𝑃(𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒) 𝑃(𝑥) 10 30 ∗ 30 70 25 70 = 0.4 = 40% Assuming normal distribution around X
  55. 55. Naïve Bayes Algorithm • Each attribute (in our case – word) is used independently (hence the term “naïve”). • Phrases are not taken under consideration like “far out”. • Simple to understand • Fast training • Stable – insensitive to small changes in the training data • Can be very robust for solving many classification problems – especially in cases : • There is a small amount of training data • You don’t have a lot of knowledge about the problem itself 58
  56. 56. Naïve Bayes Demo (Gaussian) 59
  57. 57. Logistic Regression 60
  58. 58. K-NN (K Nearest Neighbors) 61
  59. 59. Clustering problems 62 Clustering
  60. 60. Common ML algorithms for clustering • K-Means • Fuzzy clustering • Hierarchical clustering • Density based clustering • Distribution based clustering 63 Clustering
  61. 61. Clustering use case example • A cellular company need to put antennas in a region so that its users receive optimum signal processing • Locating police stations so they can arrive fast to areas of high crime rate. • Identify important products features from customer feedbacks, emails etc. • Perform efficient data compression 64
  62. 62. Reviews theme clustering • We need to represent each review to have numeric attributes. • In this example we’ll use a technique called “Term Frequency Representation”(TFR). Sample: “With tears in my eyes” All words: (movie, good, bad, with, boring, tears, yesterday, my, eyes) (0, 0, 0, 1, 0, 1, 0, 1, 1 ) We represent each review using frequencies of words. Some words characterize a document more than the others: “With tears in my eyes”. These words usually occur more rarely and differentiate the review from the others. 65
  63. 63. Reviews theme clustering Some words characterize a document more than the others: “With tears in my eyes”. These words usually occur more rarely and differentiate the review from the others. “With tears in my eyes”. We now weight the word frequencies to make the rare words stand out and the common words to have minimal weight. New weight= 1/frequency of the word This representation is called Term Frequency – Inverse Document Frequency (TF-IDF) 66 Common Rare Common Common Rare
  64. 64. K-Means algorithm • Every review is a tuple with N numbers: (0, 3, 0, 4, 0, ….) • So every review is a point in an N- dimensional space hypercube. • With K-means algorithm you define K which is “how many groups you want to converge in clusters”. 1. Initialize the mean points (also call “centroids”). 67
  65. 65. K-Means iteration 1 2. Assign each review (point) to the nearest centroid. 3. Look at each cluster and find a new centroid for the cluster. 4. Repeat 2,3 until the means stop changing. 68
  66. 66. K-Means iteration 2 2. Assign each review (point) to the nearest centroid. 3. Look at each cluster and find a new centroid for the cluster. 4. Repeat 2,3 until the means stop changing. 69
  67. 67. K-Means iteration 3 2. Assign each review (point) to the nearest centroid. 3. Look at each cluster and find a new centroid for the cluster. 4. Repeat 2,3 until the means stop changing. 70
  68. 68. K-Means iteration 4 2. Assign each review (point) to the nearest centroid. 3. Look at each cluster and find a new centroid for the cluster. 4. Repeat 2,3 until the means stop changing. 71
  69. 69. K-Means iteration 5 2. Assign each review (point) to the nearest centroid. 3. Look at each cluster and find a new centroid for the cluster. 4. Repeat 2,3 until the means stop changing. 72
  70. 70. K-Means iteration 6 2. Assign each review (point) to the nearest centroid. 3. Look at each cluster and find a new centroid for the cluster. 4. Repeat 2,3 until the means stop changing. 73
  71. 71. K-Means algorithm 1. Initialize the mean points (also call “centroids”). 2. Assign each review (point) to the nearest centroid. 3. Look at each cluster and find a new centroid for the cluster. 4. Repeat 2,3 until the means stop changing. 74
  72. 72. Reviews theme clustering • We need to represent each review to have numeric attributes. • In this example we’ll use a technique called “Term Frequency Representation”(TFR). Sample: “With tears in my eyes” All words: (movie, good, bad, with, boring, tears, yesterday, my, eyes) (0, 0, 0, 1, 0, 1, 0, 1, 1 ) We represent each review using frequencies of words. 75
  73. 73. Reviews theme clustering Some words characterize a document more than the others: “With tears in my eyes”. These words usually occur more rarely and differentiate the review from the others. “With tears in my eyes”. We now weight the word frequencies to make the rare words stand out and the common words to have minimal weight. New weight= 1/frequency of the word This representation is called Term Frequency – Inverse Document Frequency (TF-IDF) 76 Common Rare Common Common Rare
  74. 74. K-Means algorithm • Every review is a tuple with N numbers: (0, 3, 0, 4, 0, ….) • So every review is a point in an N- dimensional space hypercube. 77
  75. 75. Clustering vs. Classification • Classification – When you want classify data into pre-defined categories. • Clustering – Grouping data into a set of categories that is NOT known before hand. • We can mix them both: • Start with clustering the data • Then train the data to recognize each cluster and create a model. • Use the classification model to classify new data. 78
  76. 76. 79 Thank you ! Lior.King@gmail.com

×