2. What to Expect?
2
• What is machine learning ?
• Why is it important ?
• How do we use it ?
• Technical Concepts
• Examples
3. What is Machine Learning?
3
1. Science of getting computers to learn or recognize something without being explicitly
programmed – Andrew Ng
• Branch of Artificial Intelligence which is a branch of Computer Science
• Give lots of data to the computer so that it can figure it out
• One of the first examples is the computer checkers program by Arthur Samuel
* - ref: Andew Ng Courses, Big data: A revolution
2. Distinguish big data & machine learning: Big data is the data seed for creating
machine learning forests
• Big data collects information based on our digital exhaust (crumbs we leave in
the digital world) , demographics, preferences, health etc.
• Machine learning will mine this data and model behaviors with interactive
responses based on this data
4. Why do we need this?
4
1. Tons of applications impacting human health, utility and future simplification
Health & Wellness Utilitarian Future
• DNA sampling &
diagnosis
• Health reminders
& prevention
through AI tools
• Correlation studies
• Customizable
tablets
• Real time optimized
path maps
• Search Ranking
• Spam filter on email
• News aggregators
• Shopping
Recommendation
• Facebook face
recognition
• Age recognition
• Voice recognition –
Siri, Alexa
• Driverless cars
• Home decoration
5. Key Terms
5
• A set of data used to predict relationships.
• E.g. A diamond’s size, cut and clarity helps predicts the price. Data and
answers for each sample.
Training Set
• Uses training set to make a prediction.
• E.g. Model predicts diamond prices based on past prices.
Supervised Learning
• Provide data without suggesting anything so computer can identify patterns
or groupings.
• E.g. Customer segmentation, DNA groupings.
Unsupervised Learning
• Each distinct measurable data value you select in the training data set.
• E.g. A diamonds’ size is one of the feature’s for predicting price.
Features/ Variables /
Attributes
• Using the features provided in the training set make a prediction. Fit a curve
using the data provided.
• E.g. Price of diamond = X*Cut + Y*Clarity + Z*Size + features…
Supervised: Regression
• A defined set of categories for placing new data (observations)
• E.g. Presence of absence of cancer; Types of diabetes
Supervised: Classification
• Process of assigning observations into subsets
• E.g. Customer segment creations
Unsupervised: Clustering
6. Learning Steps
6
Collect /
Update User
Data
1
Create /
Update
Training Set
data
2
Create /
Update
algorithm for
training data
Update
Algorithm
Validate
Algorithm
3
Create
predictive
model
4
New real-time
observations
A/B Test &
Launch on
production
5
7. Data Wrangling and Feature Extraction
7
Spam Email
Detection
Title
Sender
Domain
# of
Recipients
Email
content
Country of
Origin
Non-
dictionary
Words
Hyperlinks
Address
Book
Length of
email
• Structured Data (Best)
– RDBMS, columnar data
– Strict Schema
– SQL
• Semi-Structured Data (Better)
– JSON, XML
– Enforce minimum schema
– JSON, XML Parser
• Unstructured Data
– Text, Image, Raw email
– No Schema
– Batch processing
– Regular expressions
– Map Reduce
GARBAGE IN GARBAGE OUT
8. Model Training
8
Feature
Extraction
(Feature
vector)
New
Text documents
User Activity
Images
Transaction history
Feature
Extraction
(Feature
vector)
Labels
Machine
Learning
Algorithm
Training / Testing
Text documents
User Activity
Images
Transaction history
Predictive
Model
Expected
Label
Model
Evaluation
10. Linear Classifiers
10
• Logistic regression
– )
– w with minimum loss
– Solve iteratively using gradient descent
• Support vector machine (SVM)
– Maximum margin classifier
• Artificial Neural Networks
– Inspired from how neurons work
– Activation function (Sigmoid, ReLU etc.)
– Deep Learning
11. KNN / CART
11
• K-Nearest Neighbors
– Find K nearest training examples
– Majority vote
– Easy to implement
– Not scalable for real time predictions
• Classification and Regression Trees
– Easy to interpret for small trees
• Random Forests
– Ensemble of decision trees
– Usually performs very good