2. BigML, Inc 2
Poul Petersen
CIO, BigML, Inc.
Feature Engineering
Creating Machine Learning Ready Data
3. BigML, Inc 3Feature Engineering
Machine Learning Secret
“…the largest improvements in accuracy often came from
quick experiments, feature engineering, and model tuning
rather than applying fundamentally different algorithms.”
Facebook FBLearner 2016
Feature Engineering: applying domain knowledge of
the data to create features that make machine
learning algorithms work better or at all.
4. BigML, Inc 4Feature Engineering
Obstacles
• Data Structure
• Scattered across systems
• Wrong "shape"
• Unlabelled data
• Data Value
• Format: spelling, units
• Missing values
• Non-optimal correlation
• Non-existant correlation
• Data Significance
• Unwanted: PII, Non-Preferred
• Expensive to collect
• Insidious: Leakage, obviously correlated
Data Transformation
Feature Engineering
Feature Selection
5. BigML, Inc 5Feature Engineering
Feature Engineering
2013-09-25 10:02
Automatic Date Transformation
… year month day hour minute …
… 2013 Sep 25 10 2 …
… … … … … … …
NUM NUMCAT NUM NUM
DATE-TIME
6. BigML, Inc 6Feature Engineering
Feature Engineering
Automatic Categorical Transformation
… alchemy_category …
… business …
… recreation …
… health …
… … …
CAT
business health recreation …
… 1 0 0 …
… 0 0 1 …
… 0 1 0 …
… … … … …
NUM NUM NUM
7. BigML, Inc 7Feature Engineering
Feature Engineering
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon ‘em.
TEXT
Automatic Text Transformation
… great afraid born achieve …
… 4 1 1 1 …
… … … … … …
NUM NUM NUM NUM
8. BigML, Inc 8Feature Engineering
Feature Engineering
{
“url":"cbsnews",
"title":"Breaking News Headlines
Business Entertainment World News “,
"body":" news covering all the latest
breaking national and world news
headlines, including politics, sports,
entertainment, business and more.”
}
Fixing "non-optimal correlations"
title body
Breaking News… news covering…
… …
TEXT TEXT
TEXT
9. BigML, Inc 9Feature Engineering
Feature Engineering
Discretization
Total Spend
7.342,99
304,12
4,56
345,87
8.546,32
NUM
“Predict will spend
$3,521 with error
$1,232”
Spend Category
Top 33%
Bottom 33%
Bottom 33%
Middle 33%
Top 33%
CAT
“Predict customer
will be Top 33% in
spending”
10. BigML, Inc 10Feature Engineering
Feature Engineering
Combinations of Multiple Features
Kg M2
101,4 3,24
85,2 2,8
56,2 2,9
136,1 3,6
95,9 4,1
NUM NUM
BMI
31,29
30,42
19,38
37,81
23,39
NUM
Kg
M2
11. BigML, Inc 11Feature Engineering
Feature Engineering
Flatline
• BigML’s Domain-Specific Language (DSL) for
Transforming Datasets
• Limited programming language structures
• let, cond, if, maps, list operators, */+-
• Dataset Fields are first-class citizens
• (field “diabetes pedigree”)
• Built-in transformations
• statistics, strings, timestamps, windows
12. BigML, Inc 12Basic Transformations
Data Labelling
Data may not have labels needed for doing classification
Create specific metrics for adding labels
Name Month - 3 Month - 2 Month - 1
Joe Schmo 123,23 0 0
Jane Plain 0 0 0
Mary Happy 0 55,22 243,33
Tom Thumb 12,34 8,34 14,56
Un-Labelled Data
Labelled data
Name Month - 3 Month - 2 Month - 1 Default
Joe Schmo 123,23 0 0 FALSE
Jane Plain 0 0 0 TRUE
Mary Happy 0 55,22 243,33 FALSE
Tom Thumb 12,34 8,34 14,56 FALSE
(= 0 (+ (abs ( f "Month - 3" ) ) (abs ( f "Month - 2")) (abs ( f "Month - 1") ) ))
21. BigML, Inc 21Feature Engineering
Evaluate & Automate
• Evaluate
• Did you meet the goal?
• If not, did you discover something else useful?
• If not, start over
• If you did…
• Automate - You don’t want to hand code that every time, right?
• Consider tools that are easy to automate
• scripting interface
• APIs
• Ability to maintenance is important
22. BigML, Inc 22Feature Engineering
The Process
Data
Transform
Define Goal
Model &
Evaluate
no
yes
Better
Data
Not
Possible
Tune
Algorithm
Goal
Met?
Automate
Feature
Engineer &
Selection
Better
Features