VSSML16 L5. Basic Data Transformations

BigML, Inc 2
Basic Transformations
Poul Pertesen
CIO, BigML, Inc
Creating Machine Learning Ready Data

BigML, Inc 3Machine Learning-Ready Data
Basic Transformations
Q: How does a physicist milk a cow?
A: Well, ﬁrst let us consider a spherical cow...
Q: How does a data scientist build a model?
A: Well, ﬁrst let us consider perfectly formatted data…

The Dream
CSV Dataset Model Proﬁt!

The Reality
CRM
Web Accounts
Transactions
ML Ready?
Is all hope lost?
How do you even start?

Holistic Approach
• Deﬁne a clear idea of the goal.
• Understand what ML tasks will achieve the goal.
• Understand the data structure to perform those ML tasks.
• Find out what kind of data you have and make it ML-Ready
• where is it, how is it stored?
• what are the features?
• can you access it programmatically?
• Feature Engineering: transform the data you have into the
data you actually need.
• Evaluate: Try it on a small scale
• Accept that you might have to start over….
• But when it works, automate it!!!!

Holistic Approach
Deﬁne Goal & ML Task

Understand ML Tasks
Goal
• Will this customer default on a loan?
• How many customers will apply for a
loan next month?
• Is the consumption of this product
unusual?
• Is the behavior of the customers
similar?
• Are these product purchased
together?
ML Task
Classification
Regression
Anomaly Detection
Cluster Analysis
Association Discovery

Holistic Approach
Required Data Structure

Classiﬁcation
CategoricalTrainingTesting
Predicting

Regression
NumericTrainingTesting
Predicting

Anomaly Detection

Cluster Analysis

Association Discovery

Holistic Approach
Make Your Data ML-Ready

ML-Ready Data
Instances
Fields
(Features)
Tabular Data:
• Each row is one of the instances.
• Each column is a ﬁeld that describes a property of the  
instance that is relevant to the question being modeled.
• Fields can be:
already be present in your data
derived from your data
or generated using other ﬁelds.
Machine Learning
Algorithms consume
instances of the
question that you want
to model.
!! Danger Ahead !!

Cleansing
Homogenize missing values and different types in the same
feature, fix input errors, correct semantic issues, types, etc.
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 - Rock 139
Blues alive 1990/03/01 281 Blues 239
Lonely planet 2002-11-19 5:32s Techno 42
Dance, dance 02/23/1983 312 Disco N/A
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 4 minutes Techno 895
The alchemist 2001-11-21 418 Bluesss 178
Bring me down 18-10-98 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Original
data
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 Rock 139
Blues alive 1990-03-01 281 Blues 239
Lonely planet 2002-11-19 332 Techno 42
Dance, dance 1983-02-23 312 Disco
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 240 Techno 895
The alchemist 2001-11-21 418 Blues 178
Bring me down 1998-10-18 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Cleaned
data

Denormalizing
users
artists
tracks
albums
Instances
Features
(millions)
join
Data is usually normalized in relational databases, ML-Ready datasets
need the information de-normalized in a single file/dataset.

Aggregating
User Num.Playbacks Total Time Pref.Device
User001 3 830 Tablet
User002 1 218 Smartphone
User003 3 1019 TV
User005 2 521 Tablet
Aggregated data (list of users)
When the entity to model is different from the provided data, an
aggregation to get the entity might be needed.
Content Genr
e
Duration Play Time User Device
Highway
star
Rock 190 2015-05-12
16:29:33
User001 TV
Blues alive Blues 281 2015-05-13
12:31:21
User005 Tablet
Lonely
planet
Tech
no
332 2015-05-13
14:26:04
User003 TV
Dance,
dance
Disco 312 2015-05-13
18:12:45
User001 Tablet
The wall Reag
ge
218 2015-05-14
09:02:55
User002 Smartphone
Offside
down
Tech
no
240 2015-05-14
11:26:32
User005 Tablet
The
alchemist
Blues 418 2015-05-14
21:44:15
User003 TV
Bring me
down
Class
ic
328 2015-05-15
06:59:56
User001 Tablet
The
scarecrow
Rock 269 2015-05-15
12:37:05
User003 Smartphone
Original data (list of playbacks)
tail -n+2 playlists.csv | cut -d',' -f5 | sort | uniq -c

Pivoting
Different values of a feature are pivoted to new columns in the
result dataset.
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002 Smartphone
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet
The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone
Original data
User Num.Playback
s
Total Time Pref.Device NP_TV NP_Tablet NP_Smartphone TT_TV TT_Tablet TT_Smartphone
User001 3 830 Tablet 1 2 0 190 640 0
User002 1 218 Smartphone 0 0 1 0 0 218
User003 3 1019 TV 2 0 1 750 0 269
User005 2 521 Tablet 0 2 0 0 521 0
Aggregated data with pivoted columns

Time Windows
Create new features using values over different periods of time
Instances
Features
Time
Instances
Features
(millions)
(thousands)
t=1 t=2 t=3

Updates
Need a current view of the data, but new data only comes in
batches of changes
day
1day
2day
3
Instances
Features

Structuring Output
• A CSV file uses plain text to store tabular data.
• In a CSV file, each row of the file is an instance.
• Each column in a row is usually separated by a comma (,) but other
"separators" like semi-colon (;), colon (:), pipe (|), can also be used. Each
row must contain the same number of fields
• but they can be null
• Fields can be quoted using double quotes (").
• Fields that contain commas or line separators must be quoted.
• Quotes (") in fields must be doubled ("").
• The character encoding must be UTF-8
• Optionally, a CSV file can use the first line as a header to provide the
names of each field.
After all the data transformations, a CSV (“Comma-Separated
Values) file has to be generated, following the rules below:

Holistic Approach
Feature Engineering

Feature Engineering
• Flatline
• Domain Specific Language for data generation
and filtering
• Works with datasets -> datasets
• Lots of built-in functions
• Sliding windows
• Date/Time parsing
• Flatline Editor (in UI)
• https://github.com/bigmlcom/flatline

Feature Engineering
• Feature Engineering of Numeric features:
• Discretization (percentiles, within percentiles, groups)
• Replacement
• Normalization
• Exponentiation, Logarithms, Squares, etc.
• Shock
• Feature Engineering of Text features:
• Misspellings
• Length
• Number of subordinate sentences
• Language
• Levenshtein distance
• Stacking:
• Compute a ﬁeld using non-linear combinations of other ﬁelds

Holistic Approach
Test & Automate

Test & Automate
• Test - Evaluate
• Did you meet the goal?
• If not, did you discover something else useful?
• If not, start over
• If you did…
• Automate - You don’t want to hand code that every time, right?
• Consider tools that are easy to automate
• scripting interface
• APIs
• Ability to maintenance is important

Tools
• Command Line?
• join, cut, awk, sed, sort, uniq
• Automation
• Shell, Python, etc
• Talend
• BigML: bindings, bigmler, API, whizzml
• Relational DB
• MySQL
• Non-Relational DB
• MongoDB

Prosper
Submit Bids
Cancelled Withdraw
Funded
Expired
Defaulted
Paid
Current
Late
Q: Which new loans make it to funded?
Q: Which funded loans make it to paid?
Q: If funded, what will be the rate?
Classification
Regression
Classification

Prosper
Data Provided in XML updates!!
fetch.sh
“curl”
daily
export.sh
import.py
XML
bigml.sh
Model

Predict

Share in gallery
Status
LoanStatus
BorrowerRate

Prosper
• XML… yuck!
• MongoDB has CSV export and is record based so it is easy to
handle changing data structure.
• Feature Engineering
• There are 5 diﬀerent classes of “bad” loans
• Date cleanup
• Type casting: ﬂoats and ints
• Would be better to track over time
• number of late payments
• compare predictions and actuals
• XML… yuck!
Tidbits and Lessons Learned….

Diabetes
Fix Missing Values in a “Meaningful” Way
Filter Zeros
Model  
insulin
Predict  
insulin
Select  
insulin
Fixed 
Dataset
Amended 
Dataset
Original 
Dataset
Clean 
Dataset

Stock Prices
(/ (- ( f "price") (avg-window "price" -4, -1)) (standard-deviation "price"))
Shock: Deviations from Trend
date volume price
1 34353 314
2 44455 315
3 22333 315
4 52322 321
5 28000 320
6 31254 319
7 56544 323
8 44331 324
9 81111 287
10 65422 294
11 59999 300
12 45556 302
13 19899 301
14 21453 302
314
314 315
314 315 315
314 315 315 321
315 315 321 320
315 321 320 319
4-Day moving avg)
Current - (4-day avg)
std dev

Talend
https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/
Denormalization Example

VSSML16 L5. Basic Data Transformations

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie VSSML16 L5. Basic Data Transformations

Ähnlich wie VSSML16 L5. Basic Data Transformations (20)

Mehr von BigML, Inc

Mehr von BigML, Inc (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

VSSML16 L5. Basic Data Transformations