BigML brings Data Transformations to the BigML platform, a key part of any Machine Learning workflow. Usually, data do not come ready to start working on a Machine Learning project. It can be noisy and come from many different sources in many different formats, thus, it is necessary to go through a preparation phase before applying Machine Learning. With this release, BigML adds new Data Transformation capabilities that greatly enhance existing ones. Discover the ability to perform SQL-style queries, Flatline editor improvements, and more ways to do feature engineering.
2. BigML, Inc BigML Data Transformations Release Webinar
Data Transformations
POUL PETERSEN M.SC.- Chief Infrastructure Officer
Please enter questions into chat box – We will answer
some via chat and others at the end of the session
https://bigml.com/releases/summer-2018
ATAKAN CETINSOY - VP of Predictive Applications
Resources
Moderator
Speaker
Contact support@bigml.com
Twitter @bigmlcom
Questions
!2
3. BigML, Inc BigML Data Transformations Release Webinar
Reality of a ML Application
Data
Transformations
Feature
Engineering
Data
Collection
Evaluation
& Retraining
Seen
Unseen
Self-Driving Cars?
!3
4. BigML, Inc BigML Data Transformations Release Webinar
Effort of a ML Application
State the problem as an ML task
Data wrangling
Feature engineering
Modeling and Evaluations
Predictions
Measure Results
Data transformations ~80% effort
~5% effort
~5% effort
This is only such low
effort because of
platforms like
Today’s release is the
first step towards
making this
easy as well!
Task
~10% effort
Effort
!4
5. BigML, Inc BigML Data Transformations Release Webinar
Problem Statement
• BigML’s SaaS https://bigml.com builds, on average, >40,000 trees/day
• That’s trees only! Not counting LR, deepnets, clusters, etc.
• And bigml.com only - not bigml.com.au as well
• We need to ensure that all models are started and finished ASAP
• Started is “easy”: queue monitoring + auto-scaling + heuristics
• Finished is harder: How do we know if a model is taking too long?
• What if we could predict how long a model should take to build?
• Generate alarm if it takes longer than, e.g. 120% of the predicted time
This sounds like a Machine Learning problem!!!
!5
6. BigML, Inc BigML Data Transformations Release Webinar
The Data…
• Metadata from dataset :
• Size in bytes, number of rows, number of columns, etc.
• Number of numeric, categorical, datetime, text, and items fields
• Metadata from model:
• Objective type: classification or regression
• Tree options: node_depth, missing_splits, randomization, sample
• Subcluster: relates to server size
• Objective:
• Time elapsed to build the tree
!6
7. BigML, Inc BigML Data Transformations Release Webinar
The Data
!7
8. BigML, Inc BigML Data Transformations Release Webinar
Problem #1
There may be identical feature rows
• A user testing a script, re-building with the same parameters
• A Machine Learning class building the same model for an assignment
• Users following an online tutorial
• A BigML employee demoing the same dataset / model process
#numeric #text #datetime size rows … time_ms
34 0 0 46354 1001 … 300
0 1 0 1515654 373 … 1431
4 0 1 56789 445423 … 8891
… … … … … … …
0 1 0 1515654 373 … 1673
Same Different
However it happens, this is not properly formatted for ML
!8
How?
11. BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Count
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User Count
User001 3
User005 2
User003 2
User002 1
Count
on User
Number of playbacks per user
!11
12. BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Count Distinct
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Distinct
Genre
User001 3
User005 2
User003 2
User002 1
Count
distinct
Genre
on User
Number of distinct Genre played per user
!12
13. BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Count Missing
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Missing
Device
User001 0
User005 0
User003 0
User002 1
Count
missing
Device
on User
Number of missing Device per user
!13
14. BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Sum
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Sum
Duration
User001 830
User005 521
User003 750
User002 218
Sum
Duration
on User
Total Duration per User
!14
15. BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Average
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Average
Duration
User001 276,67
User005 260,50
User003 375,00
User002 218
Average
Duration
on User
Average Duration per User
!15
16. BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Maximum
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Max
Duration
User001 328
User005 281
User003 418
User002 218
Maximum
Duration
on User
Maximum Duration per User
!16
17. BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Minimum
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Min
Duration
User001 190
User005 240
User003 332
User002 218
Minimum
Duration
on User
!17
Minimum Duration per User
• Similar for standard deviation and variance
• Possible to combine multiple aggregations on the same field
18. BigML, Inc BigML Data Transformations Release Webinar
Aggregations
!18
20. BigML, Inc BigML Data Transformations Release Webinar
Joins
• Datasets to join need to have a field in common
• joining sales and demographics on customer_id
• joining employee and budget details on department_id
• Datasets to join do not need to have the same dimensions
• Joins can be performed in several ways
• Left, Right, Inner, Outer…
!20
21. BigML, Inc BigML Data Transformations Release Webinar
Left Join
• In a Left join of dataset A to B:
• Returns all records from the left A,
and the matched records from B
• The result is NULL from B, if there is no match.
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
BLeft join
_id field1 field2
1 34 red
2 56 green
3 123 null
4 56 blue
5 79 null
A left join B=
A B
!21
No “3” or “5”
22. BigML, Inc BigML Data Transformations Release Webinar
Right Join
!22
• In a Right join of dataset A to B:
• Returns all records from the right B,
and the matched records from A
• The result is NULL from A, if there is no match.
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
BRight join
_id field2 field1
1 red 34
2 green 56
4 blue 56
6 black null
A right join B=
BA
No “6”,
“3” unused
23. BigML, Inc BigML Data Transformations Release Webinar
Inner Join
• In an Inner join of dataset A to B:
• Returns only records from the left A,
that match records from B
• If there is no match between A and B, the record is ignored
A B
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
BInner join
_id field1 field2
1 34 red
2 56 green
4 56 blue
A inner join B=
!23
“3” and “5”
unused
“6” unused
24. BigML, Inc BigML Data Transformations Release Webinar
Full Outer Join
• In a Full join of dataset A to B:
• Returns all records from the left A,
and records from B
• If there is no match in either A and B, the field is null
A B
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
Bfull join
_id field1 field2
1 34 red
2 56 green
3 123 null
4 56 blue
5 79 null
6 null black
A full join B=
!24
A
No “6”
No “3” or “5”
28. BigML, Inc BigML Data Transformations Release Webinar
Using the API
• The UI has a limited set of data transformations
• Aggregation (limited), Joins (limited), Remove Duplicates
• More functions will be added: concat, ordering, multiple group by
• The API supports nearly full SQL syntax for transforming datasets
• Nested queries not supported (yet) - e.g. subselects
• Better way to perform workflow:
• SELECT 10001a, avg(000019) AS avg_status_elapsed FROM DS GROUP BY 10001a
• Can perform entire workflow in one SQL using multiple “group by”
!28
29. BigML, Inc BigML Data Transformations Release Webinar
API Transformations
!29
30. BigML, Inc BigML Data Transformations Release Webinar
#numeric #text #datetime size rows … time_ms
12 2 0 74001 200 … 1975
1 0 1 22673 373 … 1552
1056 0 1 9231411 4352 … 7675
Problem #4
How do we add new data and retrain?
• When adding a new batch of data
• Avoid re-uploading by using a merge
• Repeat the entire workflow on the merged dataset using Scriptify
#numeric #text #datetime size rows … time_ms
34 0 0 46354 1001 … 300
0 1 0 1515654 373 … 1552
4 0 1 56789 445423 … 8891
!30