SlideShare a Scribd company logo
1 of 33
Download to read offline
Data Transformations Release
Introducing
Data Transformations
BigML, Inc BigML Data Transformations Release Webinar
Data Transformations
POUL PETERSEN M.SC.- Chief Infrastructure Officer
Please enter questions into chat box – We will answer
some via chat and others at the end of the session
https://bigml.com/releases/summer-2018
ATAKAN CETINSOY - VP of Predictive Applications
Resources
Moderator
Speaker
Contact support@bigml.com
Twitter @bigmlcom
Questions
!2
BigML, Inc BigML Data Transformations Release Webinar
Reality of a ML Application
Data

Transformations
Feature

Engineering
Data

Collection
Evaluation

& Retraining
Seen
Unseen
Self-Driving Cars?
!3
BigML, Inc BigML Data Transformations Release Webinar
Effort of a ML Application
State the problem as an ML task
Data wrangling
Feature engineering
Modeling and Evaluations
Predictions
Measure Results
Data transformations ~80% effort
~5% effort
~5% effort
This is only such low
effort because of
platforms like
Today’s release is the
first step towards
making this
easy as well!
Task
~10% effort
Effort
!4
BigML, Inc BigML Data Transformations Release Webinar
Problem Statement
• BigML’s SaaS https://bigml.com builds, on average, >40,000 trees/day
• That’s trees only! Not counting LR, deepnets, clusters, etc.
• And bigml.com only - not bigml.com.au as well
• We need to ensure that all models are started and finished ASAP
• Started is “easy”: queue monitoring + auto-scaling + heuristics
• Finished is harder: How do we know if a model is taking too long?
• What if we could predict how long a model should take to build?
• Generate alarm if it takes longer than, e.g. 120% of the predicted time
This sounds like a Machine Learning problem!!!
!5
BigML, Inc BigML Data Transformations Release Webinar
The Data…
• Metadata from dataset :
• Size in bytes, number of rows, number of columns, etc.
• Number of numeric, categorical, datetime, text, and items fields
• Metadata from model:
• Objective type: classification or regression
• Tree options: node_depth, missing_splits, randomization, sample
• Subcluster: relates to server size
• Objective:
• Time elapsed to build the tree
!6
BigML, Inc BigML Data Transformations Release Webinar
The Data
!7
BigML, Inc BigML Data Transformations Release Webinar
Problem #1
There may be identical feature rows
• A user testing a script, re-building with the same parameters
• A Machine Learning class building the same model for an assignment
• Users following an online tutorial
• A BigML employee demoing the same dataset / model process
#numeric #text #datetime size rows … time_ms
34 0 0 46354 1001 … 300
0 1 0 1515654 373 … 1431
4 0 1 56789 445423 … 8891
… … … … … … …
0 1 0 1515654 373 … 1673
Same Different
However it happens, this is not properly formatted for ML
!8
How?
BigML, Inc BigML Data Transformations Release Webinar
#numeric #text #datetime size rows … time_ms
34 0 0 46354 1001 … 300
0 1 0 1515654 373 … 1431
4 0 1 56789 445423 … 8891
… … … … … … …
0 1 0 1515656 373 … 1673
feature_key time_ms
c3f6c8be4f 300
a243ca3c38 1431
14f9d917bc 8891
… …
a243ca3c38 1673
Collapse
Transform & Aggregate
All feature rows unique
(fear not SQL experts)
feature_key avg_ms
c3f6c8be4f 300
a243ca3c38 1552
14f9d917bc 8891
… …
Aggregate
How to get there…
!9
BigML, Inc BigML Data Transformations Release Webinar
Flatline
(sha1 (str (all-but ‘status.elapsed’)))
!10
BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Count
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User Count
User001 3
User005 2
User003 2
User002 1
Count
on User
Number of playbacks per user
!11
BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Count Distinct
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Distinct
Genre
User001 3
User005 2
User003 2
User002 1
Count
distinct
Genre
on User
Number of distinct Genre played per user
!12
BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Count Missing
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Missing
Device
User001 0
User005 0
User003 0
User002 1
Count
missing
Device
on User
Number of missing Device per user
!13
BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Sum
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Sum
Duration
User001 830
User005 521
User003 750
User002 218
Sum
Duration
on User
Total Duration per User
!14
BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Average
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Average
Duration
User001 276,67
User005 260,50
User003 375,00
User002 218
Average
Duration
on User
Average Duration per User
!15
BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Maximum
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Max
Duration
User001 328
User005 281
User003 418
User002 218
Maximum
Duration
on User
Maximum Duration per User
!16
BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Minimum
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Min
Duration
User001 190
User005 240
User003 332
User002 218
Minimum
Duration
on User
!17
Minimum Duration per User
• Similar for standard deviation and variance
• Possible to combine multiple aggregations on the same field
BigML, Inc BigML Data Transformations Release Webinar
Aggregations
!18
BigML, Inc BigML Data Transformations Release Webinar
Problem #2
We have this…
#numeric #text #datetime size rows … time_ms
34 0 0 46354 1001 … 300
0 1 0 1515654 373 … 1431
4 0 1 56789 445423 … 8891
… … … … … … …
0 1 0 1515656 373 … 1673
Dataset 1
feature_key avg_ms
c3f6c8be4f 300
a243ca3c38 1552
14f9d917bc 8891
… …
Dataset 2
We want this…
#numeric #text #datetime size rows … avg_ms
34 0 0 46354 1001 … 300
0 1 0 1515654 373 … 1552
4 0 1 56789 445423 … 8891
… … … … … … …
Dataset
!19
BigML, Inc BigML Data Transformations Release Webinar
Joins
• Datasets to join need to have a field in common
• joining sales and demographics on customer_id
• joining employee and budget details on department_id
• Datasets to join do not need to have the same dimensions
• Joins can be performed in several ways
• Left, Right, Inner, Outer…
!20
BigML, Inc BigML Data Transformations Release Webinar
Left Join
• In a Left join of dataset A to B:
• Returns all records from the left A, 

and the matched records from B
• The result is NULL from B, if there is no match.
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
BLeft join
_id field1 field2
1 34 red
2 56 green
3 123 null
4 56 blue
5 79 null
A left join B=
A B
!21
No “3” or “5”
BigML, Inc BigML Data Transformations Release Webinar
Right Join
!22
• In a Right join of dataset A to B:
• Returns all records from the right B, 

and the matched records from A
• The result is NULL from A, if there is no match.
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
BRight join
_id field2 field1
1 red 34
2 green 56
4 blue 56
6 black null
A right join B=
BA
No “6”,
“3” unused
BigML, Inc BigML Data Transformations Release Webinar
Inner Join
• In an Inner join of dataset A to B:
• Returns only records from the left A, 

that match records from B
• If there is no match between A and B, the record is ignored
A B
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
BInner join
_id field1 field2
1 34 red
2 56 green
4 56 blue
A inner join B=
!23
“3” and “5”
unused
“6” unused
BigML, Inc BigML Data Transformations Release Webinar
Full Outer Join
• In a Full join of dataset A to B:
• Returns all records from the left A, 

and records from B
• If there is no match in either A and B, the field is null
A B
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
Bfull join
_id field1 field2
1 34 red
2 56 green
3 123 null
4 56 blue
5 79 null
6 null black
A full join B=
!24
A
No “6”
No “3” or “5”
BigML, Inc BigML Data Transformations Release Webinar
Joins
!25
BigML, Inc BigML Data Transformations Release Webinar
Problem #3
Left join keeps all records from the left dataset
#numeric #text #datetime size rows … avg_ms
34 0 0 46354 1001 … 300
0 1 0 1515654 373 … 1552
4 0 1 56789 445423 … 8891
… … … … … … …
0 1 0 1515656 373 … 1552
Same Rows
!26
BigML, Inc BigML Data Transformations Release Webinar
Remove Duplicates
!27
BigML, Inc BigML Data Transformations Release Webinar
Using the API
• The UI has a limited set of data transformations
• Aggregation (limited), Joins (limited), Remove Duplicates
• More functions will be added: concat, ordering, multiple group by
• The API supports nearly full SQL syntax for transforming datasets
• Nested queries not supported (yet) - e.g. subselects
• Better way to perform workflow:
• SELECT 10001a, avg(000019) AS avg_status_elapsed FROM DS GROUP BY 10001a
• Can perform entire workflow in one SQL using multiple “group by”
!28
BigML, Inc BigML Data Transformations Release Webinar
API Transformations
!29
BigML, Inc BigML Data Transformations Release Webinar
#numeric #text #datetime size rows … time_ms
12 2 0 74001 200 … 1975
1 0 1 22673 373 … 1552
1056 0 1 9231411 4352 … 7675
Problem #4
How do we add new data and retrain?
• When adding a new batch of data
• Avoid re-uploading by using a merge
• Repeat the entire workflow on the merged dataset using Scriptify
#numeric #text #datetime size rows … time_ms
34 0 0 46354 1001 … 300
0 1 0 1515654 373 … 1552
4 0 1 56789 445423 … 8891
!30
BigML, Inc BigML Data Transformations Release Webinar
Merging & Scriptify
!31
BigML, Inc BigML Data Transformations Release Webinar
https://bigml.com/releases/summer-2018
More Info
!32
Questions?
@bigmlcom support@bigml.com

More Related Content

Similar to BigML Release: Data Transformations

ITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data Lake
ITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data LakeITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data Lake
ITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data LakeITCamp
 
Take back your time; Let AI do the work for you! - M365 Bangalore
Take back your time; Let AI do the work for you! - M365 BangaloreTake back your time; Let AI do the work for you! - M365 Bangalore
Take back your time; Let AI do the work for you! - M365 BangaloreEldert Grootenboer
 
Travelling in time with SQL Server 2016 - Damian Widera
Travelling in time with SQL Server 2016 - Damian WideraTravelling in time with SQL Server 2016 - Damian Widera
Travelling in time with SQL Server 2016 - Damian WideraITCamp
 
Precomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamPrecomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamTatiana Al-Chueyr
 
Mvpskill Saturday EP_28 25 April. 2563 - Microsoft 365 Products Update + DEMO
Mvpskill Saturday EP_28 25 April. 2563 - Microsoft 365 Products Update + DEMOMvpskill Saturday EP_28 25 April. 2563 - Microsoft 365 Products Update + DEMO
Mvpskill Saturday EP_28 25 April. 2563 - Microsoft 365 Products Update + DEMOKumton Suttiraksiri
 
It camp 2015 how to scale above clouds limits, radu vunvulea
It camp 2015   how to scale above clouds limits, radu vunvuleaIt camp 2015   how to scale above clouds limits, radu vunvulea
It camp 2015 how to scale above clouds limits, radu vunvuleaRadu Vunvulea
 
Portfolio_OneSheet_RP
Portfolio_OneSheet_RPPortfolio_OneSheet_RP
Portfolio_OneSheet_RPRobert Pagan
 
Apqp for wind energy 2.0 sep 10 2020
Apqp for wind energy 2.0  sep 10 2020Apqp for wind energy 2.0  sep 10 2020
Apqp for wind energy 2.0 sep 10 2020John Cachat
 
Top Ten Siemens S7 Tips and Tricks
Top Ten Siemens S7 Tips and TricksTop Ten Siemens S7 Tips and Tricks
Top Ten Siemens S7 Tips and TricksDMC, Inc.
 
Meetic back end redesign - Meetup microservices
Meetic back end redesign - Meetup microservicesMeetic back end redesign - Meetup microservices
Meetic back end redesign - Meetup microservicesinovia
 
IMS11 BMC Susbystem Optimizer - subzero
IMS11   BMC Susbystem Optimizer - subzeroIMS11   BMC Susbystem Optimizer - subzero
IMS11 BMC Susbystem Optimizer - subzeroRobert Hain
 
Productionalizing Machine Learning Models: The Good, the Bad, and the Ugly
Productionalizing Machine Learning Models: The Good, the Bad, and the UglyProductionalizing Machine Learning Models: The Good, the Bad, and the Ugly
Productionalizing Machine Learning Models: The Good, the Bad, and the UglyIrina Kukuyeva, Ph.D.
 
Alternative microservices - one size doesn't fit all
Alternative microservices - one size doesn't fit allAlternative microservices - one size doesn't fit all
Alternative microservices - one size doesn't fit allJeppe Cramon
 
Mobile Software Diagnostics
Mobile Software DiagnosticsMobile Software Diagnostics
Mobile Software DiagnosticsDmitry Vostokov
 
BigML Fall 2015 Release
BigML Fall 2015 ReleaseBigML Fall 2015 Release
BigML Fall 2015 ReleaseBigML, Inc
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkDatabricks
 
Rsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible PipelineRsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible PipelineSanjana Chowdhury
 
eCommerce Case Studies - A Little Book of Success
eCommerce Case Studies - A Little Book of SuccesseCommerce Case Studies - A Little Book of Success
eCommerce Case Studies - A Little Book of SuccessDivante
 
2015 01 26_5212_2__transparent_archiving_with
2015 01 26_5212_2__transparent_archiving_with2015 01 26_5212_2__transparent_archiving_with
2015 01 26_5212_2__transparent_archiving_withPeter Schouboe
 

Similar to BigML Release: Data Transformations (20)

ITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data Lake
ITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data LakeITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data Lake
ITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data Lake
 
Take back your time; Let AI do the work for you! - M365 Bangalore
Take back your time; Let AI do the work for you! - M365 BangaloreTake back your time; Let AI do the work for you! - M365 Bangalore
Take back your time; Let AI do the work for you! - M365 Bangalore
 
Travelling in time with SQL Server 2016 - Damian Widera
Travelling in time with SQL Server 2016 - Damian WideraTravelling in time with SQL Server 2016 - Damian Widera
Travelling in time with SQL Server 2016 - Damian Widera
 
Precomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamPrecomputing recommendations with Apache Beam
Precomputing recommendations with Apache Beam
 
Mvpskill Saturday EP_28 25 April. 2563 - Microsoft 365 Products Update + DEMO
Mvpskill Saturday EP_28 25 April. 2563 - Microsoft 365 Products Update + DEMOMvpskill Saturday EP_28 25 April. 2563 - Microsoft 365 Products Update + DEMO
Mvpskill Saturday EP_28 25 April. 2563 - Microsoft 365 Products Update + DEMO
 
It camp 2015 how to scale above clouds limits, radu vunvulea
It camp 2015   how to scale above clouds limits, radu vunvuleaIt camp 2015   how to scale above clouds limits, radu vunvulea
It camp 2015 how to scale above clouds limits, radu vunvulea
 
Portfolio_OneSheet_RP
Portfolio_OneSheet_RPPortfolio_OneSheet_RP
Portfolio_OneSheet_RP
 
Apqp for wind energy 2.0 sep 10 2020
Apqp for wind energy 2.0  sep 10 2020Apqp for wind energy 2.0  sep 10 2020
Apqp for wind energy 2.0 sep 10 2020
 
Top Ten Siemens S7 Tips and Tricks
Top Ten Siemens S7 Tips and TricksTop Ten Siemens S7 Tips and Tricks
Top Ten Siemens S7 Tips and Tricks
 
Meetic back end redesign - Meetup microservices
Meetic back end redesign - Meetup microservicesMeetic back end redesign - Meetup microservices
Meetic back end redesign - Meetup microservices
 
IMS11 BMC Susbystem Optimizer - subzero
IMS11   BMC Susbystem Optimizer - subzeroIMS11   BMC Susbystem Optimizer - subzero
IMS11 BMC Susbystem Optimizer - subzero
 
Productionalizing Machine Learning Models: The Good, the Bad, and the Ugly
Productionalizing Machine Learning Models: The Good, the Bad, and the UglyProductionalizing Machine Learning Models: The Good, the Bad, and the Ugly
Productionalizing Machine Learning Models: The Good, the Bad, and the Ugly
 
Alternative microservices - one size doesn't fit all
Alternative microservices - one size doesn't fit allAlternative microservices - one size doesn't fit all
Alternative microservices - one size doesn't fit all
 
Mobile Software Diagnostics
Mobile Software DiagnosticsMobile Software Diagnostics
Mobile Software Diagnostics
 
BigML Fall 2015 Release
BigML Fall 2015 ReleaseBigML Fall 2015 Release
BigML Fall 2015 Release
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
 
Rsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible PipelineRsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible Pipeline
 
eCommerce Case Studies - A Little Book of Success
eCommerce Case Studies - A Little Book of SuccesseCommerce Case Studies - A Little Book of Success
eCommerce Case Studies - A Little Book of Success
 
2015 01 26_5212_2__transparent_archiving_with
2015 01 26_5212_2__transparent_archiving_with2015 01 26_5212_2__transparent_archiving_with
2015 01 26_5212_2__transparent_archiving_with
 
Vertical Slicing
Vertical SlicingVertical Slicing
Vertical Slicing
 

More from BigML, Inc

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingBigML, Inc
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationBigML, Inc
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceBigML, Inc
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesBigML, Inc
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector BigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionBigML, Inc
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLBigML, Inc
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyBigML, Inc
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorBigML, Inc
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsBigML, Inc
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsBigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleBigML, Inc
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIBigML, Inc
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object DetectionBigML, Inc
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image ProcessingBigML, Inc
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureBigML, Inc
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorBigML, Inc
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotBigML, Inc
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...BigML, Inc
 

More from BigML, Inc (20)

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in Manufacturing
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - Automation
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML Compliance
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective Anomalies
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly Detection
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End ML
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven Company
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal Sector
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe Stadiums
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at Scale
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AI
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object Detection
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image Processing
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail Sector
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
 

Recently uploaded

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 

Recently uploaded (20)

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 

BigML Release: Data Transformations

  • 2. BigML, Inc BigML Data Transformations Release Webinar Data Transformations POUL PETERSEN M.SC.- Chief Infrastructure Officer Please enter questions into chat box – We will answer some via chat and others at the end of the session https://bigml.com/releases/summer-2018 ATAKAN CETINSOY - VP of Predictive Applications Resources Moderator Speaker Contact support@bigml.com Twitter @bigmlcom Questions !2
  • 3. BigML, Inc BigML Data Transformations Release Webinar Reality of a ML Application Data Transformations Feature Engineering Data Collection Evaluation & Retraining Seen Unseen Self-Driving Cars? !3
  • 4. BigML, Inc BigML Data Transformations Release Webinar Effort of a ML Application State the problem as an ML task Data wrangling Feature engineering Modeling and Evaluations Predictions Measure Results Data transformations ~80% effort ~5% effort ~5% effort This is only such low effort because of platforms like Today’s release is the first step towards making this easy as well! Task ~10% effort Effort !4
  • 5. BigML, Inc BigML Data Transformations Release Webinar Problem Statement • BigML’s SaaS https://bigml.com builds, on average, >40,000 trees/day • That’s trees only! Not counting LR, deepnets, clusters, etc. • And bigml.com only - not bigml.com.au as well • We need to ensure that all models are started and finished ASAP • Started is “easy”: queue monitoring + auto-scaling + heuristics • Finished is harder: How do we know if a model is taking too long? • What if we could predict how long a model should take to build? • Generate alarm if it takes longer than, e.g. 120% of the predicted time This sounds like a Machine Learning problem!!! !5
  • 6. BigML, Inc BigML Data Transformations Release Webinar The Data… • Metadata from dataset : • Size in bytes, number of rows, number of columns, etc. • Number of numeric, categorical, datetime, text, and items fields • Metadata from model: • Objective type: classification or regression • Tree options: node_depth, missing_splits, randomization, sample • Subcluster: relates to server size • Objective: • Time elapsed to build the tree !6
  • 7. BigML, Inc BigML Data Transformations Release Webinar The Data !7
  • 8. BigML, Inc BigML Data Transformations Release Webinar Problem #1 There may be identical feature rows • A user testing a script, re-building with the same parameters • A Machine Learning class building the same model for an assignment • Users following an online tutorial • A BigML employee demoing the same dataset / model process #numeric #text #datetime size rows … time_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1431 4 0 1 56789 445423 … 8891 … … … … … … … 0 1 0 1515654 373 … 1673 Same Different However it happens, this is not properly formatted for ML !8 How?
  • 9. BigML, Inc BigML Data Transformations Release Webinar #numeric #text #datetime size rows … time_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1431 4 0 1 56789 445423 … 8891 … … … … … … … 0 1 0 1515656 373 … 1673 feature_key time_ms c3f6c8be4f 300 a243ca3c38 1431 14f9d917bc 8891 … … a243ca3c38 1673 Collapse Transform & Aggregate All feature rows unique (fear not SQL experts) feature_key avg_ms c3f6c8be4f 300 a243ca3c38 1552 14f9d917bc 8891 … … Aggregate How to get there… !9
  • 10. BigML, Inc BigML Data Transformations Release Webinar Flatline (sha1 (str (all-but ‘status.elapsed’))) !10
  • 11. BigML, Inc BigML Data Transformations Release Webinar Aggregation: Count Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Count User001 3 User005 2 User003 2 User002 1 Count on User Number of playbacks per user !11
  • 12. BigML, Inc BigML Data Transformations Release Webinar Aggregation: Count Distinct Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Distinct Genre User001 3 User005 2 User003 2 User002 1 Count distinct Genre on User Number of distinct Genre played per user !12
  • 13. BigML, Inc BigML Data Transformations Release Webinar Aggregation: Count Missing Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Missing Device User001 0 User005 0 User003 0 User002 1 Count missing Device on User Number of missing Device per user !13
  • 14. BigML, Inc BigML Data Transformations Release Webinar Aggregation: Sum Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Sum Duration User001 830 User005 521 User003 750 User002 218 Sum Duration on User Total Duration per User !14
  • 15. BigML, Inc BigML Data Transformations Release Webinar Aggregation: Average Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Average Duration User001 276,67 User005 260,50 User003 375,00 User002 218 Average Duration on User Average Duration per User !15
  • 16. BigML, Inc BigML Data Transformations Release Webinar Aggregation: Maximum Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Max Duration User001 328 User005 281 User003 418 User002 218 Maximum Duration on User Maximum Duration per User !16
  • 17. BigML, Inc BigML Data Transformations Release Webinar Aggregation: Minimum Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Min Duration User001 190 User005 240 User003 332 User002 218 Minimum Duration on User !17 Minimum Duration per User • Similar for standard deviation and variance • Possible to combine multiple aggregations on the same field
  • 18. BigML, Inc BigML Data Transformations Release Webinar Aggregations !18
  • 19. BigML, Inc BigML Data Transformations Release Webinar Problem #2 We have this… #numeric #text #datetime size rows … time_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1431 4 0 1 56789 445423 … 8891 … … … … … … … 0 1 0 1515656 373 … 1673 Dataset 1 feature_key avg_ms c3f6c8be4f 300 a243ca3c38 1552 14f9d917bc 8891 … … Dataset 2 We want this… #numeric #text #datetime size rows … avg_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1552 4 0 1 56789 445423 … 8891 … … … … … … … Dataset !19
  • 20. BigML, Inc BigML Data Transformations Release Webinar Joins • Datasets to join need to have a field in common • joining sales and demographics on customer_id • joining employee and budget details on department_id • Datasets to join do not need to have the same dimensions • Joins can be performed in several ways • Left, Right, Inner, Outer… !20
  • 21. BigML, Inc BigML Data Transformations Release Webinar Left Join • In a Left join of dataset A to B: • Returns all records from the left A, 
 and the matched records from B • The result is NULL from B, if there is no match. _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black BLeft join _id field1 field2 1 34 red 2 56 green 3 123 null 4 56 blue 5 79 null A left join B= A B !21 No “3” or “5”
  • 22. BigML, Inc BigML Data Transformations Release Webinar Right Join !22 • In a Right join of dataset A to B: • Returns all records from the right B, 
 and the matched records from A • The result is NULL from A, if there is no match. _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black BRight join _id field2 field1 1 red 34 2 green 56 4 blue 56 6 black null A right join B= BA No “6”, “3” unused
  • 23. BigML, Inc BigML Data Transformations Release Webinar Inner Join • In an Inner join of dataset A to B: • Returns only records from the left A, 
 that match records from B • If there is no match between A and B, the record is ignored A B _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black BInner join _id field1 field2 1 34 red 2 56 green 4 56 blue A inner join B= !23 “3” and “5” unused “6” unused
  • 24. BigML, Inc BigML Data Transformations Release Webinar Full Outer Join • In a Full join of dataset A to B: • Returns all records from the left A, 
 and records from B • If there is no match in either A and B, the field is null A B _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black Bfull join _id field1 field2 1 34 red 2 56 green 3 123 null 4 56 blue 5 79 null 6 null black A full join B= !24 A No “6” No “3” or “5”
  • 25. BigML, Inc BigML Data Transformations Release Webinar Joins !25
  • 26. BigML, Inc BigML Data Transformations Release Webinar Problem #3 Left join keeps all records from the left dataset #numeric #text #datetime size rows … avg_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1552 4 0 1 56789 445423 … 8891 … … … … … … … 0 1 0 1515656 373 … 1552 Same Rows !26
  • 27. BigML, Inc BigML Data Transformations Release Webinar Remove Duplicates !27
  • 28. BigML, Inc BigML Data Transformations Release Webinar Using the API • The UI has a limited set of data transformations • Aggregation (limited), Joins (limited), Remove Duplicates • More functions will be added: concat, ordering, multiple group by • The API supports nearly full SQL syntax for transforming datasets • Nested queries not supported (yet) - e.g. subselects • Better way to perform workflow: • SELECT 10001a, avg(000019) AS avg_status_elapsed FROM DS GROUP BY 10001a • Can perform entire workflow in one SQL using multiple “group by” !28
  • 29. BigML, Inc BigML Data Transformations Release Webinar API Transformations !29
  • 30. BigML, Inc BigML Data Transformations Release Webinar #numeric #text #datetime size rows … time_ms 12 2 0 74001 200 … 1975 1 0 1 22673 373 … 1552 1056 0 1 9231411 4352 … 7675 Problem #4 How do we add new data and retrain? • When adding a new batch of data • Avoid re-uploading by using a merge • Repeat the entire workflow on the merged dataset using Scriptify #numeric #text #datetime size rows … time_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1552 4 0 1 56789 445423 … 8891 !30
  • 31. BigML, Inc BigML Data Transformations Release Webinar Merging & Scriptify !31
  • 32. BigML, Inc BigML Data Transformations Release Webinar https://bigml.com/releases/summer-2018 More Info !32