SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
DATA QUALITY IS MORE
IMPORTANTTHANYOUTHINK
Amine BENDAHMANE
#DevFest Algiers 2019
Who am I?
• PhD candidate inArtificial Intelligence and Robotics
(ComputerVision, Swarm Optimization, Path planning,
Deep Learning, Reinforcement Learning)
• Masters degree in Machine Learning & Patterns
Recognition
• Freelance:Web developer, ML engineer
• Part-timeTeacher
Machine Learning for research purpose
Solve a problem
Bring up new ideas
Create new models & algorithms
Adapt existing approaches to
new problems
Improve existing solutions
Change mathematical equations
Analyze different factors
Identify correlations
Machine Learning for research purpose
Machine Learning process
In real world projects
No
data
Not
enough
data
Bad
quality
data
Biased
data
Data Engineering is harder than we think
Tips & tricks I learned the hard way
Let’s see next the lessons learned from those 4 projects:
1. Facial Expressions Recognition
2. Image Generation
3. Vehicules Plates Recognition
4. Robotics Path Planning
Project 1: Facial Expressions recognition
• 2016
• Nextremer Co. (Tokyo)
• AI engineering intern
• Deep Learning
Project 1: Facial Expressions recognition
AI Samurai project (Nextremer Co.)
• To deploy in a robot that uses a Raspberry Pi
• The raspberry is also used for speech and
motion (head, arms)
• NoTensorflow Lite at the moment
=> need a very small model (memory, cpu load)
Project 1: Facial Expressions recognition
• Fer2013 dataset
35.000 images (48x48px)
7 categories
Face LandmarksCode & results available at: https://github.com/amineHorseman/facial-
expression-recognition-using-cnn
Project 1: Facial Expressions recognition
• Experiment 1
20.000 images
5 expressions
■ CNN
■ CNN + Face landmarks
■ CNN + Face landmarks +
HOG + sliding window
75.1%
74.4%
73.5%
• Couldn’t get better results
Dropout
Regularization
ReLus, LeakyReLu…
Batch Normalization
Hyper parameters optimization
Project 1: Facial Expressions recognition
• Experiment 1
20.000 images
5 expressions
■ CNN
■ CNN + Face landmarks
■ CNN + Face landmarks +
HOG + sliding window
75.1%
74.4%
73.5%
• Experiment 2
35.000 images
7 expressions
50.50%
61.40%
75.20%State of art (8 CNNs)
Our best model
SVM
• Human accuracy: ~65%
Project 1: Facial Expressions recognition
• Fer2013
(a) incorrect labels
(b) Faces partially hidden
(c) Cartoon faces
(d) Black or empty images
• Human accuracy: ~65%
Project 1: Facial Expressions recognition
• Fer2013
(a) incorrect labels
(b) Faces partially hidden
(c) Cartoon faces
(d) Black or empty images
• Human accuracy: ~65%
Project 1: Facial Expressions recognition
By correcting the labels we can get up to 88% of accuracy!
Project 2: Images Generation
• 2016
• Nextremer Co. (Tokyo)
• Generating fake car images using DC-GAN
• No interesting dataset for commercial use
• No transfer learning
=> No other choices than creating
our own dataset!
Project 2: Images Generation
• Write scripts to:
 Collect images: from internet using google & flikrAPIs
 Transform the data: resizing, cropping, converting format
 Reorganize the dataset: Renaming data, classify in folders, generating labels…
 Code available at: https://github.com/amineHorseman/images-web-crawler
• Collecting 20.000 car images of 31 car models (~700 per model)
• Cleaning the data manually
For 5 seconds per image it would take 27 hours!
• The whole process of dataset creating took 4 week
Project 2: Images Generation
• Write scripts to:
 Collect images: from internet using google & flikrAPIs
 Transform the data: resizing, cropping, converting format
 Reorganize the dataset: Renaming data, classify in folders, generating labels…
 Code available at: https://github.com/amineHorseman/images-web-crawler
• Collecting 20.000 car images of 31 car models (~700 per model)
• Cleaning the data manually
For 5 seconds per image it would take 27 hours!
• The whole process of dataset creating took 4 week
Project 2: Images Generation
Training with
4000 images
Training with
20.000 images
Training with 200.000
images (redundant
images, non-cleaned
dataset)
10x bigger
bad results
Training time: 1 week
Project 3:Vehicules Plates recognition
• 2018
• Mostaganem
• Detect and localize plates
• Recognize Plate Licence Number
Project 3:Vehicules Plates recognition
• 2018
• Mostaganem
• Detect and localize plates
• Recognize Plate Licence Number
Project 3:Vehicules Plates recognition
For Plates detection and localization:
• Collecting a dataset of 300 images from internet
• Using data augmentation for generating a bigger dataset
• Using transfer learning onYOLOv3 and training
For Serial Number recognition:
• Creating a dataset of 2000 numbers from vehicule license plates
• Using MNIST pretrained model and using transfer learning
• Segmenting the number into separated digits and predicting
Project 3:Vehicules Plates recognition
For Plates detection and localization:
• Collecting a dataset of 300 images from internet
• Using data augmentation for generating a bigger dataset
• Using transfer learning onYOLOv3 and training
For Serial Number recognition
• Creating a dataset of 2000 numbers
• Using MNIST pretrained model and using transfer learning
• Segmenting the number into separated digits and predicting
Project 3:Vehicules Plates recognition
While porting to production:
• The client used a surveillance camera from the top with inclined angle
• The camera switch to B&W in the night (CCTV)
• In morning the sun is facing the camera so everything goes black (backlight)
• The serial numbers come at different fonts and formats (different separators)
• The numbers dataset I created was biased (too much 2 and 7, less 5 and 8)
Project 3:Vehicules Plates recognition
Other considerations during deployment:
• The client used a Dual CoreCPU! (predictions take 5x longer)
• Every time we retrain the model, we have to move to the client’s office for
deployment (because it has no internet, i.e: mountain)
Project 3:Vehicules Plates recognition
Other considerations during deployment:
• The client used a Dual CoreCPU! (predictions take 5x longer)
• Every time we retrain the model, we have to move to the client’s office for
deployment (because it has no internet, i.e: mountain)
The clients don’t understand Error Rate means:
• 5 % errors => 5 error for each 100 records.
• If we have 2000 records a day it would be 100 errors that needs to be manually
edited!
Project 4: Robot path planning
• Everything works well in simulation (ROS + Gazebo)
• But in the real experiments, the robots don’t behave as expected!
Project 4: Robot path planning
• Everything works well in simulation
• But in the real experiments, the robots don’t behave as expected!
• It turns out that the Laser and Sonar often return zero values (noise)
• Those noisy values affect the training
• We need to explicitly filter those false readings before using ML models
(figure out a method to automatically filter unwanted values)
Summary
• Data quality is more important than we think
• Before trying to optimize your model, check how good your data is
• In commercial projects, we often don’t have available data
• Creating a dataset is a fastidious and time consuming task
• A clean dataset may be better than a 10x larger raw dataset
• The data we get during production may not be the same as the data used in
the training
• Pay extra attention to detect bias in our data
THANKYOU

Weitere ähnliche Inhalte

Ähnlich wie Data quality is more important than you think

Can We Make Maps from Videos? ~From AI Algorithm to Engineering for Continuou...
Can We Make Maps from Videos? ~From AI Algorithm to Engineering for Continuou...Can We Make Maps from Videos? ~From AI Algorithm to Engineering for Continuou...
Can We Make Maps from Videos? ~From AI Algorithm to Engineering for Continuou...DeNA
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareTigerGraph
 
Ria Sankar on Building AI Products
Ria Sankar on Building AI ProductsRia Sankar on Building AI Products
Ria Sankar on Building AI ProductsRia Sankar
 
Introducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoIntroducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoSri Ambati
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software EngineeringMiroslaw Staron
 
Pragmatic machine learning for the real world
Pragmatic machine learning for the real worldPragmatic machine learning for the real world
Pragmatic machine learning for the real worldLouis Dorard
 
The Sky’s the Limit – The Rise of Machine Learnin
The Sky’s the Limit – The Rise of Machine LearninThe Sky’s the Limit – The Rise of Machine Learnin
The Sky’s the Limit – The Rise of Machine LearninInside Analysis
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptxgdgsurrey
 
MODEL-DRIVEN ENGINEERING (MDE) in Practice
MODEL-DRIVEN ENGINEERING (MDE) in PracticeMODEL-DRIVEN ENGINEERING (MDE) in Practice
MODEL-DRIVEN ENGINEERING (MDE) in PracticeHussein Alshkhir
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014Paris Open Source Summit
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Codemotion
 
Meetup 21/9/2017 - Image Recogonition: onmisbaar voor een slimme stad?
Meetup 21/9/2017 - Image Recogonition: onmisbaar voor een slimme stad?Meetup 21/9/2017 - Image Recogonition: onmisbaar voor een slimme stad?
Meetup 21/9/2017 - Image Recogonition: onmisbaar voor een slimme stad?Digipolis Antwerpen
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningNikolay Karelin
 
Lecture 1 computer vision introduction
Lecture 1 computer vision introductionLecture 1 computer vision introduction
Lecture 1 computer vision introductioncairo university
 
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...DevClub_lv
 
car number plate detection using matlab image & video processing
car number plate detection using matlab image & video processingcar number plate detection using matlab image & video processing
car number plate detection using matlab image & video processingKesava Korukonda
 
Pragmatic deep learning for image labelling
Pragmatic deep learning for image labellingPragmatic deep learning for image labelling
Pragmatic deep learning for image labellingPierre Gutierrez
 
Computer vision - Applications and Trends
Computer vision - Applications and TrendsComputer vision - Applications and Trends
Computer vision - Applications and TrendsKshitij Agrawal
 
Skillshare - From Noob to Tech CEO - nov 7th, 2011
Skillshare - From Noob to Tech CEO - nov 7th, 2011Skillshare - From Noob to Tech CEO - nov 7th, 2011
Skillshare - From Noob to Tech CEO - nov 7th, 2011Kareem Amin
 

Ähnlich wie Data quality is more important than you think (20)

Can We Make Maps from Videos? ~From AI Algorithm to Engineering for Continuou...
Can We Make Maps from Videos? ~From AI Algorithm to Engineering for Continuou...Can We Make Maps from Videos? ~From AI Algorithm to Engineering for Continuou...
Can We Make Maps from Videos? ~From AI Algorithm to Engineering for Continuou...
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
 
Ria Sankar on Building AI Products
Ria Sankar on Building AI ProductsRia Sankar on Building AI Products
Ria Sankar on Building AI Products
 
Introducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoIntroducción al Machine Learning Automático
Introducción al Machine Learning Automático
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
Pragmatic machine learning for the real world
Pragmatic machine learning for the real worldPragmatic machine learning for the real world
Pragmatic machine learning for the real world
 
The Sky’s the Limit – The Rise of Machine Learnin
The Sky’s the Limit – The Rise of Machine LearninThe Sky’s the Limit – The Rise of Machine Learnin
The Sky’s the Limit – The Rise of Machine Learnin
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx
 
MODEL-DRIVEN ENGINEERING (MDE) in Practice
MODEL-DRIVEN ENGINEERING (MDE) in PracticeMODEL-DRIVEN ENGINEERING (MDE) in Practice
MODEL-DRIVEN ENGINEERING (MDE) in Practice
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
Dato Keynote
Dato KeynoteDato Keynote
Dato Keynote
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
 
Meetup 21/9/2017 - Image Recogonition: onmisbaar voor een slimme stad?
Meetup 21/9/2017 - Image Recogonition: onmisbaar voor een slimme stad?Meetup 21/9/2017 - Image Recogonition: onmisbaar voor een slimme stad?
Meetup 21/9/2017 - Image Recogonition: onmisbaar voor een slimme stad?
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine Learning
 
Lecture 1 computer vision introduction
Lecture 1 computer vision introductionLecture 1 computer vision introduction
Lecture 1 computer vision introduction
 
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
 
car number plate detection using matlab image & video processing
car number plate detection using matlab image & video processingcar number plate detection using matlab image & video processing
car number plate detection using matlab image & video processing
 
Pragmatic deep learning for image labelling
Pragmatic deep learning for image labellingPragmatic deep learning for image labelling
Pragmatic deep learning for image labelling
 
Computer vision - Applications and Trends
Computer vision - Applications and TrendsComputer vision - Applications and Trends
Computer vision - Applications and Trends
 
Skillshare - From Noob to Tech CEO - nov 7th, 2011
Skillshare - From Noob to Tech CEO - nov 7th, 2011Skillshare - From Noob to Tech CEO - nov 7th, 2011
Skillshare - From Noob to Tech CEO - nov 7th, 2011
 

Kürzlich hochgeladen

ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Kürzlich hochgeladen (20)

ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Data quality is more important than you think

  • 1. DATA QUALITY IS MORE IMPORTANTTHANYOUTHINK Amine BENDAHMANE #DevFest Algiers 2019
  • 2. Who am I? • PhD candidate inArtificial Intelligence and Robotics (ComputerVision, Swarm Optimization, Path planning, Deep Learning, Reinforcement Learning) • Masters degree in Machine Learning & Patterns Recognition • Freelance:Web developer, ML engineer • Part-timeTeacher
  • 3. Machine Learning for research purpose Solve a problem Bring up new ideas Create new models & algorithms Adapt existing approaches to new problems Improve existing solutions Change mathematical equations Analyze different factors Identify correlations
  • 4. Machine Learning for research purpose
  • 5.
  • 7. In real world projects No data Not enough data Bad quality data Biased data
  • 8. Data Engineering is harder than we think
  • 9. Tips & tricks I learned the hard way Let’s see next the lessons learned from those 4 projects: 1. Facial Expressions Recognition 2. Image Generation 3. Vehicules Plates Recognition 4. Robotics Path Planning
  • 10. Project 1: Facial Expressions recognition • 2016 • Nextremer Co. (Tokyo) • AI engineering intern • Deep Learning
  • 11. Project 1: Facial Expressions recognition AI Samurai project (Nextremer Co.) • To deploy in a robot that uses a Raspberry Pi • The raspberry is also used for speech and motion (head, arms) • NoTensorflow Lite at the moment => need a very small model (memory, cpu load)
  • 12. Project 1: Facial Expressions recognition • Fer2013 dataset 35.000 images (48x48px) 7 categories Face LandmarksCode & results available at: https://github.com/amineHorseman/facial- expression-recognition-using-cnn
  • 13. Project 1: Facial Expressions recognition • Experiment 1 20.000 images 5 expressions ■ CNN ■ CNN + Face landmarks ■ CNN + Face landmarks + HOG + sliding window 75.1% 74.4% 73.5% • Couldn’t get better results Dropout Regularization ReLus, LeakyReLu… Batch Normalization Hyper parameters optimization
  • 14. Project 1: Facial Expressions recognition • Experiment 1 20.000 images 5 expressions ■ CNN ■ CNN + Face landmarks ■ CNN + Face landmarks + HOG + sliding window 75.1% 74.4% 73.5% • Experiment 2 35.000 images 7 expressions 50.50% 61.40% 75.20%State of art (8 CNNs) Our best model SVM • Human accuracy: ~65%
  • 15. Project 1: Facial Expressions recognition • Fer2013 (a) incorrect labels (b) Faces partially hidden (c) Cartoon faces (d) Black or empty images • Human accuracy: ~65%
  • 16. Project 1: Facial Expressions recognition • Fer2013 (a) incorrect labels (b) Faces partially hidden (c) Cartoon faces (d) Black or empty images • Human accuracy: ~65%
  • 17. Project 1: Facial Expressions recognition By correcting the labels we can get up to 88% of accuracy!
  • 18. Project 2: Images Generation • 2016 • Nextremer Co. (Tokyo) • Generating fake car images using DC-GAN • No interesting dataset for commercial use • No transfer learning => No other choices than creating our own dataset!
  • 19. Project 2: Images Generation • Write scripts to:  Collect images: from internet using google & flikrAPIs  Transform the data: resizing, cropping, converting format  Reorganize the dataset: Renaming data, classify in folders, generating labels…  Code available at: https://github.com/amineHorseman/images-web-crawler • Collecting 20.000 car images of 31 car models (~700 per model) • Cleaning the data manually For 5 seconds per image it would take 27 hours! • The whole process of dataset creating took 4 week
  • 20. Project 2: Images Generation • Write scripts to:  Collect images: from internet using google & flikrAPIs  Transform the data: resizing, cropping, converting format  Reorganize the dataset: Renaming data, classify in folders, generating labels…  Code available at: https://github.com/amineHorseman/images-web-crawler • Collecting 20.000 car images of 31 car models (~700 per model) • Cleaning the data manually For 5 seconds per image it would take 27 hours! • The whole process of dataset creating took 4 week
  • 21. Project 2: Images Generation Training with 4000 images Training with 20.000 images Training with 200.000 images (redundant images, non-cleaned dataset) 10x bigger bad results Training time: 1 week
  • 22. Project 3:Vehicules Plates recognition • 2018 • Mostaganem • Detect and localize plates • Recognize Plate Licence Number
  • 23. Project 3:Vehicules Plates recognition • 2018 • Mostaganem • Detect and localize plates • Recognize Plate Licence Number
  • 24. Project 3:Vehicules Plates recognition For Plates detection and localization: • Collecting a dataset of 300 images from internet • Using data augmentation for generating a bigger dataset • Using transfer learning onYOLOv3 and training For Serial Number recognition: • Creating a dataset of 2000 numbers from vehicule license plates • Using MNIST pretrained model and using transfer learning • Segmenting the number into separated digits and predicting
  • 25. Project 3:Vehicules Plates recognition For Plates detection and localization: • Collecting a dataset of 300 images from internet • Using data augmentation for generating a bigger dataset • Using transfer learning onYOLOv3 and training For Serial Number recognition • Creating a dataset of 2000 numbers • Using MNIST pretrained model and using transfer learning • Segmenting the number into separated digits and predicting
  • 26. Project 3:Vehicules Plates recognition While porting to production: • The client used a surveillance camera from the top with inclined angle • The camera switch to B&W in the night (CCTV) • In morning the sun is facing the camera so everything goes black (backlight) • The serial numbers come at different fonts and formats (different separators) • The numbers dataset I created was biased (too much 2 and 7, less 5 and 8)
  • 27. Project 3:Vehicules Plates recognition Other considerations during deployment: • The client used a Dual CoreCPU! (predictions take 5x longer) • Every time we retrain the model, we have to move to the client’s office for deployment (because it has no internet, i.e: mountain)
  • 28. Project 3:Vehicules Plates recognition Other considerations during deployment: • The client used a Dual CoreCPU! (predictions take 5x longer) • Every time we retrain the model, we have to move to the client’s office for deployment (because it has no internet, i.e: mountain) The clients don’t understand Error Rate means: • 5 % errors => 5 error for each 100 records. • If we have 2000 records a day it would be 100 errors that needs to be manually edited!
  • 29. Project 4: Robot path planning • Everything works well in simulation (ROS + Gazebo) • But in the real experiments, the robots don’t behave as expected!
  • 30. Project 4: Robot path planning • Everything works well in simulation • But in the real experiments, the robots don’t behave as expected! • It turns out that the Laser and Sonar often return zero values (noise) • Those noisy values affect the training • We need to explicitly filter those false readings before using ML models (figure out a method to automatically filter unwanted values)
  • 31.
  • 32. Summary • Data quality is more important than we think • Before trying to optimize your model, check how good your data is • In commercial projects, we often don’t have available data • Creating a dataset is a fastidious and time consuming task • A clean dataset may be better than a 10x larger raw dataset • The data we get during production may not be the same as the data used in the training • Pay extra attention to detect bias in our data