Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

[DSC Europe 22] Starting deep learning projects without sufficient amount of labeled data – few practical examples - Sergei Smirnov

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 40 Anzeige

[DSC Europe 22] Starting deep learning projects without sufficient amount of labeled data – few practical examples - Sergei Smirnov

Herunterladen, um offline zu lesen

Ideal flow of ML project considers presence of data which was labeled according to particular business task. But what if we need to start project asap but still don’t have relevant amount of such data or don’t have annotation budget? Described problem could especially harm cases for which we are planning to use deep learning approaches as the most promising ones. During this talk would be discussed real projects for which we faced in past this problem and few helpful practical approaches for solving this problem would be demonstrated.

Ideal flow of ML project considers presence of data which was labeled according to particular business task. But what if we need to start project asap but still don’t have relevant amount of such data or don’t have annotation budget? Described problem could especially harm cases for which we are planning to use deep learning approaches as the most promising ones. During this talk would be discussed real projects for which we faced in past this problem and few helpful practical approaches for solving this problem would be demonstrated.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Weitere von DataScienceConferenc1 (20)

Aktuellste (20)

Anzeige

[DSC Europe 22] Starting deep learning projects without sufficient amount of labeled data – few practical examples - Sergei Smirnov

  1. 1. Starting deep learning projects without sufficient amount of labeled data – few practical examples
  2. 2. SERGEI SMIRNOV Chief Data Scientist II • I have been working at EPAM 6+ years • Experience with ML/DS 12+ years • Participated in many projects in RecSys, NLP, CV, Time Series etc. • Joining EPAM Serbia in 2022 • Responsible for DS/MLE in Serbia/Montenegro/Turkiye/ • Responsible for development DS competency EPAM globally
  3. 3. Intro • Ideal flow of deep learning project: • Relevant business understanding • Normal data quality • Presence of labeled data • Representative sample • Good quality of labeled data • Budget for data labeling/re-labeling if needed
  4. 4. Why this talk is important • Typical set-up of early stages of project • Problem with labels • Pre-trained models and leveraging existing solutions • Auto-labeling/labeling propagation
  5. 5. Why this question is actual for EPAM • Sometimes you need to persuade customer to work with you • Start-up mode • New customer • No labeling budget • Specific case (hard to find datasets/models for such case)
  6. 6. Agenda • Few computer vision cases • Sound-basic case • Typical NLP case
  7. 7. Darts • Support for startup • Darts scoring • Detection of board • Detection of sector • Scoring • Challenges with classical CV • Too many heuristic to use • Edge cases which could make solution not stable
  8. 8. Board detection using deep learning • No datasets with board • Non-stable solution based on circles/lines • Intuition • Let’s label small sample • Add this data to pre-train model • Model can learn boxes faster than new classes • Real solution • Using pre-train model on COCO dataset • Use clock entity • Label few not working cases
  9. 9. What did we do next • Re-train model • Minimal correction of results • Train lightweighted model (SSD) + tracking between frames • 1-2 weeks of work • Result of the project • Pipeline for board detection based on deep learning • Classical CV algorithm of segmentation
  10. 10. CV segmentation case • Start-up with pig weight calculation • No labeled data • Time/budget constraints
  11. 11. How to do pig segmentation • Initial approach • Add segmentation and object detection model • No entity pig, but entity human is almost ok • Improve segmentation results by color space transformation • Re-train segmentation model (U- net like)
  12. 12. Results • After labeling correction we re-train segmentation networks • Relevant results (5-7% of MAPE for weight results) • Whole PoC solution was done in 3 weeks
  13. 13. Audio/Voice case • Huge telecom company • Content distribution platform • Problems • No meta-data for football games, TV shows etc. • They wanted system of segmentation of content • They can’t provide to us labeled dataset • Chosen use-case: Voice TV show
  14. 14. Why project was initially started? • We have full Voice TV show • Extract fragments of different types • Musical fragments • Speech fragments • Judges • Fragments with concreate person • Other • We started to solve 1st case and 2nd case
  15. 15. Proposed ML solution (HIGH LEVEL scheme)
  16. 16. Customer’s ask • Solution should work fast • They can’t provide GPU instances • So, we decided to go only with audio stream for ML algorithm • Using segmentation results we can combine highlights using time segmentation
  17. 17. Which data we need for our model Input audio file T0 T1 T2 T3 music music Music: { interval_1 : start = T0, end = T1 interval_2 : start = T2, end = T3} Ground truth
  18. 18. Where we can get labeled data • Customer can’t provide labeled data and has no labeling budget • We decided to find some open dataset to solve this task • Decided to use AudioSet for extracting different types of audio fragments to create synthetic samples
  19. 19. Audio processing Audio file Mel spectrogram DCNN PCA
  20. 20. Model – audio features Audio features CRF layer mask GRU Fully-connected layer
  21. 21. Model - target Audio features T0 T1 T2 T3 music music Target 111111 111111 000000 0 0 T0 T1 T2 T3
  22. 22. Generation of labeled data … … Pool of positive samples Pool of negative samples Synthetic sample generator Final data sample Start time Finish time
  23. 23. More details about labeled data generation Final data sample Start time Finish time Start time Finish time Left border Right border Random crop generation
  24. 24. First results
  25. 25. Generating hard-negatives • Run first model on real Voice TV show samples • Manually analyze results of segmentation • Add hard-negatives to model
  26. 26. Results on hold-out set Average Precision=0.96
  27. 27. Outcome • We provide relevant segmentation model to the customer • We got trust in our ML capabilities • After this project we run new innovation program with customer
  28. 28. Financial service entity extraction { key_1 : value_1, ……………………… key_last : value_last , key_lineItem1 : lineItem1, …………………………………….. key_lineItem_last : lineItem_last }
  29. 29. Few words about project •Customer : company which develops software for financial sector •Initial set up •Customer had idea to remove commercial solution •Customer wanted to build unified pipeline for many types of documents •This pipeline should be customizable •We didn’t have budget on data labeling •We should work with engineering team from customer side
  30. 30. Data flow
  31. 31. Overview of model – high level
  32. 32. Overview of model – feature extraction
  33. 33. Overview of model – target
  34. 34. Auto-labeling
  35. 35. Iterative pipeline for model training Documents without labels Mining labeled dataset - v1 Training model Fuzzy matching with high thresholds Mining labeled dataset – v2 Fuzzy matching with lower thresholds + modeling results Pattern extraction model Mining labeled dataset – v3 Clustering + Heuristics Mining labeled dataset – final version
  36. 36. Modeling pipeline and inference • Initial model gave 70+ % of automation for main use-case • New solution replaced commercial one in customer flow • After few iteration of improvement using established labeling flow • Achieved 85+ % of automation • More sophisticated cases were solved • More modern models were developed under the top of established labeling pipeline (both manual and auto-labeling)
  37. 37. Labeling propagation • To save the time of domain experts on manual labeling we apply labeling propagation. • After collecting bunch of files that we inferenced incorrectly we cluster them using HDBSCAN and pick some individual samples from each cluster • We give these samples to experts for manual labeling of entities • After that we use rule-based approaches to find entities on the same place in other documents within each cluster 37 3d clustering visualisation
  38. 38. Active learning pipeline New documents without labels Pattern extraction model Updated labeled dataset Model Manual validation Predictions (only which failed auto validation) Increment for labeled set Training pipeline Check against ground truth data Labeled docs Training dataset OCRed docs OCRed docs Patterns Labeled documents for new patterns New model Update model if it works better
  39. 39. Summary • Not having labeling budget is not end of the world • Even if your business task is not typical-you could try to re-use pre- trained model • Auto-labeling • Label propagation • Class similarity • Combination of classical + heuristic + ML iteration can bring result • Of course, for having stable system it is important to have good labeling data quality
  40. 40. Contacts • E-mail: sergei_smirnov1@epam.com • Telegram: +79215801652

Hinweis der Redaktion

  • Split on 2 slides – first 1 with ideal flow, second– problematic case
  • Create separate slide with clock=board, ask ritorical question to audience

×