Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

How to get into Kaggle? by Philipp Singer and Dmitry Gordeev

561 Aufrufe

Veröffentlicht am

Kaggle is one of the largest online communities for data scientists specifically known for their competitions where participants aim to solve data science challenges. Kaggle has a long history of varying types of competitions from different areas such as medicine, finance, scientific research, or sports focusing on different types of data and prediction problems such as tabular data, time series, NLP, or computer vision.

Veröffentlicht in: Daten & Analysen
  • See how I make over $7,293 a month from home doing REAL online jobs!  http://t.cn/AisJWUCf
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Gehören Sie zu den Ersten, denen das gefällt!

How to get into Kaggle? by Philipp Singer and Dmitry Gordeev

  1. 1. How to get into Kaggle? Philipp Singer & Dmitry Gordeev Vienna Data Science Meetup Vienna, Dec 5th 2019
  2. 2. Who we are ● Philipp ○ Data scientist at UNIQA ○ PhD in CS at TU Graz ○ Profound experience in ML research and applications ○ Kaggle competition master currently ranked 36th ● Dmitry ○ Data scientist at UNIQA ○ Master’s degree in data mining ○ In-depth experience of ML applications in financial institutes ○ Kaggle competition grandmaster currently ranked 34th ● Competing successfully together on Kaggle for 1 year: The Zoo 2
  3. 3. What is Kaggle? ● “Your home for Data Science” ○ Online community of data scientists and machine learners ○ Founded in 2010 ○ Acquired by Google in 2017 ● Data science competitions ● Share notebooks, datasets, and discussions ● Courses and tutorials ● Free notebook infrastructure with CPUs and GPUs 3
  4. 4. How big is Kaggle ● The most popular ML competition platform ● The largest ML community 125 000+ users 350 completed competitions up to 10 000 users per competition Usually 20,000 $ - 100,000 $ prize fund 4
  5. 5. Kaggle survey results 5
  6. 6. Kaggle survey results 6
  7. 7. Kaggle survey results 7
  8. 8. Kaggle survey results 8
  9. 9. Competitions on Kaggle ● Usually hosted by companies or research institutes ● Main goal: prediction ● Wide range of different types of competitions ○ Different types of domains (e.g., financial, medical, sports, …) ○ Different types of data (e.g., tabular, nlp, image, videos, time-series, …) ○ Different types of objectives (e.g., classification, regression, segmentation, …) ○ Different goals of competitions (featured, research, playground, in-class) ● Built-in progression system with medals and ranks ● Top spots usually receive prize money 9
  10. 10. Competition medals 10
  11. 11. User ranking + titles 11
  12. 12. How competitions usually work 12https://mc.ai/pseudo-labeling/
  13. 13. ● Started competing under the team name “The Zoo” exactly one year ago ● Little prior experience on Kaggle ● Participated in 7 competitions ● Strategy: diversify types of competitions for learning purposes The Zoo 13
  14. 14. Our Journey 14
  15. 15. Quora Develop models that identify and flag insincere questions. 1 306 122 labelled questions 6.2% insincere questions 4 037 teams 2 hours to fit and predict 15
  16. 16. Quora - sincere/insincere How can I become a data scientist? How come Trump is so stupid? Is it possible for a vegan who does crossfit to go 10 minutes without telling someone about it? Everytime I slap myself in the face, it hurts. How can I prevent this? 16
  17. 17. Quora - solution 17
  18. 18. Quora - final standings 18
  19. 19. Santander 19 Identify which customers will make a specific transaction in the future 200 000 transactions 8 802 teams 2 months duration
  20. 20. Santander - the mysterious data 20
  21. 21. Santander - solution 21
  22. 22. Santander - final standings 22
  23. 23. LANL Earthquake Prediction Predict the time remaining before laboratory earthquakes occur from real-time seismic data. 629 145 480 data points 4 200 trainings segments 4 540 teams 30 minutes to fit and predict 23
  24. 24. LANL - the physics 24
  25. 25. LANL - solution ● Derived handful of features from the data capturing peaks and volatility of the acoustic signal ● Combination (ensemble) of two state-of-the-art modeling approaches ○ Gradient Boosting Regression Trees ○ Neural Network (Deep Learning) ● Novel statistical data adjustment to account for different earthquake cycles 25
  26. 26. LANL - final standings 26
  27. 27. APTOS Blindness Detection Detect diabetic retinopathy to stop blindness before it's too late! 3 662 retina images 0 - 4 retinopathy levels 2 943 teams 15 000 evaluation images 27 Diabetic retinopathy is the leading cause of blindness in the working-age population of the developed world. It is estimated to affect over 93 million people.
  28. 28. APTOS 28 https://www.eyeops.com/contents/our-services/eye-diseases/diabetic-retinopathy; https://www.vequill.com/how-to-cure-temporary-blindness/
  29. 29. APTOS - solution ● Careful image pre-processing to remove any kind of bias (e.g., device) ● Combination of several current best deep neural networks ● Models are pre-trained on large collection of image data (imagenet + extra retina images) 29
  30. 30. APTOS - final standings 30
  31. 31. Quiz ● Did I have relevant experience to enter this competition? 31 Data: Atomic elements (H for hydrogen, C for carbon etc.) and their X, Y, Z cartesian coordinates. Task: Develop an algorithm that can predict the magnetic interaction between two atoms in a molecule.
  32. 32. Why should you start on Kaggle? ● Doing is the best way to learn ● Get in touch with data and use cases outside your main domain ● Keep up-to-date with state-of-the-art methods ● Learn from others ● Measure yourself and know where you stand ● Hardware and software is provided by Kaggle 32
  33. 33. Easy start 33
  34. 34. How can you start on Kaggle? ● Don’t be afraid! Just do it! ● Overcome self-handicapping behavior ● You gain points regardless of the result ● “Getting started” competitions ● Pick a competition that sounds exciting to you, don’t be afraid to pick one where you have no prior experience ● Research similar previous competitions and read solutions ● Follow published notebooks and discussions 34
  35. 35. Learn from the community 35
  36. 36. How to approach a competition? ● Choose a programming language (usually python or R) ● Understand the problem setting, get a feeling for the data and the metric ● Exploratory Data Analysis (EDA) ● Implement basic script / notebook from scratch doing training and prediction OR just fork someone’s model ;-) ● Think hard about robust CV setup ● Keep up-to-date on discussions and developments of competition ● Experiment a lot and iterate quickly 36
  37. 37. Try more, fail fast 37 Baseline model Final model
  38. 38. Thanks! Get in touch with us! We are open to any inquiries. me@philippsinger.com dott1718@gmail.com @ph_singer @dott1718 38Vienna Data Science Meetup Vienna, Dec 5th 2019

×