SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Downloaden Sie, um offline zu lesen
Center for Data Science
Paris-Saclay1
CNRS & University Paris Saclay	

Center for Data Science
BALÁZS KÉGL
WHAT IS WRONG WITH DATA
CHALLENGES
THE HIGGSML STORY:	

THE GOOD, THE BAD AND THE UGLY
2
Why am I so critical?
!
Why do I mitigate our own
success with the HiggsML?
3
Because I believe that there is
enormous potential in
open innovation/crowdsourcing
in science.
!
The current data challenge format
is a single point in the landscape.
4
Olga Kokshagina 2015
INTERMEDIARIES: THE GROWING INTEREST FOR
« CROWDS » - > EXPLOSION OF TOOLS
!  Crowdsourcing
!  is a model leveraging
on novel technologies
(web 2.0, mobile apps,
social networks)
!  To build content and a
structured set of
information by
gathering contributions
from large groups of
individuals
5
Center for Data Science
Paris-Saclay
CROWDSOURCING ANNOTATION
5
Center for Data Science
Paris-Saclay
CROWDSOURCING COLLECTION AND
ANNOTATION
6
Center for Data Science
Paris-Saclay
CROWDSOURCING MATH
7
Center for Data Science
Paris-Saclay
CROWDSOURCING ANALYTICS
8
Center for Data Science
Paris-Saclay
OPEN SOURCE
9
Center for Data Science
Paris-Saclay
NEW PUBLICATION MODELS
10
Center for Data Science
Paris-Saclay
THE BOOK TO READ
11
Center for Data Science
Paris-Saclay
• Summary of our conclusions after the HiggsML challenge	

• The good, the bad and the ugly	

• Elaborating on some of the points	

• Rapid Analytics and Model Prototyping	

• an experimental format we have been developing
12
OUTLINE
Center for Data Science
Paris-Saclay13
CIML WORKSHOP TOMORROW
Center for Data Science
Paris-Saclay
• Publicity, awareness	

• both in physics (about the technology) and in ML (about the problem)	

• Triggering open data	

• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014 	

• Learning a lot from Gábor on how to win a challenge	

• Gábor getting hired by Google Deep Mind	

• Benchmarking
• Tool dissemination (xgboost, keras)
14
THE GOOD
Center for Data Science
Paris-Saclay
• No direct access to code	

• No direct access to data scientists	

• No fundamentally new ideas	

• No incentive to collaborate
15
THE BAD
Center for Data Science
Paris-Saclay
• 18 months to prepare	

• legal issues, access to data	

• problem formulation: intellectually way more interesting than the
challenge itself, but difficult to “market” or to crowdsource	

• once a problem is formalized/formatted to challenge, the problem is
solved (“learning is easy” - GaelVaroquaux)
16
THE UGLY
Center for Data Science
Paris-Saclay
• We asked the wrong question, on purpose!	

• because the right questions are complex and don’t fit the challenge
setup	

• would have led to way less participation	

• would have led to bitterness among the participants, bad (?) for
marketing
17
THE UGLY
Center for Data Science
Paris-Saclay
• The HiggsML challenge on Kaggle	

• https://www.kaggle.com/c/higgs-boson
18
PUBLICITY, AWARENESS
Center for Data Science
Paris-Saclay
PUBLICITY, AWARENESS
19
B. Kégl / AppStat@LAL Learning to discover
CLASSIFICATION FOR DISCOVERY
14
Center for Data Science
Paris-Saclay
AWARENESS DYNAMICS	

20
• HEPML workshop @NIPS14	

• JMLR WS proceedings: http://jmlr.csail.mit.edu/proceedings/papers/v42	

• CERN Open Data	

• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014 	

• DataScience@LHC	

• http://indico.cern.ch/event/395374/	

• Flavors of physics challenge	

• https://www.kaggle.com/c/flavours-of-physics
Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER	

21
https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER	

22
• Sophisticated cross validation, CV bagging	

• Sophisticated calibration and model averaging	

• The first step: pro participants check if the effort is worthy,
risk assessment	

• variance estimate of the score	

• Don’t use the public leaderboard score for model selection	

• None of Gábor’s 200 out-of-the-ordinary ideas worked
https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
Center for Data Science
Paris-Saclay
BENCHMARKING
23
CLASSIFICATION FOR DISCOVERY
15
Center for Data Science
Paris-Saclay
BENCHMARKING
24
But what score did we
optimize?
!
And why?
Center for Data Science
Paris-Saclay
count (per year)
background
signal
probability
background
signal
CLASSIFICATION FOR DISCOVERY
25
Goal: optimize the expected discovery significance
flux × time
selection
expected background	

say, b = 100 events
total count,	

say, 150 events
excess is s = 50 events
AMS = = 5 sigma
ground expectation µb. When optimizing the design of
gion G = {x : g(x) = s}, we do not know n and µb. As
we estimate the expectation µb by its empirical counter-
+ b to obtain the approximate median significance
⇣
(s + b) ln
⇣
1 +
s
b
⌘
s
⌘
. (14)
x + 1) = x + x2/2 + O(x3), AMS2 can be rewritten as
MS3 ⇥
s
1 + O
✓⇣ s
b
⌘3
◆
,
AMS3 =
s
p
b
. (15)
tically indistinguishable when b s. This approxima-
nding on the chosen search region, be a valid surrogate
selection 	

thresholdselection threshold
Center for Data Science
Paris-Saclay
How to handle systematic (model) uncertainties?
• OK, so let’s design an objective function that can take background
systematics into consideration
• Likelihood with unknown background b ⇠ N(µb, b)
L(µs, µb) = P(n, b|µs, µb, b) =
(µs + µb)n
n!
e (µs+µb) 1
p
2⇡ b
e (b µb)2
/2 b
2
• Profile likelihood ratio (0) =
L(0, ˆˆµb)
L(ˆµs, ˆµb)
• The new Approximate Median Significance (by Glen Cowan)
AMS =
s
2
✓
(s + b) ln
s + b
b0
s b + b0
◆
+
(b b0)2
b
2
where
b0 =
1
2
⇣
b b
2
+
p
(b b
2)2 + 4(s + b) b
2
⌘
1 / 1
26
Center for Data Science
Paris-Saclay
HOW TO HANDLE SYSTEMATIC UNCERTAINTIES
27
Why didn’t we use it?
Center for Data Science
Paris-Saclay28
How to handle systematic (model) uncertainties?
• The new Approximate Median Significance
AMS =
s
2
✓
(s + b) ln
s + b
b0
s b + b0
◆
+
(b b0)2
b
2
where
b0 =
1
2
⇣
b b
2
+
p
(b b
2)2 + 4(s + b) b
2
⌘
1 / 1
New AMS
ATLAS
Old AMS
Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER	

29
• Sophisticated cross validation, CV bagging	

• Sophisticated calibration and model averaging	

• The first step: pro participants check if the effort is worthy,
risk assessment	

• variance estimate of the score	

• Don’t use the public leaderboard score for model selection	

• None of Gábor’s 200 out-of-the-ordinary ideas worked
Center for Data Science
Paris-Saclay
THE TWO MOST COMMON DATA
CHALLENGE KILLERS
30
Leakage
Variance of the test score
Center for Data Science
Paris-Saclay
VARIANCE OF THE TEST SCORE
31
Center for Data Science
Paris-Saclay
• Challenges are useful for	

• generating visibility in the data science community about novel
application domains	

• benchmarking in a fair way state-of-the-art techniques on
well-defined problems	

• finding talented data scientists	

• Limitations	

• not necessary adapted to solving complex and open-ended
data science problems in realistic environments	

• no direct access to solutions and data scientist	

• no incentive to collaboration
32
DATA CHALLENGES
33
We decided to design something better
Center for Data Science
Paris-Saclay
• Direct access to code, prototyping	

• Incentivizing diversity	

• Incentivizing collaboration
• Training
• Networking
34
RAPID ANALYTICS AND MODEL
PROTOTYPING (RAMP)
Center for Data Science
Paris-Saclay
• Our experience with the HiggsML challenge	

• Need to connect data scientist to domain scientists
and problems at the Paris-Saclay Center for Data
Science	

• Collaboration with management scientists specializing
in managing innovation	

• Michel Nielsen’s book: Reinventing Discovery	

• 5+ iterations so far
35
WHERE DOES IT COME FROM?
Center for Data Science
Paris-Saclay
UNIVERSITÉ PARIS-SACLAY
36
+ horizontal multi-disciplinary and multi-partner
initiatives to create cohesion
Center for Data Science
Paris-Saclay37
Center for Data Science
Paris-Saclay
A multi-disciplinary initiative to define, structure, and manage
the data science ecosystem at the Université Paris-Saclay
http://www.datascience-paris-saclay.fr/
Biology & bioinformatics
IBISC/UEvry
LRI/UPSud
Hepatinov
CESP/UPSud-UVSQ-Inserm
IGM-I2BC/UPSud
MIA/Agro
MIAj-MIG/INRA
LMAS/Centrale
Chemistry
EA4041/UPSud
Earth sciences
LATMOS/UVSQ
GEOPS/UPSud
IPSL/UVSQ
LSCE/UVSQ
LMD/Polytechnique
Economy
LM/ENSAE
RITM/UPSud
LFA/ENSAE
Neuroscience
UNICOG/Inserm
U1000/Inserm
NeuroSpin/CEA
Particle physics
astrophysics &
cosmology
LPP/Polytechnique
DMPH/ONERA
CosmoStat/CEA
IAS/UPSud
AIM/CEA
LAL/UPSud
250researchers in 35laboratories
Machine learning
LRI/UPSud
LTCI/Telecom
CMLA/Cachan
LS/ENSAE
LIX/Polytechnique
MIA/Agro
CMA/Polytechnique
LSS/Supélec
CVN/Centrale
LMAS/Centrale
DTIM/ONERA
IBISC/UEvry
Visualization
INRIA
LIMSI
Signal processing
LTCI/Telecom
CMA/Polytechnique
CVN/Centrale
LSS/Supélec
CMLA/Cachan
LIMSI
DTIM/ONERA
Statistics
LMO/UPSud
LS/ENSAE
LSS/Supélec
CMA/Polytechnique
LMAS/Centrale
MIA/AgroParisTech
machine learning
information retrieval
signal processing
data visualization
databases
Domain science
human society
life
brain
earth
universe
Tool building
software engineering
clouds/grids
high-performance
computing
optimization
Domain scientistSoftware engineer
datascience-paris-saclay.fr
LIST/CEA
38
THE DATA SCIENCE LANDSCAPE
Domain science
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
Data scientist
Data trainer
Applied scientist
Domain scientistSoftware engineer
Data engineer
Data science
statistics

machine learning
information retrieval
signal processing
data visualization
databases
Tool building
software engineering

clouds/grids
high-performance

computing
optimization
Center for Data Science
Paris-Saclay39
https://medium.com/@balazskegl
Center for Data Science
Paris-Saclay
TOOLS: LANDSCAPE TO ECOSYSTEM
40
Data scientist
Data trainer
Applied scientist
Domain expertSoftware engineer
Data engineer
Tool building Data domains
Data science
statistics

machine learning
information retrieval
signal processing
data visualization
databases
• interdisciplinary projects
• matchmaking tool
• design and innovation strategy workshops
• data challenges
• coding sprints
• Open Software Initiative
• code consolidator and engineering projects
software engineering

clouds/grids
high-performance

computing
optimization
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
• data science RAMPs and TSs
• IT platform for linked data
• annotation tools
• SaaS data science platform
Center for Data Science
Paris-Saclay
• Modularizing the collaboration	

• independent subtasks	

• reduces barriers	

• broadens the range of available expertise	

• Encouraging small contributions	

• Rich and well-structured information commons	

• so people can build on earlier work
41
NIELSEN’S CROWDSOURCING PRINCIPLES
Center for Data Science
Paris-Saclay42
RAMPS
• Single-day coding sessions
• 20-40 participants	

• preparation is similar to challenges
• Goals	

• focusing and motivating top talents	

• promoting collaboration, speed, and efficiency	

• solving (prototyping) real problems
43
TRAINING SPRINTS
• Single-day training sessions
• 20-40 participants	

• focusing on a single subject (deep learning, model tuning, functional
data, etc.)	

• preparing RAMPs
44
ANALYTICS TOOLS TO PROMOTE 	

COLLABORATION AND CODE REUSE
Center for Data Science
Paris-Saclay45
ANALYTICS TOOL TO PROMOTE 	

COLLABORATION AND CODE REUSE
Center for Data Science
Paris-Saclay
ANALYTICS TOOLS TO MONITOR PROGRESS
46
Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 Jan 15
The HiggsML challenge
47
Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 Apr 10
Classifying variable stars
48
Center for Data Science
Paris-Saclay
VARIABLE STARS
49
Learning to discoverB. Kégl / CNRS - Saclay
VARIABLE STARS
50
accuracy improvement: 89% to 96%
Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 June 16 and Sept 26
Predicting El Nino
51
52
RAPID ANALYTICS AND MODEL PROTOTYPING
RMSE improvement: 0.9˚C to 0.4˚C
53
2015 October 8
Insect classification
RAPID ANALYTICS AND MODEL PROTOTYPING
54
RAPID ANALYTICS AND MODEL PROTOTYPING
accuracy improvement: 30% to 70%
55
CONCLUSIONS
• Explore the open innovation space
• read Nielsen’s book	

• Drop me a mail (balazs.kegl@gmail.com) if you are
interested in beta-testing the RAMP tool
• Come to our CIML WS tomorrow
Center for Data Science
Paris-Saclay56
THANK YOU!

Weitere ähnliche Inhalte

Ähnlich wie What is wrong with data challenges

RAMP: Collaborative challenge with code submission
RAMP: Collaborative challenge with code submissionRAMP: Collaborative challenge with code submission
RAMP: Collaborative challenge with code submissionBalázs Kégl
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Fabricio Quintanilla
 
Data Science 101
Data Science 101Data Science 101
Data Science 101odsc
 
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroData ScienceTech Institute
 
Introduction to Learning Analytics - Framework and Implementation Concerns
Introduction to Learning Analytics - Framework and Implementation ConcernsIntroduction to Learning Analytics - Framework and Implementation Concerns
Introduction to Learning Analytics - Framework and Implementation ConcernsTore Hoel
 
IronHacks Live: Info session #3 - COVID-19 Data Science Challenge
IronHacks Live: Info session #3 - COVID-19 Data Science ChallengeIronHacks Live: Info session #3 - COVID-19 Data Science Challenge
IronHacks Live: Info session #3 - COVID-19 Data Science ChallengePurdue RCODI
 
Novo Nordisk 080522.pptx
Novo Nordisk 080522.pptxNovo Nordisk 080522.pptx
Novo Nordisk 080522.pptxPhilip Bourne
 
RAMP Data Challenge
RAMP Data Challenge RAMP Data Challenge
RAMP Data Challenge Proto204
 
Introduction to the FP7 CODE project @ BDBC
Introduction to the FP7 CODE project @ BDBCIntroduction to the FP7 CODE project @ BDBC
Introduction to the FP7 CODE project @ BDBCFlorian Stegmaier
 
Machine learning in scientific workflows
Machine learning in scientific workflowsMachine learning in scientific workflows
Machine learning in scientific workflowsBalázs Kégl
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
 
Data analytics and visualization
Data analytics and visualizationData analytics and visualization
Data analytics and visualizationVini Vasundharan
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science LandscapePhilip Bourne
 

Ähnlich wie What is wrong with data challenges (20)

RAMP: Collaborative challenge with code submission
RAMP: Collaborative challenge with code submissionRAMP: Collaborative challenge with code submission
RAMP: Collaborative challenge with code submission
 
John Stamper - WESST Keynote - Continuous Improvement of Educational Technolo...
John Stamper - WESST Keynote - Continuous Improvement of Educational Technolo...John Stamper - WESST Keynote - Continuous Improvement of Educational Technolo...
John Stamper - WESST Keynote - Continuous Improvement of Educational Technolo...
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
Using hadoop for big data
Using hadoop for big dataUsing hadoop for big data
Using hadoop for big data
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
 
Introduction to Learning Analytics - Framework and Implementation Concerns
Introduction to Learning Analytics - Framework and Implementation ConcernsIntroduction to Learning Analytics - Framework and Implementation Concerns
Introduction to Learning Analytics - Framework and Implementation Concerns
 
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
 
IronHacks Live: Info session #3 - COVID-19 Data Science Challenge
IronHacks Live: Info session #3 - COVID-19 Data Science ChallengeIronHacks Live: Info session #3 - COVID-19 Data Science Challenge
IronHacks Live: Info session #3 - COVID-19 Data Science Challenge
 
Novo Nordisk 080522.pptx
Novo Nordisk 080522.pptxNovo Nordisk 080522.pptx
Novo Nordisk 080522.pptx
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
RAMP Data Challenge
RAMP Data Challenge RAMP Data Challenge
RAMP Data Challenge
 
Introduction to the FP7 CODE project @ BDBC
Introduction to the FP7 CODE project @ BDBCIntroduction to the FP7 CODE project @ BDBC
Introduction to the FP7 CODE project @ BDBC
 
Data mining BY Zubair Yaseen
Data mining BY Zubair YaseenData mining BY Zubair Yaseen
Data mining BY Zubair Yaseen
 
Machine learning in scientific workflows
Machine learning in scientific workflowsMachine learning in scientific workflows
Machine learning in scientific workflows
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
Data analytics and visualization
Data analytics and visualizationData analytics and visualization
Data analytics and visualization
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science Landscape
 

Kürzlich hochgeladen

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Youngkajalvid75
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to VirusesAreesha Ahmad
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 

Kürzlich hochgeladen (20)

Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 

What is wrong with data challenges

  • 1. Center for Data Science Paris-Saclay1 CNRS & University Paris Saclay Center for Data Science BALÁZS KÉGL WHAT IS WRONG WITH DATA CHALLENGES THE HIGGSML STORY: THE GOOD, THE BAD AND THE UGLY
  • 2. 2 Why am I so critical? ! Why do I mitigate our own success with the HiggsML?
  • 3. 3 Because I believe that there is enormous potential in open innovation/crowdsourcing in science. ! The current data challenge format is a single point in the landscape.
  • 4. 4 Olga Kokshagina 2015 INTERMEDIARIES: THE GROWING INTEREST FOR « CROWDS » - > EXPLOSION OF TOOLS !  Crowdsourcing !  is a model leveraging on novel technologies (web 2.0, mobile apps, social networks) !  To build content and a structured set of information by gathering contributions from large groups of individuals 5
  • 5. Center for Data Science Paris-Saclay CROWDSOURCING ANNOTATION 5
  • 6. Center for Data Science Paris-Saclay CROWDSOURCING COLLECTION AND ANNOTATION 6
  • 7. Center for Data Science Paris-Saclay CROWDSOURCING MATH 7
  • 8. Center for Data Science Paris-Saclay CROWDSOURCING ANALYTICS 8
  • 9. Center for Data Science Paris-Saclay OPEN SOURCE 9
  • 10. Center for Data Science Paris-Saclay NEW PUBLICATION MODELS 10
  • 11. Center for Data Science Paris-Saclay THE BOOK TO READ 11
  • 12. Center for Data Science Paris-Saclay • Summary of our conclusions after the HiggsML challenge • The good, the bad and the ugly • Elaborating on some of the points • Rapid Analytics and Model Prototyping • an experimental format we have been developing 12 OUTLINE
  • 13. Center for Data Science Paris-Saclay13 CIML WORKSHOP TOMORROW
  • 14. Center for Data Science Paris-Saclay • Publicity, awareness • both in physics (about the technology) and in ML (about the problem) • Triggering open data • http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014 • Learning a lot from Gábor on how to win a challenge • Gábor getting hired by Google Deep Mind • Benchmarking • Tool dissemination (xgboost, keras) 14 THE GOOD
  • 15. Center for Data Science Paris-Saclay • No direct access to code • No direct access to data scientists • No fundamentally new ideas • No incentive to collaborate 15 THE BAD
  • 16. Center for Data Science Paris-Saclay • 18 months to prepare • legal issues, access to data • problem formulation: intellectually way more interesting than the challenge itself, but difficult to “market” or to crowdsource • once a problem is formalized/formatted to challenge, the problem is solved (“learning is easy” - GaelVaroquaux) 16 THE UGLY
  • 17. Center for Data Science Paris-Saclay • We asked the wrong question, on purpose! • because the right questions are complex and don’t fit the challenge setup • would have led to way less participation • would have led to bitterness among the participants, bad (?) for marketing 17 THE UGLY
  • 18. Center for Data Science Paris-Saclay • The HiggsML challenge on Kaggle • https://www.kaggle.com/c/higgs-boson 18 PUBLICITY, AWARENESS
  • 19. Center for Data Science Paris-Saclay PUBLICITY, AWARENESS 19 B. Kégl / AppStat@LAL Learning to discover CLASSIFICATION FOR DISCOVERY 14
  • 20. Center for Data Science Paris-Saclay AWARENESS DYNAMICS 20 • HEPML workshop @NIPS14 • JMLR WS proceedings: http://jmlr.csail.mit.edu/proceedings/papers/v42 • CERN Open Data • http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014 • DataScience@LHC • http://indico.cern.ch/event/395374/ • Flavors of physics challenge • https://www.kaggle.com/c/flavours-of-physics
  • 21. Center for Data Science Paris-Saclay LEARNING FROM THE WINNER 21 https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
  • 22. Center for Data Science Paris-Saclay LEARNING FROM THE WINNER 22 • Sophisticated cross validation, CV bagging • Sophisticated calibration and model averaging • The first step: pro participants check if the effort is worthy, risk assessment • variance estimate of the score • Don’t use the public leaderboard score for model selection • None of Gábor’s 200 out-of-the-ordinary ideas worked https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
  • 23. Center for Data Science Paris-Saclay BENCHMARKING 23 CLASSIFICATION FOR DISCOVERY 15
  • 24. Center for Data Science Paris-Saclay BENCHMARKING 24 But what score did we optimize? ! And why?
  • 25. Center for Data Science Paris-Saclay count (per year) background signal probability background signal CLASSIFICATION FOR DISCOVERY 25 Goal: optimize the expected discovery significance flux × time selection expected background say, b = 100 events total count, say, 150 events excess is s = 50 events AMS = = 5 sigma ground expectation µb. When optimizing the design of gion G = {x : g(x) = s}, we do not know n and µb. As we estimate the expectation µb by its empirical counter- + b to obtain the approximate median significance ⇣ (s + b) ln ⇣ 1 + s b ⌘ s ⌘ . (14) x + 1) = x + x2/2 + O(x3), AMS2 can be rewritten as MS3 ⇥ s 1 + O ✓⇣ s b ⌘3 ◆ , AMS3 = s p b . (15) tically indistinguishable when b s. This approxima- nding on the chosen search region, be a valid surrogate selection thresholdselection threshold
  • 26. Center for Data Science Paris-Saclay How to handle systematic (model) uncertainties? • OK, so let’s design an objective function that can take background systematics into consideration • Likelihood with unknown background b ⇠ N(µb, b) L(µs, µb) = P(n, b|µs, µb, b) = (µs + µb)n n! e (µs+µb) 1 p 2⇡ b e (b µb)2 /2 b 2 • Profile likelihood ratio (0) = L(0, ˆˆµb) L(ˆµs, ˆµb) • The new Approximate Median Significance (by Glen Cowan) AMS = s 2 ✓ (s + b) ln s + b b0 s b + b0 ◆ + (b b0)2 b 2 where b0 = 1 2 ⇣ b b 2 + p (b b 2)2 + 4(s + b) b 2 ⌘ 1 / 1 26
  • 27. Center for Data Science Paris-Saclay HOW TO HANDLE SYSTEMATIC UNCERTAINTIES 27 Why didn’t we use it?
  • 28. Center for Data Science Paris-Saclay28 How to handle systematic (model) uncertainties? • The new Approximate Median Significance AMS = s 2 ✓ (s + b) ln s + b b0 s b + b0 ◆ + (b b0)2 b 2 where b0 = 1 2 ⇣ b b 2 + p (b b 2)2 + 4(s + b) b 2 ⌘ 1 / 1 New AMS ATLAS Old AMS
  • 29. Center for Data Science Paris-Saclay LEARNING FROM THE WINNER 29 • Sophisticated cross validation, CV bagging • Sophisticated calibration and model averaging • The first step: pro participants check if the effort is worthy, risk assessment • variance estimate of the score • Don’t use the public leaderboard score for model selection • None of Gábor’s 200 out-of-the-ordinary ideas worked
  • 30. Center for Data Science Paris-Saclay THE TWO MOST COMMON DATA CHALLENGE KILLERS 30 Leakage Variance of the test score
  • 31. Center for Data Science Paris-Saclay VARIANCE OF THE TEST SCORE 31
  • 32. Center for Data Science Paris-Saclay • Challenges are useful for • generating visibility in the data science community about novel application domains • benchmarking in a fair way state-of-the-art techniques on well-defined problems • finding talented data scientists • Limitations • not necessary adapted to solving complex and open-ended data science problems in realistic environments • no direct access to solutions and data scientist • no incentive to collaboration 32 DATA CHALLENGES
  • 33. 33 We decided to design something better
  • 34. Center for Data Science Paris-Saclay • Direct access to code, prototyping • Incentivizing diversity • Incentivizing collaboration • Training • Networking 34 RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)
  • 35. Center for Data Science Paris-Saclay • Our experience with the HiggsML challenge • Need to connect data scientist to domain scientists and problems at the Paris-Saclay Center for Data Science • Collaboration with management scientists specializing in managing innovation • Michel Nielsen’s book: Reinventing Discovery • 5+ iterations so far 35 WHERE DOES IT COME FROM?
  • 36. Center for Data Science Paris-Saclay UNIVERSITÉ PARIS-SACLAY 36 + horizontal multi-disciplinary and multi-partner initiatives to create cohesion
  • 37. Center for Data Science Paris-Saclay37 Center for Data Science Paris-Saclay A multi-disciplinary initiative to define, structure, and manage the data science ecosystem at the Université Paris-Saclay http://www.datascience-paris-saclay.fr/ Biology & bioinformatics IBISC/UEvry LRI/UPSud Hepatinov CESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/Agro MIAj-MIG/INRA LMAS/Centrale Chemistry EA4041/UPSud Earth sciences LATMOS/UVSQ GEOPS/UPSud IPSL/UVSQ LSCE/UVSQ LMD/Polytechnique Economy LM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA CosmoStat/CEA IAS/UPSud AIM/CEA LAL/UPSud 250researchers in 35laboratories Machine learning LRI/UPSud LTCI/Telecom CMLA/Cachan LS/ENSAE LIX/Polytechnique MIA/Agro CMA/Polytechnique LSS/Supélec CVN/Centrale LMAS/Centrale DTIM/ONERA IBISC/UEvry Visualization INRIA LIMSI Signal processing LTCI/Telecom CMA/Polytechnique CVN/Centrale LSS/Supélec CMLA/Cachan LIMSI DTIM/ONERA Statistics LMO/UPSud LS/ENSAE LSS/Supélec CMA/Polytechnique LMAS/Centrale MIA/AgroParisTech machine learning information retrieval signal processing data visualization databases Domain science human society life brain earth universe Tool building software engineering clouds/grids high-performance computing optimization Domain scientistSoftware engineer datascience-paris-saclay.fr LIST/CEA
  • 38. 38 THE DATA SCIENCE LANDSCAPE Domain science energy and physical sciences health and life sciences Earth and environment economy and society brain Data scientist Data trainer Applied scientist Domain scientistSoftware engineer Data engineer Data science statistics
 machine learning information retrieval signal processing data visualization databases Tool building software engineering
 clouds/grids high-performance
 computing optimization
  • 39. Center for Data Science Paris-Saclay39 https://medium.com/@balazskegl
  • 40. Center for Data Science Paris-Saclay TOOLS: LANDSCAPE TO ECOSYSTEM 40 Data scientist Data trainer Applied scientist Domain expertSoftware engineer Data engineer Tool building Data domains Data science statistics
 machine learning information retrieval signal processing data visualization databases • interdisciplinary projects • matchmaking tool • design and innovation strategy workshops • data challenges • coding sprints • Open Software Initiative • code consolidator and engineering projects software engineering
 clouds/grids high-performance
 computing optimization energy and physical sciences health and life sciences Earth and environment economy and society brain • data science RAMPs and TSs • IT platform for linked data • annotation tools • SaaS data science platform
  • 41. Center for Data Science Paris-Saclay • Modularizing the collaboration • independent subtasks • reduces barriers • broadens the range of available expertise • Encouraging small contributions • Rich and well-structured information commons • so people can build on earlier work 41 NIELSEN’S CROWDSOURCING PRINCIPLES
  • 42. Center for Data Science Paris-Saclay42 RAMPS • Single-day coding sessions • 20-40 participants • preparation is similar to challenges • Goals • focusing and motivating top talents • promoting collaboration, speed, and efficiency • solving (prototyping) real problems
  • 43. 43 TRAINING SPRINTS • Single-day training sessions • 20-40 participants • focusing on a single subject (deep learning, model tuning, functional data, etc.) • preparing RAMPs
  • 44. 44 ANALYTICS TOOLS TO PROMOTE COLLABORATION AND CODE REUSE
  • 45. Center for Data Science Paris-Saclay45 ANALYTICS TOOL TO PROMOTE COLLABORATION AND CODE REUSE
  • 46. Center for Data Science Paris-Saclay ANALYTICS TOOLS TO MONITOR PROGRESS 46
  • 47. Center for Data Science Paris-Saclay RAPID ANALYTICS AND MODEL PROTOTYPING 2015 Jan 15 The HiggsML challenge 47
  • 48. Center for Data Science Paris-Saclay RAPID ANALYTICS AND MODEL PROTOTYPING 2015 Apr 10 Classifying variable stars 48
  • 49. Center for Data Science Paris-Saclay VARIABLE STARS 49
  • 50. Learning to discoverB. Kégl / CNRS - Saclay VARIABLE STARS 50 accuracy improvement: 89% to 96%
  • 51. Center for Data Science Paris-Saclay RAPID ANALYTICS AND MODEL PROTOTYPING 2015 June 16 and Sept 26 Predicting El Nino 51
  • 52. 52 RAPID ANALYTICS AND MODEL PROTOTYPING RMSE improvement: 0.9˚C to 0.4˚C
  • 53. 53 2015 October 8 Insect classification RAPID ANALYTICS AND MODEL PROTOTYPING
  • 54. 54 RAPID ANALYTICS AND MODEL PROTOTYPING accuracy improvement: 30% to 70%
  • 55. 55 CONCLUSIONS • Explore the open innovation space • read Nielsen’s book • Drop me a mail (balazs.kegl@gmail.com) if you are interested in beta-testing the RAMP tool • Come to our CIML WS tomorrow
  • 56. Center for Data Science Paris-Saclay56 THANK YOU!