SlideShare ist ein Scribd-Unternehmen logo
1 von 19
DATA DOPING
Solution for "Flavour of Physics" challenge
Dr. Vicens Gaitan
Grupo AIA
Heavy Flavour Data mining Workshop
18-20 February 2016
University of Zurich
AGENDA
• “Flavour of Physics” Kaggle Challenge
• Why is so hard to “discover” the invariant mass?
• How to win the challenge: leasons learned
• Breaking the rules: Data Doping
• Machine Learning in HEP
• Conclusions
“FLAVOUR OF PHYSICS” KAGGLE
CHALLENGE
WHY IS SO HARD TO “DISCOVER” THE
INVARIANT MASS?
• Fact 1: The tau invariant mass can be reconstructed with high accuracy from
kinematic variables (p0,pt,eta) and/or lifetime & time of flight
ptau<-function(d){
# muon mass in MeV/c^2
mmu = 105.6583715
# calculate tau energy
tau_e = sqrt(d$p0_p ** 2 + mmu**2) + sqrt(d$p1_p ** 2 + mmu**2) +sqrt(d$p2_p ** 2 + mmu**2)
# calculate pz of tau candidate
tau_pz = d$p0_pt * sinh(d$p0_eta) + d$p1_pt * sinh(d$p1_eta)+d$p2_pt * sinh(d$p2_eta)
# calculate momentum of tau candidate
tau_p =sqrt(d$pt ** 2 + tau_pz ** 2)
# calculate eta of tau candidate
tau_eta = asinh(tau_pz / d$pt)
# calculate mass of tau candidate
tau_m2=tau_e ** 2 - tau_p ** 2
tau_m2[tau_m2<0]=0
tau_m_k=sqrt(tau_m2)
#M = tau_p*LifeTime*c/FlightDistance
#c Speed of light
c= 299.792458
tau_m_t=tau_p*d$LifeTime*c/d$FlightDistance
return(list(tau_e,tau_pz,tau_p,tau_eta,tau_m))
}
WHY IS SO HARD TO “DISCOVER” THE
INVARIANT MASS?
• Fact 2: The reconstructed tau mass separates nicely signal and background
because the background spectrum has a "hole" for decays coming from a true tau
(Real data (background) in this window can contain “signal”)
Train & Agreement Test
WHY IS SO HARD TO “DISCOVER” THE
INVARIANT MASS?
• Toy models ( not taking into account agreement & correlation):
• XGBoost
• 3-fold CV
• Par("max_depth"=5,"eta"=.1)
1. Using true mass : wAUC = 0.999999999999999(89) +/- 8.4e-16 (!!)
2. Using mass_k: wAUC = 0.997(51) +/- 0.00038
3. Using mass_t: wAUC = 0.996(81) +/- 0.00021
4. Using both wAUC = 0.999(83) +/- 0.00012
Reconstructed Mass is THE Golden Feature
WHY IS SO HARD TO “DISCOVER” THE
INVARIANT MASS?
• BUT using as variables:
c("p0_p","p1_p","p2_p","p0_pt","p1_pt","p2_pt","p0_eta","p1_eta","p2_eta","pt","LifeTime","FlightDistance")
c("tau_e","tau_p","LifeTime","FlightDistance")
wAUC = 0.85(15) +/- 0.0032 ??????
And trying to fit the mass with XGBoost : we obtain:
WHY IS SO HARD TO “DISCOVER” THE
INVARIANT MASS?
The reason: Highly correlated variables: mass is an effect of 1 over 2500
• Solution: uncorrelate variables with PCA
• wAUC = 0.947(73) +/- 0.00099
• Gradient Boosting Trees are not able build a representation of Invariant Mass
• Maybe Deep Learning can do it?
HOW TO WIN THE CHALLENGE:
LEASONS LEARNED
Recipe to win the challenge
1. Add the reconstructed tau mass ( don’t bother about mass correlation test)
2. Use all available variables ( profit from bad simulated MC variables to separate signal from real
background)
AUCw=0.9999920 CVM=0.0848 K-S=0.2226
3. Hack the Correlation and Agreement Test ;)
HOW TO WIN THE CHALLENGE:
LEASONS LEARNED
Define p’ = .99 * p ^5000 + .01 * RND and calculate the CVM
CVM= 0.85
Correlation Test : Correlation between classifier output p and mass over a rolling window:
CVM= 0.85
• CVM= 0.0012 !!
• p’ has very similar AUC as p
• p’ for agreement sample ->0 because p’ agreement <<1
• Useless for physics : Just exploiting the background mass gap
AUCw= 0.9999921 CVM=0.0014 K-S=0.0088
BREAKING THE RULES: DATA DOPING
• Recipe to build a physically sound classifier:
1. Not to use reconstructed mass, nor features allowing easy mass reconstruction
2. Try to not use variable regions for which the Monte Carlo simulation doesn't agree
with real data
In order to fullfill 2 we have to break the rules and take a look to the control channel
Goal: train a classifier able to separate A
from B, but not C from D
Max(wAUC(A,B)) with KS(C,D)<epsilon
Hypothesis: Control Channel & Analysis
channel share the same MC “defects”
MC Real Data
Control Channel
Analysis Channel
Known
Physics
Known
Physics
New
Physics
Background
A B
C D
BREAKING THE RULES: DATA DOPING
• The idea is to "dope" (in the semiconductor meaning) the training set with a small number of Monte
Carlo events from the control channel , but labeled as background.
This disallow the classifier to pick features discriminating data and Monte Carlo.
There are two parameters that regularize the learning:
• The number of "doping" events
• the complexity of the classifier (for instance number of trees)
MC Real Data
Control Channel
Analysis Channel
Known
Physics
Known
Physics
New
Physics
Background
A B
C D
BREAKING THE RULES: DATA DOPING
Free classifier Doping events: 2000 Doping events: 3000
Grid search over Classifier complexity (n_ trees) and Number (weight) of doping events
Dammit! A new hyperparameter….
BREAKING THE RULES: DATA DOPING
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
0.12
0.13
0.14
0.15
0.986 0.987 0.988 0.989 0.99 0.991 0.992 0.993 0.994
2500
3000
3100
3200
3500
3900
5000
4000
AUCw= 0.9893 CVM=0.0011 K-S=0.087
CV wAUC
Private LB
KS
Optimal Doping Events: 3960
BREAKING THE RULES: DATA DOPING
Analysis Channel Control channel Correlation Test
• Good discriminating power in analysis channel
• No separation for the control channel
• Classifier not correlated with mass
• Probably good for physics….
MACHINE LEARNING IN HEP
Today we have
• the right tools
• data availability
• the computer power
• But new physics is difficult to discover unless you know what
are you looking for…
• A complementary approach can be to use unsupervised
learning (only real data driven, we have lots of them)
MACHINE LEARNING IN HEP
Example: exploring tau decay at LEP (ALEPH 1993)
(yes, e+ e- physics is cleaner…)
Feeding an autoencoder with “elaborated” detector data
we are able to “discover” different decay modes looking at the
compressed representation without a physics model (MC)
Today is possible to do “end to end” autoencoding from raw
detector data
MACHINE LEARNING IN HEP
Example: t-sne with the challenge data: Look at the fine structure….
MC:
Control Channel
Signal
Real Data (all you see is real!)
Control Channel
Background
Test
CONCLUSIONS
• Machine learning algorithms alone can fail to discover tiny effects in the data (1777
MeV is only 1.e-4 of the energy at the center of mass)
• Use your knowledge: Try to reduce your data using fundamental simmetries, like
Lorentz invariance, detector geometry...
• Be aware of the test you are using to assure the classifier validity:
• If a test can be “hacked” an enough powerfull machine learning algorithm will find the
way
• If it is possible, try to use non supervised methods( without MC) to gain insight in
your data

Weitere ähnliche Inhalte

Was ist angesagt?

damping_constant_spring
damping_constant_springdamping_constant_spring
damping_constant_springN'Vida Yotcho
 
03 20256 ijict
03 20256 ijict03 20256 ijict
03 20256 ijictIAESIJEECS
 
Capitulo 12
Capitulo 12Capitulo 12
Capitulo 12vimacive
 
Ejercicios resueltos de rc
Ejercicios resueltos de rcEjercicios resueltos de rc
Ejercicios resueltos de rcNiels Estrada
 
Generation Optimization Project Report
Generation Optimization Project ReportGeneration Optimization Project Report
Generation Optimization Project ReportManjot Singh E.I.T
 
【潔能講堂】大型風力發電廠之發電效率分析與預測
【潔能講堂】大型風力發電廠之發電效率分析與預測【潔能講堂】大型風力發電廠之發電效率分析與預測
【潔能講堂】大型風力發電廠之發電效率分析與預測Energy-Education-Resource-Center
 
The Minimum Total Heating Lander By The Maximum Principle Pontryagin
The Minimum Total Heating Lander By The Maximum Principle PontryaginThe Minimum Total Heating Lander By The Maximum Principle Pontryagin
The Minimum Total Heating Lander By The Maximum Principle PontryaginIJERA Editor
 
05 20261 real power loss reduction
05 20261 real power loss reduction05 20261 real power loss reduction
05 20261 real power loss reductionIAESIJEECS
 
Nonlinear Control of an Active Magnetic Bearing with Output Constraint
Nonlinear Control of an Active Magnetic Bearing with Output ConstraintNonlinear Control of an Active Magnetic Bearing with Output Constraint
Nonlinear Control of an Active Magnetic Bearing with Output ConstraintIJECEIAES
 
VASP And Wannier90: A Quick Tutorial
VASP And Wannier90: A Quick TutorialVASP And Wannier90: A Quick Tutorial
VASP And Wannier90: A Quick TutorialJonathan Skelton
 
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님AI Robotics KR
 

Was ist angesagt? (16)

damping_constant_spring
damping_constant_springdamping_constant_spring
damping_constant_spring
 
03 20256 ijict
03 20256 ijict03 20256 ijict
03 20256 ijict
 
Capitulo 12
Capitulo 12Capitulo 12
Capitulo 12
 
Ejercicios resueltos de rc
Ejercicios resueltos de rcEjercicios resueltos de rc
Ejercicios resueltos de rc
 
Generation Optimization Project Report
Generation Optimization Project ReportGeneration Optimization Project Report
Generation Optimization Project Report
 
【潔能講堂】大型風力發電廠之發電效率分析與預測
【潔能講堂】大型風力發電廠之發電效率分析與預測【潔能講堂】大型風力發電廠之發電效率分析與預測
【潔能講堂】大型風力發電廠之發電效率分析與預測
 
The Minimum Total Heating Lander By The Maximum Principle Pontryagin
The Minimum Total Heating Lander By The Maximum Principle PontryaginThe Minimum Total Heating Lander By The Maximum Principle Pontryagin
The Minimum Total Heating Lander By The Maximum Principle Pontryagin
 
Final Thesis
Final ThesisFinal Thesis
Final Thesis
 
05 20261 real power loss reduction
05 20261 real power loss reduction05 20261 real power loss reduction
05 20261 real power loss reduction
 
Nonlinear Control of an Active Magnetic Bearing with Output Constraint
Nonlinear Control of an Active Magnetic Bearing with Output ConstraintNonlinear Control of an Active Magnetic Bearing with Output Constraint
Nonlinear Control of an Active Magnetic Bearing with Output Constraint
 
Poster
PosterPoster
Poster
 
VASP And Wannier90: A Quick Tutorial
VASP And Wannier90: A Quick TutorialVASP And Wannier90: A Quick Tutorial
VASP And Wannier90: A Quick Tutorial
 
Example problem
Example problemExample problem
Example problem
 
ControlsLab1
ControlsLab1ControlsLab1
ControlsLab1
 
Ballingham_Severance_Lab4
Ballingham_Severance_Lab4Ballingham_Severance_Lab4
Ballingham_Severance_Lab4
 
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
 

Andere mochten auch

Kaggle digits analysis_final_fc
Kaggle digits analysis_final_fcKaggle digits analysis_final_fc
Kaggle digits analysis_final_fcZachary Combs
 
Predicting Delinquency-Give me some credit
Predicting Delinquency-Give me some creditPredicting Delinquency-Give me some credit
Predicting Delinquency-Give me some creditpragativbora
 
Kaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overviewKaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overviewAdam Pah
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013Philip Zheng
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle CompetitionsDataRobot
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleKrishna Sankar
 
Tda presentation
Tda presentationTda presentation
Tda presentationHJ van Veen
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentationHJ van Veen
 

Andere mochten auch (8)

Kaggle digits analysis_final_fc
Kaggle digits analysis_final_fcKaggle digits analysis_final_fc
Kaggle digits analysis_final_fc
 
Predicting Delinquency-Give me some credit
Predicting Delinquency-Give me some creditPredicting Delinquency-Give me some credit
Predicting Delinquency-Give me some credit
 
Kaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overviewKaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overview
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to Kaggle
 
Tda presentation
Tda presentationTda presentation
Tda presentation
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 

Ähnlich wie Zurich machine learning_vicensgaitan

[Slides] Using deep recurrent neural network for direct beam solar irradiance...
[Slides] Using deep recurrent neural network for direct beam solar irradiance...[Slides] Using deep recurrent neural network for direct beam solar irradiance...
[Slides] Using deep recurrent neural network for direct beam solar irradiance...Maosi Chen
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Aritra Sarkar
 
Noha danms13 talk_final
Noha danms13 talk_finalNoha danms13 talk_final
Noha danms13 talk_finalNoha Elprince
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Dataaimsnist
 
Introduction to Chainer Chemistry
Introduction to Chainer ChemistryIntroduction to Chainer Chemistry
Introduction to Chainer ChemistryPreferred Networks
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Charles Martin
 
QX Simulator and quantum programming - 2020-04-28
QX Simulator and quantum programming - 2020-04-28QX Simulator and quantum programming - 2020-04-28
QX Simulator and quantum programming - 2020-04-28Aritra Sarkar
 
Mass spectrometry assay optimization using functional programming patterns in...
Mass spectrometry assay optimization using functional programming patterns in...Mass spectrometry assay optimization using functional programming patterns in...
Mass spectrometry assay optimization using functional programming patterns in...Bennett Kalafut
 
Iaetsd power capture safe test pattern determination
Iaetsd power capture safe test pattern determinationIaetsd power capture safe test pattern determination
Iaetsd power capture safe test pattern determinationIaetsd Iaetsd
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개r-kor
 
Evolving Fuzzy System Applied to Battery Charge Capacity Prediction for Faul...
 Evolving Fuzzy System Applied to Battery Charge Capacity Prediction for Faul... Evolving Fuzzy System Applied to Battery Charge Capacity Prediction for Faul...
Evolving Fuzzy System Applied to Battery Charge Capacity Prediction for Faul...Murilo Camargos
 
Compressed learning for time series classification
Compressed learning for time series classificationCompressed learning for time series classification
Compressed learning for time series classification學翰 施
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMWireilla
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMijfls
 
Integrating Adaptation Mechanisms Using Control Theory Centric Architecture M...
Integrating Adaptation Mechanisms Using Control Theory Centric Architecture M...Integrating Adaptation Mechanisms Using Control Theory Centric Architecture M...
Integrating Adaptation Mechanisms Using Control Theory Centric Architecture M...Filip Krikava
 
Geothermal Reserves Assessment
Geothermal Reserves AssessmentGeothermal Reserves Assessment
Geothermal Reserves AssessmentManuel Rivera
 
Java Performance Puzzlers
Java Performance PuzzlersJava Performance Puzzlers
Java Performance PuzzlersDoug Hawkins
 
Complex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsComplex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsPeter Solymos
 

Ähnlich wie Zurich machine learning_vicensgaitan (20)

[Slides] Using deep recurrent neural network for direct beam solar irradiance...
[Slides] Using deep recurrent neural network for direct beam solar irradiance...[Slides] Using deep recurrent neural network for direct beam solar irradiance...
[Slides] Using deep recurrent neural network for direct beam solar irradiance...
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18
 
Noha danms13 talk_final
Noha danms13 talk_finalNoha danms13 talk_final
Noha danms13 talk_final
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
 
Introduction to Chainer Chemistry
Introduction to Chainer ChemistryIntroduction to Chainer Chemistry
Introduction to Chainer Chemistry
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022
 
QX Simulator and quantum programming - 2020-04-28
QX Simulator and quantum programming - 2020-04-28QX Simulator and quantum programming - 2020-04-28
QX Simulator and quantum programming - 2020-04-28
 
Mass spectrometry assay optimization using functional programming patterns in...
Mass spectrometry assay optimization using functional programming patterns in...Mass spectrometry assay optimization using functional programming patterns in...
Mass spectrometry assay optimization using functional programming patterns in...
 
Iaetsd power capture safe test pattern determination
Iaetsd power capture safe test pattern determinationIaetsd power capture safe test pattern determination
Iaetsd power capture safe test pattern determination
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
 
Evolving Fuzzy System Applied to Battery Charge Capacity Prediction for Faul...
 Evolving Fuzzy System Applied to Battery Charge Capacity Prediction for Faul... Evolving Fuzzy System Applied to Battery Charge Capacity Prediction for Faul...
Evolving Fuzzy System Applied to Battery Charge Capacity Prediction for Faul...
 
Ch2 2011 s
Ch2 2011 sCh2 2011 s
Ch2 2011 s
 
Compressed learning for time series classification
Compressed learning for time series classificationCompressed learning for time series classification
Compressed learning for time series classification
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
Integrating Adaptation Mechanisms Using Control Theory Centric Architecture M...
Integrating Adaptation Mechanisms Using Control Theory Centric Architecture M...Integrating Adaptation Mechanisms Using Control Theory Centric Architecture M...
Integrating Adaptation Mechanisms Using Control Theory Centric Architecture M...
 
Geothermal Reserves Assessment
Geothermal Reserves AssessmentGeothermal Reserves Assessment
Geothermal Reserves Assessment
 
Java Performance Puzzlers
Java Performance PuzzlersJava Performance Puzzlers
Java Performance Puzzlers
 
Complex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsComplex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutions
 

Kürzlich hochgeladen

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfTukamushabaBismark
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Youngkajalvid75
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 

Kürzlich hochgeladen (20)

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Zurich machine learning_vicensgaitan

  • 1. DATA DOPING Solution for "Flavour of Physics" challenge Dr. Vicens Gaitan Grupo AIA Heavy Flavour Data mining Workshop 18-20 February 2016 University of Zurich
  • 2. AGENDA • “Flavour of Physics” Kaggle Challenge • Why is so hard to “discover” the invariant mass? • How to win the challenge: leasons learned • Breaking the rules: Data Doping • Machine Learning in HEP • Conclusions
  • 3. “FLAVOUR OF PHYSICS” KAGGLE CHALLENGE
  • 4. WHY IS SO HARD TO “DISCOVER” THE INVARIANT MASS? • Fact 1: The tau invariant mass can be reconstructed with high accuracy from kinematic variables (p0,pt,eta) and/or lifetime & time of flight ptau<-function(d){ # muon mass in MeV/c^2 mmu = 105.6583715 # calculate tau energy tau_e = sqrt(d$p0_p ** 2 + mmu**2) + sqrt(d$p1_p ** 2 + mmu**2) +sqrt(d$p2_p ** 2 + mmu**2) # calculate pz of tau candidate tau_pz = d$p0_pt * sinh(d$p0_eta) + d$p1_pt * sinh(d$p1_eta)+d$p2_pt * sinh(d$p2_eta) # calculate momentum of tau candidate tau_p =sqrt(d$pt ** 2 + tau_pz ** 2) # calculate eta of tau candidate tau_eta = asinh(tau_pz / d$pt) # calculate mass of tau candidate tau_m2=tau_e ** 2 - tau_p ** 2 tau_m2[tau_m2<0]=0 tau_m_k=sqrt(tau_m2) #M = tau_p*LifeTime*c/FlightDistance #c Speed of light c= 299.792458 tau_m_t=tau_p*d$LifeTime*c/d$FlightDistance return(list(tau_e,tau_pz,tau_p,tau_eta,tau_m)) }
  • 5. WHY IS SO HARD TO “DISCOVER” THE INVARIANT MASS? • Fact 2: The reconstructed tau mass separates nicely signal and background because the background spectrum has a "hole" for decays coming from a true tau (Real data (background) in this window can contain “signal”) Train & Agreement Test
  • 6. WHY IS SO HARD TO “DISCOVER” THE INVARIANT MASS? • Toy models ( not taking into account agreement & correlation): • XGBoost • 3-fold CV • Par("max_depth"=5,"eta"=.1) 1. Using true mass : wAUC = 0.999999999999999(89) +/- 8.4e-16 (!!) 2. Using mass_k: wAUC = 0.997(51) +/- 0.00038 3. Using mass_t: wAUC = 0.996(81) +/- 0.00021 4. Using both wAUC = 0.999(83) +/- 0.00012 Reconstructed Mass is THE Golden Feature
  • 7. WHY IS SO HARD TO “DISCOVER” THE INVARIANT MASS? • BUT using as variables: c("p0_p","p1_p","p2_p","p0_pt","p1_pt","p2_pt","p0_eta","p1_eta","p2_eta","pt","LifeTime","FlightDistance") c("tau_e","tau_p","LifeTime","FlightDistance") wAUC = 0.85(15) +/- 0.0032 ?????? And trying to fit the mass with XGBoost : we obtain:
  • 8. WHY IS SO HARD TO “DISCOVER” THE INVARIANT MASS? The reason: Highly correlated variables: mass is an effect of 1 over 2500 • Solution: uncorrelate variables with PCA • wAUC = 0.947(73) +/- 0.00099 • Gradient Boosting Trees are not able build a representation of Invariant Mass • Maybe Deep Learning can do it?
  • 9. HOW TO WIN THE CHALLENGE: LEASONS LEARNED Recipe to win the challenge 1. Add the reconstructed tau mass ( don’t bother about mass correlation test) 2. Use all available variables ( profit from bad simulated MC variables to separate signal from real background) AUCw=0.9999920 CVM=0.0848 K-S=0.2226 3. Hack the Correlation and Agreement Test ;)
  • 10. HOW TO WIN THE CHALLENGE: LEASONS LEARNED Define p’ = .99 * p ^5000 + .01 * RND and calculate the CVM CVM= 0.85 Correlation Test : Correlation between classifier output p and mass over a rolling window: CVM= 0.85 • CVM= 0.0012 !! • p’ has very similar AUC as p • p’ for agreement sample ->0 because p’ agreement <<1 • Useless for physics : Just exploiting the background mass gap AUCw= 0.9999921 CVM=0.0014 K-S=0.0088
  • 11. BREAKING THE RULES: DATA DOPING • Recipe to build a physically sound classifier: 1. Not to use reconstructed mass, nor features allowing easy mass reconstruction 2. Try to not use variable regions for which the Monte Carlo simulation doesn't agree with real data In order to fullfill 2 we have to break the rules and take a look to the control channel Goal: train a classifier able to separate A from B, but not C from D Max(wAUC(A,B)) with KS(C,D)<epsilon Hypothesis: Control Channel & Analysis channel share the same MC “defects” MC Real Data Control Channel Analysis Channel Known Physics Known Physics New Physics Background A B C D
  • 12. BREAKING THE RULES: DATA DOPING • The idea is to "dope" (in the semiconductor meaning) the training set with a small number of Monte Carlo events from the control channel , but labeled as background. This disallow the classifier to pick features discriminating data and Monte Carlo. There are two parameters that regularize the learning: • The number of "doping" events • the complexity of the classifier (for instance number of trees) MC Real Data Control Channel Analysis Channel Known Physics Known Physics New Physics Background A B C D
  • 13. BREAKING THE RULES: DATA DOPING Free classifier Doping events: 2000 Doping events: 3000 Grid search over Classifier complexity (n_ trees) and Number (weight) of doping events Dammit! A new hyperparameter….
  • 14. BREAKING THE RULES: DATA DOPING 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.986 0.987 0.988 0.989 0.99 0.991 0.992 0.993 0.994 2500 3000 3100 3200 3500 3900 5000 4000 AUCw= 0.9893 CVM=0.0011 K-S=0.087 CV wAUC Private LB KS Optimal Doping Events: 3960
  • 15. BREAKING THE RULES: DATA DOPING Analysis Channel Control channel Correlation Test • Good discriminating power in analysis channel • No separation for the control channel • Classifier not correlated with mass • Probably good for physics….
  • 16. MACHINE LEARNING IN HEP Today we have • the right tools • data availability • the computer power • But new physics is difficult to discover unless you know what are you looking for… • A complementary approach can be to use unsupervised learning (only real data driven, we have lots of them)
  • 17. MACHINE LEARNING IN HEP Example: exploring tau decay at LEP (ALEPH 1993) (yes, e+ e- physics is cleaner…) Feeding an autoencoder with “elaborated” detector data we are able to “discover” different decay modes looking at the compressed representation without a physics model (MC) Today is possible to do “end to end” autoencoding from raw detector data
  • 18. MACHINE LEARNING IN HEP Example: t-sne with the challenge data: Look at the fine structure…. MC: Control Channel Signal Real Data (all you see is real!) Control Channel Background Test
  • 19. CONCLUSIONS • Machine learning algorithms alone can fail to discover tiny effects in the data (1777 MeV is only 1.e-4 of the energy at the center of mass) • Use your knowledge: Try to reduce your data using fundamental simmetries, like Lorentz invariance, detector geometry... • Be aware of the test you are using to assure the classifier validity: • If a test can be “hacked” an enough powerfull machine learning algorithm will find the way • If it is possible, try to use non supervised methods( without MC) to gain insight in your data