SlideShare a Scribd company logo
1 of 51
Download to read offline
Center for Data Science
Paris-Saclay1
CNRS & University Paris Saclay
BALÁZS KÉGL
THE SYSTEMIC CHALLENGES OF DATA
SCIENCE INITIATIVES
(AND SOME SOLUTIONS)
2
I will not talk about science
3
I will talk about 

management 

(of) (data) science
Center for Data Science
Paris-Saclay
• My eight-year of experience interfacing
between high-energy physics and data science	

• Our two-year experience of running PS-CDS	

• Extensive collaboration with management
scientist
4
WHERE DOES IT COME FROM?
Center for Data Science
Paris-Saclay
• Intro: data science, the PS-CDS	

• The data science ecosystem: challenges	

• Some tools
5
OUTLINE
Center for Data Science
Paris-Saclay
DATA SCIENCE IN THE WORLD
6
Center for Data Science
Paris-Saclay7
UNIVERSITÉ PARIS-SACLAY
19 founding partners
Center for Data Science
Paris-Saclay
UNIVERSITÉ PARIS-SACLAY
8
+ horizontal multi-disciplinary and multi-partner
initiatives to create cohesion
Center for Data Science
Paris-Saclay9
Center for Data Science
Paris-Saclay
A multi-disciplinary initiative to define, structure, and manage
the data science ecosystem at the Université Paris-Saclay
http://www.datascience-paris-saclay.fr/
Biology & bioinformatics
IBISC/UEvry
LRI/UPSud
Hepatinov
CESP/UPSud-UVSQ-Inserm
IGM-I2BC/UPSud
MIA/Agro
MIAj-MIG/INRA
LMAS/Centrale
Chemistry
EA4041/UPSud
Earth sciences
LATMOS/UVSQ
GEOPS/UPSud
IPSL/UVSQ
LSCE/UVSQ
LMD/Polytechnique
Economy
LM/ENSAE
RITM/UPSud
LFA/ENSAE
Neuroscience
UNICOG/Inserm
U1000/Inserm
NeuroSpin/CEA
Particle physics
astrophysics &
cosmology
LPP/Polytechnique
DMPH/ONERA
CosmoStat/CEA
IAS/UPSud
AIM/CEA
LAL/UPSud
250researchers in 35laboratories
Machine learning
LRI/UPSud
LTCI/Telecom
CMLA/Cachan
LS/ENSAE
LIX/Polytechnique
MIA/Agro
CMA/Polytechnique
LSS/Supélec
CVN/Centrale
LMAS/Centrale
DTIM/ONERA
IBISC/UEvry
Visualization
INRIA
LIMSI
Signal processing
LTCI/Telecom
CMA/Polytechnique
CVN/Centrale
LSS/Supélec
CMLA/Cachan
LIMSI
DTIM/ONERA
Statistics
LMO/UPSud
LS/ENSAE
LSS/Supélec
CMA/Polytechnique
LMAS/Centrale
MIA/AgroParisTech
machine learning
information retrieval
signal processing
data visualization
databases
Domain science
human society
life
brain
earth
universe
Tool building
software engineering
clouds/grids
high-performance
computing
optimization
Domain scientistSoftware engineer
datascience-paris-saclay.fr
LIST/CEA
Center for Data Science
Paris-Saclay
DATA SCIENCE
10
Design of automated methods
to analyze massive and complex data
to extract useful information
Center for Data Science
Paris-Saclay11
CENTER FOR DATA SCIENCE

=

DATA CENTER

We are focusing on inference:
data knowledge
Interfacing with HPC, cloud, storage, production,
privacy, security
12
WHAT IS NEW?
“As the flow of data increases, it is
increasingly processed, analyzed,
and acted upon by machines, not
humans.”
NYU-CDS manifesto
• We have the data	

• statistical / physical modeling is less important	

• data-driven prediction	

• We have the computational power	

• We have the algorithms
• deep learning breakthrough: image, speech, language	

• closing on AI, step by step
13
WHAT IS NEW?
Center for Data Science
Paris-Saclay14
https://medium.com/@balazskegl
Domain science
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
15
THE DATA SCIENCE LANDSCAPE
Data scientist
Data trainer
Applied scientist
Domain scientistSoftware engineer
Data engineer
Data science
statistics

machine learning
information retrieval
signal processing
data visualization
databases
Tool building
software engineering

clouds/grids
high-performance

computing
optimization
Center for Data Science
Paris-Saclay
• (The lack of) manpower	

• especially at the interfaces	

• industrial brain-drain	

• Incentives	

• data scientists are not incentivized to work on domain science	

• scientists are not incentivized to work on tools	

• Access	

• no well-developed channels to identify the right experts for a given problem	

• Tools	

• few tools that can help domain scientists and data scientists to collaborate efficiently
16
CHALLENGES
Center for Data Science
Paris-Saclay
TOOLS
17
We are designing and learning to manage tools to
accompany data science projects with different needs
Center for Data Science
Paris-Saclay
TOOLS: LANDSCAPE TO ECOSYSTEM
18
Data scientist
Data trainer
Applied scientist
Domain expertSoftware engineer
Data engineer
Tool building Data domains
Data science
statistics

machine learning
information retrieval
signal processing
data visualization
databases
• interdisciplinary projects
• matchmaking tool
• design and innovation strategy workshops
• data challenges
• coding sprints
• Open Software Initiative
• code consolidator and engineering projects
software engineering

clouds/grids
high-performance

computing
optimization
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
• data science RAMPs and TSs
• IT platform for linked data
• annotation tools
• SaaS data science platform
19
• Efficient exploration of the space of innovative ideas
• Communication, knowledge sharing	

• Project building
DESIGNING DATA SCIENCE PROJECTS
Center for Data Science
Paris-Saclay
DESIGNING DATA SCIENCE PROJECTS
20
Data value
Data analytics
Exploration of value
!
• design theory
• data-based prospection
• innovation workshops
Problem formulation
Problem solving
!
• specialized teams
• RAMPs / training sprints
• data challenges
Center for Data Science
Paris-Saclay
• Putting domain scientists, data scientists, and management
scientist in the same room	

• Getting them understand each other	

• Keeping them collectively creative	

• The goal: identifying and defining projects	

• low-hanging fruits	

• breakthrough projects	

• long-term vision
21
DESIGN AND INNOVATION STRATEGY WORKSHOPS
Center for Data Science
Paris-Saclay
innovative design 	

=	

interaction and joint
expansion of concepts
and knowledge
22
DESIGN AND INNOVATION STRATEGY WORKSHOPS
C/K design theory
Center for Data Science
Paris-Saclay23
DESIGN AND INNOVATION STRATEGY WORKSHOPS
[P]$Project$
building!
Ini3alisa3on$
[K]$Knowledge$
sharing$
Workshops$
[C]$IFM?Design$
Workshops$
[RUN]!
DKCP process: linearizing C-K dynamics
Center for Data Science
Paris-Saclay
THREE ANALYTICS TOOLS FOR INITIATING	

DOMAIN-DATA SCIENCE INTERACTIONS
24
RAPID ANALYTICS AND
MODEL PROTOTYPING
(RAMP)
DATA CHALLENGES
TRAINING SPRINTS (TS)
Center for Data Science
Paris-Saclay
DATA CHALLENGES
25
Center for Data Science
Paris-Saclay
DATA CHALLENGES
26
• A data challenge is a recently developed unconventional
dissemination and communication tool	

• a scientific or industrial data producer arrives with a well-defined problem
and a corresponding annotated data set	

• defines a quantitative goal	

• makes the problem and part of the data set (the training set) public on a
dedicated site	

• data science experts then take the public training data and submit solutions
(predictions) for a test set with hidden annotations	

• submissions are evaluated numerically using the quantitative measure	

• contestants are listed on a leaderboard	

• after a predefined time, typically a couple of months, the final results are
revealed and the winners are awarded
Center for Data Science
Paris-Saclay
DATA CHALLENGES
27
• The HiggsML challenge on Kaggle	

• https://www.kaggle.com/c/higgs-boson
Center for Data Science
Paris-Saclay
HUGE PUBLICITY	

28
CLASSIFICATION FOR DISCOVERY
14
Center for Data Science
Paris-Saclay
SIGNIFICANT IMPROVEMENT OVER THE BASELINE
29
CLASSIFICATION FOR DISCOVERY
15
30
HUGE PUBLICITY	

SIGNIFICANT IMPROVEMENT OVER THE BASELINE
yet partially missing the objectives
• Challenges are useful for	

• generating visibility in the data science community about novel
application domains	

• benchmarking in a fair way state-of-the-art techniques on
well-defined problems	

• finding talented data scientists	

• Limitations	

• not necessary adapted to solving complex and open-ended
data science problems in realistic environments	

• no direct access to solutions and data scientist	

• emphasizes competition
31
DATA CHALLENGES
32
We decided to design something better
33
RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)
• Prototyping
• Training	

• Collaboration building
Center for Data Science
Paris-Saclay34
RAMPS
• Single-day coding sessions
• 20-40 participants	

• preparation is similar to challenges
• Goals	

• focusing and motivating top talents	

• promoting collaboration, speed, and efficiency	

• solving (prototyping) real problems
35
TRAINING SPRINTS
• Single-day training sessions
• 20-40 participants	

• focusing on a single subject (deep learning, model tuning, functional
data, etc.)	

• preparing RAMPs
36
ANALYTICS TOOLS TO PROMOTE 	

COLLABORATION AND CODE REUSE
Center for Data Science
Paris-Saclay
SIGNIFICANT IMPROVEMENT OVER THE BASELINE
37
CLASSIFICATION FOR DISCOVERY
15
Center for Data Science
Paris-Saclay38
ANALYTICS TOOL TO PROMOTE 	

COLLABORATION AND CODE REUSE
Center for Data Science
Paris-Saclay
ANALYTICS TOOLS TO MONITOR PROGRESS
39
Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 Jan 15
The HiggsML challenge
40
Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 Apr 10
Classifying variable stars
41
Center for Data Science
Paris-Saclay
VARIABLE STARS
42
Learning to discoverB. Kégl / CNRS - Saclay
VARIABLE STARS
43
accuracy improvement: 89% to 96%
Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 June 16 and Sept 26
Predicting El Nino
44
45
RAPID ANALYTICS AND MODEL PROTOTYPING
RMSE improvement: 0.9˚C to 0.4˚C
46
2015 October 8
Insect classification
RAPID ANALYTICS AND MODEL PROTOTYPING
47
RAPID ANALYTICS AND MODEL PROTOTYPING
accuracy improvement: 30% to 70%
48
2015 Fall
Drug identification from spectra
RAPID ANALYTICS AND MODEL PROTOTYPING
Center for Data Science
Paris-Saclay49
THANK YOU!
Center for Data Science
Paris-Saclay
• A window to open data at Paris-Saclay	

• We are not storing or handling existing large data sets	

• Rather indexing, linking, and mapping, embedding in the
worldwide linked data (RDF) ecosystem	

• Storing small data sets of small teams is possible	

• Subsets of large sets for prototyping	

• Or simply store metadata plus pointer
50
IT PLATFORM FOR LINKED DATA
http://io.datascience-paris-saclay.fr/
Center for Data Science
Paris-Saclay
IT PLATFORM FOR LINKED DATA
51

More Related Content

Similar to The systemic challenges in data science initiatives (and some solutions)

Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?DIGITALSAI1
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)SayyedYusufali
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabadVamsiNihal
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabadsaitejavella
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training HyderabadNithinsunil1
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabadVamsiNihal
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)SayyedYusufali
 
data science training and placement
data science training and placementdata science training and placement
data science training and placementSaiprasadVella
 
online data science training
online data science trainingonline data science training
online data science trainingDIGITALSAI1
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabadVamsiNihal
 
data science online training in hyderabad
data science online training in hyderabaddata science online training in hyderabad
data science online training in hyderabadVamsiNihal
 
Best data science training in Hyderabad
Best data science training in HyderabadBest data science training in Hyderabad
Best data science training in HyderabadKumarNaik21
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training HyderabadNithinsunil1
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)SayyedYusufali
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)SayyedYusufali
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)SayyedYusufali
 
1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdfAyele40
 
Building the Data Science Profession in Europe
Building the Data Science Profession in EuropeBuilding the Data Science Profession in Europe
Building the Data Science Profession in EuropeSteven Miller
 

Similar to The systemic challenges in data science initiatives (and some solutions) (20)

Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
data science training and placement
data science training and placementdata science training and placement
data science training and placement
 
online data science training
online data science trainingonline data science training
online data science training
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
data science online training in hyderabad
data science online training in hyderabaddata science online training in hyderabad
data science online training in hyderabad
 
Best data science training in Hyderabad
Best data science training in HyderabadBest data science training in Hyderabad
Best data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)
 
1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf
 
Building the Data Science Profession in Europe
Building the Data Science Profession in EuropeBuilding the Data Science Profession in Europe
Building the Data Science Profession in Europe
 

More from Balázs Kégl

Data-driven hypothesis generation using deep neural nets
Data-driven hypothesis generation using deep neural netsData-driven hypothesis generation using deep neural nets
Data-driven hypothesis generation using deep neural netsBalázs Kégl
 
Model-based reinforcement learning and self-driving engineering systems
Model-based reinforcement learning and self-driving engineering systemsModel-based reinforcement learning and self-driving engineering systems
Model-based reinforcement learning and self-driving engineering systemsBalázs Kégl
 
Managing the AI process: putting humans (back) in the loop
Managing the AI process: putting humans (back) in the loopManaging the AI process: putting humans (back) in the loop
Managing the AI process: putting humans (back) in the loopBalázs Kégl
 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...Balázs Kégl
 
Machine learning in scientific workflows
Machine learning in scientific workflowsMachine learning in scientific workflows
Machine learning in scientific workflowsBalázs Kégl
 
A historical introduction to deep learning: hardware, data, and tricks
A historical introduction to deep learning: hardware, data, and tricksA historical introduction to deep learning: hardware, data, and tricks
A historical introduction to deep learning: hardware, data, and tricksBalázs Kégl
 
Build your own data challenge, or just organize team work
Build your own data challenge, or just organize team workBuild your own data challenge, or just organize team work
Build your own data challenge, or just organize team workBalázs Kégl
 
Learning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physicsLearning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physicsBalázs Kégl
 

More from Balázs Kégl (8)

Data-driven hypothesis generation using deep neural nets
Data-driven hypothesis generation using deep neural netsData-driven hypothesis generation using deep neural nets
Data-driven hypothesis generation using deep neural nets
 
Model-based reinforcement learning and self-driving engineering systems
Model-based reinforcement learning and self-driving engineering systemsModel-based reinforcement learning and self-driving engineering systems
Model-based reinforcement learning and self-driving engineering systems
 
Managing the AI process: putting humans (back) in the loop
Managing the AI process: putting humans (back) in the loopManaging the AI process: putting humans (back) in the loop
Managing the AI process: putting humans (back) in the loop
 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
 
Machine learning in scientific workflows
Machine learning in scientific workflowsMachine learning in scientific workflows
Machine learning in scientific workflows
 
A historical introduction to deep learning: hardware, data, and tricks
A historical introduction to deep learning: hardware, data, and tricksA historical introduction to deep learning: hardware, data, and tricks
A historical introduction to deep learning: hardware, data, and tricks
 
Build your own data challenge, or just organize team work
Build your own data challenge, or just organize team workBuild your own data challenge, or just organize team work
Build your own data challenge, or just organize team work
 
Learning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physicsLearning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physics
 

Recently uploaded

Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)itwameryclare
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 

Recently uploaded (20)

Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 

The systemic challenges in data science initiatives (and some solutions)

  • 1. Center for Data Science Paris-Saclay1 CNRS & University Paris Saclay BALÁZS KÉGL THE SYSTEMIC CHALLENGES OF DATA SCIENCE INITIATIVES (AND SOME SOLUTIONS)
  • 2. 2 I will not talk about science
  • 3. 3 I will talk about 
 management 
 (of) (data) science
  • 4. Center for Data Science Paris-Saclay • My eight-year of experience interfacing between high-energy physics and data science • Our two-year experience of running PS-CDS • Extensive collaboration with management scientist 4 WHERE DOES IT COME FROM?
  • 5. Center for Data Science Paris-Saclay • Intro: data science, the PS-CDS • The data science ecosystem: challenges • Some tools 5 OUTLINE
  • 6. Center for Data Science Paris-Saclay DATA SCIENCE IN THE WORLD 6
  • 7. Center for Data Science Paris-Saclay7 UNIVERSITÉ PARIS-SACLAY 19 founding partners
  • 8. Center for Data Science Paris-Saclay UNIVERSITÉ PARIS-SACLAY 8 + horizontal multi-disciplinary and multi-partner initiatives to create cohesion
  • 9. Center for Data Science Paris-Saclay9 Center for Data Science Paris-Saclay A multi-disciplinary initiative to define, structure, and manage the data science ecosystem at the Université Paris-Saclay http://www.datascience-paris-saclay.fr/ Biology & bioinformatics IBISC/UEvry LRI/UPSud Hepatinov CESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/Agro MIAj-MIG/INRA LMAS/Centrale Chemistry EA4041/UPSud Earth sciences LATMOS/UVSQ GEOPS/UPSud IPSL/UVSQ LSCE/UVSQ LMD/Polytechnique Economy LM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA CosmoStat/CEA IAS/UPSud AIM/CEA LAL/UPSud 250researchers in 35laboratories Machine learning LRI/UPSud LTCI/Telecom CMLA/Cachan LS/ENSAE LIX/Polytechnique MIA/Agro CMA/Polytechnique LSS/Supélec CVN/Centrale LMAS/Centrale DTIM/ONERA IBISC/UEvry Visualization INRIA LIMSI Signal processing LTCI/Telecom CMA/Polytechnique CVN/Centrale LSS/Supélec CMLA/Cachan LIMSI DTIM/ONERA Statistics LMO/UPSud LS/ENSAE LSS/Supélec CMA/Polytechnique LMAS/Centrale MIA/AgroParisTech machine learning information retrieval signal processing data visualization databases Domain science human society life brain earth universe Tool building software engineering clouds/grids high-performance computing optimization Domain scientistSoftware engineer datascience-paris-saclay.fr LIST/CEA
  • 10. Center for Data Science Paris-Saclay DATA SCIENCE 10 Design of automated methods to analyze massive and complex data to extract useful information
  • 11. Center for Data Science Paris-Saclay11 CENTER FOR DATA SCIENCE
 =
 DATA CENTER
 We are focusing on inference: data knowledge Interfacing with HPC, cloud, storage, production, privacy, security
  • 12. 12 WHAT IS NEW? “As the flow of data increases, it is increasingly processed, analyzed, and acted upon by machines, not humans.” NYU-CDS manifesto
  • 13. • We have the data • statistical / physical modeling is less important • data-driven prediction • We have the computational power • We have the algorithms • deep learning breakthrough: image, speech, language • closing on AI, step by step 13 WHAT IS NEW?
  • 14. Center for Data Science Paris-Saclay14 https://medium.com/@balazskegl
  • 15. Domain science energy and physical sciences health and life sciences Earth and environment economy and society brain 15 THE DATA SCIENCE LANDSCAPE Data scientist Data trainer Applied scientist Domain scientistSoftware engineer Data engineer Data science statistics
 machine learning information retrieval signal processing data visualization databases Tool building software engineering
 clouds/grids high-performance
 computing optimization
  • 16. Center for Data Science Paris-Saclay • (The lack of) manpower • especially at the interfaces • industrial brain-drain • Incentives • data scientists are not incentivized to work on domain science • scientists are not incentivized to work on tools • Access • no well-developed channels to identify the right experts for a given problem • Tools • few tools that can help domain scientists and data scientists to collaborate efficiently 16 CHALLENGES
  • 17. Center for Data Science Paris-Saclay TOOLS 17 We are designing and learning to manage tools to accompany data science projects with different needs
  • 18. Center for Data Science Paris-Saclay TOOLS: LANDSCAPE TO ECOSYSTEM 18 Data scientist Data trainer Applied scientist Domain expertSoftware engineer Data engineer Tool building Data domains Data science statistics
 machine learning information retrieval signal processing data visualization databases • interdisciplinary projects • matchmaking tool • design and innovation strategy workshops • data challenges • coding sprints • Open Software Initiative • code consolidator and engineering projects software engineering
 clouds/grids high-performance
 computing optimization energy and physical sciences health and life sciences Earth and environment economy and society brain • data science RAMPs and TSs • IT platform for linked data • annotation tools • SaaS data science platform
  • 19. 19 • Efficient exploration of the space of innovative ideas • Communication, knowledge sharing • Project building DESIGNING DATA SCIENCE PROJECTS
  • 20. Center for Data Science Paris-Saclay DESIGNING DATA SCIENCE PROJECTS 20 Data value Data analytics Exploration of value ! • design theory • data-based prospection • innovation workshops Problem formulation Problem solving ! • specialized teams • RAMPs / training sprints • data challenges
  • 21. Center for Data Science Paris-Saclay • Putting domain scientists, data scientists, and management scientist in the same room • Getting them understand each other • Keeping them collectively creative • The goal: identifying and defining projects • low-hanging fruits • breakthrough projects • long-term vision 21 DESIGN AND INNOVATION STRATEGY WORKSHOPS
  • 22. Center for Data Science Paris-Saclay innovative design = interaction and joint expansion of concepts and knowledge 22 DESIGN AND INNOVATION STRATEGY WORKSHOPS C/K design theory
  • 23. Center for Data Science Paris-Saclay23 DESIGN AND INNOVATION STRATEGY WORKSHOPS [P]$Project$ building! Ini3alisa3on$ [K]$Knowledge$ sharing$ Workshops$ [C]$IFM?Design$ Workshops$ [RUN]! DKCP process: linearizing C-K dynamics
  • 24. Center for Data Science Paris-Saclay THREE ANALYTICS TOOLS FOR INITIATING DOMAIN-DATA SCIENCE INTERACTIONS 24 RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP) DATA CHALLENGES TRAINING SPRINTS (TS)
  • 25. Center for Data Science Paris-Saclay DATA CHALLENGES 25
  • 26. Center for Data Science Paris-Saclay DATA CHALLENGES 26 • A data challenge is a recently developed unconventional dissemination and communication tool • a scientific or industrial data producer arrives with a well-defined problem and a corresponding annotated data set • defines a quantitative goal • makes the problem and part of the data set (the training set) public on a dedicated site • data science experts then take the public training data and submit solutions (predictions) for a test set with hidden annotations • submissions are evaluated numerically using the quantitative measure • contestants are listed on a leaderboard • after a predefined time, typically a couple of months, the final results are revealed and the winners are awarded
  • 27. Center for Data Science Paris-Saclay DATA CHALLENGES 27 • The HiggsML challenge on Kaggle • https://www.kaggle.com/c/higgs-boson
  • 28. Center for Data Science Paris-Saclay HUGE PUBLICITY 28 CLASSIFICATION FOR DISCOVERY 14
  • 29. Center for Data Science Paris-Saclay SIGNIFICANT IMPROVEMENT OVER THE BASELINE 29 CLASSIFICATION FOR DISCOVERY 15
  • 30. 30 HUGE PUBLICITY SIGNIFICANT IMPROVEMENT OVER THE BASELINE yet partially missing the objectives
  • 31. • Challenges are useful for • generating visibility in the data science community about novel application domains • benchmarking in a fair way state-of-the-art techniques on well-defined problems • finding talented data scientists • Limitations • not necessary adapted to solving complex and open-ended data science problems in realistic environments • no direct access to solutions and data scientist • emphasizes competition 31 DATA CHALLENGES
  • 32. 32 We decided to design something better
  • 33. 33 RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP) • Prototyping • Training • Collaboration building
  • 34. Center for Data Science Paris-Saclay34 RAMPS • Single-day coding sessions • 20-40 participants • preparation is similar to challenges • Goals • focusing and motivating top talents • promoting collaboration, speed, and efficiency • solving (prototyping) real problems
  • 35. 35 TRAINING SPRINTS • Single-day training sessions • 20-40 participants • focusing on a single subject (deep learning, model tuning, functional data, etc.) • preparing RAMPs
  • 36. 36 ANALYTICS TOOLS TO PROMOTE COLLABORATION AND CODE REUSE
  • 37. Center for Data Science Paris-Saclay SIGNIFICANT IMPROVEMENT OVER THE BASELINE 37 CLASSIFICATION FOR DISCOVERY 15
  • 38. Center for Data Science Paris-Saclay38 ANALYTICS TOOL TO PROMOTE COLLABORATION AND CODE REUSE
  • 39. Center for Data Science Paris-Saclay ANALYTICS TOOLS TO MONITOR PROGRESS 39
  • 40. Center for Data Science Paris-Saclay RAPID ANALYTICS AND MODEL PROTOTYPING 2015 Jan 15 The HiggsML challenge 40
  • 41. Center for Data Science Paris-Saclay RAPID ANALYTICS AND MODEL PROTOTYPING 2015 Apr 10 Classifying variable stars 41
  • 42. Center for Data Science Paris-Saclay VARIABLE STARS 42
  • 43. Learning to discoverB. Kégl / CNRS - Saclay VARIABLE STARS 43 accuracy improvement: 89% to 96%
  • 44. Center for Data Science Paris-Saclay RAPID ANALYTICS AND MODEL PROTOTYPING 2015 June 16 and Sept 26 Predicting El Nino 44
  • 45. 45 RAPID ANALYTICS AND MODEL PROTOTYPING RMSE improvement: 0.9˚C to 0.4˚C
  • 46. 46 2015 October 8 Insect classification RAPID ANALYTICS AND MODEL PROTOTYPING
  • 47. 47 RAPID ANALYTICS AND MODEL PROTOTYPING accuracy improvement: 30% to 70%
  • 48. 48 2015 Fall Drug identification from spectra RAPID ANALYTICS AND MODEL PROTOTYPING
  • 49. Center for Data Science Paris-Saclay49 THANK YOU!
  • 50. Center for Data Science Paris-Saclay • A window to open data at Paris-Saclay • We are not storing or handling existing large data sets • Rather indexing, linking, and mapping, embedding in the worldwide linked data (RDF) ecosystem • Storing small data sets of small teams is possible • Subsets of large sets for prototyping • Or simply store metadata plus pointer 50 IT PLATFORM FOR LINKED DATA http://io.datascience-paris-saclay.fr/
  • 51. Center for Data Science Paris-Saclay IT PLATFORM FOR LINKED DATA 51