SlideShare a Scribd company logo
1 of 27
(Big) Data (Science) Skills
Big Data Value Association Summit in Madrid
17/06/2015
Oscar Corcho
ocorcho@fi.upm.es
@ocorcho
https://www.slideshare.com/ocorcho
License
• This work is licensed under the license
CC BY-NC-SA 4.0 International
• http://purl.org/NET/rdflicense/cc-by-nc-sa4.0
• You are free:
• to Share — to copy, distribute and transmit the work
• to Remix — to adapt the work
• Under the following conditions
• Attribution — You must attribute the work by inserting
• “[source Oscar Corcho]” at the footer of each reused slide
• a credits slide stating: “These slides are partially based on
“(Big) Data (Science) Skills” by O. Corcho”
• Non-commercial
• Share-Alike
Data Scientist: Technical and Soft Skills needed
• One of the two or
three pictures
expected from a talk
on skills…
• I may start going
through
• Each of these topics
• Discussing on the
specific skills needed
• However…
Sorry, looking for the reference to add here
What is Big Data?
Source: http://www.philipchircop.com/post/25783275888/seeing-the-full-elephant-its-a-tree-its-a
Big Data and the theory of ecological niches
Characteristics of an ecological niche
• A niche is defined by a spectrum of resource usage
• Species differ from each other in how efficient they are in
using resources that change continuously
• Characteristics of a niche
• Amplitude (range in which resources are used)
• Generic species (they can use a wide range of
resources)
• Specialist species (they require a very specific
combination of resources)
• Overlap (similarity among niches in their usage of resources)
• Competitive exclusion principle (Gause, 1934)
• If two species coexist in a stable environment, they do it as a
differentiation of their effective ecological niches.
Source: Javier Seoane. Ecología. Unidad Temática 21. Teoría del nicho ecológico
WHAT’S THE RELATIONSHIP
TO BIG DATA?
Well, that’s interesting, but…
Big Data Niche 1. HPC and e-Infrastructure Experts
Background: Computer Science (Systems)
System Administration
Terms used in their native language:
Blades, Infiniband, OpenMPI,
racks, HDF, TBs, Gflops
Their daily life:
Check system logs
Make sure that queues are active
Install a new rack
What’s Big Data for them?
A “commercial” term for something
that they have done for a long time
They really know how to configure
and monitor a Hadoop cluster
They would love seeing those talking
about Big Data executing processes
on fluid dynamics
Big Data Niche 2. Data Storage and Access Experts
Background: Computer Science
Database administration
Terms used in their native language:
SQL, NoSQL, Column store
Transacions, Hive, TBs/PBs/…,
TPS (Transactions per s)
Their daily life:
Optimise several queries
Run a new benchmark
Design an optimiser/physical operator
What’s Big Data for them?
A new opportunity to work on
optimisation algorithms
They know how to configure a database
They often laugh at those who deploy
a NoSQL solution for a problem
that can be solved with a
relational database
Big Data Niche 3. Machine Learning Experts
Background: Mathematics, Statistics,
Physics, Computer Science
Terms used in their native language:
Complexity, algorithm, p-value,
convergence, precision, recall
ROC curves, bayesian networks, R
Their daily life:
Read about a new problem
Write down a few formulae in the
whiteboard (even blackboards)
Prove that the algorithm terminates
What’s Big Data for them?
The same problems applied to data of
larger size, with new challenges
Problems are not only solved in
Haddop or a powerful NoSQL DB
Astonished by those who still mix up
correlation and causality
Big Data Niche 4. Slow-data Experts
Background: Computer Science, Statistics,
Library Sciences, Linguistics
Terms used in their native language:
Information model, vocabulary,
ontology, data quality, curation
Their daily life:
Receive a database schema
Talk to data producers and (re)users
Obtain consensus and transform data
What’s Big Data for them?
The difficulty lies on the variety of
data formats and structures
We may integrate data from varied
sources, although this is not
always possible
When you manage to integrate
heterogeneous data, you can achieve
better results
Big Data Niche 5. (Big Data) Consultants
Background: Computer Science, Economy,
…
Terms used in their native language:
Business model, business opportunity,
Big Data, Data Value Chain,
Hadoop, Spark, R, TBs, GFlops
Their daily life:
Read a Gartner Big Data report
Talk to potential customers
Transfer needs to technicians
What’s Big Data for them?
It’s the 4Vs, plus a few more
I have a PPT presentation with a
Big Data infrastructure,
architecture,
and previous projects, which I will
use to sell a project to my
customers
Are we missing any ecological niche?
• We have already seen a couple of ecological
niches…
• They all coexist
• Some of them are overlapping
Is there anyone that has not been yet
considered?
The evolution of a new species: the Data Scientist
Background: Computer Science+Statistics+
+Mathematics+Economy+
…
Terms used in their new exotic language:
HPC, databases, algorithms,
harmonisation, integration,
Hadoop, Spark, R, TBs, GFlops
Their daily life:
Learn about a new infraestructure
Code scripts to be run on Spark
Interpret results
Install a new framework
Read a few scientific papers
Make shiny presentations
Describe in their blog the activities
that they do, so that Big Data is
better known and understood
…
© Volker Markl: “Data Scientist” – “Jack of All Trades!”
Application
Data
Science
Control Flow
Iterative Algorithms
Error Estimation
Active Sampling
Sketches
Curse of Dimensionality
Decoupling
Convergence
Monte Carlo
Mathematical Programming
Linear Algebra
Stochastic Gradient Descent
Regression
Statistics
Hashing
Parallelization
Query Optimization
Fault Tolerance
Relational Algebra / SQL
Scalability
Data Analysis Language
Compiler
Memory Management
Memory Hierarchy
Data Flow
Hardware Adaptation
Indexing
Resource Management
NF2 /XQuery
Data Warehouse/OLAP
Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics)
Real-Time
Data Scientists and Pi-shaped people
• Let’s now go into
the expected
discussion
Sorry, looking for the reference to add here
Will all species survive?
• If Big Data defines an ecosystem…
• Which species will survive?
• Will Data Scientists wipe out the other species?
• Or will they be able to live in perfect symbiosis?
What is the ideal training required
for the individuals of these
species so that they can survive?
Data Science starter kits. Are they effective?
Masters in Data Science, Big Data and alike (I)
Expert in Big Data
Expert in Data Science
Masters in Data Science, Big Data and alike (II)
Masters in Data Science, Big Data and alike (III)
Year 1
• Data handling
• Data analysis
• Advanced data analysis and data
management
• Visualization
• Applications
Year 2
Are we doing it right in terms of training?
• Probably it is all about lack of maturity in the area, but
syllabi do not seem to be perfectly compatible…
• It is not easy to believe that we can create Data
Scientists in only one year
• Should we train people to know a bit about everything?
• Or should we separate more clearly the species in our
ecosystem and specialise them better for their work?
How do we manage to keep a
healthy and stable ecosystem?
Shameless self-promotion
• Strategies for success in the
Digital-Data Revolution
• Separation of concerns
• Intellectual ramps
• Data-intensive knowledge
discovery
• Components and usage
patterns
• Data-intensive engineering
• Development vs enactment
• Data-intensive application
experiences
• In Science
• In Business
Can we learn from lessons
learned in Data-Intensive
Science?
Separation of concerns: three clear profiles
• Domain experts (WHAT)
• They know the problems they want to
solve
• They know the application domain
• They can create (scientific) workflows
• Data-intensive analysts (WHAT)
• They know a lot about (Big) data
analysis
• The may not know about the
infrastructure behind the scenes
• They do not necessarily know all the
details of the application domain
• Data-intensive engineers (HOW)
• They know a lot about distributed
computing/infraestructure/HPC/cloud
s/etc.
• They received the description of an
algorithm and they can make it more
efficient (parallelisation)
Separation of concerns: Differentiated tasks
[<select =
"1<= day(inp.first.start)<=5",
project="inp">,
<select =
"6<= day(inp.first.start)<=10",
project="inp">,
<select =
"11<= day(inp.first.start)<=15",
project="inp">,
... ]
Programmable
Filter
Project
outputs
inp
rules
distrib
"second.fURI ASC..."
Sort
outp
data
rule
Sort
outp
data
rule
Sort
outp
data
rule
Sort
outp
data
rule
["first,second"]
Tuple
Burst
outp
input
structcols inputs
Tuple
Burst
outp
input
structcols inputs
Tuple
Burst
outp
input
structcols inputs
Tuple
Burst
outp
input
structcols inputs
De
List opinp
De
List opinp
De
List opinp
De
List opinp
inp
CorrFarm
User and application diversity
System complexity
Iterative "what"
process
development
Mapping,
optimisation,
deployment and
execution
Accommodating and facilitating
Several application domains
Several tool sets
Several process representations
Several working practices
DISPEL representation
Composing and providing
Many autonomous resources
One enactment mechanism
A single platform
Gateway
Tool level
Enactment
level
Component
library
Conclusions
• We all know that there are big opportunities in Big Data
• But we need to be more productive. For that we need:
• Create real multidisciplinary teams with at least three roles
(application developers, data-intensive analysts and data-intensive
engineers)
• Understand that simply by using Hadoop, Spark or R we are not
necessarily doing Big Data
• The same as by coding in Java we are not necessarily
understanding object-oriented programming
• Understand that we have to interpret results adequately, from a
scientific point of view
• Understand the importance of homogeneising datasets, in order to
facilitate their integration (slow-data)
• Continue working on delivering tools that can be used to develop
Big Data applications more productively
• Should we also be funding this?
(Big) Data (Science) Skills
Big Data Value Association Summit in Madrid
17/06/2015
Oscar Corcho
ocorcho@fi.upm.es
@ocorcho
https://www.slideshare.com/ocorcho

More Related Content

What's hot

Drupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupalDay
 
FAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the FutureFAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the Futuredgarijo
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge GraphsPeter Haase
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW
 
D.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationD.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationPRELIDA Project
 
Towards Automating Data Narratives
Towards Automating Data NarrativesTowards Automating Data Narratives
Towards Automating Data Narrativesdgarijo
 
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataIntroduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataSören Auer
 
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...Dr. Haxel Consult
 
II-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent OfficeII-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent OfficeDr. Haxel Consult
 
Question answering in linked data
Question answering in linked dataQuestion answering in linked data
Question answering in linked dataReza Ramezani
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollinkSSSW
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphSören Auer
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph IntroductionSören Auer
 
II-PIC 2017: Gain insight into technical, legal and business information thro...
II-PIC 2017: Gain insight into technical, legal and business information thro...II-PIC 2017: Gain insight into technical, legal and business information thro...
II-PIC 2017: Gain insight into technical, legal and business information thro...Dr. Haxel Consult
 
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...Dr. Haxel Consult
 
Towards digitizing scholarly communication
Towards digitizing scholarly communicationTowards digitizing scholarly communication
Towards digitizing scholarly communicationSören Auer
 

What's hot (20)

Drupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open data
 
FAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the FutureFAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the Future
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge Graphs
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow Tutorial
 
D.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationD.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital Preservation
 
Towards Automating Data Narratives
Towards Automating Data NarrativesTowards Automating Data Narratives
Towards Automating Data Narratives
 
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataIntroduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
 
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
 
II-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent OfficeII-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent Office
 
Question answering in linked data
Question answering in linked dataQuestion answering in linked data
Question answering in linked data
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollink
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge Graph
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
II-PIC 2017: Gain insight into technical, legal and business information thro...
II-PIC 2017: Gain insight into technical, legal and business information thro...II-PIC 2017: Gain insight into technical, legal and business information thro...
II-PIC 2017: Gain insight into technical, legal and business information thro...
 
CORFU-MTSR 2013
CORFU-MTSR 2013CORFU-MTSR 2013
CORFU-MTSR 2013
 
Linked Open Data and Ontotext Projects
Linked Open Data and Ontotext ProjectsLinked Open Data and Ontotext Projects
Linked Open Data and Ontotext Projects
 
The RDFIndex-MTSR 2013
The RDFIndex-MTSR 2013The RDFIndex-MTSR 2013
The RDFIndex-MTSR 2013
 
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
 
Towards digitizing scholarly communication
Towards digitizing scholarly communicationTowards digitizing scholarly communication
Towards digitizing scholarly communication
 

Viewers also liked

Matching Workforce Skills with Employer Needs Now & into the Future
Matching Workforce Skills with Employer Needs Now & into the FutureMatching Workforce Skills with Employer Needs Now & into the Future
Matching Workforce Skills with Employer Needs Now & into the Futurenado-web
 
e-skills reshaping the future of learning
e-skills reshaping the future of learninge-skills reshaping the future of learning
e-skills reshaping the future of learning@cristobalcobo
 
Day of data: skills for the future
Day of data: skills for the futureDay of data: skills for the future
Day of data: skills for the futureSteven Miller
 
Navigating the Changing Economic and Demographic Realities of the 21st Century
Navigating the Changing Economic and Demographic Realities of the 21st Century Navigating the Changing Economic and Demographic Realities of the 21st Century
Navigating the Changing Economic and Demographic Realities of the 21st Century nado-web
 
Official Slideshare for What's the Future of Business by Brian Solis #WTF
Official Slideshare for What's the Future of Business by Brian Solis #WTFOfficial Slideshare for What's the Future of Business by Brian Solis #WTF
Official Slideshare for What's the Future of Business by Brian Solis #WTFBrian Solis
 

Viewers also liked (7)

Matching Workforce Skills with Employer Needs Now & into the Future
Matching Workforce Skills with Employer Needs Now & into the FutureMatching Workforce Skills with Employer Needs Now & into the Future
Matching Workforce Skills with Employer Needs Now & into the Future
 
e-skills reshaping the future of learning
e-skills reshaping the future of learninge-skills reshaping the future of learning
e-skills reshaping the future of learning
 
Day of data: skills for the future
Day of data: skills for the futureDay of data: skills for the future
Day of data: skills for the future
 
Navigating the Changing Economic and Demographic Realities of the 21st Century
Navigating the Changing Economic and Demographic Realities of the 21st Century Navigating the Changing Economic and Demographic Realities of the 21st Century
Navigating the Changing Economic and Demographic Realities of the 21st Century
 
How to hack into the big data team
How to hack into the big data teamHow to hack into the big data team
How to hack into the big data team
 
99 Facts on the Future of Business in the Digital Economy
99 Facts on the Future of Business in the Digital Economy99 Facts on the Future of Business in the Digital Economy
99 Facts on the Future of Business in the Digital Economy
 
Official Slideshare for What's the Future of Business by Brian Solis #WTF
Official Slideshare for What's the Future of Business by Brian Solis #WTFOfficial Slideshare for What's the Future of Business by Brian Solis #WTF
Official Slideshare for What's the Future of Business by Brian Solis #WTF
 

Similar to (Big) Data (Science) Skills

The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewDr. Ananth Krishnamoorthy
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Tomasz Bednarz
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
Data science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosData science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosSpiros Antonatos
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Big data berlin
Big data berlinBig data berlin
Big data berlinkammeyer
 
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIMAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIBig Data Week
 
Big data and you
Big data and you Big data and you
Big data and you IBM
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Sciencesarith divakar
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 

Similar to (Big) Data (Science) Skills (20)

Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017 Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017
 
Introduction to BigData
Introduction to BigData Introduction to BigData
Introduction to BigData
 
The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape Overview
 
On Big Data
On Big DataOn Big Data
On Big Data
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Data science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosData science meetup - Spiros Antonatos
Data science meetup - Spiros Antonatos
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIMAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
 
Big data and you
Big data and you Big data and you
Big data and you
 
ODSC and iRODS
ODSC and iRODSODSC and iRODS
ODSC and iRODS
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 

More from Oscar Corcho

Organisational Interoperability in Practice at Universidad Politécnica de Madrid
Organisational Interoperability in Practice at Universidad Politécnica de MadridOrganisational Interoperability in Practice at Universidad Politécnica de Madrid
Organisational Interoperability in Practice at Universidad Politécnica de MadridOscar Corcho
 
Introducción a los Datos Abiertos - Open Data Day 2020
Introducción a los Datos Abiertos - Open Data Day 2020Introducción a los Datos Abiertos - Open Data Day 2020
Introducción a los Datos Abiertos - Open Data Day 2020Oscar Corcho
 
Open Data (and Software, and other Research Artefacts) - A proper management
Open Data (and Software, and other Research Artefacts) -A proper managementOpen Data (and Software, and other Research Artefacts) -A proper management
Open Data (and Software, and other Research Artefacts) - A proper management Oscar Corcho
 
Adiós a los ficheros, hola a los grafos de conocimientos estadísticos
Adiós a los ficheros, hola a los grafos de conocimientos estadísticosAdiós a los ficheros, hola a los grafos de conocimientos estadísticos
Adiós a los ficheros, hola a los grafos de conocimientos estadísticosOscar Corcho
 
Ontology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data SharingOntology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data SharingOscar Corcho
 
Situación de las iniciativas de Open Data internacionales (y algunas recomen...
Situación de las iniciativas de Open Data internacionales (y algunas recomen...Situación de las iniciativas de Open Data internacionales (y algunas recomen...
Situación de las iniciativas de Open Data internacionales (y algunas recomen...Oscar Corcho
 
STARS4ALL - Contaminación Lumínica
STARS4ALL - Contaminación LumínicaSTARS4ALL - Contaminación Lumínica
STARS4ALL - Contaminación LumínicaOscar Corcho
 
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experienceTowards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experienceOscar Corcho
 
Publishing Linked Statistical Data: Aragón, a case study
Publishing Linked Statistical Data: Aragón, a case studyPublishing Linked Statistical Data: Aragón, a case study
Publishing Linked Statistical Data: Aragón, a case studyOscar Corcho
 
An initial analysis of topic-based similarity among scientific documents base...
An initial analysis of topic-based similarity among scientific documents base...An initial analysis of topic-based similarity among scientific documents base...
An initial analysis of topic-based similarity among scientific documents base...Oscar Corcho
 
Linked Statistical Data 101
Linked Statistical Data 101Linked Statistical Data 101
Linked Statistical Data 101Oscar Corcho
 
Aplicando los principios de Linked Data en AEMET
Aplicando los principios de Linked Data en AEMETAplicando los principios de Linked Data en AEMET
Aplicando los principios de Linked Data en AEMET Oscar Corcho
 
Ojo Al Data 100 - Call for sharing session at IODC 2016
Ojo Al Data 100 - Call for sharing session at IODC 2016Ojo Al Data 100 - Call for sharing session at IODC 2016
Ojo Al Data 100 - Call for sharing session at IODC 2016Oscar Corcho
 
Educando sobre datos abiertos: desde el colegio a la universidad
Educando sobre datos abiertos: desde el colegio a la universidadEducando sobre datos abiertos: desde el colegio a la universidad
Educando sobre datos abiertos: desde el colegio a la universidadOscar Corcho
 
STARS4ALL general presentation at ALAN2016
STARS4ALL general presentation at ALAN2016STARS4ALL general presentation at ALAN2016
STARS4ALL general presentation at ALAN2016Oscar Corcho
 
Generación de datos estadísticos enlazados del Instituto Aragonés de Estadística
Generación de datos estadísticos enlazados del Instituto Aragonés de EstadísticaGeneración de datos estadísticos enlazados del Instituto Aragonés de Estadística
Generación de datos estadísticos enlazados del Instituto Aragonés de EstadísticaOscar Corcho
 
Presentación de la red de excelencia de Open Data y Smart Cities
Presentación de la red de excelencia de Open Data y Smart CitiesPresentación de la red de excelencia de Open Data y Smart Cities
Presentación de la red de excelencia de Open Data y Smart CitiesOscar Corcho
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Oscar Corcho
 
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Oscar Corcho
 
Big Data - El Futuro a través de los Datos
Big Data - El Futuro a través de los DatosBig Data - El Futuro a través de los Datos
Big Data - El Futuro a través de los DatosOscar Corcho
 

More from Oscar Corcho (20)

Organisational Interoperability in Practice at Universidad Politécnica de Madrid
Organisational Interoperability in Practice at Universidad Politécnica de MadridOrganisational Interoperability in Practice at Universidad Politécnica de Madrid
Organisational Interoperability in Practice at Universidad Politécnica de Madrid
 
Introducción a los Datos Abiertos - Open Data Day 2020
Introducción a los Datos Abiertos - Open Data Day 2020Introducción a los Datos Abiertos - Open Data Day 2020
Introducción a los Datos Abiertos - Open Data Day 2020
 
Open Data (and Software, and other Research Artefacts) - A proper management
Open Data (and Software, and other Research Artefacts) -A proper managementOpen Data (and Software, and other Research Artefacts) -A proper management
Open Data (and Software, and other Research Artefacts) - A proper management
 
Adiós a los ficheros, hola a los grafos de conocimientos estadísticos
Adiós a los ficheros, hola a los grafos de conocimientos estadísticosAdiós a los ficheros, hola a los grafos de conocimientos estadísticos
Adiós a los ficheros, hola a los grafos de conocimientos estadísticos
 
Ontology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data SharingOntology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data Sharing
 
Situación de las iniciativas de Open Data internacionales (y algunas recomen...
Situación de las iniciativas de Open Data internacionales (y algunas recomen...Situación de las iniciativas de Open Data internacionales (y algunas recomen...
Situación de las iniciativas de Open Data internacionales (y algunas recomen...
 
STARS4ALL - Contaminación Lumínica
STARS4ALL - Contaminación LumínicaSTARS4ALL - Contaminación Lumínica
STARS4ALL - Contaminación Lumínica
 
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experienceTowards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
 
Publishing Linked Statistical Data: Aragón, a case study
Publishing Linked Statistical Data: Aragón, a case studyPublishing Linked Statistical Data: Aragón, a case study
Publishing Linked Statistical Data: Aragón, a case study
 
An initial analysis of topic-based similarity among scientific documents base...
An initial analysis of topic-based similarity among scientific documents base...An initial analysis of topic-based similarity among scientific documents base...
An initial analysis of topic-based similarity among scientific documents base...
 
Linked Statistical Data 101
Linked Statistical Data 101Linked Statistical Data 101
Linked Statistical Data 101
 
Aplicando los principios de Linked Data en AEMET
Aplicando los principios de Linked Data en AEMETAplicando los principios de Linked Data en AEMET
Aplicando los principios de Linked Data en AEMET
 
Ojo Al Data 100 - Call for sharing session at IODC 2016
Ojo Al Data 100 - Call for sharing session at IODC 2016Ojo Al Data 100 - Call for sharing session at IODC 2016
Ojo Al Data 100 - Call for sharing session at IODC 2016
 
Educando sobre datos abiertos: desde el colegio a la universidad
Educando sobre datos abiertos: desde el colegio a la universidadEducando sobre datos abiertos: desde el colegio a la universidad
Educando sobre datos abiertos: desde el colegio a la universidad
 
STARS4ALL general presentation at ALAN2016
STARS4ALL general presentation at ALAN2016STARS4ALL general presentation at ALAN2016
STARS4ALL general presentation at ALAN2016
 
Generación de datos estadísticos enlazados del Instituto Aragonés de Estadística
Generación de datos estadísticos enlazados del Instituto Aragonés de EstadísticaGeneración de datos estadísticos enlazados del Instituto Aragonés de Estadística
Generación de datos estadísticos enlazados del Instituto Aragonés de Estadística
 
Presentación de la red de excelencia de Open Data y Smart Cities
Presentación de la red de excelencia de Open Data y Smart CitiesPresentación de la red de excelencia de Open Data y Smart Cities
Presentación de la red de excelencia de Open Data y Smart Cities
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
 
Big Data - El Futuro a través de los Datos
Big Data - El Futuro a través de los DatosBig Data - El Futuro a través de los Datos
Big Data - El Futuro a través de los Datos
 

Recently uploaded

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

(Big) Data (Science) Skills

  • 1. (Big) Data (Science) Skills Big Data Value Association Summit in Madrid 17/06/2015 Oscar Corcho ocorcho@fi.upm.es @ocorcho https://www.slideshare.com/ocorcho
  • 2. License • This work is licensed under the license CC BY-NC-SA 4.0 International • http://purl.org/NET/rdflicense/cc-by-nc-sa4.0 • You are free: • to Share — to copy, distribute and transmit the work • to Remix — to adapt the work • Under the following conditions • Attribution — You must attribute the work by inserting • “[source Oscar Corcho]” at the footer of each reused slide • a credits slide stating: “These slides are partially based on “(Big) Data (Science) Skills” by O. Corcho” • Non-commercial • Share-Alike
  • 3. Data Scientist: Technical and Soft Skills needed • One of the two or three pictures expected from a talk on skills… • I may start going through • Each of these topics • Discussing on the specific skills needed • However… Sorry, looking for the reference to add here
  • 4. What is Big Data? Source: http://www.philipchircop.com/post/25783275888/seeing-the-full-elephant-its-a-tree-its-a
  • 5. Big Data and the theory of ecological niches
  • 6. Characteristics of an ecological niche • A niche is defined by a spectrum of resource usage • Species differ from each other in how efficient they are in using resources that change continuously • Characteristics of a niche • Amplitude (range in which resources are used) • Generic species (they can use a wide range of resources) • Specialist species (they require a very specific combination of resources) • Overlap (similarity among niches in their usage of resources) • Competitive exclusion principle (Gause, 1934) • If two species coexist in a stable environment, they do it as a differentiation of their effective ecological niches. Source: Javier Seoane. Ecología. Unidad Temática 21. Teoría del nicho ecológico
  • 7. WHAT’S THE RELATIONSHIP TO BIG DATA? Well, that’s interesting, but…
  • 8. Big Data Niche 1. HPC and e-Infrastructure Experts Background: Computer Science (Systems) System Administration Terms used in their native language: Blades, Infiniband, OpenMPI, racks, HDF, TBs, Gflops Their daily life: Check system logs Make sure that queues are active Install a new rack What’s Big Data for them? A “commercial” term for something that they have done for a long time They really know how to configure and monitor a Hadoop cluster They would love seeing those talking about Big Data executing processes on fluid dynamics
  • 9. Big Data Niche 2. Data Storage and Access Experts Background: Computer Science Database administration Terms used in their native language: SQL, NoSQL, Column store Transacions, Hive, TBs/PBs/…, TPS (Transactions per s) Their daily life: Optimise several queries Run a new benchmark Design an optimiser/physical operator What’s Big Data for them? A new opportunity to work on optimisation algorithms They know how to configure a database They often laugh at those who deploy a NoSQL solution for a problem that can be solved with a relational database
  • 10. Big Data Niche 3. Machine Learning Experts Background: Mathematics, Statistics, Physics, Computer Science Terms used in their native language: Complexity, algorithm, p-value, convergence, precision, recall ROC curves, bayesian networks, R Their daily life: Read about a new problem Write down a few formulae in the whiteboard (even blackboards) Prove that the algorithm terminates What’s Big Data for them? The same problems applied to data of larger size, with new challenges Problems are not only solved in Haddop or a powerful NoSQL DB Astonished by those who still mix up correlation and causality
  • 11. Big Data Niche 4. Slow-data Experts Background: Computer Science, Statistics, Library Sciences, Linguistics Terms used in their native language: Information model, vocabulary, ontology, data quality, curation Their daily life: Receive a database schema Talk to data producers and (re)users Obtain consensus and transform data What’s Big Data for them? The difficulty lies on the variety of data formats and structures We may integrate data from varied sources, although this is not always possible When you manage to integrate heterogeneous data, you can achieve better results
  • 12. Big Data Niche 5. (Big Data) Consultants Background: Computer Science, Economy, … Terms used in their native language: Business model, business opportunity, Big Data, Data Value Chain, Hadoop, Spark, R, TBs, GFlops Their daily life: Read a Gartner Big Data report Talk to potential customers Transfer needs to technicians What’s Big Data for them? It’s the 4Vs, plus a few more I have a PPT presentation with a Big Data infrastructure, architecture, and previous projects, which I will use to sell a project to my customers
  • 13. Are we missing any ecological niche? • We have already seen a couple of ecological niches… • They all coexist • Some of them are overlapping Is there anyone that has not been yet considered?
  • 14. The evolution of a new species: the Data Scientist Background: Computer Science+Statistics+ +Mathematics+Economy+ … Terms used in their new exotic language: HPC, databases, algorithms, harmonisation, integration, Hadoop, Spark, R, TBs, GFlops Their daily life: Learn about a new infraestructure Code scripts to be run on Spark Interpret results Install a new framework Read a few scientific papers Make shiny presentations Describe in their blog the activities that they do, so that Big Data is better known and understood …
  • 15. © Volker Markl: “Data Scientist” – “Jack of All Trades!” Application Data Science Control Flow Iterative Algorithms Error Estimation Active Sampling Sketches Curse of Dimensionality Decoupling Convergence Monte Carlo Mathematical Programming Linear Algebra Stochastic Gradient Descent Regression Statistics Hashing Parallelization Query Optimization Fault Tolerance Relational Algebra / SQL Scalability Data Analysis Language Compiler Memory Management Memory Hierarchy Data Flow Hardware Adaptation Indexing Resource Management NF2 /XQuery Data Warehouse/OLAP Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics) Real-Time
  • 16. Data Scientists and Pi-shaped people • Let’s now go into the expected discussion Sorry, looking for the reference to add here
  • 17. Will all species survive? • If Big Data defines an ecosystem… • Which species will survive? • Will Data Scientists wipe out the other species? • Or will they be able to live in perfect symbiosis? What is the ideal training required for the individuals of these species so that they can survive?
  • 18. Data Science starter kits. Are they effective?
  • 19. Masters in Data Science, Big Data and alike (I) Expert in Big Data Expert in Data Science
  • 20. Masters in Data Science, Big Data and alike (II)
  • 21. Masters in Data Science, Big Data and alike (III) Year 1 • Data handling • Data analysis • Advanced data analysis and data management • Visualization • Applications Year 2
  • 22. Are we doing it right in terms of training? • Probably it is all about lack of maturity in the area, but syllabi do not seem to be perfectly compatible… • It is not easy to believe that we can create Data Scientists in only one year • Should we train people to know a bit about everything? • Or should we separate more clearly the species in our ecosystem and specialise them better for their work? How do we manage to keep a healthy and stable ecosystem?
  • 23. Shameless self-promotion • Strategies for success in the Digital-Data Revolution • Separation of concerns • Intellectual ramps • Data-intensive knowledge discovery • Components and usage patterns • Data-intensive engineering • Development vs enactment • Data-intensive application experiences • In Science • In Business Can we learn from lessons learned in Data-Intensive Science?
  • 24. Separation of concerns: three clear profiles • Domain experts (WHAT) • They know the problems they want to solve • They know the application domain • They can create (scientific) workflows • Data-intensive analysts (WHAT) • They know a lot about (Big) data analysis • The may not know about the infrastructure behind the scenes • They do not necessarily know all the details of the application domain • Data-intensive engineers (HOW) • They know a lot about distributed computing/infraestructure/HPC/cloud s/etc. • They received the description of an algorithm and they can make it more efficient (parallelisation)
  • 25. Separation of concerns: Differentiated tasks [<select = "1<= day(inp.first.start)<=5", project="inp">, <select = "6<= day(inp.first.start)<=10", project="inp">, <select = "11<= day(inp.first.start)<=15", project="inp">, ... ] Programmable Filter Project outputs inp rules distrib "second.fURI ASC..." Sort outp data rule Sort outp data rule Sort outp data rule Sort outp data rule ["first,second"] Tuple Burst outp input structcols inputs Tuple Burst outp input structcols inputs Tuple Burst outp input structcols inputs Tuple Burst outp input structcols inputs De List opinp De List opinp De List opinp De List opinp inp CorrFarm User and application diversity System complexity Iterative "what" process development Mapping, optimisation, deployment and execution Accommodating and facilitating Several application domains Several tool sets Several process representations Several working practices DISPEL representation Composing and providing Many autonomous resources One enactment mechanism A single platform Gateway Tool level Enactment level Component library
  • 26. Conclusions • We all know that there are big opportunities in Big Data • But we need to be more productive. For that we need: • Create real multidisciplinary teams with at least three roles (application developers, data-intensive analysts and data-intensive engineers) • Understand that simply by using Hadoop, Spark or R we are not necessarily doing Big Data • The same as by coding in Java we are not necessarily understanding object-oriented programming • Understand that we have to interpret results adequately, from a scientific point of view • Understand the importance of homogeneising datasets, in order to facilitate their integration (slow-data) • Continue working on delivering tools that can be used to develop Big Data applications more productively • Should we also be funding this?
  • 27. (Big) Data (Science) Skills Big Data Value Association Summit in Madrid 17/06/2015 Oscar Corcho ocorcho@fi.upm.es @ocorcho https://www.slideshare.com/ocorcho