SlideShare ist ein Scribd-Unternehmen logo
1 von 31
ESSnet Big Data WP8
Methodology (+ Quality, +IT)
Deliverables prepared by: WP8 members
BDES 2018 - Sofia,
14-15 May 2018
• Introduction Piet Daas
• IT Jacek Maślankowski
• Quality Magdalena Six
• Methodology Valentin Chavdarov & Piet Daas
• Literature study Jacek Maślankowski
• Discussant Faiz Alsuhail
• Questions + Discussion All
Overview of this session
• Aim of WP8 is to generalize the findings of the pilots in ESSnet Big
Data and relate them to the conditions for future use of big data
sources within the European Statistical System.
• Only active in SGA-2 (January 2017 - May 2018)
• Focus on Methodology, Quality and IT-infrastructure
Overview of WP8
• Based on real world experiences
– Work performed in WP 1-7 of ESSnet and other work relevant for
NSI’s (or similar)
• Broad area: focus on most important topics
– In three areas: IT, Quality and Methodology
– Identify the most important topics (for each area) at the start of
WP8 during a workshop with experts
– To assure a sufficiently ‘blended’ view on BD
• Follow a bottom-up approach
Starting points of WP8
• Identified most important topics for
• IT
– 10 in total
• Quality
– 7 in total
• Methodology
– 11 in total
• Topics based on the BD ‘start of art’ in January 2017
– One topic emerged in each of the 3 areas
Results of the workshop
• A Process ‘view’ on Big Data
– IT: Data Processing Life Cycle
– Quality: Process Chain Control
– Methodology: Data Process Architecture
– This is important
• GSBPM provides a general view on NSI processes
(Generic Statistical Business Process Model )
Common topic
Big Data process: Data driven
Different than the approach commonly used in official statistics
IT Report
in the ESSnet Big Data
Deliverable 8.3 of WP8
Prepared by: WP8 members
Jacek Maślankowski, Statistics Poland
BDES 2018 - Sofia,
14-15 May 2018
1. Big Data processing life cycle
2. Metadata management (ontology)
3. Format of Big Data processing
4. Data-hub and Data-lake
5. Data source integration
6. Choosing the right infrastructure
7. List of secure and tested API’s
8. Shared libraries and documented standards
9. Speed of algorithms
10. Training/skills/knowledge
Information covered in the report
Conceptual Big Data platform
Data processing and data storage
Data type
Batch
Static data
Structured
RDBMS, DBF, ...
Relational
database, files
Hadoop, MySQL, ..
Unstructured
Text, Website, ...
Files, NoSQL
Hadoop, Solr, ...
Semi-
structured
CSV, JSON, XML, ...
Files, NoSQL or
relat. databases
Hadoop, HBase, ...
Streaming
Realtime data
Sensors
TXT or CSV files
In-memory
processing
engine
Spark, Kafka, ...
Web
Websites
In-memory
processing
engine
Spark, Storm, ...
Github repositories
No. Name Link Main features
1 Awesome Official Statistics
software
https://github.com/SNStatComp/awesom
e-official-statistics-software
The list of useful statistical software
with links to other GitHub
repositories, by CBS NL
2 ONS (Office for National
Statistics) UK Big Data team
https://github.com/ONSBigData Various software developed by ONS
UK Big Data Team
3 ONS (Office for National
Statistics) UK Data Science
Campus
https://github.com/datasciencecampus Various software developed by ONS
Data Science Campus Team
4 ESTP Big Data course
software
https://github.com/SNStatComp/ESTPBD Various software developed for the
ESTP Big Data training courses
API’s used
No. Name of the
API with
hyperlink
Basic functionality Restrictions Potential domains
(WP number)
Remarks
1 Twitter API Scrap the tweets by keywords,
hashtags, users; streaming
scrapping
25 to 900 requests/15 minutes; access only to public
profiles
Population, Social
Statistics, Tourism
(WP2, WP7)
Account and API code
needed
2 Facebook Graph
API
Collect information from
public profiles, also very
specific such as photos
metatags
Mostly present information, typical no more than
dozens of requests
Population (WP7) Account and API code
needed
3 Google Maps
API
Looking for any kind of
objects (e.g., hotels),
verification of addresses,
monitoring the traffic on
specific roads
Free up to 2.5 thous. requests per day.
$0.50 USD / 1 thous. additional requests, up to 100
thous. daily, if billing is enabled.
Tourism (WP7) Google account and API
code needed
4 Google Custom
Search API
Can be used to search through
one website, with
modifications it will search for
a keywords in the whole
Internet; can be used to find a
URL of the specific enterprise
JSON/Atom Custom Search API provides 100 search
queries per day for free. Additional requests cost $5
per 1000 queries, up to 10k queries per day.
Business (WP2) Google account and API
code needed
5 Bing API Finding specific URL of the
enterprise
7 queries per second (QPS) per IP address Business (WP2) AppID needed
6 Guardian API Collect news articles and
comments from Guardian
website
Free for non-commercial use. Up to 12 calls per
second, Up to 5,000 calls per day, Access to article text,
Access to over 1,900,000 pieces of content.
Population, Social
Statistics (WP7)
Registered account
needed
7 Copernicus
Open Access
Hub
Access to Sentinel-1 and
Sentinel-2 repositories
Free for registered users Agriculture (WP7) Registered account
needed
1. There is no unified framework for Big Data metadata
management.
2. There is a common point in all WPs on tools and data
storage.
3. Data-lakes and data-hubs are still not explored deeply.
4. There are best practices on using different API’s.
5. Software is shared by NSI’s on Github repositories.
6. There is no unified framework for data sources integration.
7. Variety of training courses allows increasing required skills
of data scientists.
Main findings
Report on
Quality Aspects of Big Data
in the ESSnet Big Data
Deliverable 8.2 of WP8
Prepared by: WP8 members
Magdalena Six, Statistics Austria
BDES 2018 - Sofia,
14-15 May 2018
In relation to cause(s) of errors:
• Coverage, Accuracy and Selectivity
• Processing errors
• Linkability
• Measurement errors
• Model errors and precision
In relation to changes in the composition of the source
• Comparability over time
• Process chain control
7 Quality Aspects of Big Data
7 Quality Aspects in the Context of
UNECE’s Quality Framework for BD
• 3 Phases of the business process: Input, Throughput, Output
• 3 Hyperdimensions: Source, Data, Metadata
Structure of the Report on Quality in
the ESSnet Big Data
7 Chapters according to the 7 identified quality aspects
Same structure for each chapter:
1. Introduction: meaning of the respective quality aspect in the
context of Big Data
2. Examples and Methods: Role of the respective quality aspect
in the WP1-WP7
3. Discussion: Challenges for the quality aspect, cross connections
to other Chapters in the Quality Report, but also to IT and
Methodology Report
Examples for new (?), BD specific (?)
Error Sources
• Scrambling of the Automated Identification Signal (AIS) of ships in WP4 ->
measurement or coverage error?
• Scraping of a deceptive Job vacancy ad -> measurement or coverage error?
• Non-stable access to the BD source, change in technological process
generating the BD, change in use of BD-generating devices -> comparability
over time
• Multiple layers of (new) processing steps required (advanced techniques for
editing, imputation, linking techniques, text mining algorithms…) including
new error sources
• Deduction of information about target variable from other variables via
modelling, models based on small-sample statistical inference don’t work
Quality Measures: Challenges from the
past and Challenges ahead
• Still in the experimenting phase
• Often no routine, no regular access to Big Data source
• Focus in WPs more on potential sources and potential access to
sources than on a standardized reporting of quality measures
• Experimental phase shows: Big Data sources, as well as processes
needed to work with these sources are so diverse that the
development of standardized quality measures / a quality framework
will be challenging
Report on
Methodology
in the ESSnet Big Data
Deliverable 8.4 of WP8
Prepared by: WP8 members
Valentin Chavdarov & Piet Daas
BDES 2018 - Sofia,
14-15 May 2018
Why Big data methodology?
1. A good part of statistical methodology is built
around survey data. There are many conventions
in statistical methodology that reflect the failure
of surveys to capture important social economic
and social phenomena.
2. Big Data is a by-product of modern society. Not a
lot is known on the data generation process and
of the units included.
3. Working in a data-driven way is new for NSI’s.
Methods and principles are needed to assure
valid conclusions are drawn when using Big Data.
Big data methodology issues
1. Assessing accuracy
2. What should our final product look like?
3. Deal with spatial dimension
4. Changes in data sources
5. Mashine learning in official statistics
6. Data linkage
7. Secure multi-party computation
8. Infererence
9. Sampling
10. Data process architecture
11. Unit identification problem
Big data methodology issues
- cont
• Methodological issues are different in terms of scope. Assessing
accuracy for example covers almost all stages of statistical
production process: from collecting data through processing to data
analysis.
• Some of issues are BD specific: data linkage; changes in data
sources; unit identification problem.
Risk of social sciences datafication
There are three ways in which Big Data can be used for official statistics
1) Survey based, as an additional source
to improve survey based estimation (~ WP2, WP7,
sentiment NL)
2) Census based, as the main/single source
Whole target population is included (WP4, road sensor NL)
3) Incomplete, as the main/single source
Only part of the target population is included (WP1, WP3 ….)
Need to correct for that
Using Big Data
Methodology Quality IT
• Bias & models Coverage Choosing right infra
• Data driven way of working Sources of error Training/skills/knowledge
• Machine Learning (2 places) Editing data Big Data libraries
• Linking (e.g. geo-loc) Linkability Programming languages
• Unit identification (features)
In these areas new methods are needed and is being developed!
More important/New to Big Data
Literature study
in the ESSnet Big Data
Deliverable 8.1 of WP8
Prepared by: WP8 members
Jacek Maślankowski, Statistics Poland
BDES 2018 - Sofia,
14-15 May 2018
• Bibliographic data
• Link
• Short overview (strengths, weaknesses)
• Data sources
• Domains
• Keywords
• Classification (A – very relevant, B – relevant, C – less relevant)
Sharing the experience
WP8 Wiki 
Reports, milestones and
deliverables 
Literature overview
Living document
Thank you for your attention

Weitere ähnliche Inhalte

Was ist angesagt?

Accelerating Time to Research Using CloudBank
Accelerating Time to Research Using CloudBankAccelerating Time to Research Using CloudBank
Accelerating Time to Research Using CloudBankSanjay Padhi, Ph.D
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]ssuser23e4f31
 
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...Big Data Value Association
 
Building Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsBuilding Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsOntotext
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRamakant Soni
 
Online retail a look at data consulting approach
Online retail   a look at data consulting approachOnline retail   a look at data consulting approach
Online retail a look at data consulting approachShesha R
 
Fairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesFairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesPistoia Alliance
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformSanjay Padhi, Ph.D
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data AnalyticsUtkarsh Sharma
 
Paper id 26201475
Paper id 26201475Paper id 26201475
Paper id 26201475IJRAT
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryMark Grover
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?Robert Grossman
 

Was ist angesagt? (20)

Data mining
Data miningData mining
Data mining
 
Accelerating Time to Research Using CloudBank
Accelerating Time to Research Using CloudBankAccelerating Time to Research Using CloudBank
Accelerating Time to Research Using CloudBank
 
Big data road map
Big data road mapBig data road map
Big data road map
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
 
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
 
Building Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsBuilding Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 steps
 
Unit 3 part 2
Unit  3 part 2Unit  3 part 2
Unit 3 part 2
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data Warehouse
 
Online retail a look at data consulting approach
Online retail   a look at data consulting approachOnline retail   a look at data consulting approach
Online retail a look at data consulting approach
 
Bigowl aitech
Bigowl aitechBigowl aitech
Bigowl aitech
 
Fairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesFairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matrices
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 
Paper id 26201475
Paper id 26201475Paper id 26201475
Paper id 26201475
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 

Ähnlich wie ESSnet Big Data WP8 Methodology (+ Quality, +IT)

Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
Eclipse day Sydney 2014 BIG data presentation
Eclipse day Sydney 2014 BIG data presentationEclipse day Sydney 2014 BIG data presentation
Eclipse day Sydney 2014 BIG data presentationSai Paravastu
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
KU_Big_Data_3_25_2015a
KU_Big_Data_3_25_2015aKU_Big_Data_3_25_2015a
KU_Big_Data_3_25_2015avonmcconnell
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
 
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...BigData_Europe
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Denodo
 
Mapping presentation THAG big data from space
Mapping presentation THAG big data from spaceMapping presentation THAG big data from space
Mapping presentation THAG big data from spaceBartosz Szkudlarek
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big dataRaul Chong
 
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01Soujanya V
 
Data-centric design and the knowledge graph
Data-centric design and the knowledge graphData-centric design and the knowledge graph
Data-centric design and the knowledge graphAlan Morrison
 
Big data seminor
Big data seminorBig data seminor
Big data seminorberasrujana
 

Ähnlich wie ESSnet Big Data WP8 Methodology (+ Quality, +IT) (20)

Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Eclipse day Sydney 2014 BIG data presentation
Eclipse day Sydney 2014 BIG data presentationEclipse day Sydney 2014 BIG data presentation
Eclipse day Sydney 2014 BIG data presentation
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
KU_Big_Data_3_25_2015a
KU_Big_Data_3_25_2015aKU_Big_Data_3_25_2015a
KU_Big_Data_3_25_2015a
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
Ijdbms
IjdbmsIjdbms
Ijdbms
 
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
 
Ijdbms
IjdbmsIjdbms
Ijdbms
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
 
Mapping presentation THAG big data from space
Mapping presentation THAG big data from spaceMapping presentation THAG big data from space
Mapping presentation THAG big data from space
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Ijdbms
IjdbmsIjdbms
Ijdbms
 
Ijdbms
IjdbmsIjdbms
Ijdbms
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
 
Data-centric design and the knowledge graph
Data-centric design and the knowledge graphData-centric design and the knowledge graph
Data-centric design and the knowledge graph
 
Big data seminor
Big data seminorBig data seminor
Big data seminor
 

Mehr von Piet J.H. Daas

Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their usePiet J.H. Daas
 
IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsPiet J.H. Daas
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesPiet J.H. Daas
 
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statisticsPiet J.H. Daas
 
Isi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasIsi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasPiet J.H. Daas
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsPiet J.H. Daas
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSPiet J.H. Daas
 
Ntts2017 presentation 45
Ntts2017 presentation 45Ntts2017 presentation 45
Ntts2017 presentation 45Piet J.H. Daas
 
Big Data presentation Mannheim
Big Data presentation MannheimBig Data presentation Mannheim
Big Data presentation MannheimPiet J.H. Daas
 
Extracting information from ' messy' social media data
Extracting information from ' messy' social media dataExtracting information from ' messy' social media data
Extracting information from ' messy' social media dataPiet J.H. Daas
 
Big data cbs_piet_daas
Big data cbs_piet_daasBig data cbs_piet_daas
Big data cbs_piet_daasPiet J.H. Daas
 
Gebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekGebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekPiet J.H. Daas
 
Profiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityProfiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityPiet J.H. Daas
 
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data MethodologyUsing Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data MethodologyPiet J.H. Daas
 
Big Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenBig Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenPiet J.H. Daas
 
Big Data presentation for Statistics Canada
Big Data presentation for Statistics CanadaBig Data presentation for Statistics Canada
Big Data presentation for Statistics CanadaPiet J.H. Daas
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statisticsPiet J.H. Daas
 
Quality Approaches to Big Data
Quality Approaches to Big DataQuality Approaches to Big Data
Quality Approaches to Big DataPiet J.H. Daas
 
Social media sentiment and consumer confidence
Social media sentiment and consumer confidenceSocial media sentiment and consumer confidence
Social media sentiment and consumer confidencePiet J.H. Daas
 

Mehr von Piet J.H. Daas (20)

Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their use
 
IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics Netherlands
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniques
 
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statistics
 
Isi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasIsi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and bias
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics Netherlands
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONS
 
Ntts2017 presentation 45
Ntts2017 presentation 45Ntts2017 presentation 45
Ntts2017 presentation 45
 
Big Data presentation Mannheim
Big Data presentation MannheimBig Data presentation Mannheim
Big Data presentation Mannheim
 
Extracting information from ' messy' social media data
Extracting information from ' messy' social media dataExtracting information from ' messy' social media data
Extracting information from ' messy' social media data
 
Big data cbs_piet_daas
Big data cbs_piet_daasBig data cbs_piet_daas
Big data cbs_piet_daas
 
Gebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekGebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiek
 
Big Data @ CBS
Big Data @ CBSBig Data @ CBS
Big Data @ CBS
 
Profiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityProfiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivity
 
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data MethodologyUsing Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
 
Big Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenBig Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in Eindhoven
 
Big Data presentation for Statistics Canada
Big Data presentation for Statistics CanadaBig Data presentation for Statistics Canada
Big Data presentation for Statistics Canada
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statistics
 
Quality Approaches to Big Data
Quality Approaches to Big DataQuality Approaches to Big Data
Quality Approaches to Big Data
 
Social media sentiment and consumer confidence
Social media sentiment and consumer confidenceSocial media sentiment and consumer confidence
Social media sentiment and consumer confidence
 

Kürzlich hochgeladen

Call Girls Mehsana / 8250092165 Genuine Call girls with real Photos and Number
Call Girls Mehsana / 8250092165 Genuine Call girls with real Photos and NumberCall Girls Mehsana / 8250092165 Genuine Call girls with real Photos and Number
Call Girls Mehsana / 8250092165 Genuine Call girls with real Photos and NumberSareena Khatun
 
Top profile Call Girls In Morena [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Morena [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Morena [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Morena [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Russian Escorts in Abu Dhabi 0508644382 Abu Dhabi Escorts
Russian Escorts in Abu Dhabi 0508644382 Abu Dhabi EscortsRussian Escorts in Abu Dhabi 0508644382 Abu Dhabi Escorts
Russian Escorts in Abu Dhabi 0508644382 Abu Dhabi EscortsMonica Sydney
 
sponsor for poor old age person food.pdf
sponsor for poor old age person food.pdfsponsor for poor old age person food.pdf
sponsor for poor old age person food.pdfSERUDS INDIA
 
NGO working for orphan children’s education
NGO working for orphan children’s educationNGO working for orphan children’s education
NGO working for orphan children’s educationSERUDS INDIA
 
Coastal Protection Measures in Hulhumale'
Coastal Protection Measures in Hulhumale'Coastal Protection Measures in Hulhumale'
Coastal Protection Measures in Hulhumale'NAP Global Network
 
Tuvalu Coastal Adaptation Project (TCAP)
Tuvalu Coastal Adaptation Project (TCAP)Tuvalu Coastal Adaptation Project (TCAP)
Tuvalu Coastal Adaptation Project (TCAP)NAP Global Network
 
Nagerbazar @ Independent Call Girls Kolkata - 450+ Call Girl Cash Payment 800...
Nagerbazar @ Independent Call Girls Kolkata - 450+ Call Girl Cash Payment 800...Nagerbazar @ Independent Call Girls Kolkata - 450+ Call Girl Cash Payment 800...
Nagerbazar @ Independent Call Girls Kolkata - 450+ Call Girl Cash Payment 800...HyderabadDolls
 
Contributi dei parlamentari del PD - Contributi L. 3/2019
Contributi dei parlamentari del PD - Contributi L. 3/2019Contributi dei parlamentari del PD - Contributi L. 3/2019
Contributi dei parlamentari del PD - Contributi L. 3/2019Partito democratico
 
An Atoll Futures Research Institute? Presentation for CANCC
An Atoll Futures Research Institute? Presentation for CANCCAn Atoll Futures Research Institute? Presentation for CANCC
An Atoll Futures Research Institute? Presentation for CANCCNAP Global Network
 
Scaling up coastal adaptation in Maldives through the NAP process
Scaling up coastal adaptation in Maldives through the NAP processScaling up coastal adaptation in Maldives through the NAP process
Scaling up coastal adaptation in Maldives through the NAP processNAP Global Network
 
Dating Call Girls inBaloda Bazar Bhatapara 9332606886Call Girls Advance Cash...
Dating Call Girls inBaloda Bazar Bhatapara  9332606886Call Girls Advance Cash...Dating Call Girls inBaloda Bazar Bhatapara  9332606886Call Girls Advance Cash...
Dating Call Girls inBaloda Bazar Bhatapara 9332606886Call Girls Advance Cash...kumargunjan9515
 
Competitive Advantage slide deck___.pptx
Competitive Advantage slide deck___.pptxCompetitive Advantage slide deck___.pptx
Competitive Advantage slide deck___.pptxScottMeyers35
 
NAP Expo - Delivering effective and adequate adaptation.pptx
NAP Expo - Delivering effective and adequate adaptation.pptxNAP Expo - Delivering effective and adequate adaptation.pptx
NAP Expo - Delivering effective and adequate adaptation.pptxNAP Global Network
 
Genuine Call Girls in Salem 9332606886 HOT & SEXY Models beautiful and charm...
Genuine Call Girls in Salem  9332606886 HOT & SEXY Models beautiful and charm...Genuine Call Girls in Salem  9332606886 HOT & SEXY Models beautiful and charm...
Genuine Call Girls in Salem 9332606886 HOT & SEXY Models beautiful and charm...Sareena Khatun
 
9867746289 Independent Call Girls in Mumbai Airport 24/7 - Mumbai Escorts
9867746289 Independent Call Girls in Mumbai Airport 24/7 - Mumbai Escorts9867746289 Independent Call Girls in Mumbai Airport 24/7 - Mumbai Escorts
9867746289 Independent Call Girls in Mumbai Airport 24/7 - Mumbai EscortsPooja Nehwal
 
Antisemitism Awareness Act: pénaliser la critique de l'Etat d'Israël
Antisemitism Awareness Act: pénaliser la critique de l'Etat d'IsraëlAntisemitism Awareness Act: pénaliser la critique de l'Etat d'Israël
Antisemitism Awareness Act: pénaliser la critique de l'Etat d'IsraëlEdouardHusson
 
Kolkata Call Girls Halisahar 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Girl ...
Kolkata Call Girls Halisahar  💯Call Us 🔝 8005736733 🔝 💃  Top Class Call Girl ...Kolkata Call Girls Halisahar  💯Call Us 🔝 8005736733 🔝 💃  Top Class Call Girl ...
Kolkata Call Girls Halisahar 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Girl ...Namrata Singh
 

Kürzlich hochgeladen (20)

Call Girls Mehsana / 8250092165 Genuine Call girls with real Photos and Number
Call Girls Mehsana / 8250092165 Genuine Call girls with real Photos and NumberCall Girls Mehsana / 8250092165 Genuine Call girls with real Photos and Number
Call Girls Mehsana / 8250092165 Genuine Call girls with real Photos and Number
 
Top profile Call Girls In Morena [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Morena [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Morena [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Morena [ 7014168258 ] Call Me For Genuine Models We...
 
Russian Escorts in Abu Dhabi 0508644382 Abu Dhabi Escorts
Russian Escorts in Abu Dhabi 0508644382 Abu Dhabi EscortsRussian Escorts in Abu Dhabi 0508644382 Abu Dhabi Escorts
Russian Escorts in Abu Dhabi 0508644382 Abu Dhabi Escorts
 
sponsor for poor old age person food.pdf
sponsor for poor old age person food.pdfsponsor for poor old age person food.pdf
sponsor for poor old age person food.pdf
 
Sustainability by Design: Assessment Tool for Just Energy Transition Plans
Sustainability by Design: Assessment Tool for Just Energy Transition PlansSustainability by Design: Assessment Tool for Just Energy Transition Plans
Sustainability by Design: Assessment Tool for Just Energy Transition Plans
 
NGO working for orphan children’s education
NGO working for orphan children’s educationNGO working for orphan children’s education
NGO working for orphan children’s education
 
Coastal Protection Measures in Hulhumale'
Coastal Protection Measures in Hulhumale'Coastal Protection Measures in Hulhumale'
Coastal Protection Measures in Hulhumale'
 
Tuvalu Coastal Adaptation Project (TCAP)
Tuvalu Coastal Adaptation Project (TCAP)Tuvalu Coastal Adaptation Project (TCAP)
Tuvalu Coastal Adaptation Project (TCAP)
 
Nagerbazar @ Independent Call Girls Kolkata - 450+ Call Girl Cash Payment 800...
Nagerbazar @ Independent Call Girls Kolkata - 450+ Call Girl Cash Payment 800...Nagerbazar @ Independent Call Girls Kolkata - 450+ Call Girl Cash Payment 800...
Nagerbazar @ Independent Call Girls Kolkata - 450+ Call Girl Cash Payment 800...
 
Contributi dei parlamentari del PD - Contributi L. 3/2019
Contributi dei parlamentari del PD - Contributi L. 3/2019Contributi dei parlamentari del PD - Contributi L. 3/2019
Contributi dei parlamentari del PD - Contributi L. 3/2019
 
An Atoll Futures Research Institute? Presentation for CANCC
An Atoll Futures Research Institute? Presentation for CANCCAn Atoll Futures Research Institute? Presentation for CANCC
An Atoll Futures Research Institute? Presentation for CANCC
 
Scaling up coastal adaptation in Maldives through the NAP process
Scaling up coastal adaptation in Maldives through the NAP processScaling up coastal adaptation in Maldives through the NAP process
Scaling up coastal adaptation in Maldives through the NAP process
 
AHMR volume 10 number 1 January-April 2024
AHMR volume 10 number 1 January-April 2024AHMR volume 10 number 1 January-April 2024
AHMR volume 10 number 1 January-April 2024
 
Dating Call Girls inBaloda Bazar Bhatapara 9332606886Call Girls Advance Cash...
Dating Call Girls inBaloda Bazar Bhatapara  9332606886Call Girls Advance Cash...Dating Call Girls inBaloda Bazar Bhatapara  9332606886Call Girls Advance Cash...
Dating Call Girls inBaloda Bazar Bhatapara 9332606886Call Girls Advance Cash...
 
Competitive Advantage slide deck___.pptx
Competitive Advantage slide deck___.pptxCompetitive Advantage slide deck___.pptx
Competitive Advantage slide deck___.pptx
 
NAP Expo - Delivering effective and adequate adaptation.pptx
NAP Expo - Delivering effective and adequate adaptation.pptxNAP Expo - Delivering effective and adequate adaptation.pptx
NAP Expo - Delivering effective and adequate adaptation.pptx
 
Genuine Call Girls in Salem 9332606886 HOT & SEXY Models beautiful and charm...
Genuine Call Girls in Salem  9332606886 HOT & SEXY Models beautiful and charm...Genuine Call Girls in Salem  9332606886 HOT & SEXY Models beautiful and charm...
Genuine Call Girls in Salem 9332606886 HOT & SEXY Models beautiful and charm...
 
9867746289 Independent Call Girls in Mumbai Airport 24/7 - Mumbai Escorts
9867746289 Independent Call Girls in Mumbai Airport 24/7 - Mumbai Escorts9867746289 Independent Call Girls in Mumbai Airport 24/7 - Mumbai Escorts
9867746289 Independent Call Girls in Mumbai Airport 24/7 - Mumbai Escorts
 
Antisemitism Awareness Act: pénaliser la critique de l'Etat d'Israël
Antisemitism Awareness Act: pénaliser la critique de l'Etat d'IsraëlAntisemitism Awareness Act: pénaliser la critique de l'Etat d'Israël
Antisemitism Awareness Act: pénaliser la critique de l'Etat d'Israël
 
Kolkata Call Girls Halisahar 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Girl ...
Kolkata Call Girls Halisahar  💯Call Us 🔝 8005736733 🔝 💃  Top Class Call Girl ...Kolkata Call Girls Halisahar  💯Call Us 🔝 8005736733 🔝 💃  Top Class Call Girl ...
Kolkata Call Girls Halisahar 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Girl ...
 

ESSnet Big Data WP8 Methodology (+ Quality, +IT)

  • 1. ESSnet Big Data WP8 Methodology (+ Quality, +IT) Deliverables prepared by: WP8 members BDES 2018 - Sofia, 14-15 May 2018
  • 2. • Introduction Piet Daas • IT Jacek Maślankowski • Quality Magdalena Six • Methodology Valentin Chavdarov & Piet Daas • Literature study Jacek Maślankowski • Discussant Faiz Alsuhail • Questions + Discussion All Overview of this session
  • 3. • Aim of WP8 is to generalize the findings of the pilots in ESSnet Big Data and relate them to the conditions for future use of big data sources within the European Statistical System. • Only active in SGA-2 (January 2017 - May 2018) • Focus on Methodology, Quality and IT-infrastructure Overview of WP8
  • 4. • Based on real world experiences – Work performed in WP 1-7 of ESSnet and other work relevant for NSI’s (or similar) • Broad area: focus on most important topics – In three areas: IT, Quality and Methodology – Identify the most important topics (for each area) at the start of WP8 during a workshop with experts – To assure a sufficiently ‘blended’ view on BD • Follow a bottom-up approach Starting points of WP8
  • 5. • Identified most important topics for • IT – 10 in total • Quality – 7 in total • Methodology – 11 in total • Topics based on the BD ‘start of art’ in January 2017 – One topic emerged in each of the 3 areas Results of the workshop
  • 6. • A Process ‘view’ on Big Data – IT: Data Processing Life Cycle – Quality: Process Chain Control – Methodology: Data Process Architecture – This is important • GSBPM provides a general view on NSI processes (Generic Statistical Business Process Model ) Common topic
  • 7. Big Data process: Data driven Different than the approach commonly used in official statistics
  • 8. IT Report in the ESSnet Big Data Deliverable 8.3 of WP8 Prepared by: WP8 members Jacek Maślankowski, Statistics Poland BDES 2018 - Sofia, 14-15 May 2018
  • 9. 1. Big Data processing life cycle 2. Metadata management (ontology) 3. Format of Big Data processing 4. Data-hub and Data-lake 5. Data source integration 6. Choosing the right infrastructure 7. List of secure and tested API’s 8. Shared libraries and documented standards 9. Speed of algorithms 10. Training/skills/knowledge Information covered in the report
  • 11. Data processing and data storage Data type Batch Static data Structured RDBMS, DBF, ... Relational database, files Hadoop, MySQL, .. Unstructured Text, Website, ... Files, NoSQL Hadoop, Solr, ... Semi- structured CSV, JSON, XML, ... Files, NoSQL or relat. databases Hadoop, HBase, ... Streaming Realtime data Sensors TXT or CSV files In-memory processing engine Spark, Kafka, ... Web Websites In-memory processing engine Spark, Storm, ...
  • 12. Github repositories No. Name Link Main features 1 Awesome Official Statistics software https://github.com/SNStatComp/awesom e-official-statistics-software The list of useful statistical software with links to other GitHub repositories, by CBS NL 2 ONS (Office for National Statistics) UK Big Data team https://github.com/ONSBigData Various software developed by ONS UK Big Data Team 3 ONS (Office for National Statistics) UK Data Science Campus https://github.com/datasciencecampus Various software developed by ONS Data Science Campus Team 4 ESTP Big Data course software https://github.com/SNStatComp/ESTPBD Various software developed for the ESTP Big Data training courses
  • 13. API’s used No. Name of the API with hyperlink Basic functionality Restrictions Potential domains (WP number) Remarks 1 Twitter API Scrap the tweets by keywords, hashtags, users; streaming scrapping 25 to 900 requests/15 minutes; access only to public profiles Population, Social Statistics, Tourism (WP2, WP7) Account and API code needed 2 Facebook Graph API Collect information from public profiles, also very specific such as photos metatags Mostly present information, typical no more than dozens of requests Population (WP7) Account and API code needed 3 Google Maps API Looking for any kind of objects (e.g., hotels), verification of addresses, monitoring the traffic on specific roads Free up to 2.5 thous. requests per day. $0.50 USD / 1 thous. additional requests, up to 100 thous. daily, if billing is enabled. Tourism (WP7) Google account and API code needed 4 Google Custom Search API Can be used to search through one website, with modifications it will search for a keywords in the whole Internet; can be used to find a URL of the specific enterprise JSON/Atom Custom Search API provides 100 search queries per day for free. Additional requests cost $5 per 1000 queries, up to 10k queries per day. Business (WP2) Google account and API code needed 5 Bing API Finding specific URL of the enterprise 7 queries per second (QPS) per IP address Business (WP2) AppID needed 6 Guardian API Collect news articles and comments from Guardian website Free for non-commercial use. Up to 12 calls per second, Up to 5,000 calls per day, Access to article text, Access to over 1,900,000 pieces of content. Population, Social Statistics (WP7) Registered account needed 7 Copernicus Open Access Hub Access to Sentinel-1 and Sentinel-2 repositories Free for registered users Agriculture (WP7) Registered account needed
  • 14. 1. There is no unified framework for Big Data metadata management. 2. There is a common point in all WPs on tools and data storage. 3. Data-lakes and data-hubs are still not explored deeply. 4. There are best practices on using different API’s. 5. Software is shared by NSI’s on Github repositories. 6. There is no unified framework for data sources integration. 7. Variety of training courses allows increasing required skills of data scientists. Main findings
  • 15. Report on Quality Aspects of Big Data in the ESSnet Big Data Deliverable 8.2 of WP8 Prepared by: WP8 members Magdalena Six, Statistics Austria BDES 2018 - Sofia, 14-15 May 2018
  • 16. In relation to cause(s) of errors: • Coverage, Accuracy and Selectivity • Processing errors • Linkability • Measurement errors • Model errors and precision In relation to changes in the composition of the source • Comparability over time • Process chain control 7 Quality Aspects of Big Data
  • 17. 7 Quality Aspects in the Context of UNECE’s Quality Framework for BD • 3 Phases of the business process: Input, Throughput, Output • 3 Hyperdimensions: Source, Data, Metadata
  • 18. Structure of the Report on Quality in the ESSnet Big Data 7 Chapters according to the 7 identified quality aspects Same structure for each chapter: 1. Introduction: meaning of the respective quality aspect in the context of Big Data 2. Examples and Methods: Role of the respective quality aspect in the WP1-WP7 3. Discussion: Challenges for the quality aspect, cross connections to other Chapters in the Quality Report, but also to IT and Methodology Report
  • 19. Examples for new (?), BD specific (?) Error Sources • Scrambling of the Automated Identification Signal (AIS) of ships in WP4 -> measurement or coverage error? • Scraping of a deceptive Job vacancy ad -> measurement or coverage error? • Non-stable access to the BD source, change in technological process generating the BD, change in use of BD-generating devices -> comparability over time • Multiple layers of (new) processing steps required (advanced techniques for editing, imputation, linking techniques, text mining algorithms…) including new error sources • Deduction of information about target variable from other variables via modelling, models based on small-sample statistical inference don’t work
  • 20. Quality Measures: Challenges from the past and Challenges ahead • Still in the experimenting phase • Often no routine, no regular access to Big Data source • Focus in WPs more on potential sources and potential access to sources than on a standardized reporting of quality measures • Experimental phase shows: Big Data sources, as well as processes needed to work with these sources are so diverse that the development of standardized quality measures / a quality framework will be challenging
  • 21. Report on Methodology in the ESSnet Big Data Deliverable 8.4 of WP8 Prepared by: WP8 members Valentin Chavdarov & Piet Daas BDES 2018 - Sofia, 14-15 May 2018
  • 22. Why Big data methodology? 1. A good part of statistical methodology is built around survey data. There are many conventions in statistical methodology that reflect the failure of surveys to capture important social economic and social phenomena. 2. Big Data is a by-product of modern society. Not a lot is known on the data generation process and of the units included. 3. Working in a data-driven way is new for NSI’s. Methods and principles are needed to assure valid conclusions are drawn when using Big Data.
  • 23. Big data methodology issues 1. Assessing accuracy 2. What should our final product look like? 3. Deal with spatial dimension 4. Changes in data sources 5. Mashine learning in official statistics 6. Data linkage 7. Secure multi-party computation 8. Infererence 9. Sampling 10. Data process architecture 11. Unit identification problem
  • 24. Big data methodology issues - cont • Methodological issues are different in terms of scope. Assessing accuracy for example covers almost all stages of statistical production process: from collecting data through processing to data analysis. • Some of issues are BD specific: data linkage; changes in data sources; unit identification problem.
  • 25. Risk of social sciences datafication
  • 26. There are three ways in which Big Data can be used for official statistics 1) Survey based, as an additional source to improve survey based estimation (~ WP2, WP7, sentiment NL) 2) Census based, as the main/single source Whole target population is included (WP4, road sensor NL) 3) Incomplete, as the main/single source Only part of the target population is included (WP1, WP3 ….) Need to correct for that Using Big Data
  • 27. Methodology Quality IT • Bias & models Coverage Choosing right infra • Data driven way of working Sources of error Training/skills/knowledge • Machine Learning (2 places) Editing data Big Data libraries • Linking (e.g. geo-loc) Linkability Programming languages • Unit identification (features) In these areas new methods are needed and is being developed! More important/New to Big Data
  • 28. Literature study in the ESSnet Big Data Deliverable 8.1 of WP8 Prepared by: WP8 members Jacek Maślankowski, Statistics Poland BDES 2018 - Sofia, 14-15 May 2018
  • 29. • Bibliographic data • Link • Short overview (strengths, weaknesses) • Data sources • Domains • Keywords • Classification (A – very relevant, B – relevant, C – less relevant) Sharing the experience WP8 Wiki  Reports, milestones and deliverables  Literature overview
  • 31. Thank you for your attention