Kolkata Call Girls Halisahar 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Girl ...
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
1. ESSnet Big Data WP8
Methodology (+ Quality, +IT)
Deliverables prepared by: WP8 members
BDES 2018 - Sofia,
14-15 May 2018
2. • Introduction Piet Daas
• IT Jacek Maślankowski
• Quality Magdalena Six
• Methodology Valentin Chavdarov & Piet Daas
• Literature study Jacek Maślankowski
• Discussant Faiz Alsuhail
• Questions + Discussion All
Overview of this session
3. • Aim of WP8 is to generalize the findings of the pilots in ESSnet Big
Data and relate them to the conditions for future use of big data
sources within the European Statistical System.
• Only active in SGA-2 (January 2017 - May 2018)
• Focus on Methodology, Quality and IT-infrastructure
Overview of WP8
4. • Based on real world experiences
– Work performed in WP 1-7 of ESSnet and other work relevant for
NSI’s (or similar)
• Broad area: focus on most important topics
– In three areas: IT, Quality and Methodology
– Identify the most important topics (for each area) at the start of
WP8 during a workshop with experts
– To assure a sufficiently ‘blended’ view on BD
• Follow a bottom-up approach
Starting points of WP8
5. • Identified most important topics for
• IT
– 10 in total
• Quality
– 7 in total
• Methodology
– 11 in total
• Topics based on the BD ‘start of art’ in January 2017
– One topic emerged in each of the 3 areas
Results of the workshop
6. • A Process ‘view’ on Big Data
– IT: Data Processing Life Cycle
– Quality: Process Chain Control
– Methodology: Data Process Architecture
– This is important
• GSBPM provides a general view on NSI processes
(Generic Statistical Business Process Model )
Common topic
7. Big Data process: Data driven
Different than the approach commonly used in official statistics
8. IT Report
in the ESSnet Big Data
Deliverable 8.3 of WP8
Prepared by: WP8 members
Jacek Maślankowski, Statistics Poland
BDES 2018 - Sofia,
14-15 May 2018
9. 1. Big Data processing life cycle
2. Metadata management (ontology)
3. Format of Big Data processing
4. Data-hub and Data-lake
5. Data source integration
6. Choosing the right infrastructure
7. List of secure and tested API’s
8. Shared libraries and documented standards
9. Speed of algorithms
10. Training/skills/knowledge
Information covered in the report
11. Data processing and data storage
Data type
Batch
Static data
Structured
RDBMS, DBF, ...
Relational
database, files
Hadoop, MySQL, ..
Unstructured
Text, Website, ...
Files, NoSQL
Hadoop, Solr, ...
Semi-
structured
CSV, JSON, XML, ...
Files, NoSQL or
relat. databases
Hadoop, HBase, ...
Streaming
Realtime data
Sensors
TXT or CSV files
In-memory
processing
engine
Spark, Kafka, ...
Web
Websites
In-memory
processing
engine
Spark, Storm, ...
12. Github repositories
No. Name Link Main features
1 Awesome Official Statistics
software
https://github.com/SNStatComp/awesom
e-official-statistics-software
The list of useful statistical software
with links to other GitHub
repositories, by CBS NL
2 ONS (Office for National
Statistics) UK Big Data team
https://github.com/ONSBigData Various software developed by ONS
UK Big Data Team
3 ONS (Office for National
Statistics) UK Data Science
Campus
https://github.com/datasciencecampus Various software developed by ONS
Data Science Campus Team
4 ESTP Big Data course
software
https://github.com/SNStatComp/ESTPBD Various software developed for the
ESTP Big Data training courses
13. API’s used
No. Name of the
API with
hyperlink
Basic functionality Restrictions Potential domains
(WP number)
Remarks
1 Twitter API Scrap the tweets by keywords,
hashtags, users; streaming
scrapping
25 to 900 requests/15 minutes; access only to public
profiles
Population, Social
Statistics, Tourism
(WP2, WP7)
Account and API code
needed
2 Facebook Graph
API
Collect information from
public profiles, also very
specific such as photos
metatags
Mostly present information, typical no more than
dozens of requests
Population (WP7) Account and API code
needed
3 Google Maps
API
Looking for any kind of
objects (e.g., hotels),
verification of addresses,
monitoring the traffic on
specific roads
Free up to 2.5 thous. requests per day.
$0.50 USD / 1 thous. additional requests, up to 100
thous. daily, if billing is enabled.
Tourism (WP7) Google account and API
code needed
4 Google Custom
Search API
Can be used to search through
one website, with
modifications it will search for
a keywords in the whole
Internet; can be used to find a
URL of the specific enterprise
JSON/Atom Custom Search API provides 100 search
queries per day for free. Additional requests cost $5
per 1000 queries, up to 10k queries per day.
Business (WP2) Google account and API
code needed
5 Bing API Finding specific URL of the
enterprise
7 queries per second (QPS) per IP address Business (WP2) AppID needed
6 Guardian API Collect news articles and
comments from Guardian
website
Free for non-commercial use. Up to 12 calls per
second, Up to 5,000 calls per day, Access to article text,
Access to over 1,900,000 pieces of content.
Population, Social
Statistics (WP7)
Registered account
needed
7 Copernicus
Open Access
Hub
Access to Sentinel-1 and
Sentinel-2 repositories
Free for registered users Agriculture (WP7) Registered account
needed
14. 1. There is no unified framework for Big Data metadata
management.
2. There is a common point in all WPs on tools and data
storage.
3. Data-lakes and data-hubs are still not explored deeply.
4. There are best practices on using different API’s.
5. Software is shared by NSI’s on Github repositories.
6. There is no unified framework for data sources integration.
7. Variety of training courses allows increasing required skills
of data scientists.
Main findings
15. Report on
Quality Aspects of Big Data
in the ESSnet Big Data
Deliverable 8.2 of WP8
Prepared by: WP8 members
Magdalena Six, Statistics Austria
BDES 2018 - Sofia,
14-15 May 2018
16. In relation to cause(s) of errors:
• Coverage, Accuracy and Selectivity
• Processing errors
• Linkability
• Measurement errors
• Model errors and precision
In relation to changes in the composition of the source
• Comparability over time
• Process chain control
7 Quality Aspects of Big Data
17. 7 Quality Aspects in the Context of
UNECE’s Quality Framework for BD
• 3 Phases of the business process: Input, Throughput, Output
• 3 Hyperdimensions: Source, Data, Metadata
18. Structure of the Report on Quality in
the ESSnet Big Data
7 Chapters according to the 7 identified quality aspects
Same structure for each chapter:
1. Introduction: meaning of the respective quality aspect in the
context of Big Data
2. Examples and Methods: Role of the respective quality aspect
in the WP1-WP7
3. Discussion: Challenges for the quality aspect, cross connections
to other Chapters in the Quality Report, but also to IT and
Methodology Report
19. Examples for new (?), BD specific (?)
Error Sources
• Scrambling of the Automated Identification Signal (AIS) of ships in WP4 ->
measurement or coverage error?
• Scraping of a deceptive Job vacancy ad -> measurement or coverage error?
• Non-stable access to the BD source, change in technological process
generating the BD, change in use of BD-generating devices -> comparability
over time
• Multiple layers of (new) processing steps required (advanced techniques for
editing, imputation, linking techniques, text mining algorithms…) including
new error sources
• Deduction of information about target variable from other variables via
modelling, models based on small-sample statistical inference don’t work
20. Quality Measures: Challenges from the
past and Challenges ahead
• Still in the experimenting phase
• Often no routine, no regular access to Big Data source
• Focus in WPs more on potential sources and potential access to
sources than on a standardized reporting of quality measures
• Experimental phase shows: Big Data sources, as well as processes
needed to work with these sources are so diverse that the
development of standardized quality measures / a quality framework
will be challenging
21. Report on
Methodology
in the ESSnet Big Data
Deliverable 8.4 of WP8
Prepared by: WP8 members
Valentin Chavdarov & Piet Daas
BDES 2018 - Sofia,
14-15 May 2018
22. Why Big data methodology?
1. A good part of statistical methodology is built
around survey data. There are many conventions
in statistical methodology that reflect the failure
of surveys to capture important social economic
and social phenomena.
2. Big Data is a by-product of modern society. Not a
lot is known on the data generation process and
of the units included.
3. Working in a data-driven way is new for NSI’s.
Methods and principles are needed to assure
valid conclusions are drawn when using Big Data.
23. Big data methodology issues
1. Assessing accuracy
2. What should our final product look like?
3. Deal with spatial dimension
4. Changes in data sources
5. Mashine learning in official statistics
6. Data linkage
7. Secure multi-party computation
8. Infererence
9. Sampling
10. Data process architecture
11. Unit identification problem
24. Big data methodology issues
- cont
• Methodological issues are different in terms of scope. Assessing
accuracy for example covers almost all stages of statistical
production process: from collecting data through processing to data
analysis.
• Some of issues are BD specific: data linkage; changes in data
sources; unit identification problem.
26. There are three ways in which Big Data can be used for official statistics
1) Survey based, as an additional source
to improve survey based estimation (~ WP2, WP7,
sentiment NL)
2) Census based, as the main/single source
Whole target population is included (WP4, road sensor NL)
3) Incomplete, as the main/single source
Only part of the target population is included (WP1, WP3 ….)
Need to correct for that
Using Big Data
27. Methodology Quality IT
• Bias & models Coverage Choosing right infra
• Data driven way of working Sources of error Training/skills/knowledge
• Machine Learning (2 places) Editing data Big Data libraries
• Linking (e.g. geo-loc) Linkability Programming languages
• Unit identification (features)
In these areas new methods are needed and is being developed!
More important/New to Big Data
28. Literature study
in the ESSnet Big Data
Deliverable 8.1 of WP8
Prepared by: WP8 members
Jacek Maślankowski, Statistics Poland
BDES 2018 - Sofia,
14-15 May 2018
29. • Bibliographic data
• Link
• Short overview (strengths, weaknesses)
• Data sources
• Domains
• Keywords
• Classification (A – very relevant, B – relevant, C – less relevant)
Sharing the experience
WP8 Wiki
Reports, milestones and
deliverables
Literature overview