call girls in Narela DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
Responsible Data Science at Statistics Netherlands
1. Responsible Data Science at
Statistics Netherlands:
implications for big data research
Piet Daas
Senior Methodologist,Theme coordinator Big Data research
& lead Data Scientist at the Center for Big Data Statistics
2. Responsible Data Science @ CBS
– CBS
‐ About CBS and the CBS law
‐ A relevant example
‐ Responsible Statistics
– Big Data
‐ Center for Big Data Statistics
‐ Responsible Data science
‐ Implications (challenges) for Big Data
‐ Some examples of the things we do 2
4. Statistics Netherlands mission
Publish reliable and coherent statistical information that
responds to the needs of Dutch society.
– Independent organization
– When Statistics Netherlands requests information:
• companies and institutes are legally obliged to cooperate
• persons and households provide this on a voluntary basis
– To reduce the response burden SN has access to registers (admin
data) of governmental and semi-governmental organisations.
– We present facts with a ´short story´ (but no ‘cherry picking’)
– Information is made available -at the same time- to everyone for
free.
5. About Statistics Netherlands
Statistics Netherlands was founded in 1899
2 rooms on ‘het binnenhof’ with 5 employees
We currently have ~2000 employees
(max in 1982: 3600)
We produce >500 statistics per year
80% based on EU-regulations
There is a solid legal base to enable access to all kinds of
data and to process personal data:
- the CBS-law
- CBS data collection law
5
6. The Statistics Netherlands law
6
It is our intention to ‘burden’ people and companies as less as possible
with requests for data
- Re-use as much data as possible, such as data collected by others
- Increasing use of admin data and hence our interest in Big Data
8. Responsible Statistics
– Social Statistical Database (SSD)
– Combination of predominantly administrative data on
persons combined with a number of survey’s
– The SSD is used for a whole range of social statistics, for
social research and for the virtual census
8
9. What’s in the SSD?
The data is combined at the individual level and covers a
long period (start date 1999)
12. SSD ‘under the hood’
– All data is processed in our most secure internal
environment
– Personal Identifiable Data (such as CSN, addresses and
names) are removed ASAP from data files
– CSN is converted to a so-called RIN-number (non-
identifable unique number)
– Researchers only get access to the variables they need
(nothing more; even for SN-colleagues)
– Output is rigorously checked for disclosure (if there is a
risk, part of the data is disturbed)
12
13. Responsible Statistics (2)
– Fairness
‐ PID’s are removed as early in the process as possible
‐ Data-minimalization principle is applied
‐ Data is re-used as much as possible (reduce response burden).
– Accuracy
‐ Only well-established methods are used. Should be part of the
Methodology series of Statistics Netherlands (or published in
journals)
‐ Quality checks and confidence intervals should be available
– Confidentiality
‐ Statistics are produced on non-(de)identifiable (aggregated) units
‐ All output is checked for disclosure and (locally) disturbed if needed.
– Transparency
‐ The way data is processed, combined and the estimation models used
should be clearly described and internally available (no ‘personal’
produced statistics).
13
14.
15.
16. Goals CBDS
- New, more detailed, real time statistics
- Reduce the data collection footprint
- Deepen knowledge in Big Data methodology
- Privacy aspects on the use of Big Data in official statistics
- Offer an ecosystem to exchange knowledge and resources
16
17. Admin Data Sources
• Tax Data
• Population Register
• Insurance Register
• ...
Surveys
• Labor Force Survey
• Safety Monitor
• International Trade
• …
Data integration
17
Big Data
• Sensor Data
• Social Media
• Internet Data
• ….
Safety
MobilityIncome
Tourism
Environmental
Labor force
Census
Health
Economy
Statistical Output
Energy
CBDS
18. Responsible Data Science
– Fairness
‐ PID’s are removed as early in the process as possible
‐ Data-minimalization principle is applied
‐ Data is re-used as much as possible (low response burden).
– Accuracy
‐ Only well-established methods are used. Should be part of the
Methodology series of Statistics Netherlands (or published in
journals)
‐ Quality checks and confidence intervals should be available (bias?)
– Confidentiality
‐ Statistics are produced on non-(de)identifiable (aggregated) units
‐ All output is checked for disclosure control and (locally) disturbed if
needed. Effect of adding Big Data sources?
– Transparency
‐ The way data is processed, combined and the estimation models
used must be clearly described and available for everyone involved
(no ‘personal’ produced statistics). What about new processes?
18
19. Responsible Data Science: Fairness
– Fairness
‐ PID’s are removed as early in the process as possible
• Are there PID’s available?
• Identifying units in Big Data is sometime very hard
• Some data, such as the text of a tweet, is also a PID.
• This is publicly available data, ‘consciously’ put on the
internet by a user
‐ Data-minimalization principle is applied
• Our way of working is making use of as much data as
available because of the low information content of Big
Data (the data used for another purpose then intended)
19
20. Responsible Data Science: Accuracy
– Accuracy
‐ Only well-established methods are used. Should be part
of the Methodology series of Statistics Netherlands (or
published in journals)
• There are hardly any well established Big Data methods
at the moment (those used for satellite data?)
‐ Quality checks and confidence intervals should be
available (and bias?)
• Fully automated quality checks are needed
• What about (reasonable) confidence intervals of ML-
based methods?
• Isn’t bias more important (selectivity of data in the
source)? 20
21. Responsible Data Science
– Confidentiality
‐ Statistics are produced on non-(de)identifiable
(aggregated) units
‐ All output is checked for disclosure control and (locally)
disturbed if needed.
‐ Effect of adding Big Data sources?
• Can we guaranty de-identification when more and more
Big Data is added to a statistical output?
21
22. Responsible Data Science: Transparency
– Transparency
‐ The way data is processed, combined and the estimation
models used must be clearly described and available for
everyone involved (no ‘personal’ produced statistics).
‐ Estimation models
• No black boxes, How transparent are ML-based
models?
‐ What about new processes?
• New kind of processes emerge, e.g. start processing of
the data at the location of the data holder (not at SN)
• Example: Traffic index based on road sensor data
22
24. What´s next?
– Statistics Netherlands is currently changing from
Responsible Statistics to Responsible Data Science
– Clearly additional work is needed to fully enable this
– This is important as it is an essential step in unleashing
the (full) potential of Big Data.
– For Statistics Netherlands the latter leads to:
‐ New products,
‐ More readily available statistics
‐ Improve quality of existing products
‐ Assure the work of SN remains relevant for the
Netherlands 24
25. Work at Center of Big Data Statistics
- At the Center for Big data Statistics
1. Case studies/beta products
2. Methodological & exploratory research
- Examples of our work
‐ Income data: Visualisation 2D/3D
‐ Road sensor data: Traffic intensity and GDP
‐ Scanner data: ‘Ginger bread’ index
‐ Twitter: Social tension indicator
‐ Webpages: Identify web only shops
‐ Webpages: Identifying innovative companies
25
26. Heat map: Age vs. ‘Income’
Age
Income(euro)
26
Heatmaps of income vs age (gender)
27. A 3D heat map: Age vs. Income vs. Amount
amount
mount
3D Heatmap of income vs age
27
28. Road sensor data
28More on: https://www.cbs.nl/en-gb/our-services/innovation
Traffic intensity vs GDP
29. Scanner data
29More on: https://www.cbs.nl/nl-nl/onze-diensten/innovatie
Turnover of ‘ginger bread’ specific for Saint Nicolas festivities
(2015 and 2016: weekly)
31. Web pages
– From Common Crawl archive ‘2016-07’
– Found:
‐ +/- 60 million websites
‐ +/- 50.000 Dutch web shops
‐ 12670 web shops in scope
Web only shops in the Netherlands