SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Big Data as a source for Official Statistics

Edwin de Jonge and Piet Daas
November 12, London
Overview

• Big Data
• Research ‘theme’ at Stat. Netherlands
• Data driven approach
• Visualization as a tool
•Why?
•Examples in our office

• Issues & challenges
• From an official statistical perspective
• Focus on methodological and legal ones
2
Why Visualization?

October 1st 2013, Statistics Netherlands
Effective Display!
(see Tor Norretranders, “Band width of our senses)
Anscombes quartet…

DS1 x

y

DS2 x

y

DS3

x

y

DS4

x

y

10

8.04

10 9.14

10 7.46

8

6.58

8

6.95

8 8.14

8 6.77

8

5.76

13

7.58

13 8.74

13 12.74

8

7.71

9

8.81

9 8.77

9 7.11

8

8.84

11

8.33

11 9.26

11 7.81

8

8.47

14

9.96

14 8.1

14 8.84

8

7.04

6

7.24

6 6.13

6 6.08

8

5.25

4

4.26

4 3.1

4 5.39

19

12.5

12

10.84

12 9.13

12 8.15

8

5.56

7

4.82

7 7.26

7 6.42

8

7.91

5

5.68

5 4.74

5 5.73

8

6.89

5
Anscombe’s quartet

Property

Value

Mean of x1, x2, x3, x4

All equal: 9

Variance of x1, x2, x3, x4

All equal: 11

Mean of y1, y2, y3, y4

All equal: 7.50

Variance of y1, y2, y3, y4

All equal: 4.1

Correlation for ds1, ds2, ds3, ds4

All equal 0.816

Linear regression for ds1, ds2, ds3,
ds4

All equal: y = 3.00 + 0.500x

Looks the same, right?
Lets plot!
Assumptions…

8
Why visualization?
Tool for data analysis
–Effective display of information
–Summary of data
–Show outliers / patterns
–Helps exploring data
–Helps checking assumptions
Often Maps
Many visualizations are maps
–Positive:
‐ Is familiar
‐ Attractive
But: only makes sense:
‐ When data geographically distributed
‐ When locality is meaningful
‐ When data is correctly normalized
Huh, Normalized?,

11
Many maps just population maps!
A better map:
‐ Takes population size into account (e.g.
by making figures relative)

‐ May plot difference w.r.t. an expected
value.
13
Visualization is not easy
– Creating good visualizations is hard
– “Easy Reading” is not “Easy Writing”
Visualization must be:
– Faithful
– Objective
Thus not introduce perceptial bias
Visualization
– Use appropriate chart
– Use approprate scales
‐ x,y, color, time
– Use appropriate granularity
Research: What works for which data?
Example:
Census

16
Example Virtual Census
‐ Every 10 years a Census needs to be conducted
‐ No longer with surveys in the Netherlands
• Last traditional census was in 1971

‐ Now by (re‐)using existing information
• Linking administrative sources and available sample
survey data at a large scale
• Check result
• How?
• With a visualisation method: the Tableplot
11
Making the Tableplot
1.
2.

Load file
Sort record according to
key variable
• Age in this example
3. Combine records
each)
• Numeric variables
•

•

100 groups (170,000 records

Calculate average (avg. age)

Categorical variables
•

4.

17 million records
17 million records

Ratio between categories present (male vs. female)

Plot figure
•

Colours used are important

of select number of variables
up to 12

12
October 1st 2013, Statistics Netherlands tableplot of the census test file
Tableplot: Monitor data quality
– All data in Office passes stages:
‐ Raw data (collected)
‐ Preproccesed (technically correct)
‐ Edited (completed data)
‐ Final (removal of outliers etc.)

21
Processing of data
Raw (unedited) data

Edited data

Final data
Example 2 : Social Security Register

15
– Contains all financial data on jobs, benefits and
pensions in the Netherlands
‐ Collected by the Dutch Tax office
‐ A total of 20 million records each month
‐ How to obtain insight into so much data?
• With a visualisation method: a heat map

24
Income (euro)

Heat map: Age vs. ‘Income’

Age

October 1st 2013, Statistics Netherlands

16
After ‘
d

ata re
d

uction

’

amount

amount

age

October 1st 2013, Statistics Netherlands

age

17
Visualization helps with volume of data
–
–
–
–
–
–

Summarize by “binning”
Tableplot
Histogram
Heatmap (2D histogram)
Smoothing?
Detect unexpected patterns

We use it as a tool to check, explore and communicate
data
27
Big Data: Issues and challenges
Big Data: issues & challenges
During our exploratory studies we identified
a number of issues & challenges.
Focussing on the methodological and legal ones,
we found that there is a need to:
1) deal with noisy and dirty data
2) deal with selectivity
3) go beyond correlation
4) cope with privacy and security issues
We have only solved some of them (partially)
29
1) Deal with noisy and dirty data
– Big Data is often
‐ noisy
‐ dirty
‐ redundant
‐ unstructured
• e.g. texts, images
– How to extract information
from Big data?
‐ In the best/most efficient way
30
Noisy and dirty data

Social media sentiment

Traffic loop data

Aggregate, apply filters (Poisson/Kalman), try to exclude noisy records, models
(capture structure), ‘Google approach’ (80/20 rule)
Preferably do NOT use samples !

31
Noise reduction
Social media: daily sentiment in Dutch messages

32
Noise reduction
Social media, daily sentiment in Dutch messages
Social media: daily & weekly sentiment in Dutch messages

33
Noise reduction
Social media, daily sentiment in Dutch messages
Social media: daily, weekly & monthly sentiment in Dutch messages

34
Noise reduction
Social media, daily sentiment in Dutch messages
Social media: monthly sentiment in Dutch messages

35
Social media sentiment & Consumer confidence
Social media: monthly sentiment in Dutch messages &
Social media, daily sentiment in Dutch messages
Consumer confidence

Corr: 0.88

36
Dirty data
Total number of vehicles detected by traffic loops during the day

37

Time (hour)
Loop active varies during the day

38

(first 10 min)
Correct for dirty data
Use data from same location from previous/next minute (5 min. window)
Before

Total = ~ 295 million vehicles

39

After

Total = ~ 330 million vehicles (+ 12%)
2) Deal with selectivity
–

Big data sources are selective (they do NOT cover
the entire population considered)
‐

–

AND: all Big Data sources studied so far contain events!
‐
‐

–

Some probably more then others
E.g. social media messages created, calls made and vehicles detected
Events are probably the reason why these sources are so Big

When there is a need to correct for selectivity
1)

Convert events to units
How to identify units?

2) Correct for selectivity of units included
How to cope with units that are truly absent and part of the
population under study?

40
Units / events
– Big Data contains events
‐ Social media messages are generated by usernames
‐ Traffic loops count vehicles (Dutch roads are units)
‐ Call detail records of mobile phone ID’s

‐ Convert events to units
• By profiling

41
Profiling of Big data

42
Travel behaviour of active mobile phones

Mobility of very active mobile
phone users
- during a 14-day period

Based on:
- Call- and text-activity
multiples times a day

- Location based on phone masts

Clearly selective:
- North and South-west
of the country hardly included

43

__
3) Go beyond correlation
–

You will very likely use correlation to check Big Data
findings with those in other (survey) data

–

When correlation is high:
1) try falsifying it first (is it coincidental/spurious?)
correlation ≠ causation
2) If this fails, you may have found something
interesting!
3) Perform additional analysis (look for causality)
cointegration, structural time‐series approach

44

Use common sense!
An illustrative example
Official unemployment percentage

Number of social media messages
including the word “unemployment”

X

Corr: 0.90 ?

45
4) Privacy and security issues
– The Dutch privacy and security law allows the study of privacy
sensitive data for scientific and statistical research
– Still appropriate measures need to be taken
• Prior to new research studies, check privacy sensitivity of data
• In case of privacy sensitive data:
• Try to anonymize micro data or use aggregates
• Use a secure environment

– Legal issues that enable the use of Big Data for official statistics
production is currently being looked at
‐ No problems for Big Data that can be considered ‘Administrative data’: i.e.
Big Data that is managed by a (semi‐)governmentally funded organisation
46
Conclusions
– Big data is a very interesting data source
‐ Also for official statistics
– Visualisation is a great way of getting/creating insight
‐ Not only for data exploration
– A number of fundamental issues need to be resolved
‐ Methodological
‐ Legal
‐ Technical (not discussed here)
– We expect great things in the near future!
47
The future of statistics?

Weitere ähnliche Inhalte

Was ist angesagt?

Paulo Canas Rodrigues - The role of Statistics in the Internet of Things - ...
Paulo Canas Rodrigues - The role of Statistics  in the  Internet of Things - ...Paulo Canas Rodrigues - The role of Statistics  in the  Internet of Things - ...
Paulo Canas Rodrigues - The role of Statistics in the Internet of Things - ...Mindtrek
 
Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22 Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22 Thorhildur Jetzek, Ph.D.
 
Big Data and Nowcasting
Big Data and NowcastingBig Data and Nowcasting
Big Data and NowcastingDario Buono
 
Basic conditions, availability, and the value added of open data in comparison
Basic conditions, availability, and the value added of open data in comparisonBasic conditions, availability, and the value added of open data in comparison
Basic conditions, availability, and the value added of open data in comparisonHeinrich-Heine-University Düsseldorf
 
Quality Approaches to Big Data
Quality Approaches to Big DataQuality Approaches to Big Data
Quality Approaches to Big DataPiet J.H. Daas
 
The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...Juan Mateos-Garcia
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
 
Data Center Computing for Data Science: an evolution of machines, middleware,...
Data Center Computing for Data Science: an evolution of machines, middleware,...Data Center Computing for Data Science: an evolution of machines, middleware,...
Data Center Computing for Data Science: an evolution of machines, middleware,...Paco Nathan
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltoolssuresh sood
 
hariri2019.pdf
hariri2019.pdfhariri2019.pdf
hariri2019.pdfAkuhuruf
 
Lars Lyberg, Inizio: Rapport från konferensen BigSurv18
Lars Lyberg, Inizio: Rapport från konferensen BigSurv18Lars Lyberg, Inizio: Rapport från konferensen BigSurv18
Lars Lyberg, Inizio: Rapport från konferensen BigSurv18Alf Fyhrlund
 
Computational intelligence for big data analytics bda 2013
Computational intelligence for big data analytics   bda 2013Computational intelligence for big data analytics   bda 2013
Computational intelligence for big data analytics bda 2013oj08
 
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...e-ROSA
 
Diffusion of Big Data and Analytics in Developing Countries
Diffusion of Big Data and Analytics in Developing CountriesDiffusion of Big Data and Analytics in Developing Countries
Diffusion of Big Data and Analytics in Developing Countriestheijes
 
Data science and business analytics
Data  science and business analyticsData  science and business analytics
Data science and business analyticsInbavalli Valli
 
Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...
Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...
Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...Istituto nazionale di statistica
 
New Data for Innovation Policy
New Data for Innovation PolicyNew Data for Innovation Policy
New Data for Innovation PolicyJuan Mateos-Garcia
 

Was ist angesagt? (20)

Paulo Canas Rodrigues - The role of Statistics in the Internet of Things - ...
Paulo Canas Rodrigues - The role of Statistics  in the  Internet of Things - ...Paulo Canas Rodrigues - The role of Statistics  in the  Internet of Things - ...
Paulo Canas Rodrigues - The role of Statistics in the Internet of Things - ...
 
Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22 Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22
 
Big Data and Nowcasting
Big Data and NowcastingBig Data and Nowcasting
Big Data and Nowcasting
 
Basic conditions, availability, and the value added of open data in comparison
Basic conditions, availability, and the value added of open data in comparisonBasic conditions, availability, and the value added of open data in comparison
Basic conditions, availability, and the value added of open data in comparison
 
Quality Approaches to Big Data
Quality Approaches to Big DataQuality Approaches to Big Data
Quality Approaches to Big Data
 
Big data Big impact?
Big data Big impact?Big data Big impact?
Big data Big impact?
 
The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Data Center Computing for Data Science: an evolution of machines, middleware,...
Data Center Computing for Data Science: an evolution of machines, middleware,...Data Center Computing for Data Science: an evolution of machines, middleware,...
Data Center Computing for Data Science: an evolution of machines, middleware,...
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltools
 
hariri2019.pdf
hariri2019.pdfhariri2019.pdf
hariri2019.pdf
 
Lars Lyberg, Inizio: Rapport från konferensen BigSurv18
Lars Lyberg, Inizio: Rapport från konferensen BigSurv18Lars Lyberg, Inizio: Rapport från konferensen BigSurv18
Lars Lyberg, Inizio: Rapport från konferensen BigSurv18
 
Computational intelligence for big data analytics bda 2013
Computational intelligence for big data analytics   bda 2013Computational intelligence for big data analytics   bda 2013
Computational intelligence for big data analytics bda 2013
 
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...
 
Data stories
Data storiesData stories
Data stories
 
Diffusion of Big Data and Analytics in Developing Countries
Diffusion of Big Data and Analytics in Developing CountriesDiffusion of Big Data and Analytics in Developing Countries
Diffusion of Big Data and Analytics in Developing Countries
 
Data science and business analytics
Data  science and business analyticsData  science and business analytics
Data science and business analytics
 
Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...
Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...
Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...
 
New Data for Innovation Policy
New Data for Innovation PolicyNew Data for Innovation Policy
New Data for Innovation Policy
 

Andere mochten auch

Karel Thönissen (Garabit) @ PIDS seminar
Karel Thönissen (Garabit) @ PIDS seminarKarel Thönissen (Garabit) @ PIDS seminar
Karel Thönissen (Garabit) @ PIDS seminarAlmereDataCapital
 
EE003_Lung Cancer_Associated_With_Cystic_Airspaces_an_Entity_to_Be_Recognized...
EE003_Lung Cancer_Associated_With_Cystic_Airspaces_an_Entity_to_Be_Recognized...EE003_Lung Cancer_Associated_With_Cystic_Airspaces_an_Entity_to_Be_Recognized...
EE003_Lung Cancer_Associated_With_Cystic_Airspaces_an_Entity_to_Be_Recognized...nishiburute
 
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Belmiro Moreira
 
Freek Bomhof (TNO) - Big Data en kansen in de zorg
Freek Bomhof (TNO) - Big Data en kansen in de zorgFreek Bomhof (TNO) - Big Data en kansen in de zorg
Freek Bomhof (TNO) - Big Data en kansen in de zorgAlmereDataCapital
 
BIG Data and Methodology-A review
BIG Data and Methodology-A reviewBIG Data and Methodology-A review
BIG Data and Methodology-A reviewShilpa Soi
 
Chapter21 International Finance Management
Chapter21 International Finance ManagementChapter21 International Finance Management
Chapter21 International Finance ManagementPiyush Gaur
 
Re-monetizing the Book (June 2011)
Re-monetizing the Book (June 2011)Re-monetizing the Book (June 2011)
Re-monetizing the Book (June 2011)Victoria Gaitskell
 
Introduction to GNSS RAW measurements provided by Android N
Introduction to GNSS RAW measurements provided by Android NIntroduction to GNSS RAW measurements provided by Android N
Introduction to GNSS RAW measurements provided by Android NLukasz Kosma Bonenberg
 
Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.OReillyStrata
 
Chapter14 International Finance Management
Chapter14 International Finance ManagementChapter14 International Finance Management
Chapter14 International Finance ManagementPiyush Gaur
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansJameel Syed
 

Andere mochten auch (20)

Karel Thönissen (Garabit) @ PIDS seminar
Karel Thönissen (Garabit) @ PIDS seminarKarel Thönissen (Garabit) @ PIDS seminar
Karel Thönissen (Garabit) @ PIDS seminar
 
HEIDY ANDREA CACERES AMAYA
HEIDY ANDREA CACERES AMAYA	HEIDY ANDREA CACERES AMAYA
HEIDY ANDREA CACERES AMAYA
 
October classes 2015
October classes 2015October classes 2015
October classes 2015
 
Stage Assessment - Devil is in the Detail
Stage Assessment - Devil is in the DetailStage Assessment - Devil is in the Detail
Stage Assessment - Devil is in the Detail
 
EE003_Lung Cancer_Associated_With_Cystic_Airspaces_an_Entity_to_Be_Recognized...
EE003_Lung Cancer_Associated_With_Cystic_Airspaces_an_Entity_to_Be_Recognized...EE003_Lung Cancer_Associated_With_Cystic_Airspaces_an_Entity_to_Be_Recognized...
EE003_Lung Cancer_Associated_With_Cystic_Airspaces_an_Entity_to_Be_Recognized...
 
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
 
Freek Bomhof (TNO) - Big Data en kansen in de zorg
Freek Bomhof (TNO) - Big Data en kansen in de zorgFreek Bomhof (TNO) - Big Data en kansen in de zorg
Freek Bomhof (TNO) - Big Data en kansen in de zorg
 
Samantha Arnold
Samantha ArnoldSamantha Arnold
Samantha Arnold
 
Superhero GPS
Superhero GPSSuperhero GPS
Superhero GPS
 
Dec 2016 classes
Dec 2016 classes  Dec 2016 classes
Dec 2016 classes
 
BIG Data and Methodology-A review
BIG Data and Methodology-A reviewBIG Data and Methodology-A review
BIG Data and Methodology-A review
 
Chapter21 International Finance Management
Chapter21 International Finance ManagementChapter21 International Finance Management
Chapter21 International Finance Management
 
Re-monetizing the Book (June 2011)
Re-monetizing the Book (June 2011)Re-monetizing the Book (June 2011)
Re-monetizing the Book (June 2011)
 
Introduction to GNSS RAW measurements provided by Android N
Introduction to GNSS RAW measurements provided by Android NIntroduction to GNSS RAW measurements provided by Android N
Introduction to GNSS RAW measurements provided by Android N
 
R presentation - UoN
R presentation - UoNR presentation - UoN
R presentation - UoN
 
Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.
 
Crystal Gazing - Estimating Lifetime PDs
Crystal Gazing - Estimating Lifetime PDsCrystal Gazing - Estimating Lifetime PDs
Crystal Gazing - Estimating Lifetime PDs
 
Chapter14 International Finance Management
Chapter14 International Finance ManagementChapter14 International Finance Management
Chapter14 International Finance Management
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake Fans
 
RAW GNSS in Android Nugat
RAW GNSS in Android NugatRAW GNSS in Android Nugat
RAW GNSS in Android Nugat
 

Ähnlich wie Strata Big data presentation

Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data VisualizationEdwin de Jonge
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsPiet J.H. Daas
 
Big Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenBig Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenPiet J.H. Daas
 
Data Science-1 (1).ppt
Data Science-1 (1).pptData Science-1 (1).ppt
Data Science-1 (1).pptSanjayAcharaya
 
open-data-presentation.pptx
open-data-presentation.pptxopen-data-presentation.pptx
open-data-presentation.pptxDennicaRivera
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptxAkhirulAminulloh2
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Thinkful
 
Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media suresh sood
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
 
Data Science For Social Good: Tackling the Challenge of Homelessness
Data Science For Social Good: Tackling the Challenge of HomelessnessData Science For Social Good: Tackling the Challenge of Homelessness
Data Science For Social Good: Tackling the Challenge of HomelessnessAnita Luthra
 
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...Lauri Eloranta
 
DSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco NathanDSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco NathanPaco Nathan
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
 
Big Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressBig Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressMarcel Blattner, PhD
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectbodaceacat
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSara-Jayne Terp
 
Dia sds2015 web version
Dia sds2015 web versionDia sds2015 web version
Dia sds2015 web versionMichael Brodie
 

Ähnlich wie Strata Big data presentation (20)

Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics Netherlands
 
Big Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenBig Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in Eindhoven
 
Data Science-1 (1).ppt
Data Science-1 (1).pptData Science-1 (1).ppt
Data Science-1 (1).ppt
 
open-data-presentation.pptx
open-data-presentation.pptxopen-data-presentation.pptx
open-data-presentation.pptx
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
 
Big Data @ CBS
Big Data @ CBSBig Data @ CBS
Big Data @ CBS
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
 
Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Data Science For Social Good: Tackling the Challenge of Homelessness
Data Science For Social Good: Tackling the Challenge of HomelessnessData Science For Social Good: Tackling the Challenge of Homelessness
Data Science For Social Good: Tackling the Challenge of Homelessness
 
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
 
DSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco NathanDSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco Nathan
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Big Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressBig Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR Congress
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
DBMS
DBMSDBMS
DBMS
 
Dia sds2015 web version
Dia sds2015 web versionDia sds2015 web version
Dia sds2015 web version
 

Mehr von Piet J.H. Daas

Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their usePiet J.H. Daas
 
IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsPiet J.H. Daas
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)Piet J.H. Daas
 
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statisticsPiet J.H. Daas
 
Isi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasIsi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasPiet J.H. Daas
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSPiet J.H. Daas
 
Ntts2017 presentation 45
Ntts2017 presentation 45Ntts2017 presentation 45
Ntts2017 presentation 45Piet J.H. Daas
 
Big Data presentation Mannheim
Big Data presentation MannheimBig Data presentation Mannheim
Big Data presentation MannheimPiet J.H. Daas
 
Extracting information from ' messy' social media data
Extracting information from ' messy' social media dataExtracting information from ' messy' social media data
Extracting information from ' messy' social media dataPiet J.H. Daas
 
Big data cbs_piet_daas
Big data cbs_piet_daasBig data cbs_piet_daas
Big data cbs_piet_daasPiet J.H. Daas
 
Gebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekGebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekPiet J.H. Daas
 
Profiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityProfiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityPiet J.H. Daas
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statisticsPiet J.H. Daas
 
Social media sentiment and consumer confidence
Social media sentiment and consumer confidenceSocial media sentiment and consumer confidence
Social media sentiment and consumer confidencePiet J.H. Daas
 
Bi dutch meeting data science
Bi dutch meeting data scienceBi dutch meeting data science
Bi dutch meeting data sciencePiet J.H. Daas
 
Piet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningenPiet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningenPiet J.H. Daas
 
Big data en officiële statistiek
Big data en officiële statistiekBig data en officiële statistiek
Big data en officiële statistiekPiet J.H. Daas
 
New Data Sources for Statistics, Social media: Twitter.
New Data Sources for Statistics, Social media: Twitter.New Data Sources for Statistics, Social media: Twitter.
New Data Sources for Statistics, Social media: Twitter.Piet J.H. Daas
 

Mehr von Piet J.H. Daas (19)

Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their use
 
IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics Netherlands
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
 
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statistics
 
Isi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasIsi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and bias
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONS
 
Ntts2017 presentation 45
Ntts2017 presentation 45Ntts2017 presentation 45
Ntts2017 presentation 45
 
Big Data presentation Mannheim
Big Data presentation MannheimBig Data presentation Mannheim
Big Data presentation Mannheim
 
Extracting information from ' messy' social media data
Extracting information from ' messy' social media dataExtracting information from ' messy' social media data
Extracting information from ' messy' social media data
 
Big data cbs_piet_daas
Big data cbs_piet_daasBig data cbs_piet_daas
Big data cbs_piet_daas
 
Gebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekGebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiek
 
Profiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityProfiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivity
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statistics
 
Social media sentiment and consumer confidence
Social media sentiment and consumer confidenceSocial media sentiment and consumer confidence
Social media sentiment and consumer confidence
 
Big data @ CBS
Big data @ CBSBig data @ CBS
Big data @ CBS
 
Bi dutch meeting data science
Bi dutch meeting data scienceBi dutch meeting data science
Bi dutch meeting data science
 
Piet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningenPiet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningen
 
Big data en officiële statistiek
Big data en officiële statistiekBig data en officiële statistiek
Big data en officiële statistiek
 
New Data Sources for Statistics, Social media: Twitter.
New Data Sources for Statistics, Social media: Twitter.New Data Sources for Statistics, Social media: Twitter.
New Data Sources for Statistics, Social media: Twitter.
 

Kürzlich hochgeladen

Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxleah joy valeriano
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 

Kürzlich hochgeladen (20)

Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 

Strata Big data presentation

  • 1. Big Data as a source for Official Statistics Edwin de Jonge and Piet Daas November 12, London
  • 2. Overview • Big Data • Research ‘theme’ at Stat. Netherlands • Data driven approach • Visualization as a tool •Why? •Examples in our office • Issues & challenges • From an official statistical perspective • Focus on methodological and legal ones 2
  • 3. Why Visualization? October 1st 2013, Statistics Netherlands
  • 4. Effective Display! (see Tor Norretranders, “Band width of our senses)
  • 5. Anscombes quartet… DS1 x y DS2 x y DS3 x y DS4 x y 10 8.04 10 9.14 10 7.46 8 6.58 8 6.95 8 8.14 8 6.77 8 5.76 13 7.58 13 8.74 13 12.74 8 7.71 9 8.81 9 8.77 9 7.11 8 8.84 11 8.33 11 9.26 11 7.81 8 8.47 14 9.96 14 8.1 14 8.84 8 7.04 6 7.24 6 6.13 6 6.08 8 5.25 4 4.26 4 3.1 4 5.39 19 12.5 12 10.84 12 9.13 12 8.15 8 5.56 7 4.82 7 7.26 7 6.42 8 7.91 5 5.68 5 4.74 5 5.73 8 6.89 5
  • 6. Anscombe’s quartet Property Value Mean of x1, x2, x3, x4 All equal: 9 Variance of x1, x2, x3, x4 All equal: 11 Mean of y1, y2, y3, y4 All equal: 7.50 Variance of y1, y2, y3, y4 All equal: 4.1 Correlation for ds1, ds2, ds3, ds4 All equal 0.816 Linear regression for ds1, ds2, ds3, ds4 All equal: y = 3.00 + 0.500x Looks the same, right?
  • 9. Why visualization? Tool for data analysis –Effective display of information –Summary of data –Show outliers / patterns –Helps exploring data –Helps checking assumptions
  • 10. Often Maps Many visualizations are maps –Positive: ‐ Is familiar ‐ Attractive But: only makes sense: ‐ When data geographically distributed ‐ When locality is meaningful ‐ When data is correctly normalized
  • 12.
  • 13. Many maps just population maps! A better map: ‐ Takes population size into account (e.g. by making figures relative) ‐ May plot difference w.r.t. an expected value. 13
  • 14. Visualization is not easy – Creating good visualizations is hard – “Easy Reading” is not “Easy Writing” Visualization must be: – Faithful – Objective Thus not introduce perceptial bias
  • 15. Visualization – Use appropriate chart – Use approprate scales ‐ x,y, color, time – Use appropriate granularity Research: What works for which data?
  • 17. Example Virtual Census ‐ Every 10 years a Census needs to be conducted ‐ No longer with surveys in the Netherlands • Last traditional census was in 1971 ‐ Now by (re‐)using existing information • Linking administrative sources and available sample survey data at a large scale • Check result • How? • With a visualisation method: the Tableplot 11
  • 18. Making the Tableplot 1. 2. Load file Sort record according to key variable • Age in this example 3. Combine records each) • Numeric variables • • 100 groups (170,000 records Calculate average (avg. age) Categorical variables • 4. 17 million records 17 million records Ratio between categories present (male vs. female) Plot figure • Colours used are important of select number of variables up to 12 12
  • 19.
  • 20. October 1st 2013, Statistics Netherlands tableplot of the census test file
  • 21. Tableplot: Monitor data quality – All data in Office passes stages: ‐ Raw data (collected) ‐ Preproccesed (technically correct) ‐ Edited (completed data) ‐ Final (removal of outliers etc.) 21
  • 22. Processing of data Raw (unedited) data Edited data Final data
  • 23. Example 2 : Social Security Register 15
  • 24. – Contains all financial data on jobs, benefits and pensions in the Netherlands ‐ Collected by the Dutch Tax office ‐ A total of 20 million records each month ‐ How to obtain insight into so much data? • With a visualisation method: a heat map 24
  • 25. Income (euro) Heat map: Age vs. ‘Income’ Age October 1st 2013, Statistics Netherlands 16
  • 26. After ‘ d ata re d uction ’ amount amount age October 1st 2013, Statistics Netherlands age 17
  • 27. Visualization helps with volume of data – – – – – – Summarize by “binning” Tableplot Histogram Heatmap (2D histogram) Smoothing? Detect unexpected patterns We use it as a tool to check, explore and communicate data 27
  • 28. Big Data: Issues and challenges
  • 29. Big Data: issues & challenges During our exploratory studies we identified a number of issues & challenges. Focussing on the methodological and legal ones, we found that there is a need to: 1) deal with noisy and dirty data 2) deal with selectivity 3) go beyond correlation 4) cope with privacy and security issues We have only solved some of them (partially) 29
  • 30. 1) Deal with noisy and dirty data – Big Data is often ‐ noisy ‐ dirty ‐ redundant ‐ unstructured • e.g. texts, images – How to extract information from Big data? ‐ In the best/most efficient way 30
  • 31. Noisy and dirty data Social media sentiment Traffic loop data Aggregate, apply filters (Poisson/Kalman), try to exclude noisy records, models (capture structure), ‘Google approach’ (80/20 rule) Preferably do NOT use samples ! 31
  • 32. Noise reduction Social media: daily sentiment in Dutch messages 32
  • 33. Noise reduction Social media, daily sentiment in Dutch messages Social media: daily & weekly sentiment in Dutch messages 33
  • 34. Noise reduction Social media, daily sentiment in Dutch messages Social media: daily, weekly & monthly sentiment in Dutch messages 34
  • 35. Noise reduction Social media, daily sentiment in Dutch messages Social media: monthly sentiment in Dutch messages 35
  • 36. Social media sentiment & Consumer confidence Social media: monthly sentiment in Dutch messages & Social media, daily sentiment in Dutch messages Consumer confidence Corr: 0.88 36
  • 37. Dirty data Total number of vehicles detected by traffic loops during the day 37 Time (hour)
  • 38. Loop active varies during the day 38 (first 10 min)
  • 39. Correct for dirty data Use data from same location from previous/next minute (5 min. window) Before Total = ~ 295 million vehicles 39 After Total = ~ 330 million vehicles (+ 12%)
  • 40. 2) Deal with selectivity – Big data sources are selective (they do NOT cover the entire population considered) ‐ – AND: all Big Data sources studied so far contain events! ‐ ‐ – Some probably more then others E.g. social media messages created, calls made and vehicles detected Events are probably the reason why these sources are so Big When there is a need to correct for selectivity 1) Convert events to units How to identify units? 2) Correct for selectivity of units included How to cope with units that are truly absent and part of the population under study? 40
  • 41. Units / events – Big Data contains events ‐ Social media messages are generated by usernames ‐ Traffic loops count vehicles (Dutch roads are units) ‐ Call detail records of mobile phone ID’s ‐ Convert events to units • By profiling 41
  • 42. Profiling of Big data 42
  • 43. Travel behaviour of active mobile phones Mobility of very active mobile phone users - during a 14-day period Based on: - Call- and text-activity multiples times a day - Location based on phone masts Clearly selective: - North and South-west of the country hardly included 43 __
  • 44. 3) Go beyond correlation – You will very likely use correlation to check Big Data findings with those in other (survey) data – When correlation is high: 1) try falsifying it first (is it coincidental/spurious?) correlation ≠ causation 2) If this fails, you may have found something interesting! 3) Perform additional analysis (look for causality) cointegration, structural time‐series approach 44 Use common sense!
  • 45. An illustrative example Official unemployment percentage Number of social media messages including the word “unemployment” X Corr: 0.90 ? 45
  • 46. 4) Privacy and security issues – The Dutch privacy and security law allows the study of privacy sensitive data for scientific and statistical research – Still appropriate measures need to be taken • Prior to new research studies, check privacy sensitivity of data • In case of privacy sensitive data: • Try to anonymize micro data or use aggregates • Use a secure environment – Legal issues that enable the use of Big Data for official statistics production is currently being looked at ‐ No problems for Big Data that can be considered ‘Administrative data’: i.e. Big Data that is managed by a (semi‐)governmentally funded organisation 46
  • 47. Conclusions – Big data is a very interesting data source ‐ Also for official statistics – Visualisation is a great way of getting/creating insight ‐ Not only for data exploration – A number of fundamental issues need to be resolved ‐ Methodological ‐ Legal ‐ Technical (not discussed here) – We expect great things in the near future! 47
  • 48. The future of statistics?