2. Overview
2
• Big Data
• Definition?
• DGINS: Scheveningen Memorandum
• Experiences at Statistics Netherlands
• From ‘New data sources’ to ‘Big Data’
• Data driven approach (learning by doing)
• Opportunities & challenges
• Methodological & technical challenges
• Skills, legal and other issues
•With examples !
4. What is Big Data?
Defining Big Data is not easy:
An attempt: “Data that are difficult to collect, store or process within the conventional
systems of statistical organizations. Either, their volume, velocity, structure
or variety requires the adoption of new statistical software processing
techniques and/or IT infrastructure to enable cost-effective insights to be
made.” (Virtual sprint paper)
More technical: “Big Data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage, and process the
data within a tolerable elapsed time.” (Wikipedia)
A user: “Data sources that are awkward to work with.”
4
TIP: Big Data sources are NOT surveys and NOT administrative data
5. DGINS: Scheveningen Memorandum
1. Big Data represent new opportunities and challenges for Official Statistics.
2. Develop an 'Official Statistics Big Data strategy' at national and EU-level.
3. Recognize the implications of Big Data for legislation especially with
regard to data protection and personal rights
4. Several NSIs are currently initiating or considering different uses of Big
Data. Momentum to share experiences and to collaborate.
5. Recognize the necessary capabilities and skills to effectively explore Big
Data
6. Acknowledge that the multidisciplinary character requires synergies and
partnerships.
7. The use of Big Data in the context of official statistics requires new
developments in methodology, quality assessment and IT related issues.
8. Agree on adopting an ESS action plan and roadmap by mid-2014
5
6. Experiences at Statistics Netherlands
– Started as ‘New data sources for statistics’ in 2009
– Several initiatives over the years:
‐ Internet as a data source
• Collecting price data with web robots
• Study the use of web job vacancies data
• ‘Markplaats’ data (Dutch eBay clone)
‐ Alternative means of collecting primary data
• Use of smartphones
‐ Big Data (really large amounts of data)
• Traffic loop detection data (road sensors)
• Mobile phone data (location data)
• Social media data (content and sentiment)
6
8. What have we learned (so far) ?
I’ll discuss the most important ones:
1) Types of ‘data’ in Big Data
2) How to access and analyse large amounts of data
3) How to deal with noisy and unstructured data
4) How to deal with selectivity (and our own bias)
5) How to go beyond correlation
6) The need for people with the right skills and mind‐set
7) Need to solve/deal with privacy and security issues
8) Data management & costs
8
We are slowly starting to get a grip on some of these topics
10. 1) Types of data & events
There are many different Big data sources,
An attempt to classify them (Virtual sprint paper):
A) Human-sourced information (‘Social Networks’)
Social media messages, blogs, web searches
B) Process-mediated data (‘Traditional Business Systems andWebsites’)
Credit card, bank or on-line transactions,CDR, product prices,
page-views
C) Machine-generated data (‘Automated Systems’)
Road or climate sensors, satellite images, GPS,AIS.
Essentially most of the data are event-based of which some can be
directly related to a user (e.g. the target population)
10
11. 2) How to access and analyse large amounts of data
11
– If you want to analyse Big Data
– You need a lot of computer power!!
– Or you need a lot of time!
High Performance
Computing expertise
is essential !
12. – We have:
- Workstations with lot’s of memory (32-64GB), fast disk drives (SSD, 512
GB) and a large hard drive (>= 1TB)
- A secure environment in which to access the data with those computers
- A Big Data lab
- The knowledge to load and analyse all the data into R or Python
- Followed a High Performance Computing training course
- Realized that learning by doing is key! (?databases?)
AND a Big data source with no privacy and security issues so we can test all
kinds of analysis, soft- and hardware (anyplace, anytime, anywhere)
• Traffic loop data (road sensors)
12
Our current equipment and more
13. An example:
– Processing of traffic loop data of 1 day
- A total ~100 million records (25 GB)
I/O limitation can by solved by:
1) Input part by using a cluster (distributed computing)
2) Output part by implementing a C++ write routine in R (20% faster)
Processing in R Time needed Speed-up
First R-script 6 hours -
Improved code 30 min 12
Faster hardware 10 min 36
(Java code)
Faster hardware 2 min 180
+ preprocessed data
Limited by I/O
13
15. 3) How to deal with noisy and unstructured data
– Big Data is often
‐ noisy, dirty
‐ redundant
‐ unstructured
• e.g. texts, images
– How to extract information
from Big data?
‐ In the best/most efficient way
15
16. Example of noisy data: Roads sensors
Traffic loop data
‐ Each minute (24/7) the number of passing vehicles is
counted in around 20.000 ‘loops’ in the Netherlands
• Total and in different length classes
‐ Nice data source for transport and traffic statistics
(and more)
• A lot of data, around 100 million records a day
Locations
16
18. Correct for missing data: macro level
Sliding window of 5 min. Impute missing data.
Before After
Total = ~ 295 million vehicles Total = ~ 330 million (+ 12%)
vehicles
18
19. Correct for missing data: micro level
19
Time (min.)
Numberofvehiclesdetected
Recursive Bayesian estimator (<1 sec on GPGPU)
20. 4) How to deal with selectivity
– Big Data sources may be selective when
- Only part of the population contributes to the data set
• For example: mobile phone owners
- The measurement mechanism is selective (e.g. non-random times or
places)
• For example: placing of road sensors on Dutch highways is not random
– Many Big Data sources contain events
- Population units may generate widely varying numbers of events
- Attempt to associate events with units
– Correcting for selectivity
- Background characteristics – or features – are needed (linking with
registers; profiling)
- Use predictive modelling / machine learning to produce population
estimates20
22. Selectivity illustrated
Selectivity of big data could potentially
be less problematic than high non-
response rates of surveys.
-There is just more data for your
model!
The black line shows the relationship
between the target and auxiliary variable in
the target population.The red lines show
the estimated relationship according to
each of the three sources (with 95%
confidence intervals).
Here we assume units with auxiliary
variables are available!
22
23. 5) How to go beyond correlation
– You will very likely use correlation to check Big Data findings
with those in other (survey) data
– When correlation is high:
1) try falsifying it first (is it coincidental?)
correlation ≠ causation
2) If this fails, you may have found something interesting!
3) Perform additional analysis (look for causality)
cointegration, Granger causality, time‐series approach,
etc.
23
26. Table 1. Social media messages properties for various platforms and their correlation with consumer confidence
Correlation coefficient of
Social media platform Number of social Number of messages as monthly sentiment index and
media messages1
percentage of total (%) consumer confidence ( r )2
All platforms combined 3,153,002,327 100 0.75 0.78
Facebook 334,854,088 10.6 0.81* 0.85*
Twitter 2,526,481,479 80.1 0.68 0.70
Hyves 45,182,025 1.4 0.50 0.58
News sites 56,027,686 1.8 0.37 0.26
Blogs 48,600,987 1.5 0.25 0.22
Google+ 644,039 0.02 -0.04 -0.09
Linkedin 565,811 0.02 -0.23 -0.25
Youtube 5,661,274 0.2 -0.37 -0.41
Forums 134,98,938 4.3 -0.45 -0.49
1
period covered June 2010 untill November 2013
2
confirmed by visual inspecting scatterplots and additional checks (see text)
*cointegrated
Platform specific results
Granger causality reveals that Consumer Confidence precedes
Facebook sentiment ! (p-value < 0.001)
26
27. A schematic view
Vorige maand Maand
Consumer Confidence
Publication date (~20th)
Social media sentiment
Dag 1-7 Dag 8-14 Dag 15-21 Dag 22-28
Previous month Current month
Day 1-7 Day 8-14 Day 15-21 Day 22-28
27
28. Platform specific results (2)
More detailed studies revealed a 1 week delay between both!
Consumer confidence comes first, Social media sentiment follows
28
Table 1. Social media messages properties for various platforms and their correlation with consumer confidence
Correlation coefficient of
Social media platform Number of social Number of messages as monthly sentiment index and
media messages1
percentage of total (%) consumer confidence ( r )2
All platforms combined 3,153,002,327 100 0.75 0.78
Facebook 334,854,088 10.6 0.81* 0.85*
Twitter 2,526,481,479 80.1 0.68 0.70
Hyves 45,182,025 1.4 0.50 0.58
News sites 56,027,686 1.8 0.37 0.26
Blogs 48,600,987 1.5 0.25 0.22
Google+ 644,039 0.02 -0.04 -0.09
Linkedin 565,811 0.02 -0.23 -0.25
Youtube 5,661,274 0.2 -0.37 -0.41
Forums 134,98,938 4.3 -0.45 -0.49
1
period covered June 2010 untill November 2013
2
confirmed by visual inspecting scatterplots and additional checks (see text)
*cointegrated
29. 6) People and skills needed
For Big data studies you need:
– People with an open mind‐set that do not see all
problems a priori in terms of sampling theory
– People with programming skills and IT‐affinity
– People with a data‐driven, pragmatic attitude (data
explorers, ’practitioners’)
‐ You need Data scientists !
29
30. Data science skills ‘landscape’
Sexy Skills of Data Geeks
1) Statistics - traditional analysis you're used to
thinking about
2) Data ‘munging’ - parsing, scraping, and
formatting data
3)Visualization - graphs, tools, etc.
4) High Performance Computing knowledge30
32. 7) Privacy and security issues
– The Dutch privacy and security law allows the study of privacy
sensitive data for scientific and statistical research
– Of course, appropriate measures always need to be taken
• Prior to new research studies, check privacy sensitivity of data
• In case of privacy sensitive data:
• Try to anonymize micro data or use aggregates
• Use secure environment: workstations in Big Data lab
– Legal issues that enable the use of Big Data for official statistics
production are currently being looked at
- There is Big Data that can be considered ‘Administrative data’: i.e. Big
Data that is managed by a (semi-)governmentally funded organisation
32
33. Example: Mobile phones
Mobile phone activity as a data source
– Nearly every person in the Netherlands has a mobile phone
- Usually on them and almost always switched on!
- Many people are very active during the day
– Can data of mobile phones be used for statistics?
- Travel behaviour (of active phones)
- ‘Day time population’ (of active phones)
- Tourism (new phones that register to network)
– Data of a single mobile company was used
- Hourly aggregates per area (only when > 15 events)
- Especially important for roaming data (foreign visitors)
33
34. ‘Day time population’
– Hourly changes of mobile
phone activity
– 7 & 8 May 2013
– Per area distinguished
– Only data for areas with
> 15 events per hour
34
36. 8) Costs and data management
– Costs
‐ In the Netherlands we don’t pay for administrative data.
‐ How about Big Data?
• We currently pay for social media (access) and mobile phone
data (extra processing efforts)
– Data management
‐ Who owns the data? Stability of delivery/source
‐ Cope with the huge volume
• Run queries in database of data source holder
• Collect and process it as data stream
• Bulk processing
36