SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Data Science

   ‘and the future of statistics’

       Piet Daas (and many colleagues)*

       Statistics Netherlands / Centraal Bureau voor de Statistiek

*Martijn Tennekes, Edwin de Jonge, Alex Priem, Bart Buelens, Merijn van Pelt, Paul van den Hurk

                                                              Data Science NL, 8 Nov. Utrecht
Layout

• Introduction
• What is Data Science?
  • You need data, to be one
• Data Scientist skills
  • A sexy job with a paradigm shift
• Link with Statistics Netherlands work
  • Examples of recent developments


 Data Science NL, 8 November, Utrecht     1
Introduction




  “Statistics Netherlands will produces
  about 5000 official publications and
  tables in 2012”
            For this we need DATA




Data Science NL, 8 November, Utrecht      2
Two types of data




        Primary data                    Secondary data




                                       Data from ‘others’
      Our own surveys                   - Administrative sources
                                        - ‘New’ data sources
Data Science NL, 8 November, Utrecht                               3
• Data, data everywhere!




  X

Data Science NL, 8 November, Utrecht   4
Statistics & Data science

1) Is the study of ‘the use of secondary data
   for statistics’ data science?

2) What is data science?




Data Science NL, 8 November, Utrecht            5
What is Data Science?

 • First used in 1974 by Danish computer
   scientist Peter Nauer in book “Concise
   Survey of Computer Methods ”
 • Defined as:
    • “The science of dealing with data, once
      they have been established”
Established data is data that has been created. If that
was done by someone else: Than its secondary data!
 Data Science NL, 8 November, Utrecht                6
Data scientist /statistician is “the sexiest job of the
21st Century”




        People able to derive knowledge from large amounts of data!

   Data Science NL, 8 November, Utrecht                               7
Data science skills ‘landscape’

                      k ills
                     s
              i ng
          m
      m
   gra
Pro



                                          Sexy Skills of Data Geeks
                                          1) Statistics - traditional analysis you're used to
                                          thinking about
                                          2) Data ‘munging’ - parsing, scraping, and
                                          formatting data
                                          3) Visualization - graphs, tools, etc.



   Data Science NL, 8 November, Utrecht                                                  8
Data science skills ‘landscape’

                      k ills
                     s
              i ng
          m
      m
   gra
Pro



                                          Sexy Skills of Data Geeks
                                          1) Statistics - traditional analysis you're used to
                                          thinking about
                                          2) Data ‘munging’ - parsing, scraping, and
                                          formatting data
                                          3) Visualization - graphs, tools, etc.



   Data Science NL, 8 November, Utrecht                                                  8
Data Science NL, 8 November, Utrecht   9
Are things changing at the office?




  Data Science NL, 8 November, Utrecht   10
Statistics Netherlands law

• “Statistics Netherlands aims to reduce the
  administrative burden for companies and the
  public as much as possible”
  • By (re-)using existing administrative registrations of both
    government and government-funded organizations.
  • And study potential new sources of information




 Data Science NL, 8 November, Utrecht                         11
Statistics Netherlands and Data
•    Data is generated in increasing amounts and at increasing frequencies:
    •      From ‘Data scarcity’ (sample survey) to ‘Data abundance’ (administrative
           & Big)
          •    Ever increasing amounts of data need to be checked, processed and
               analyzed
          •    More sources of information become available
          •    Opportunities to produce statistics faster (‘real-time statistics’)
    •      Need for new methods and tools
          1. Methods to quickly uncover information from massive amounts of data
             available, such as visualisation methods and data-, text- and stream-
             mining techniques (‘making Big Data small’), High Performance Comp.
          2. Methods capable of integrating the information in the statistical process,
             e.g. linking at massive scale, macro/meso-integration, estimation methods
             suited for large datasets


        Data Science NL, 8 November, Utrecht                                         12
Examples of new developments

1) New approaches to official statistical inference
     a. Algorithmic inference

2) Visualisation methods to quickly obtain insight into
    large datasets
     b. Virtual Census           (17 million records)
     c. Social Security Register (20 million records)

3) Research findings on the use of ‘new’ data sources
     d. Traffic loop data              (80 million records)
     e. Mobile phone data              (~500 million records)
     f. Social media                   (12 million - 1 billion records)

Data Science NL, 8 November, Utrecht                                      13
Example a. Statistical inference

 • Inference is traditionally motivated from a
   design-based sample perspective
 • The model-based approach is being
   gradually adopted in specific circumstances
   (e.g. adminstrative data).
 • Next step: algorithmic inference methods
    • Machine learning, data mining approaches


 Data Science NL, 8 November, Utrecht            14
Simulation results (1000x)
                    Design       Model   Neural.   DisTree




 Data Science NL, 8 November, Utrecht
 Shifting paradigms                                          15
Example b. Virtual Census

     • Every 10 years a Census needs to be conducted
     • No longer with surveys in the Netherlands
          • Last traditional census was in 1971
     • Now by (re-)using existing information
          • Linking administrative sources and available sample
            survey data at a large scale
          • Check result
          • How?
              • With a visualisation method: the Tableplot



Data Science NL, 8 November, Utrecht                              16
Making the Tableplot
1.        Load file                                  17 million records
2.        Sort record according to                   17 million records
          key variable
      •          Age in this example
3.        Combine records                            100 groups (170,000 records each)
      •          Numeric variables
             •     Calculate average (avg. age)
      •          Categorical variables
             •     Ratio between categories present (male vs. female)
4.        Plot figure                                of select number of variables
             •     Colours used are important           up to 12




     Data Science NL, 8 November, Utrecht                                            17
Data Science NL, 8 November, Utrecht   tableplot of the census test file
Processing of data

      Raw (unedited) data




                                          Edited data




                    Final data


   Data Science NL, 8 November, Utrecht
Example c: Social Security Register

• Contains all financial data on jobs, benefits
  and pensions in the Netherlands
     • Collected by the Dutch Tax office
     • A total of 20 million records each month

     • How to obtain insight into so much data?
          • With a visualisation method: a heat map




Data Science NL, 8 November, Utrecht                  20
Income (euro)
                Heat map: Age vs. ‘Income’




                                           Age

                Data Science NL, 8 November, Utrecht   21
A 3D heat map: Age vs. Income vs. Amount
                                                         After ‘
                                                                data r
                                                                       educt
                                                                             ion’




amount


                                                amount




                 age
                                                         age




         Data Science NL, 8 November, Utrecht                                 22
Example c: Traffic loop detection data

• Traffic ‘loops’
   • Every minute (24/7) the number of passing
     vehicles is counted by >10,000 road sensors
     & camera’s in the Netherlands
      • Total vehicles and in different length classes

   • Interesting source to produce traffic and
     transport statistics (and more)
       • Huge amounts of data, about 80 million
         records a day
                                                         Locations


    Data Science NL, 8 November, Utrecht                        23
Number of detected vehicles on a single day




                                       Total = ~ 295 milion

Data Science NL, 8 November, Utrecht                          24
Traffic loop detection activity (only first 10 min.)




Data Science NL, 8 November, Utrecht                    25
Number of detected vehicles on a single day




                                       12% added

Data Science NL, 8 November, Utrecht               26
Total vehicles during the day (snapshots)




Data Science NL, 8 November, Utrecht          28
Small, medium & large vehicles




Data Science NL, 8 November, Utrecht   31
Volatile behaviour at the micro-level




Data Science NL, 8 November, Utrecht           32
Docks in Rotterdam




                                        51.941,4.02836



Data Science NL, 8 November, Utrecht                     33
Example d: Mobile phone data
• Nearly every person in the Netherlands has a mobile phone
     • On them and almost always switched on!
          • An increasing number of people has a smart phone

• Ideal source of information to:
     • Use mobile phone data of mobile phone companies:
          • Travel behaviour (‘Day time’-population)
          • Tourism (new phones that register to network)
          • Crowd info (for example during events)

     • But also as a data collection instrument:
          •   Questionnaires (with app, text messaging or browser)
          •   Taking pictures of products, cash receipts and barcodes
          •   Determine exact GPS location
          •   Etc.



Data Science NL, 8 November, Utrecht                                    34
Travel behaviour of mobile phones

                                       Mobility of very active
                                       active mobile phone users
                                          - during a 14-day period
                                          - data of a single mob. company

                                       Based on:
                                          - Call- and text-activity
                                              multiples times a day
                                          - Location based on phone masts

                                       Clearly selective:
                                          - Includes major cities
                                          - But the North and South-east
                                            of the country much less


Data Science NL, 8 November, Utrecht                                  35
Example e: Social media

• Dutch are very active on social media platforms
     • Bijna altijd bij zich en staat vrijwel altijd aan
          • Steeds meer mensen hebben een smartphone!

• Mogelijke informatiebron voor:
     • Welke onderwerpen zijn actueel:
          • Aantal berichten en sentiment hierover


     • Als meetinstrument te gebruiken voor:
          • .
                                                     Map by Eric Fischer (via Fast Company)



Data Science NL, 8 November, Utrecht                                                          36
Social media: Dutch messages
• Dutch are very active on social media platforms
  • Potential information source for:
          • Topics discussed and sentiment over these topics (quickly
            available!) and probably more?
          • Investigate it to obtain an answer on potential use




  Collected Dutch Twitter messages for study: ‘selection’ of 12 million

Data Science NL, 8 November, Utrecht                                      37
Social media: Dutch Twitter topics

              (3%)




                   (7%)
               (3%)


                        (10%)
                     (7%)
              (3%)
                 (5%)
                                             (46%)


                                       12 million messages

Data Science NL, 8 November, Utrecht                    38
Final remarks: Future of statistics
 • Preparing large data sources for statistics is a lot of work
    • Exploration phase takes a lot of time
    • Reduction of information is needed (‘making big data small’)
    • Risk: ‘garbage in’    ‘garbage statistics out’
 • Traditional approach does not suffice
    • Large data sources are definitely not ‘large’ sample surveys
    • Often a selective but large part of the population is included
    • Sometimes its just to much detailed data
    • With traditional statistical analysis everything will be significant!
 • More need for:
    • Visualisation methods (to rapidly gain insight)
    • Methods specific for large dataset (speedy and ‘robust’) and non-
      linear estimation methods (data mining like)
    • ‘Computational statistics’ (& dedicated hardware)
    • Privacy demands will increase!

   Data Science NL, 8 November, Utrecht                                       42
Data Science NL, 8 November, Utrecht   The future of Stat Neth?

Weitere ähnliche Inhalte

Was ist angesagt?

INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
Attila Barta
 
The impact of Big Data on next generation of smart cities
The impact of Big Data on next generation of smart citiesThe impact of Big Data on next generation of smart cities
The impact of Big Data on next generation of smart cities
PayamBarnaghi
 

Was ist angesagt? (20)

INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - Introduction
 
A Brief History Of Data
A Brief History Of DataA Brief History Of Data
A Brief History Of Data
 
Computational intelligence for big data analytics bda 2013
Computational intelligence for big data analytics   bda 2013Computational intelligence for big data analytics   bda 2013
Computational intelligence for big data analytics bda 2013
 
Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things
 
The impact of Big Data on next generation of smart cities
The impact of Big Data on next generation of smart citiesThe impact of Big Data on next generation of smart cities
The impact of Big Data on next generation of smart cities
 
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaBIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
Big Data 101
Big Data 101Big Data 101
Big Data 101
 
A comprehensive survey on data mining
A comprehensive survey on data miningA comprehensive survey on data mining
A comprehensive survey on data mining
 
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
 
01 intro
01 intro01 intro
01 intro
 
Data stories
Data storiesData stories
Data stories
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Geographical aspects of Big Data
Geographical aspects of Big DataGeographical aspects of Big Data
Geographical aspects of Big Data
 
Piet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningenPiet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningen
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Tools
 
Big dataorig
Big dataorigBig dataorig
Big dataorig
 
data mining
data miningdata mining
data mining
 

Andere mochten auch

Big Data in Manufacturing Final PPT
Big Data in Manufacturing Final PPTBig Data in Manufacturing Final PPT
Big Data in Manufacturing Final PPT
Nikhil Atkuri
 

Andere mochten auch (10)

Statistiek en grote databestanden
Statistiek en grote databestandenStatistiek en grote databestanden
Statistiek en grote databestanden
 
Bi dutch meeting data science
Bi dutch meeting data scienceBi dutch meeting data science
Bi dutch meeting data science
 
Big Data presentation for Statistics Canada
Big Data presentation for Statistics CanadaBig Data presentation for Statistics Canada
Big Data presentation for Statistics Canada
 
Big Data @ CBS
Big Data @ CBSBig Data @ CBS
Big Data @ CBS
 
Big Data Analytics in Government
Big Data Analytics in GovernmentBig Data Analytics in Government
Big Data Analytics in Government
 
Big Data in Manufacturing Final PPT
Big Data in Manufacturing Final PPTBig Data in Manufacturing Final PPT
Big Data in Manufacturing Final PPT
 
태블로 소프트웨어(Tableau Software) 소개
태블로 소프트웨어(Tableau Software) 소개태블로 소프트웨어(Tableau Software) 소개
태블로 소프트웨어(Tableau Software) 소개
 
온라인 서비스 개선을 데이터 활용법 - 김진영 (How We Use Data)
온라인 서비스 개선을 데이터 활용법  - 김진영 (How We Use Data)온라인 서비스 개선을 데이터 활용법  - 김진영 (How We Use Data)
온라인 서비스 개선을 데이터 활용법 - 김진영 (How We Use Data)
 
The Efficient Big data Platform - IDC 360, Copenhagen
The Efficient Big data Platform - IDC 360, CopenhagenThe Efficient Big data Platform - IDC 360, Copenhagen
The Efficient Big data Platform - IDC 360, Copenhagen
 
IoT architecture
IoT architectureIoT architecture
IoT architecture
 

Ähnlich wie Data science and the future of statistics

Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data Science
Feyzi R. Bagirov
 

Ähnlich wie Data science and the future of statistics (20)

Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
Data Science definition
Data Science definitionData Science definition
Data Science definition
 
Let's talk about Data Science
Let's talk about Data ScienceLet's talk about Data Science
Let's talk about Data Science
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .ppt
 
Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data Science
 
Data; Data manipulation, sorting, grouping, rearranging. Plotting the data. D...
Data; Data manipulation, sorting, grouping, rearranging. Plotting the data. D...Data; Data manipulation, sorting, grouping, rearranging. Plotting the data. D...
Data; Data manipulation, sorting, grouping, rearranging. Plotting the data. D...
 
Big Data as the Fuel and Visual Analytics as the Engine Mount of the Digital ...
Big Data as the Fuel and Visual Analytics as the Engine Mount of the Digital ...Big Data as the Fuel and Visual Analytics as the Engine Mount of the Digital ...
Big Data as the Fuel and Visual Analytics as the Engine Mount of the Digital ...
 
Moving forward data centric sciences weaving AI, Big Data & HPC
Moving forward data centric sciences  weaving AI, Big Data & HPCMoving forward data centric sciences  weaving AI, Big Data & HPC
Moving forward data centric sciences weaving AI, Big Data & HPC
 
Unit 1
Unit 1Unit 1
Unit 1
 
data science and its role in big data analytics.pptx
data science and its role in big data analytics.pptxdata science and its role in big data analytics.pptx
data science and its role in big data analytics.pptx
 
EurnewsLDN_Toine_Pieters
EurnewsLDN_Toine_PietersEurnewsLDN_Toine_Pieters
EurnewsLDN_Toine_Pieters
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
EO in Society: Open Science and Innovation
EO in Society: Open Science and InnovationEO in Society: Open Science and Innovation
EO in Society: Open Science and Innovation
 
Taming the Big Data Beast - Together
Taming the Big Data Beast - TogetherTaming the Big Data Beast - Together
Taming the Big Data Beast - Together
 
A Statistician's Introductory View on Big Data and Data Science (Version 7)
A Statistician's Introductory View on Big Data and Data Science (Version 7)A Statistician's Introductory View on Big Data and Data Science (Version 7)
A Statistician's Introductory View on Big Data and Data Science (Version 7)
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
Dwdm
DwdmDwdm
Dwdm
 
Demystifying Big Data, Data Science and Statistics, along with Machine Intell...
Demystifying Big Data, Data Science and Statistics, along with Machine Intell...Demystifying Big Data, Data Science and Statistics, along with Machine Intell...
Demystifying Big Data, Data Science and Statistics, along with Machine Intell...
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Doing data in the social sciences and humanities: links to and from published...
Doing data in the social sciences and humanities: links to and from published...Doing data in the social sciences and humanities: links to and from published...
Doing data in the social sciences and humanities: links to and from published...
 

Mehr von Piet J.H. Daas

Mehr von Piet J.H. Daas (20)

Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their use
 
IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics Netherlands
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniques
 
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statistics
 
Isi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasIsi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and bias
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics Netherlands
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONS
 
Ntts2017 presentation 45
Ntts2017 presentation 45Ntts2017 presentation 45
Ntts2017 presentation 45
 
Big Data presentation Mannheim
Big Data presentation MannheimBig Data presentation Mannheim
Big Data presentation Mannheim
 
Extracting information from ' messy' social media data
Extracting information from ' messy' social media dataExtracting information from ' messy' social media data
Extracting information from ' messy' social media data
 
Big data cbs_piet_daas
Big data cbs_piet_daasBig data cbs_piet_daas
Big data cbs_piet_daas
 
Gebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekGebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiek
 
Profiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityProfiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivity
 
Big Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenBig Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in Eindhoven
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statistics
 
Quality Approaches to Big Data
Quality Approaches to Big DataQuality Approaches to Big Data
Quality Approaches to Big Data
 
Social media sentiment and consumer confidence
Social media sentiment and consumer confidenceSocial media sentiment and consumer confidence
Social media sentiment and consumer confidence
 
Big data @ CBS
Big data @ CBSBig data @ CBS
Big data @ CBS
 
Big data Big impact?
Big data Big impact?Big data Big impact?
Big data Big impact?
 

Kürzlich hochgeladen

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 

Kürzlich hochgeladen (20)

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 

Data science and the future of statistics

  • 1. Data Science ‘and the future of statistics’ Piet Daas (and many colleagues)* Statistics Netherlands / Centraal Bureau voor de Statistiek *Martijn Tennekes, Edwin de Jonge, Alex Priem, Bart Buelens, Merijn van Pelt, Paul van den Hurk Data Science NL, 8 Nov. Utrecht
  • 2. Layout • Introduction • What is Data Science? • You need data, to be one • Data Scientist skills • A sexy job with a paradigm shift • Link with Statistics Netherlands work • Examples of recent developments Data Science NL, 8 November, Utrecht 1
  • 3. Introduction “Statistics Netherlands will produces about 5000 official publications and tables in 2012” For this we need DATA Data Science NL, 8 November, Utrecht 2
  • 4. Two types of data Primary data Secondary data Data from ‘others’ Our own surveys - Administrative sources - ‘New’ data sources Data Science NL, 8 November, Utrecht 3
  • 5. • Data, data everywhere! X Data Science NL, 8 November, Utrecht 4
  • 6. Statistics & Data science 1) Is the study of ‘the use of secondary data for statistics’ data science? 2) What is data science? Data Science NL, 8 November, Utrecht 5
  • 7. What is Data Science? • First used in 1974 by Danish computer scientist Peter Nauer in book “Concise Survey of Computer Methods ” • Defined as: • “The science of dealing with data, once they have been established” Established data is data that has been created. If that was done by someone else: Than its secondary data! Data Science NL, 8 November, Utrecht 6
  • 8. Data scientist /statistician is “the sexiest job of the 21st Century” People able to derive knowledge from large amounts of data! Data Science NL, 8 November, Utrecht 7
  • 9. Data science skills ‘landscape’ k ills s i ng m m gra Pro Sexy Skills of Data Geeks 1) Statistics - traditional analysis you're used to thinking about 2) Data ‘munging’ - parsing, scraping, and formatting data 3) Visualization - graphs, tools, etc. Data Science NL, 8 November, Utrecht 8
  • 10. Data science skills ‘landscape’ k ills s i ng m m gra Pro Sexy Skills of Data Geeks 1) Statistics - traditional analysis you're used to thinking about 2) Data ‘munging’ - parsing, scraping, and formatting data 3) Visualization - graphs, tools, etc. Data Science NL, 8 November, Utrecht 8
  • 11. Data Science NL, 8 November, Utrecht 9
  • 12. Are things changing at the office? Data Science NL, 8 November, Utrecht 10
  • 13. Statistics Netherlands law • “Statistics Netherlands aims to reduce the administrative burden for companies and the public as much as possible” • By (re-)using existing administrative registrations of both government and government-funded organizations. • And study potential new sources of information Data Science NL, 8 November, Utrecht 11
  • 14. Statistics Netherlands and Data • Data is generated in increasing amounts and at increasing frequencies: • From ‘Data scarcity’ (sample survey) to ‘Data abundance’ (administrative & Big) • Ever increasing amounts of data need to be checked, processed and analyzed • More sources of information become available • Opportunities to produce statistics faster (‘real-time statistics’) • Need for new methods and tools 1. Methods to quickly uncover information from massive amounts of data available, such as visualisation methods and data-, text- and stream- mining techniques (‘making Big Data small’), High Performance Comp. 2. Methods capable of integrating the information in the statistical process, e.g. linking at massive scale, macro/meso-integration, estimation methods suited for large datasets Data Science NL, 8 November, Utrecht 12
  • 15. Examples of new developments 1) New approaches to official statistical inference a. Algorithmic inference 2) Visualisation methods to quickly obtain insight into large datasets b. Virtual Census (17 million records) c. Social Security Register (20 million records) 3) Research findings on the use of ‘new’ data sources d. Traffic loop data (80 million records) e. Mobile phone data (~500 million records) f. Social media (12 million - 1 billion records) Data Science NL, 8 November, Utrecht 13
  • 16. Example a. Statistical inference • Inference is traditionally motivated from a design-based sample perspective • The model-based approach is being gradually adopted in specific circumstances (e.g. adminstrative data). • Next step: algorithmic inference methods • Machine learning, data mining approaches Data Science NL, 8 November, Utrecht 14
  • 17. Simulation results (1000x) Design Model Neural. DisTree Data Science NL, 8 November, Utrecht Shifting paradigms 15
  • 18. Example b. Virtual Census • Every 10 years a Census needs to be conducted • No longer with surveys in the Netherlands • Last traditional census was in 1971 • Now by (re-)using existing information • Linking administrative sources and available sample survey data at a large scale • Check result • How? • With a visualisation method: the Tableplot Data Science NL, 8 November, Utrecht 16
  • 19. Making the Tableplot 1. Load file 17 million records 2. Sort record according to 17 million records key variable • Age in this example 3. Combine records 100 groups (170,000 records each) • Numeric variables • Calculate average (avg. age) • Categorical variables • Ratio between categories present (male vs. female) 4. Plot figure of select number of variables • Colours used are important up to 12 Data Science NL, 8 November, Utrecht 17
  • 20. Data Science NL, 8 November, Utrecht tableplot of the census test file
  • 21. Processing of data Raw (unedited) data Edited data Final data Data Science NL, 8 November, Utrecht
  • 22. Example c: Social Security Register • Contains all financial data on jobs, benefits and pensions in the Netherlands • Collected by the Dutch Tax office • A total of 20 million records each month • How to obtain insight into so much data? • With a visualisation method: a heat map Data Science NL, 8 November, Utrecht 20
  • 23. Income (euro) Heat map: Age vs. ‘Income’ Age Data Science NL, 8 November, Utrecht 21
  • 24. A 3D heat map: Age vs. Income vs. Amount After ‘ data r educt ion’ amount amount age age Data Science NL, 8 November, Utrecht 22
  • 25. Example c: Traffic loop detection data • Traffic ‘loops’ • Every minute (24/7) the number of passing vehicles is counted by >10,000 road sensors & camera’s in the Netherlands • Total vehicles and in different length classes • Interesting source to produce traffic and transport statistics (and more) • Huge amounts of data, about 80 million records a day Locations Data Science NL, 8 November, Utrecht 23
  • 26. Number of detected vehicles on a single day Total = ~ 295 milion Data Science NL, 8 November, Utrecht 24
  • 27. Traffic loop detection activity (only first 10 min.) Data Science NL, 8 November, Utrecht 25
  • 28. Number of detected vehicles on a single day 12% added Data Science NL, 8 November, Utrecht 26
  • 29. Total vehicles during the day (snapshots) Data Science NL, 8 November, Utrecht 28
  • 30. Small, medium & large vehicles Data Science NL, 8 November, Utrecht 31
  • 31. Volatile behaviour at the micro-level Data Science NL, 8 November, Utrecht 32
  • 32. Docks in Rotterdam 51.941,4.02836 Data Science NL, 8 November, Utrecht 33
  • 33. Example d: Mobile phone data • Nearly every person in the Netherlands has a mobile phone • On them and almost always switched on! • An increasing number of people has a smart phone • Ideal source of information to: • Use mobile phone data of mobile phone companies: • Travel behaviour (‘Day time’-population) • Tourism (new phones that register to network) • Crowd info (for example during events) • But also as a data collection instrument: • Questionnaires (with app, text messaging or browser) • Taking pictures of products, cash receipts and barcodes • Determine exact GPS location • Etc. Data Science NL, 8 November, Utrecht 34
  • 34. Travel behaviour of mobile phones Mobility of very active active mobile phone users - during a 14-day period - data of a single mob. company Based on: - Call- and text-activity multiples times a day - Location based on phone masts Clearly selective: - Includes major cities - But the North and South-east of the country much less Data Science NL, 8 November, Utrecht 35
  • 35. Example e: Social media • Dutch are very active on social media platforms • Bijna altijd bij zich en staat vrijwel altijd aan • Steeds meer mensen hebben een smartphone! • Mogelijke informatiebron voor: • Welke onderwerpen zijn actueel: • Aantal berichten en sentiment hierover • Als meetinstrument te gebruiken voor: • . Map by Eric Fischer (via Fast Company) Data Science NL, 8 November, Utrecht 36
  • 36. Social media: Dutch messages • Dutch are very active on social media platforms • Potential information source for: • Topics discussed and sentiment over these topics (quickly available!) and probably more? • Investigate it to obtain an answer on potential use Collected Dutch Twitter messages for study: ‘selection’ of 12 million Data Science NL, 8 November, Utrecht 37
  • 37. Social media: Dutch Twitter topics (3%) (7%) (3%) (10%) (7%) (3%) (5%) (46%) 12 million messages Data Science NL, 8 November, Utrecht 38
  • 38. Final remarks: Future of statistics • Preparing large data sources for statistics is a lot of work • Exploration phase takes a lot of time • Reduction of information is needed (‘making big data small’) • Risk: ‘garbage in’ ‘garbage statistics out’ • Traditional approach does not suffice • Large data sources are definitely not ‘large’ sample surveys • Often a selective but large part of the population is included • Sometimes its just to much detailed data • With traditional statistical analysis everything will be significant! • More need for: • Visualisation methods (to rapidly gain insight) • Methods specific for large dataset (speedy and ‘robust’) and non- linear estimation methods (data mining like) • ‘Computational statistics’ (& dedicated hardware) • Privacy demands will increase! Data Science NL, 8 November, Utrecht 42
  • 39. Data Science NL, 8 November, Utrecht The future of Stat Neth?