SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
BRINGING SURVEY SAMPLING
TECHNIQUES INTO ‘BIG DATA’
ANTOINE REBECQ
UBISOFT MONTRÉAL
NOVEMBER 7, 2018
1
About me
• Formerly: survey sampling methodologist at INSEE, France
• “Type A” data scientist turned “Type B”
Key takeaway
The future of ‘big data’ is a statistician
Summary
I. What is a data science team? How can a (survey) statistician fit into
it?
II. Examples of awesome ‘big data’ challenges that could use
statisticians
I. Data science and data scientists
I. Data science and data scientists
Data scientists = combination of computer science, statistics, applied
mathematics and domain expertise
Type A data scientist = Focused on analyses, decision science
Type B data scientist = Focused on production data application
(typically ML, recommendations, etc.)
What does our type B data science team do?
Machine Learning in games! Example: Recommendations (from Netflix:
Basilico, 2015)
What does our type B data science team do?
Send data
Send content
Compute
ML
models
What does our type B data science team do?
At core: programming team
- Production code:
- Distributed computation
- Optimized algorithms
- Code history and reviews
Tech stack:
Modern data science teams
The (in)famous data science Venn diagram (Conway, 2013)
Modern data science teams
Some truths:
- Blur the line between all jobs (opportunities, not requirements)
- Unicorns are rare but they do exist
- Let them have fun!
- Pay them accordingly!
More generally: Create opportunities for everyone to learn from every
domain
Modern data science teams
What can statisticians get from CS culture
- Quality control for statisticians (hint: it’s the same!):
- Distributed computation
- Optimized algorithms
- Code history and reviews
R community has a very positive influence in introducing CS quality
processes for statistics and data science (for example see Wickham,
2015 on git).
II. Examples of ‘big data’ challenges that
could use statisticians
II. Examples of challenges
1. A/B testing
2. Sampled events (understanding data sources)
3. Improving ML algorithms (quality)
4. Improving ML algorithms (speed)
5. Understanding user feedback
II. Examples of challenges
1. A/B testing
A/B testing = ‘big data’ term for Randomized Controlled Trial (RCT)
Very useful for:
- Product shipping
- Business decisions
For example Microsoft has a dedicated team doing extensive work on
A/B testing (see Deng, 2018).
II. Examples of challenges
1. A/B testing
Need for carefully crafted sampling designs (Image from Miller).
II. Examples of challenges
2. Sampled tracking events
Event = single information sent to server when something happens
Some events are sampled to reduce load (CPU, network, storage)
II. Examples of challenges
2. Sampled tracking events
Example: analysis of balancing in a fighting game
An event is sent by a sample of players when they use a new weapon.
Question: is sword A better than sword B?
-> Analysis of matches where these weapons are used
…
II. Examples of challenges
2. Sampled tracking events
… This is an indirect sampling design (Lavallée, 2009)
(Unequal probabilities because of players preferences, game rules, etc.)
Our ‘quick-and-dirty’ solution: calibration and R package Icarus
(Rebecq, 2016)
II. Examples of challenges
3. Better probabilities for ML algorithms using sampling calibration
Using sampling calibration (Deville, 1992) to craft better probabilities
from ML algorithms
1. Example with balancing of sample data:
http://nc233.com/2018/07/weighting-tricks-for-machine-learning-
with-icarus-part-1/
II. Examples of challenges
3. Better probabilities for ML algorithms using sampling calibration
II. Examples of challenges
3. Better probabilities for ML algorithms using sampling calibration
2. Directly calibrate output probabilities (WIP)
- Better simulations
- Better recommendations
II. Examples of challenges
4. Speed up big data tasks
Example: Sampling to speed up network analyses (Leskovec, 2016 and Rebecq,
2017)
II. Examples of challenges
5. Understand user feedback
Sentiment analysis (Pang, 2002)
Direct feedback from community
Vs.
Sampling and carefully crafted questionnaire
Conclusion
- A lot of interesting topics in survey sampling literature can be super
useful for ‘big data’ problems (research and practice)
- Hire a statistician for your type A data science team!
- Hire a statistician for your type B data science team!
- If you’re a statistician, look into ‘big data’ jobs for interesting
challenges!
Thanks!
Antoine Rebecq
.
Blog post: nc233.com/symposium2018
LinkedIn
References (1)
[Basilico, 2015] BASILICO, Justin. Recommendations for building Machine Learning systems
https://www.slideshare.net/SessionsEvents/justin-basilico-research-engineering-manager-at-netflix-at-mlconf-
sf-111315
[Conway, 2013] CONWAY, Drew. The data science Venn diagram http://drewconway.com/zia/2013/3/26/the-
data-science-venn-diagram
[Deville, 1992] DEVILLE, Jean-Claude and SÄRNDAL, Carl-Erik. Calibration estimators in survey sampling. Journal
of the American statistical Association, 1992, vol. 87, no 418, p. 376-382.
[Deng, 2018] DENG, Alex, KNOBLICH, Ulf, and LU, Jiannan. Applying the Delta method in metric analytics: A
practical guide with novel ideas. arXiv preprint arXiv:1803.06336, 2018.
[Lavallée, 2009] LAVALLÉE, Pierre. Indirect sampling. Springer Science & Business Media, 2009.
References (2)
[Leskovec, 2016] LESKOVEC, Jure and SOSIČ, Rok. Snap: A general-purpose network analysis and graph-mining
library. ACM Transactions on Intelligent Systems and Technology (TIST), 2016, vol. 8, no 1, p. 1.
[Miller] MILLER, Evan. Evan Miller’s sample size calculator https://www.evanmiller.org/ab-testing/sample-
size.html
[Pang, 2002] PANG, Bo, LEE, Lillian, and VAITHYANATHAN, Shivakumar. Thumbs up?: sentiment classification
using machine learning techniques. In : Proceedings of the ACL-02 conference on Empirical methods in natural
language processing-Volume 10. Association for Computational Linguistics, 2002. p. 79-86.
[Rebecq, 2017] REBECQ, Antoine. Sampling graphs https://nc233.com/2017/03/sampling-graphs-mad-stat-
seminar-at-toulouse-school-of-economics/
References (3)
[Rebecq, 2016] REBECQ, Antoine. Icarus: un package R pour le calage sur marges et ses variantes. In : 9e
colloque francophone sur les sondages, Gatineau (Canada). 2016.
[Wickham, 2015] WICKHAM, Hadley. R packages: organize, test, document, and share your code. " O'Reilly
Media, Inc.", 2015 (page on git available at http://r-pkgs.had.co.nz/git.html)

Weitere ähnliche Inhalte

Ähnlich wie Bring survey sampling techniques into big data

Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Data Analytics & Visualization (Introduction)
Data Analytics & Visualization (Introduction)Data Analytics & Visualization (Introduction)
Data Analytics & Visualization (Introduction)Dolapo Amusat
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
From Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valueFrom Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valuePeadar Coyle
 
Data Science Demystified
Data Science DemystifiedData Science Demystified
Data Science DemystifiedEmily Robinson
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptxXanGwaps
 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018mark madsen
 
An Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsAn Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsIRJET Journal
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Srinath Perera
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Tomasz Bednarz
 
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...multimediaeval
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Hima Patel
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015DataKitchen
 
Predicting the future with social media
Predicting the future with social mediaPredicting the future with social media
Predicting the future with social mediaPeter Wlodarczak
 
System Dynamics, Analytics & Big Data (16th Conference of the UK Chapter of t...
System Dynamics, Analytics & Big Data (16th Conference of the UK Chapter of t...System Dynamics, Analytics & Big Data (16th Conference of the UK Chapter of t...
System Dynamics, Analytics & Big Data (16th Conference of the UK Chapter of t...Michael Mortenson
 
Tips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistTips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistLisa Cohen
 

Ähnlich wie Bring survey sampling techniques into big data (20)

Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Data Analytics & Visualization (Introduction)
Data Analytics & Visualization (Introduction)Data Analytics & Visualization (Introduction)
Data Analytics & Visualization (Introduction)
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
From Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valueFrom Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into value
 
Data Science Demystified
Data Science DemystifiedData Science Demystified
Data Science Demystified
 
18231979 Data Mining
18231979 Data Mining18231979 Data Mining
18231979 Data Mining
 
Big Data
Big DataBig Data
Big Data
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
 
An Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsAn Overview of Python for Data Analytics
An Overview of Python for Data Analytics
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
 
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015
 
Predicting the future with social media
Predicting the future with social mediaPredicting the future with social media
Predicting the future with social media
 
System Dynamics, Analytics & Big Data (16th Conference of the UK Chapter of t...
System Dynamics, Analytics & Big Data (16th Conference of the UK Chapter of t...System Dynamics, Analytics & Big Data (16th Conference of the UK Chapter of t...
System Dynamics, Analytics & Big Data (16th Conference of the UK Chapter of t...
 
Tips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistTips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data Scientist
 

Kürzlich hochgeladen

User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 

Kürzlich hochgeladen (20)

User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 

Bring survey sampling techniques into big data

  • 1. BRINGING SURVEY SAMPLING TECHNIQUES INTO ‘BIG DATA’ ANTOINE REBECQ UBISOFT MONTRÉAL NOVEMBER 7, 2018 1
  • 2. About me • Formerly: survey sampling methodologist at INSEE, France • “Type A” data scientist turned “Type B”
  • 3. Key takeaway The future of ‘big data’ is a statistician
  • 4. Summary I. What is a data science team? How can a (survey) statistician fit into it? II. Examples of awesome ‘big data’ challenges that could use statisticians
  • 5. I. Data science and data scientists
  • 6. I. Data science and data scientists Data scientists = combination of computer science, statistics, applied mathematics and domain expertise Type A data scientist = Focused on analyses, decision science Type B data scientist = Focused on production data application (typically ML, recommendations, etc.)
  • 7. What does our type B data science team do? Machine Learning in games! Example: Recommendations (from Netflix: Basilico, 2015)
  • 8. What does our type B data science team do? Send data Send content Compute ML models
  • 9. What does our type B data science team do? At core: programming team - Production code: - Distributed computation - Optimized algorithms - Code history and reviews Tech stack:
  • 10. Modern data science teams The (in)famous data science Venn diagram (Conway, 2013)
  • 11. Modern data science teams Some truths: - Blur the line between all jobs (opportunities, not requirements) - Unicorns are rare but they do exist - Let them have fun! - Pay them accordingly! More generally: Create opportunities for everyone to learn from every domain
  • 12. Modern data science teams What can statisticians get from CS culture - Quality control for statisticians (hint: it’s the same!): - Distributed computation - Optimized algorithms - Code history and reviews R community has a very positive influence in introducing CS quality processes for statistics and data science (for example see Wickham, 2015 on git).
  • 13. II. Examples of ‘big data’ challenges that could use statisticians
  • 14. II. Examples of challenges 1. A/B testing 2. Sampled events (understanding data sources) 3. Improving ML algorithms (quality) 4. Improving ML algorithms (speed) 5. Understanding user feedback
  • 15. II. Examples of challenges 1. A/B testing A/B testing = ‘big data’ term for Randomized Controlled Trial (RCT) Very useful for: - Product shipping - Business decisions For example Microsoft has a dedicated team doing extensive work on A/B testing (see Deng, 2018).
  • 16. II. Examples of challenges 1. A/B testing Need for carefully crafted sampling designs (Image from Miller).
  • 17. II. Examples of challenges 2. Sampled tracking events Event = single information sent to server when something happens Some events are sampled to reduce load (CPU, network, storage)
  • 18. II. Examples of challenges 2. Sampled tracking events Example: analysis of balancing in a fighting game An event is sent by a sample of players when they use a new weapon. Question: is sword A better than sword B? -> Analysis of matches where these weapons are used …
  • 19. II. Examples of challenges 2. Sampled tracking events … This is an indirect sampling design (Lavallée, 2009) (Unequal probabilities because of players preferences, game rules, etc.) Our ‘quick-and-dirty’ solution: calibration and R package Icarus (Rebecq, 2016)
  • 20. II. Examples of challenges 3. Better probabilities for ML algorithms using sampling calibration Using sampling calibration (Deville, 1992) to craft better probabilities from ML algorithms 1. Example with balancing of sample data: http://nc233.com/2018/07/weighting-tricks-for-machine-learning- with-icarus-part-1/
  • 21. II. Examples of challenges 3. Better probabilities for ML algorithms using sampling calibration
  • 22. II. Examples of challenges 3. Better probabilities for ML algorithms using sampling calibration 2. Directly calibrate output probabilities (WIP) - Better simulations - Better recommendations
  • 23. II. Examples of challenges 4. Speed up big data tasks Example: Sampling to speed up network analyses (Leskovec, 2016 and Rebecq, 2017)
  • 24. II. Examples of challenges 5. Understand user feedback Sentiment analysis (Pang, 2002) Direct feedback from community Vs. Sampling and carefully crafted questionnaire
  • 25. Conclusion - A lot of interesting topics in survey sampling literature can be super useful for ‘big data’ problems (research and practice) - Hire a statistician for your type A data science team! - Hire a statistician for your type B data science team! - If you’re a statistician, look into ‘big data’ jobs for interesting challenges!
  • 26. Thanks! Antoine Rebecq . Blog post: nc233.com/symposium2018 LinkedIn
  • 27. References (1) [Basilico, 2015] BASILICO, Justin. Recommendations for building Machine Learning systems https://www.slideshare.net/SessionsEvents/justin-basilico-research-engineering-manager-at-netflix-at-mlconf- sf-111315 [Conway, 2013] CONWAY, Drew. The data science Venn diagram http://drewconway.com/zia/2013/3/26/the- data-science-venn-diagram [Deville, 1992] DEVILLE, Jean-Claude and SÄRNDAL, Carl-Erik. Calibration estimators in survey sampling. Journal of the American statistical Association, 1992, vol. 87, no 418, p. 376-382. [Deng, 2018] DENG, Alex, KNOBLICH, Ulf, and LU, Jiannan. Applying the Delta method in metric analytics: A practical guide with novel ideas. arXiv preprint arXiv:1803.06336, 2018. [Lavallée, 2009] LAVALLÉE, Pierre. Indirect sampling. Springer Science & Business Media, 2009.
  • 28. References (2) [Leskovec, 2016] LESKOVEC, Jure and SOSIČ, Rok. Snap: A general-purpose network analysis and graph-mining library. ACM Transactions on Intelligent Systems and Technology (TIST), 2016, vol. 8, no 1, p. 1. [Miller] MILLER, Evan. Evan Miller’s sample size calculator https://www.evanmiller.org/ab-testing/sample- size.html [Pang, 2002] PANG, Bo, LEE, Lillian, and VAITHYANATHAN, Shivakumar. Thumbs up?: sentiment classification using machine learning techniques. In : Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. Association for Computational Linguistics, 2002. p. 79-86. [Rebecq, 2017] REBECQ, Antoine. Sampling graphs https://nc233.com/2017/03/sampling-graphs-mad-stat- seminar-at-toulouse-school-of-economics/
  • 29. References (3) [Rebecq, 2016] REBECQ, Antoine. Icarus: un package R pour le calage sur marges et ses variantes. In : 9e colloque francophone sur les sondages, Gatineau (Canada). 2016. [Wickham, 2015] WICKHAM, Hadley. R packages: organize, test, document, and share your code. " O'Reilly Media, Inc.", 2015 (page on git available at http://r-pkgs.had.co.nz/git.html)