This talk, is part of the MIT Program on Information Science brown bag series (http://informatics.mit.edu)
This talk reviews emerging big data sources for social scientific analysis and explores the challenges these present. Many of these sources pose distinct challenges for acquisition, processing, analysis, inference, sharing, and preservation.
Dr Micah Altman is Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, at the Massachusetts Institute of Technology. Dr. Altman is also a Non-Resident Senior Fellow at The Brookings Institution. Prior to arriving at MIT, Dr. Altman served at Harvard University for fifteen years as the Associate Director of the Harvard-MIT Data Center, Archival Director of the Henry A. Murray Archive, and Senior Research Scientist in the Institute for Quantitative Social Sciences.
Dr. Altman conducts research in social science, information science and research methods -- focusing on the intersections of information, technology, privacy, and politics; and on the dissemination, preservation, reliability and governance of scientific knowledge.
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
1. Sources of Big Data for the Social
Sciences
Micah Altman
Director of Research
MIT Libraries
Prepared for
Program on Information Science Brown Bag Series
MIT
August 2015
2. Roadmap
Sources of Big Data for the Social Sciences
What the @#%&! Is
“big data”?
Two examples of big
data in social &
health sciences
Open questions
Potential roles for
libraries
Big Data
Challenges
Acquisition
Retention
Analysis
Access
3. Sources of Big Data for the Social Sciences
Credits
&
Disclaimers
4. DISCLAIMER
These opinions are my own, they are not the
opinions of MIT, Brookings, any of the project
funders, nor (with the exception of co-authored
previously published work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about
the future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston
Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert
Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan
Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel,
Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.
Sources of Big Data for the Social Sciences
5. Collaborators & Co-Conspirators
Workshop Series Co-Organizers
– U.S. Census Bureau
Cavan Capps
Ron Prevost
Research Support
Supported by the U.S. Census Bureau
Sources of Big Data for the Social Sciences
6. Related Work
Main Project:
Census-MIT Big Data Workshop Series
projects.informatics.mit.edu/bigdataworkshop
s
Related publications:
(Reprints available from: informatics.mit.edu )
Altman, M., D. O’Brien, S. Vadhan, A. Wood. 2014. “Big Data Study:
Request for Information.”
Altman, M Altman M, Wood A, O'Brien D, Gasser M., Vadhan, S. Towards a
Modern Approach to Privacy-Aware Government Data Releases. Berkeley
Journal of Technology Law. Forthcoming.
Altman M, McDonald MP. 2014. Public Participation GIS : The Case of
Redistricting. Proceedings of the 47th Annual Hawaii International
Conference on Systems Science .
Sources of Big Data for the Social Sciences
7. Workshops Series: Big Data and Official Statistics
Sources of Big Data for the Social Sciences
Acquisition
Challenges
Using New forms of Information for
Official Economic Statistics
[August 3-4]
Privacy Challenges
Location Confidentiality and
Official Surveys
[October 5-6]
Inference Challenges
Transparency and Inference
[December 7-8]
Expected outcomes:
Workshop reports
(September, October, December)
Integrated white paper
(February)
Identifying new opportunities for
statistical agencies
Inform the
Census Big Data Research
Program.
projects.informatics.mit.edu/bigdataworkshops
8. Sources of Big Data for the Social Sciences
What the
@#%&!
is Big Data?
9. Small, Big, Massive & Ginormous
Sources of Big Data for the Social Sciences
Data Characteristics: the k “V’s” of big data
Volume
Velocity
Variety
+ Veracity
+ Variability
+ …
10. “Big” is in the use, not just the data
Sources of Big Data for the Social Sciences
When do challenges of “big” exceed limits of well-
selected traditional methods and practices?
Data Management – Workflow & Governance
Challenge
Implementation – Performance Challenges
Analysis methods – Inferential Challenges
11. Sources of Big Data for the Social Sciences
Why pay attention
now?
12. Trends and Challenges
Sources of Big Data for the Social Sciences
Trends
Increasingly data-driven economy
Individuals are increasingly mobile
Technology changes data uses
Stakeholder expectations are changing
Agency budgets and staffing remain flat.
The next generation of official statistics
Utilize broad sources of information
Increase granularity, detail, and timeliness
Reduce cost & burden
Maintain confidentiality and security
Multi-disciplinary challenges :
Computation, Statistics, Informatics, Social Science,
Policy
13. Sources of Big Data for the Social Sciences
Two examples
(Good Cop, Bad Cop?)
14. Strategies
(and U.S. Debate Strategies)
Sources of Big Data for the Social Sciences
More Information
• Grimmer, Justin, and Gary King. "General purpose computer-
assisted clustering and conceptualization." Proceedings of the
National Academy of Sciences 108.7 (2011): 2643-2650.
• King, Gary, Jennifer Pan, and Margaret E Roberts. 2013.
“How Censorship in China Allows Government Criticism but
Silences Collective Expression.” American Political Science
Review 107 (2 (May): 1-18. Copy at http://j.mp/LdVXqN
“Posts with negative, even vitriolic, criticism of
the state, its leaders, and its policies are not
more likely to be censored… the censorship
program is aimed at curtailing collective
action by silencing comments that represent,
reinforce, or spur social mobilization, regardless
of content.”
Data Source - Social Media Messages
Data: Structure - Network, Unstructured Text,
Structured metadata
Unit of Observation - Individuals; Interactions
Collection Design - Pure observational
Desired Inferences - Causal inference
– what censorship
strategies cause observed
reaction
- Inference to Population
Frame
Performance challenges - High volume
- Complex network structure
- Scaling bespoke algorithms
- Sparsity
- Systematic and sparse
metadata
Management
Challenges
- License
- Replication
- Revision Control
Inferential Challenges - Measurement error
– extracting topics from text
15. Using Google Searches to Forecast Disease Outbreaks
Sources of Big Data for the Social Sciences
More Information
• Ginsberg, Jeremy, et al. "Detecting influenza epidemics using
search engine query data." Nature 457.7232 (2009): 1012-
1014.
• Lazer, David, et al. "The parable of Google Flu: traps in big
data analysis." Science 343.14 March (2014).
“Big data hubris” is the often implicit
assumption that big data are a
substitute
for, rather than a supplement to,
traditional data collection and analysis.
Data Source - Google search queries
Data: Structure - Quasi-tabular, structured
metadata and unstructured text
Unit of Observation - Interactions with a system
Collection Design - Pure observational
Desired Inferences - Predictive inference
-- where will flu clusters appear
next
-- Short-term (nearcasting)
-- small-area (fine-spatial
granularity)
- Inference to general population
Performance challenges - Streaming algorithms
Management Challenges - Replication
- Transparency
- Variability
Inferential Challenges - External Validity
- Measurement error
– extracting topics from text
- Overfitting
- Sampling
16. Comparing Cases
Sources of Big Data for the Social Sciences
Chinese Censorship Flu Prediction
Data Source - Social Media Messages - Google search queries
Data: Structure - Network, Unstructured Text,
Structured metadata
- Quasi-tabular, structured metadata
and unstructured text
Unit of Observation - Individuals; Interactions - Interactions with a system
Collection Design - Pure observational - Pure observational
Desired Inferences - Causal inference
– what censorship strategies cause
observed reaction
- Inference to Population Frame
- Predictive inference
-- where will flu clusters appear next
-- Short-term (nearcasting)
-- small-area (fine-spatial
granularity)
- Inference to general population
Performance challenges - High volume
- Complex network structure
- Scaling bespoke algorithms
- Sparsity
- Systematic and sparse metadata
- Streaming algorithms
Management Challenges - License
- Replication
- Revision Control
- Replication
- Transparency
- Variability
Inferential Challenges - Measurement error
– extracting topics from text
- External Validity
- Measurement error
– extracting topics from text
- Overfitting
- Sampling
17. Sources of Big Data for the Social Sciences
Why is dealing with
big data hard?
19. Challenges of Big Data
Acquisition
Challenges:
Quality, Provenance,
Sources
20. Some Sources of Economic Information
Challenges of Big Data
Smartphone sensors – GPS +
Vehicle systems
IoT – smart thermostats, fire alarms
Transactions – online, internal
Search behavior – search engine queries
Social media – twitter, FaceBook, LinkedIN
Imagery – satellite, thermal, video
…
21. Source Characteristics
Challenges of Big Data
Unit of Observation
Location, virtual service, communication network,
individual
Context
Behavior, transaction, environment, statement
Measure characteristics
Measure scale
Measure structure
Accuracy, precision
Frame & Sample characteristics
22. Challenges of Big Data
Analysis Challenges:
Bias, Computation,
Causation, Integration
23. Some Potential Sources of Analysis Error
Challenges of Big Data
Target
Population
Frame
Selection
Super
Population
Laws
(structures)
λ
β
(generates)
Parameters
• Selection bias
• Frame uncertainty
• Measurement error
• Unknown
measurement
semantics
• Non-independence
of measures
• Non-independence
of samples
• Model uncertainty
• Unknown causal
structure
• Shift in
measurements,
samples, frames
24. Challenges of Big Data
Access Challenge:
Data
Repeatability,
Transparency,
Preservation
25. Many Initiatives to Improve Scientific Reliability
Retraction monitoring
Data citation
Clinical trial
preregistration
Registered replication
Open data
Badges
Challenges of Big Data
26. Some Types of Reproducibility Issues
Challenges of Big Data
• Fraud
• Misconduct
• Negligence
• Bit Rot
• Versioning problem
• Replication
• Reproduction
• Extension
• Result Validation
• Fact Checking
• Calibration, Extension, Reuse
• Undereporting
• Data Dredging
• Multiple Comparisons’ P-Hacking
• Sensitivity, Robustness
• Reliability
• Generalizability
27. Ensuring Repeatability & Transparency
Challenges of Big Data
‘
‘’ΩΩΩΩ
Theory
(Rules, Entities, Concepts)
Algorithm
(Protocol, Operationalization)
Theory
(Rules, Entities, Concepts)
Theory
(Rules, Entities, Concepts)
Implementation
(Software, Coding Rules, Instrumentation )
Execution
(Deployment, House Survey Style, Equipment
Setting )
’
Algorithms
(Protocol, Operationalization)
Implementations
(Software, Coding Rules, Instrumentation Design )
Executions
(Deployment, House Survey Style, Operating System,
Hardware, Starting Values, PRNG seeds)
Structure
Formats
Versions/Revisions
Selections
Integrations
Instantiations
(copies)
Execution Context
(weather, compiler, operating system system load)
28. Challenges of Big Data
Access Challenge:
Data Confidentiality,
Security
29. Durable, Long-Term Access
• Why durable access?
• The rule of law require maintaining authentic public records
• Scientific advances rely on a cumulative, traceable evidence base
• Art, history, culture require durable access to national heritage
information
• Our nation needs durable access to a strategic information reserve
• Humanity needs durable long-term access information in order to
communicate to future generations
• Big data challenges to durability
• Velocity – information is updated, sometime overwritten
• Many sources are commercial/private
– not routinely archived, preserved
• Modeling future value of information
• Maintaining privacy and confidentiality
Challenges of Big Data
30. Big data challenges…
Anonymization can completely destroy utility
The “Netflix Problem”: large, sparse datasets that overlap
can be probabilistically linked [Narayan and Shmatikov
2008]
Observable Behavior Leaves Unique
“Fingerprints”
The “GIS”: fine geo-spatial-temporal data impossible
mask, when correlated with external data [Zimmerman
2008; ]
Big Data can be Rich, Messy & Surprising
The “Facebook Problem”: Possible to identify masked
network data, if only a few nodes controlled. [Backstrom,
et. al 2007]
The “Blog problem” : Pseudononymous communication
can be linked through textual analysis [Novak wet. al
2004]
Source: [Calberese 2008; Real
Time Rome Project 2007]
Challenges of Big Data
31. Little Data in a Big World
Little Data in a Big World
The “Favorite Ice Cream” problem
-- public information that is not risky
can help us learn information that is
risky
The “Doesn’t Stay in Vegas”
problem
-- information shared locally can be
found anywhere
The “Unintended Algorithmic
Discrimination” problem
-- algorithms are often not
transparent, and can amplify
human biases
Challenges of Big Data
32. Categorizing Challenges
Sources of Big Data for the Social Sciences
Implementation – Performance
Challenges
Systems challenges
Exceed capacity of locally
managed storage
Location and migration of data
becomes critical for performance
Standard backup, recovery and
data integrity mechanisms
ineffective
Communication bandwidth
Algorithmic Challenges
“in core” vs. “out-of-core”
implementations
O(N^2) vs. O(log n) complexity
Static vs. streaming algorithms
Serial vs. massively parallel
Distributed – shared-nothing
algorithms
Analysis methods – Inferential
Challenges
Sources: Designed vs. “found”
data
Model-based vs. data-based
Causal inference vs.
Descriptive/ predictive
(forecasting) inference
Data Management & Workflow
Provenance
Data quality
Change management
Continuous integration
Accommodating variety –
semantics, quality
Transparency and reproducibility
Privacy
Security
Data Governance and Policy
Standards
Incentives
Certifications
Regulation
33. Sources of Big Data for the Social Sciences
Some Open
Questions About
Data Sources
34. Preliminary Observations from First Workshop
Sources of Big Data for the Social Sciences
Topic:
Sources of Economic Big Data
Use Case:
Commodity Flow Survey
Observations:
Different classes of decisions require different sources of data:
E.g. much designed survey data contributes baseline data for
decisions about infrastructure and strategic planning
Transaction based big data could contribute frequency and granularity of
estimates
In big data, data sources are stakeholders
Businesses need to react quickly and predict the future – and need frequently
updated detailed data
Critical to provide a value proposition to business
Critical to develop a trust relationship
Some Potential sources
ERP and DRP operations data
EDI
Mobile Phone
Traffic Data
35. Some Non-Technical Questions About Sources
Sources of Big Data for the Social Sciences
● Who are the key stakeholders in big data source,
and what are the key stakeholder incentives?
○ What key decisions does this information support for
stakeholders? What are the gaps in data from the
stakeholder perspective?
○ What are barriers associated with new sources
of information?
○ Legal barriers
○ Economic barriers
○ Social/trust barriers
36. Sources of Big Data for the Social Sciences
Potential Roles for
Libraries
37. Potential Roles -- Infrastructure
Sources of Big Data for the Social Sciences
Dissemination
Catalog range of new statistics/indicators , sources
Selection based on quality
Guide proper use
Durability
Ensure long-term accessibility of big-data
Manage provenance, versioning
Provide transparency of new indicators/statistics
Security & Confidentiality
Libraries could be a trusted and accountable 3rd party
Store and integrate data from multiple sources
Could develop expert implementation of privacy
best practices
38. Potential Roles - Leadership
Sources of Big Data for the Social Sciences
Advocacy
Advocate for quality, transparency,
replication, durable access.
Standardization
Develop new methods for big data
management
Identify “best practices” for replication,
transparency, long-term access
Standardize licenses for reuse,
preservation
39. Additional References
● Einav, Liran, and Jonathan Levin. "Economics in the age
of big data." Science 346.6210 (2014): 1243089.
http://www.sciencemag.org/content/346/6210/1243089.sh
ort
● Varian, Hal R. "Big data: New tricks for econometrics."
The Journal of Economic Perspectives 28.2 (2014): 3-27.
http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.p
df
Reimsbach-Kounatze, C. (2015), “The Proliferation of
“Big Data” and Implications for Official Statistics and
Statistical Agencies: A Preliminary Analysis”, OECD
Digital Economy Papers, No. 245, OECD Publishing.
http://dx.doi.org/10.1787/5js7t9wqzvg8-en
Kriger, David S., et al. Freight Transportation Surveys.
Vol. 410. Transportation Research Board, 2011.
http://www.nap.edu/catalog/13627/nchrp-synthesis-410-
freight-transportation-surveys
Sources of Big Data for the Social Sciences
41. Creative Commons License
This work. Managing Confidential
information in research, by Micah Altman
(http://redistricting.info) is licensed under
the Creative Commons Attribution-Share
Alike 3.0 United States License. To view a
copy of this license, visit
http://creativecommons.org/licenses/by-
sa/3.0/us/ or send a letter to Creative
Commons, 171 Second Street, Suite 300,
San Francisco, California, 94105, USA.
Sources of Big Data for the Social Sciences