Presented during the WKWSCI Symposium 2014
21 March 2014
Marina Bay Sands Expo and Convention Centre
Organized by the Wee Kim Wee School of Communication and Information at Nanyang Technological University
Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen
1. Words and More Words:
Challenges of Big (Text) Data
Edie Rasmussen
Visiting Professor, Nanyang Technological University
Professor, University of British Columbia
WKWSCI
SYMPOSIUM
2014
Big Data, Big Ideas for Smarter
Communities
2. Outline
• The Rise of Big Text Data
• Challenges for Text Data
• Research Opportunities
– Counting and Culturomics
– Extracting Meaning from Text
2
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
3. The Rise of Big Text Data
• Before there was Big Data, there were large
bibliographic databases:
– Dialog: ~180 scholarly databases
– Lexis/Nexis: 5 billion documents (business/law/news)
– Citation Indexes: > 40 million records
• IR techniques designed for rapid access to very
large (text) databases
• Swanson: “Undiscovered public knowledge”
(1987)
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
3
4. Current Text Sources
• Digitized Legacy Materials
– Google Books, Hathi Trust (11 million volumes, 500 TB)
• The Web
• Search Logs (over 2 million queries per minute)
• Wikipedia (~4.5 million English articles)
• Blogs (The Blogosphere)
• Twitter (The Twitterverse)
• Test Collections
– Smaller
– Experimentally more robust
4
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
5. Challenges of Text
• Legacy Text/Digitization Costs
• Quality (OCR Errors; Metadata Errors)
• Availability (Access, Copyright, Privacy)
• Reliability
– Algorithmic dependencies
– Creator trustworthiness
• Authorship Issues (Identification, Authority)
• Lack of Structure
• Lack of Context
• Ambiguity of human language
• Breadth vs. Depth
5
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
6. Processing Text
• Tokenizing, stopping, stemming
• Statistics of text: term values (tf*idf)
• “Bag of Words” approach
• Other evidence: network structures
• Similarity calculations
• Creating ranked lists
• Note: Probabilistic rather than Deterministic
6
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
7. Counting and the Rise of Culturomics
• “Culturomics is the application of high-
throughput data collection and analysis to the
study of human culture”
• Database of >5 million digitized books (~4%)
• Michel et al. (Science, 2011): “Quantitative
analysis of culture using millions of digitized
books”
• Google’s N-Gram Viewer
7
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
8. Using the N-Gram Viewer
8
typhoid
gout
1800 20001900
HIV
cholera
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
9. How Far Will Counting Take us?
• Many limitations (e.g. incomplete data set)
• Some surprisingly sophisticated analyses:
– Size of English lexicon
– Change in word usage (irregular verbs) over time
– Cultural turnover (inventions)
– The nature (duration) of fame
– Patterns of censorship (“suppression index”)
9
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
10. Critiques of Culturomics
• “The death of theory”
• “…second-rate scholars will use the Google
Books corpus to churn out gigabytes of
uninformative graphs and insignificant
conclusions.” (Nunberg, 2011)
• Books as a representation of human history
• A “time sink”
10
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
11. Social Media as Big Data
• ‘Internet Minute’
– 320+ new Twitter accounts
– 100,000 new Tweets
– 2+ million search queries
– 6 new Wikipedia articles
– 30 hours of video uploaded
(Source: Intel
http://www.intel.com/content/www/us/en/communications/interne
t-minute-infographic.html)
11
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
12. TM: Topic Detection and Tracking
• Tracking a story line over time
• News wire input, identify new story, find
subsequent instances
• Story segmentation, First story detection,
Clustering of like stories
• Interesting to news, business, security analysts
12
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
13. TM: Sentiment Analysis/Opinion Mining
• Rich data from Blogs and Tweets
• Basically a classification problem (SVM, Naïve
Bayes, etc.) - > positive, negative, neutral
• Involves Entity Extraction, NLP, sentiment
vocabularies
• Of interest to government and businesses
• See Stanford SA of movie reviews:
http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
13
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
14. TM: Trends and Predictions
• Can Tweets and Search Logs be used to
predict the future?
• Google Flu Trends, Google Dengue Trends
– Correlated with Search Terms
• Network analysis on Tweets on Arab Spring
• Assessing tone of global news data to predict
national stability, location of terrorists, etc.
(Leetaru)
• Predicting opinions (recommender systems)
14
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
15. TM: Question Answering
• Combines multiple sources of evidence:
– Question type identification
– Information retrieval of candidate text
– Natural language processing
– Entity extraction
– Hypothesis generation and scoring (confidence)
– Ranking hypotheses
15
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
17. Structuring Research:
“Digging Into Data” Program
• Addresses: “how "big data" changes the research
landscape for the humanities and social sciences”
• 3 rounds of international research funding
• Canada, US, UK, plus Netherlands
• Team approach: scholars, scientists, information
professionals
• Requires international teams; funding from at
least two countries
• Wide range of datasets made available
• http://www.diggingintodata.org/
17
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities