Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

Words and More Words:
Challenges of Big (Text) Data
Edie Rasmussen
Visiting Professor, Nanyang Technological University
Professor, University of British Columbia
WKWSCI
SYMPOSIUM
2014
Big Data, Big Ideas for Smarter
Communities

Outline
• The Rise of Big Text Data
• Challenges for Text Data
• Research Opportunities
– Counting and Culturomics
– Extracting Meaning from Text
2
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities

The Rise of Big Text Data
• Before there was Big Data, there were large
bibliographic databases:
– Dialog: ~180 scholarly databases
– Lexis/Nexis: 5 billion documents (business/law/news)
– Citation Indexes: > 40 million records
• IR techniques designed for rapid access to very
large (text) databases
• Swanson: “Undiscovered public knowledge”
(1987)
3

Current Text Sources
• Digitized Legacy Materials
– Google Books, Hathi Trust (11 million volumes, 500 TB)
• The Web
• Search Logs (over 2 million queries per minute)
• Wikipedia (~4.5 million English articles)
• Blogs (The Blogosphere)
• Twitter (The Twitterverse)
• Test Collections
– Smaller
– Experimentally more robust
4

Challenges of Text
• Legacy Text/Digitization Costs
• Quality (OCR Errors; Metadata Errors)
• Availability (Access, Copyright, Privacy)
• Reliability
– Algorithmic dependencies
– Creator trustworthiness
• Authorship Issues (Identification, Authority)
• Lack of Structure
• Lack of Context
• Ambiguity of human language
• Breadth vs. Depth
5

Processing Text
• Tokenizing, stopping, stemming
• Statistics of text: term values (tf*idf)
• “Bag of Words” approach
• Other evidence: network structures
• Similarity calculations
• Creating ranked lists
• Note: Probabilistic rather than Deterministic
6

Counting and the Rise of Culturomics
• “Culturomics is the application of high-
throughput data collection and analysis to the
study of human culture”
• Database of >5 million digitized books (~4%)
• Michel et al. (Science, 2011): “Quantitative
analysis of culture using millions of digitized
books”
• Google’s N-Gram Viewer
7

Using the N-Gram Viewer
8
typhoid
gout
1800 20001900
HIV
cholera

How Far Will Counting Take us?
• Many limitations (e.g. incomplete data set)
• Some surprisingly sophisticated analyses:
– Size of English lexicon
– Change in word usage (irregular verbs) over time
– Cultural turnover (inventions)
– The nature (duration) of fame
– Patterns of censorship (“suppression index”)
9

Critiques of Culturomics
• “The death of theory”
• “…second-rate scholars will use the Google
Books corpus to churn out gigabytes of
uninformative graphs and insignificant
conclusions.” (Nunberg, 2011)
• Books as a representation of human history
• A “time sink”
10

Social Media as Big Data
• ‘Internet Minute’
– 320+ new Twitter accounts
– 100,000 new Tweets
– 2+ million search queries
– 6 new Wikipedia articles
– 30 hours of video uploaded
(Source: Intel
http://www.intel.com/content/www/us/en/communications/interne
t-minute-infographic.html)
11

TM: Topic Detection and Tracking
• Tracking a story line over time
• News wire input, identify new story, find
subsequent instances
• Story segmentation, First story detection,
Clustering of like stories
• Interesting to news, business, security analysts
12

TM: Sentiment Analysis/Opinion Mining
• Rich data from Blogs and Tweets
• Basically a classification problem (SVM, Naïve
Bayes, etc.) - > positive, negative, neutral
• Involves Entity Extraction, NLP, sentiment
vocabularies
• Of interest to government and businesses
• See Stanford SA of movie reviews:
http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
13

TM: Trends and Predictions
• Can Tweets and Search Logs be used to
predict the future?
• Google Flu Trends, Google Dengue Trends
– Correlated with Search Terms
• Network analysis on Tweets on Arab Spring
• Assessing tone of global news data to predict
national stability, location of terrorists, etc.
(Leetaru)
• Predicting opinions (recommender systems)
14

TM: Question Answering
• Combines multiple sources of evidence:
– Question type identification
– Information retrieval of candidate text
– Natural language processing
– Entity extraction
– Hypothesis generation and scoring (confidence)
– Ranking hypotheses
15

16
Watson, 2011
Hans Peter Luhn, 1952
Watson, 2011

Structuring Research:
“Digging Into Data” Program
• Addresses: “how "big data" changes the research
landscape for the humanities and social sciences”
• 3 rounds of international research funding
• Canada, US, UK, plus Netherlands
• Team approach: scholars, scientists, information
professionals
• Requires international teams; funding from at
least two countries
• Wide range of datasets made available
• http://www.diggingintodata.org/
17

18

Thank you!
19

Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

Similar to Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen (20)

Recently uploaded

Recently uploaded (20)

Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen