SlideShare a Scribd company logo
1 of 19
Download to read offline
Words and More Words:
Challenges of Big (Text) Data
Edie Rasmussen
Visiting Professor, Nanyang Technological University
Professor, University of British Columbia
WKWSCI
SYMPOSIUM
2014
Big Data, Big Ideas for Smarter
Communities
Outline
• The Rise of Big Text Data
• Challenges for Text Data
• Research Opportunities
– Counting and Culturomics
– Extracting Meaning from Text
2
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
The Rise of Big Text Data
• Before there was Big Data, there were large
bibliographic databases:
– Dialog: ~180 scholarly databases
– Lexis/Nexis: 5 billion documents (business/law/news)
– Citation Indexes: > 40 million records
• IR techniques designed for rapid access to very
large (text) databases
• Swanson: “Undiscovered public knowledge”
(1987)
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
3
Current Text Sources
• Digitized Legacy Materials
– Google Books, Hathi Trust (11 million volumes, 500 TB)
• The Web
• Search Logs (over 2 million queries per minute)
• Wikipedia (~4.5 million English articles)
• Blogs (The Blogosphere)
• Twitter (The Twitterverse)
• Test Collections
– Smaller
– Experimentally more robust
4
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Challenges of Text
• Legacy Text/Digitization Costs
• Quality (OCR Errors; Metadata Errors)
• Availability (Access, Copyright, Privacy)
• Reliability
– Algorithmic dependencies
– Creator trustworthiness
• Authorship Issues (Identification, Authority)
• Lack of Structure
• Lack of Context
• Ambiguity of human language
• Breadth vs. Depth
5
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Processing Text
• Tokenizing, stopping, stemming
• Statistics of text: term values (tf*idf)
• “Bag of Words” approach
• Other evidence: network structures
• Similarity calculations
• Creating ranked lists
• Note: Probabilistic rather than Deterministic
6
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Counting and the Rise of Culturomics
• “Culturomics is the application of high-
throughput data collection and analysis to the
study of human culture”
• Database of >5 million digitized books (~4%)
• Michel et al. (Science, 2011): “Quantitative
analysis of culture using millions of digitized
books”
• Google’s N-Gram Viewer
7
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Using the N-Gram Viewer
8
typhoid
gout
1800 20001900
HIV
cholera
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
How Far Will Counting Take us?
• Many limitations (e.g. incomplete data set)
• Some surprisingly sophisticated analyses:
– Size of English lexicon
– Change in word usage (irregular verbs) over time
– Cultural turnover (inventions)
– The nature (duration) of fame
– Patterns of censorship (“suppression index”)
9
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Critiques of Culturomics
• “The death of theory”
• “…second-rate scholars will use the Google
Books corpus to churn out gigabytes of
uninformative graphs and insignificant
conclusions.” (Nunberg, 2011)
• Books as a representation of human history
• A “time sink”
10
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Social Media as Big Data
• ‘Internet Minute’
– 320+ new Twitter accounts
– 100,000 new Tweets
– 2+ million search queries
– 6 new Wikipedia articles
– 30 hours of video uploaded
(Source: Intel
http://www.intel.com/content/www/us/en/communications/interne
t-minute-infographic.html)
11
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
TM: Topic Detection and Tracking
• Tracking a story line over time
• News wire input, identify new story, find
subsequent instances
• Story segmentation, First story detection,
Clustering of like stories
• Interesting to news, business, security analysts
12
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
TM: Sentiment Analysis/Opinion Mining
• Rich data from Blogs and Tweets
• Basically a classification problem (SVM, Naïve
Bayes, etc.) - > positive, negative, neutral
• Involves Entity Extraction, NLP, sentiment
vocabularies
• Of interest to government and businesses
• See Stanford SA of movie reviews:
http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
13
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
TM: Trends and Predictions
• Can Tweets and Search Logs be used to
predict the future?
• Google Flu Trends, Google Dengue Trends
– Correlated with Search Terms
• Network analysis on Tweets on Arab Spring
• Assessing tone of global news data to predict
national stability, location of terrorists, etc.
(Leetaru)
• Predicting opinions (recommender systems)
14
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
TM: Question Answering
• Combines multiple sources of evidence:
– Question type identification
– Information retrieval of candidate text
– Natural language processing
– Entity extraction
– Hypothesis generation and scoring (confidence)
– Ranking hypotheses
15
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
16
Watson, 2011
Hans Peter Luhn, 1952
Watson, 2011
Structuring Research:
“Digging Into Data” Program
• Addresses: “how "big data" changes the research
landscape for the humanities and social sciences”
• 3 rounds of international research funding
• Canada, US, UK, plus Netherlands
• Team approach: scholars, scientists, information
professionals
• Requires international teams; funding from at
least two countries
• Wide range of datasets made available
• http://www.diggingintodata.org/
17
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
18
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Thank you!
19
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities

More Related Content

Viewers also liked

Customer segmentation
Customer segmentationCustomer segmentation
Customer segmentationweave Belgium
 
Marketing campaign to sell long term deposits
Marketing campaign to sell long term depositsMarketing campaign to sell long term deposits
Marketing campaign to sell long term depositsAditya Bahl
 
FAST Digital Telco
FAST Digital TelcoFAST Digital Telco
FAST Digital TelcoCapgemini
 
Telco 2.0 Report Summary: Telcos' Role in Advertising Value Chain
Telco 2.0 Report Summary:  Telcos' Role in Advertising Value ChainTelco 2.0 Report Summary:  Telcos' Role in Advertising Value Chain
Telco 2.0 Report Summary: Telcos' Role in Advertising Value Chainbazza1664
 
Telco 4.0 Business Operating Model Value Proposition Overview
Telco 4.0 Business Operating Model Value Proposition   OverviewTelco 4.0 Business Operating Model Value Proposition   Overview
Telco 4.0 Business Operating Model Value Proposition OverviewNigel Tebbutt
 
Telco Paper by Blueocean Market Intelligence
Telco Paper by Blueocean Market IntelligenceTelco Paper by Blueocean Market Intelligence
Telco Paper by Blueocean Market IntelligenceCourse5i
 
Brand Building in the Age of Big Data by Mr. Gavin Coombes
Brand Building in the Age of Big Data by Mr. Gavin CoombesBrand Building in the Age of Big Data by Mr. Gavin Coombes
Brand Building in the Age of Big Data by Mr. Gavin Coombeswkwsci-research
 
Telco churn presentation
Telco churn presentationTelco churn presentation
Telco churn presentationAditya Bahl
 
Customer segmentation approach
Customer segmentation approachCustomer segmentation approach
Customer segmentation approachSumit K Jha
 
Patient Powered Research with Big Data and Connected Communities by Assoc. P...
Patient Powered Research with Big Data and Connected Communities  by Assoc. P...Patient Powered Research with Big Data and Connected Communities  by Assoc. P...
Patient Powered Research with Big Data and Connected Communities by Assoc. P...wkwsci-research
 
獲利世代Business Model Generation
獲利世代Business Model Generation獲利世代Business Model Generation
獲利世代Business Model Generation貫中 侯
 
Layering Common Sense on Top of all that Rocket Science by Prof. Sharon Dunwoody
Layering Common Sense on Top of all that Rocket Science by Prof. Sharon DunwoodyLayering Common Sense on Top of all that Rocket Science by Prof. Sharon Dunwoody
Layering Common Sense on Top of all that Rocket Science by Prof. Sharon Dunwoodywkwsci-research
 
Roadmap to realizing the value of telco data – opportunities, challenges, use...
Roadmap to realizing the value of telco data – opportunities, challenges, use...Roadmap to realizing the value of telco data – opportunities, challenges, use...
Roadmap to realizing the value of telco data – opportunities, challenges, use...Flytxt
 
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...Amazon Web Services
 
Benefiting from Big Data - A New Approach for the Telecom Industry
Benefiting from Big Data - A New Approach for the Telecom Industry  Benefiting from Big Data - A New Approach for the Telecom Industry
Benefiting from Big Data - A New Approach for the Telecom Industry Persontyle
 
Monetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service ProvidersMonetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service ProvidersDataWorks Summit
 
Telco Churn Roi V3
Telco Churn Roi V3Telco Churn Roi V3
Telco Churn Roi V3hkaul
 
Sample - Data Warehouse Requirements
Sample -  Data Warehouse RequirementsSample -  Data Warehouse Requirements
Sample - Data Warehouse RequirementsDavid Walker
 
Customer Segmentation
Customer SegmentationCustomer Segmentation
Customer SegmentationCarlos Soares
 

Viewers also liked (20)

Role of Analytics in Customer Management
Role of Analytics in Customer ManagementRole of Analytics in Customer Management
Role of Analytics in Customer Management
 
Customer segmentation
Customer segmentationCustomer segmentation
Customer segmentation
 
Marketing campaign to sell long term deposits
Marketing campaign to sell long term depositsMarketing campaign to sell long term deposits
Marketing campaign to sell long term deposits
 
FAST Digital Telco
FAST Digital TelcoFAST Digital Telco
FAST Digital Telco
 
Telco 2.0 Report Summary: Telcos' Role in Advertising Value Chain
Telco 2.0 Report Summary:  Telcos' Role in Advertising Value ChainTelco 2.0 Report Summary:  Telcos' Role in Advertising Value Chain
Telco 2.0 Report Summary: Telcos' Role in Advertising Value Chain
 
Telco 4.0 Business Operating Model Value Proposition Overview
Telco 4.0 Business Operating Model Value Proposition   OverviewTelco 4.0 Business Operating Model Value Proposition   Overview
Telco 4.0 Business Operating Model Value Proposition Overview
 
Telco Paper by Blueocean Market Intelligence
Telco Paper by Blueocean Market IntelligenceTelco Paper by Blueocean Market Intelligence
Telco Paper by Blueocean Market Intelligence
 
Brand Building in the Age of Big Data by Mr. Gavin Coombes
Brand Building in the Age of Big Data by Mr. Gavin CoombesBrand Building in the Age of Big Data by Mr. Gavin Coombes
Brand Building in the Age of Big Data by Mr. Gavin Coombes
 
Telco churn presentation
Telco churn presentationTelco churn presentation
Telco churn presentation
 
Customer segmentation approach
Customer segmentation approachCustomer segmentation approach
Customer segmentation approach
 
Patient Powered Research with Big Data and Connected Communities by Assoc. P...
Patient Powered Research with Big Data and Connected Communities  by Assoc. P...Patient Powered Research with Big Data and Connected Communities  by Assoc. P...
Patient Powered Research with Big Data and Connected Communities by Assoc. P...
 
獲利世代Business Model Generation
獲利世代Business Model Generation獲利世代Business Model Generation
獲利世代Business Model Generation
 
Layering Common Sense on Top of all that Rocket Science by Prof. Sharon Dunwoody
Layering Common Sense on Top of all that Rocket Science by Prof. Sharon DunwoodyLayering Common Sense on Top of all that Rocket Science by Prof. Sharon Dunwoody
Layering Common Sense on Top of all that Rocket Science by Prof. Sharon Dunwoody
 
Roadmap to realizing the value of telco data – opportunities, challenges, use...
Roadmap to realizing the value of telco data – opportunities, challenges, use...Roadmap to realizing the value of telco data – opportunities, challenges, use...
Roadmap to realizing the value of telco data – opportunities, challenges, use...
 
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
 
Benefiting from Big Data - A New Approach for the Telecom Industry
Benefiting from Big Data - A New Approach for the Telecom Industry  Benefiting from Big Data - A New Approach for the Telecom Industry
Benefiting from Big Data - A New Approach for the Telecom Industry
 
Monetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service ProvidersMonetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service Providers
 
Telco Churn Roi V3
Telco Churn Roi V3Telco Churn Roi V3
Telco Churn Roi V3
 
Sample - Data Warehouse Requirements
Sample -  Data Warehouse RequirementsSample -  Data Warehouse Requirements
Sample - Data Warehouse Requirements
 
Customer Segmentation
Customer SegmentationCustomer Segmentation
Customer Segmentation
 

Similar to Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011lljohnston
 
Digital Humanities and “Digital” Social Sciences
Digital Humanities and “Digital” Social SciencesDigital Humanities and “Digital” Social Sciences
Digital Humanities and “Digital” Social SciencesChantal van Son
 
Data sharing in the age of the Social Machine
Data sharing in the age of the Social MachineData sharing in the age of the Social Machine
Data sharing in the age of the Social MachineUlrik Lyngs
 
The Dataverse Commons
The Dataverse CommonsThe Dataverse Commons
The Dataverse CommonsMerce Crosas
 
Digital Humanities Venice Group Presentation - Opening the Libro d'Oro
Digital Humanities Venice Group Presentation - Opening the Libro d'OroDigital Humanities Venice Group Presentation - Opening the Libro d'Oro
Digital Humanities Venice Group Presentation - Opening the Libro d'OroMichael Mitchell
 
Big Data in the Arts and Humanities
Big Data in the Arts and HumanitiesBig Data in the Arts and Humanities
Big Data in the Arts and HumanitiesAndrew Prescott
 
Big Data in the Arts and Humanities: Stirling presentation
Big Data in the Arts and Humanities: Stirling presentationBig Data in the Arts and Humanities: Stirling presentation
Big Data in the Arts and Humanities: Stirling presentationAndrew Prescott
 
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBeth Plale
 
Miscellaneous Info: The Digital Past, Present, Future
Miscellaneous Info: The Digital Past, Present, FutureMiscellaneous Info: The Digital Past, Present, Future
Miscellaneous Info: The Digital Past, Present, FutureLee Cafferata
 
APLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataAPLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataHamilton Public Library
 
Melissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabMelissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabUniversity of Edinburgh
 
Ralph schroeder and eric meyer
Ralph schroeder and eric meyerRalph schroeder and eric meyer
Ralph schroeder and eric meyeroiisdp
 
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the MapNew Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the MapAxel Bruns
 
Next generation data services at the Marriott Library
Next generation data services at the Marriott LibraryNext generation data services at the Marriott Library
Next generation data services at the Marriott LibraryRebekah Cummings
 
Digital research: Collections, data, tools and methods
Digital research: Collections, data, tools and methods Digital research: Collections, data, tools and methods
Digital research: Collections, data, tools and methods Stella Wisdom
 
AAPOR - comparing found data from social media and made data from surveys
AAPOR - comparing found data from social media and made data from surveysAAPOR - comparing found data from social media and made data from surveys
AAPOR - comparing found data from social media and made data from surveysCliff Lampe
 

Similar to Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen (20)

Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011
 
Digital Humanities and “Digital” Social Sciences
Digital Humanities and “Digital” Social SciencesDigital Humanities and “Digital” Social Sciences
Digital Humanities and “Digital” Social Sciences
 
Data 101: A Gentle Introduction
Data 101: A Gentle IntroductionData 101: A Gentle Introduction
Data 101: A Gentle Introduction
 
Data sharing in the age of the Social Machine
Data sharing in the age of the Social MachineData sharing in the age of the Social Machine
Data sharing in the age of the Social Machine
 
Data stories
Data storiesData stories
Data stories
 
The Dataverse Commons
The Dataverse CommonsThe Dataverse Commons
The Dataverse Commons
 
Digital Humanities Venice Group Presentation - Opening the Libro d'Oro
Digital Humanities Venice Group Presentation - Opening the Libro d'OroDigital Humanities Venice Group Presentation - Opening the Libro d'Oro
Digital Humanities Venice Group Presentation - Opening the Libro d'Oro
 
2014_WWW_BTOR
2014_WWW_BTOR2014_WWW_BTOR
2014_WWW_BTOR
 
Big Data in the Arts and Humanities
Big Data in the Arts and HumanitiesBig Data in the Arts and Humanities
Big Data in the Arts and Humanities
 
Big Data in the Arts and Humanities: Stirling presentation
Big Data in the Arts and Humanities: Stirling presentationBig Data in the Arts and Humanities: Stirling presentation
Big Data in the Arts and Humanities: Stirling presentation
 
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
 
Miscellaneous Info: The Digital Past, Present, Future
Miscellaneous Info: The Digital Past, Present, FutureMiscellaneous Info: The Digital Past, Present, Future
Miscellaneous Info: The Digital Past, Present, Future
 
APLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataAPLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with Data
 
Data 101: A Gentle Introduction
Data 101: A Gentle IntroductionData 101: A Gentle Introduction
Data 101: A Gentle Introduction
 
Melissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabMelissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLab
 
Ralph schroeder and eric meyer
Ralph schroeder and eric meyerRalph schroeder and eric meyer
Ralph schroeder and eric meyer
 
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the MapNew Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
 
Next generation data services at the Marriott Library
Next generation data services at the Marriott LibraryNext generation data services at the Marriott Library
Next generation data services at the Marriott Library
 
Digital research: Collections, data, tools and methods
Digital research: Collections, data, tools and methods Digital research: Collections, data, tools and methods
Digital research: Collections, data, tools and methods
 
AAPOR - comparing found data from social media and made data from surveys
AAPOR - comparing found data from social media and made data from surveysAAPOR - comparing found data from social media and made data from surveys
AAPOR - comparing found data from social media and made data from surveys
 

Recently uploaded

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 

Recently uploaded (20)

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 

Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

  • 1. Words and More Words: Challenges of Big (Text) Data Edie Rasmussen Visiting Professor, Nanyang Technological University Professor, University of British Columbia WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 2. Outline • The Rise of Big Text Data • Challenges for Text Data • Research Opportunities – Counting and Culturomics – Extracting Meaning from Text 2 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 3. The Rise of Big Text Data • Before there was Big Data, there were large bibliographic databases: – Dialog: ~180 scholarly databases – Lexis/Nexis: 5 billion documents (business/law/news) – Citation Indexes: > 40 million records • IR techniques designed for rapid access to very large (text) databases • Swanson: “Undiscovered public knowledge” (1987) WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities 3
  • 4. Current Text Sources • Digitized Legacy Materials – Google Books, Hathi Trust (11 million volumes, 500 TB) • The Web • Search Logs (over 2 million queries per minute) • Wikipedia (~4.5 million English articles) • Blogs (The Blogosphere) • Twitter (The Twitterverse) • Test Collections – Smaller – Experimentally more robust 4 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 5. Challenges of Text • Legacy Text/Digitization Costs • Quality (OCR Errors; Metadata Errors) • Availability (Access, Copyright, Privacy) • Reliability – Algorithmic dependencies – Creator trustworthiness • Authorship Issues (Identification, Authority) • Lack of Structure • Lack of Context • Ambiguity of human language • Breadth vs. Depth 5 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 6. Processing Text • Tokenizing, stopping, stemming • Statistics of text: term values (tf*idf) • “Bag of Words” approach • Other evidence: network structures • Similarity calculations • Creating ranked lists • Note: Probabilistic rather than Deterministic 6 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 7. Counting and the Rise of Culturomics • “Culturomics is the application of high- throughput data collection and analysis to the study of human culture” • Database of >5 million digitized books (~4%) • Michel et al. (Science, 2011): “Quantitative analysis of culture using millions of digitized books” • Google’s N-Gram Viewer 7 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 8. Using the N-Gram Viewer 8 typhoid gout 1800 20001900 HIV cholera WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 9. How Far Will Counting Take us? • Many limitations (e.g. incomplete data set) • Some surprisingly sophisticated analyses: – Size of English lexicon – Change in word usage (irregular verbs) over time – Cultural turnover (inventions) – The nature (duration) of fame – Patterns of censorship (“suppression index”) 9 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 10. Critiques of Culturomics • “The death of theory” • “…second-rate scholars will use the Google Books corpus to churn out gigabytes of uninformative graphs and insignificant conclusions.” (Nunberg, 2011) • Books as a representation of human history • A “time sink” 10 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 11. Social Media as Big Data • ‘Internet Minute’ – 320+ new Twitter accounts – 100,000 new Tweets – 2+ million search queries – 6 new Wikipedia articles – 30 hours of video uploaded (Source: Intel http://www.intel.com/content/www/us/en/communications/interne t-minute-infographic.html) 11 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 12. TM: Topic Detection and Tracking • Tracking a story line over time • News wire input, identify new story, find subsequent instances • Story segmentation, First story detection, Clustering of like stories • Interesting to news, business, security analysts 12 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 13. TM: Sentiment Analysis/Opinion Mining • Rich data from Blogs and Tweets • Basically a classification problem (SVM, Naïve Bayes, etc.) - > positive, negative, neutral • Involves Entity Extraction, NLP, sentiment vocabularies • Of interest to government and businesses • See Stanford SA of movie reviews: http://nlp.stanford.edu:8080/sentiment/rntnDemo.html 13 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 14. TM: Trends and Predictions • Can Tweets and Search Logs be used to predict the future? • Google Flu Trends, Google Dengue Trends – Correlated with Search Terms • Network analysis on Tweets on Arab Spring • Assessing tone of global news data to predict national stability, location of terrorists, etc. (Leetaru) • Predicting opinions (recommender systems) 14 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 15. TM: Question Answering • Combines multiple sources of evidence: – Question type identification – Information retrieval of candidate text – Natural language processing – Entity extraction – Hypothesis generation and scoring (confidence) – Ranking hypotheses 15 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 16. 16 Watson, 2011 Hans Peter Luhn, 1952 Watson, 2011
  • 17. Structuring Research: “Digging Into Data” Program • Addresses: “how "big data" changes the research landscape for the humanities and social sciences” • 3 rounds of international research funding • Canada, US, UK, plus Netherlands • Team approach: scholars, scientists, information professionals • Requires international teams; funding from at least two countries • Wide range of datasets made available • http://www.diggingintodata.org/ 17 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 18. 18 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 19. Thank you! 19 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities