SlideShare ist ein Scribd-Unternehmen logo
1 von 41
February 6, 2018
Ron Snyder and Sharon Garewal
Building an LDA topic
model using Wikipedia
ITHAKA is a not-for-profit organization that helps the academic
community use digital technologies to preserve the scholarly record
and to advance research and teaching in sustainable ways.
JSTOR is a not-for-profit
digital library of academic
journals, books, and
primary sources.
Ithaka S+R is a not-for-profit
research and consulting
service that helps academic,
cultural, and publishing
communities thrive in the
digital environment.
Portico is a not-for-profit
preservation service for
digital publications, including
electronic journals, books,
and historical collections.
Artstor provides 2+ million
high-quality images and
digital asset management
software to enhance
scholarship and teaching.
JSTOR Labs works with partner publishers,
libraries and labs to create tools for
researchers, teachers and students that are
immediately useful – and a little bit magical.
Presentation Outline
Text Analyzer – What it is and how it works, including
a short demo
LDA topic models – What they are and how we’re
creating/using them
Topic data curation – Our process, lessons-learned,
and future work
Multilingual Text Analysis – An experimental
approach leveraging LDA topic models and Wikipedia
relationships (with short demo)
Text Analyzer
Text Analyzer - beta
Text Analyzer
Analyzes arbitrary text to
find related content in
JSTOR archive
Drag-n-drop
File select
Text Analyzer • Text Analyzer extracts topics and
named entities from submitted
text to find related/similar
documents in the JSTOR archive
Text Analyzer • Text Analyzer extracts topics and
named entities from submitted
text to find related/similar
document in the JSTOR archive
• Topics are based on the terms in
the JSTOR Thesaurus
Text Analyzer
Text is submitted via:
• Direct input
• Copy/paste
• Local file
• Drag and drop from local
computer filesystem or a web URL
• Photo of text, via phone camera
A variety of document types
are supported:
• PDF
• MS-Word
• HTML
• RTF
• Plain text, Powerpoint, and Excel
• Images (on-the-fly OCR is
performed)
Another example
My bookshelf at work
Topics inferred
Using a smartphone photo as input…
Image analysis
My bookshelf at work
Topics inferred
Text Analyzer recommendations
Recommendations are based on a ”best fit” of all prioritized
terms (topics and entities) and weights
• The selection of documents in results represents an ‘OR’ of documents
containing one or more terms
• Results ordering is based on a score representing the number of terms
matched, the importance of the term(s) to the document (based on LDA
weight) and the user-specified importance
• A user is able to quickly refine the terms and weights to tailor the results
to a specific need
Values used in relevancy and ranking calculations are available
for inspection
Latent dirichlet allocation (LDA)
• LDA is one of the most common algorithms for topic modeling
The Latent part of LDA comes into play because in statistics, a variable we
have to infer rather than directly observing is called a "latent variable". We're
only directly observing the words and not the topics, so the topics
themselves are latent variables (along with the distributions themselves).*
• LDA is based on the concept that:
• Every document is a mixture of topics
• Every topic is a mixture of words
• LDA is a mathematical method for estimating both of these at the same
time:
• Finding the mixture of words that is associated with each topic,
• while also determining the mixture of topics that describes each
document
* https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation
Latent dirichlet allocation (LDA)
• Model training
• Can be supervised or unsupervised
• When performed unsupervised a predefined number of topics are
identified and are represented by word probabilities
• Supervised training involves the use of a tagged corpus where the tags
will be used as topic labels
• For topics we use a subset of the JSTOR Thesaurus
• For the model training documents we’re now using Wikipedia
articles associated with each topic
• Topic inferencing
• Using a trained topic model, ‘latent’ topics in a document can be
inferred using the words in the text
• Topics are expressed as probability distribution
What is an LDA topic?
It’s simple…
OK, the math isn’t so simple but conceptually
a topic is just a set of “word” relationships
climate temperature earth ice warming global weather atmosphere climate_change ocean cycle oscillation carbon greenhouse model scientist
water age ice_age atmospheric event tropical gas pacific heat dioxide carbon_dioxide pattern average region wave surface extreme variation air
classification cold precipitation latitude cooling global_warming land radiation solar science greenhouse_gas rainfall determine condition enso
report emission current dry theory variability force summer polar infrared annual future north range mass climatic feedback atlantic northern
record sea rise scientific natural evidence scale factor planet winter cold_wave glacial air_mass warm climate_model regional cfc
extreme_weather climatology niño interglacial oceanic assessment pollution phase absorption location published america ancient result energy
arrhenius surface_temperature climate_oscillation vegetation seasonal trend moon south shift activity sun assessment_report climate_pattern
humid hurricane fluctuation anomaly methane conclude decadal maritime tree arctic concentration month short infrared_radiation glacier
future_climate monsoon forecasting global_temperature continental milankovitch orbital water_vapor vapor james estimate normal observe
maximum variable pdo heat_wave pacific_ocean climatologist tree_ring arid convince forcings holocene ice_sheet cloud fourier
climate_sensitivity icehouse weather_event wmo southern_oscillation climate_climate solar_variation climate_cycle north_america croll
human_emission icehouse_climate agassiz enso_event climate_index global_climate climate_variability mid_latitude paleoclimatology
thornthwaite köppen current_climate indian_ocean niño_southern_oscillation niño_southern interglacial_period climate_classification
electromagnetic_radiation term_climate ocean_atmosphere wind_shear fossil_fuel ramanathan wetherald manabe keeling absorbing_infrared
james_croll charpentier scientific_opinion buckland level_pressure sea_level_pressure inter_decadal tropical_pacific decadal_oscillation mjo
climate_science ozone_depletion nao extreme_weather_event change_climate climate_proxy east_pacific annual_basis ice_cap
subarctic_climate oceanic_climate humid_subtropical modern_climate climate_zone bergeron polar_ice regular_cycle scientific_literature
lake_bed current_interglacial temperature_fluctuation warm_period shorter_term classification_include climate_force milankovitch_cycle
projected_increase excessive_heat heatwave bioclimatology cfc_focused dioxide_molecule carbon_dioxide_molecule
absorbing_infrared_radiation lovelock_speculate james_lovelock_speculate scientist_james_lovelock_speculate scientist_james_lovelock
british_scientist_james_lovelock british_scientist_james core_drilled ice_core_drilled particulate_pollution aerosol_pollution sea_core
deep_sea_core david_keeling charles_david_keeling charles_david callendar varves högbom infrared_absorption measure_infrared
cycle_lasting venetz perraudin change_climate_change climate_change_climate_change climate_change_climate change_science
climate_change_science century_scientist background_climate bake_crust british_scientist langley james_lovelock cfc_molecule
chlorofluorocarbon_cfc tyndall john_tyndall extreme_event scientist_james hothouse energy_budget teleconnections sst_anomaly
For example, the top “words” associated with the topic Climatology
LDA topics
Climate change Viticulture
An LDA topic can then be thought of as the density of
associated terms in an analyzed text
For example, in this article on climate change and wine
from a recent edition of the JSTOR Daily we see the top
words for 2 topics highlighted
Named Entity Recognition (NER)
• Entities in a submitted text are identified and available for
document selection
• Persons
• Locations
• Organizations
• Results from multiple entity recognition engines are merged
during analysis
• IBM Alchemy
• OpenCalais (Thompson Reuters)
• OpenNLP (Apache)
• Stanford NER
Using Wikipedia for LDA
training data
Early versions of LDA topic models were trained with JSTOR
documents using MAIstro indexing terms
This worked fairly well but had some significant
limitations/challenges inhibiting further improvement of the
models and inferencing
• Many tagged articles were only semi-related to the topic
• Documents often contained too many topics
• The JSTOR document text was often too “noisy”
1. OCR errors
2. Running headers/footers
3. Citations and references
Using Wikipedia for LDA
training data
Early experimentation with Wikipedia articles for training data
in mid-summer proved promising
• Performed comparison tests of models built from JSTOR-only,
Wikipedia-only, and hybrid training datasets
Converted to the use of Wikipedia training data for Text
Analyzer in September
• Initially hoped for 100% automated mapping from topic to training docs
• Eventually concluded that some level of manual curation would be
needed
• Training data curation performed in Q4 with JSTOR and Access
Innovation staff using an internal tool (more on that in a bit)
Wikepedia and Wikidata
Wikidata provides rich machine readable (semantic) data for augmenting
and linking wikipedia training data
For example: https://www.wikidata.org/wiki/Q52139
Downloadable Wikepedia data dumps
provide clean text for model training
Uncluttered and error-free
• No HTML markup,
hyperlinks, etc
• Ideal for text processing
• As a bonus, summary
snippets are easily extracted
• These snippets are not
currently exposed in the
interface but could be used in a
number of ways in the future
Compare that with some OCR text
from a typical JSTOR article
Goal: Train a new topic model
Produce a “super set” of terms
Find training articles using
Wikipedia
New model will catch nuances
& more subtle language
Project phases
Spreadsheets, curation tool, thesaurus and Wikipedia
1. Mapping thesaurus terms to Wikipedia categories
2. Identifying Wikipedia training articles for thesaurus terms
3. Whitelisting terms
4. Working in the curation tool
5. Spreadsheet validation
Mapping terms to Wikipedia categories
Research Category page
Article level page
JSTOR Thesaurus
Wikipedia category Thesaurus term Wikipedia category link Notes
Musical instruments Musical instruments https://en.wikipedia.org/wiki/Category:Musical_instruments
Musical notation Music notation https://en.wikipedia.org/wiki/Category:Musical_notation
Musical scales Musical scales https://en.wikipedia.org/wiki/Category:Musical_scales
Musical theatre Musical theater https://en.wikipedia.org/wiki/Category:Musical_theatre
Musical tuning Musical tuning https://en.wikipedia.org/wiki/Category:Musical_tuning
Musicians Musicians https://en.wikipedia.org/wiki/Category:Musicians
Musicologists Musicology https://en.wikipedia.org/wiki/Category:Musicologists
Musicology Musicology https://en.wikipedia.org/wiki/Category:Musicology
Mustaali ENTITY https://en.wikipedia.org/wiki/Category:Mustaali named entity
Mustard (condiment) MATCH https://en.wikipedia.org/wiki/Category:Mustard_(condiment)
Didn't match due to parens; "Mustards" in
jthes
Muswell Hill ENTITY https://en.wikipedia.org/wiki/Category:Muswell_Hill named entity
Mutilation NO MATCH https://en.wikipedia.org/wiki/Category:Mutilation
Don't have this term but variations of
term
Mutineers Mutiny https://en.wikipedia.org/wiki/Category:Mutineers
Mutinies Mutiny https://en.wikipedia.org/wiki/Category:Mutinies
Mutualism (biology) MATCH https://en.wikipedia.org/wiki/Category:Mutualism_(biology)
Didn't match due to parens; "Mutualism"
in jthes
Mutualism (movement) NO MATCH https://en.wikipedia.org/wiki/Category:Mutualism_(movement)
Didn't match due to parens; "Mutualism"
within Ecology
Mycology Mycology https://en.wikipedia.org/wiki/Category:Mycology
Myeloid neoplasia NO MATCH https://en.wikipedia.org/wiki/Category:Myeloid_neoplasia no match in jthes
Myoneural junction and neuromuscular
diseases NO MATCH
https://en.wikipedia.org/wiki/Category:Myoneural_junction_and_neuromuscular_diseases
comprised too many concepts
Myrmecophagous mammals NO MATCH https://en.wikipedia.org/wiki/Category:Myrmecophagous_mammals
MySQL SQL https://en.wikipedia.org/wiki/Category:MySQL
Choose 10+ articles
1 week of Labs time
Try to cover first four
levels of hierarchy
Identifying Wikipedia Training articles
The Whitelist
Cut down the list of thesaurus
terms
Used high/low count to help
with assessment
Chose 18k of original 48k
Updated curation tool
Levels 1-4 – Full coverage
Back in the curation tool we decided, for efficiency, we would do all top
level branches down to the 4 level so all subjects were covered to the same
depth.
Learned from the Labs week training documents that a better target is 1-5
training docs per term and being selective is better than including those
that may only be tangentially related.
Some terms only have one or two strong documents.
Spreadsheet validation
Lessons learned
Challenges
• Size of the thesaurus
• Lack of knowledge of some subject
areas
• Wikipedia only articles
• Time/Staffing constraints
• Tool glitches
The future
• Coverage of all thesaurus terms
• Other articles outside of Wikipedia
• Integrated as part of our weekly
workflow
• Working with Subject Matter Experts
to choose training documents
Mulltilingual topic inferencing
LLDA Topic Model
JSTOR Thesaurus
Training docs
16,000 topics
+ 30,000 wikipedia articles
Topic model training
LLDA Topic ModelJSTOR Thesaurus
Training docs
+ 30,000 wikipedia articles
English
Arabic (74%)
LLDA Topic Model
Turkish (55%)
Chinese (82%)
Dutch (78%)
French (86%)
German (86%)
Hebrew (63%)
Italian (76%)
Japanese (82%)
Korean (66%)
Polish (74%)
Portuguese (75%)
Russian (81%)
Spanish (84%)
Multilingual topic inferencing
Multilingual topic inferencing
Demo…
Mulltilingual topic inferencing
Mulltilingual topic inferencing
Thank You

Weitere ähnliche Inhalte

Ähnlich wie Building an LDA topic model using Wikipedia

W13 libr250 databases___sources1
W13 libr250 databases___sources1W13 libr250 databases___sources1
W13 libr250 databases___sources1
lterrones
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
plan4all
 
Parsons citation geodata2014
Parsons citation geodata2014Parsons citation geodata2014
Parsons citation geodata2014
Mark Parsons
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
Kalpit Desai
 
Analysis of Metadata and Topic Modeling for
Analysis of Metadata and Topic Modeling forAnalysis of Metadata and Topic Modeling for
Analysis of Metadata and Topic Modeling for
Jigar Mehta
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research Objects
Carole Goble
 

Ähnlich wie Building an LDA topic model using Wikipedia (20)

Digital Dissertation Overview - Dissertation Top Gun
Digital Dissertation Overview - Dissertation Top GunDigital Dissertation Overview - Dissertation Top Gun
Digital Dissertation Overview - Dissertation Top Gun
 
W13 libr250 databases___sources1
W13 libr250 databases___sources1W13 libr250 databases___sources1
W13 libr250 databases___sources1
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
 
Learning Resource Metadata Initiative: Vocabulary Development Best Practices
Learning Resource Metadata Initiative: Vocabulary Development Best PracticesLearning Resource Metadata Initiative: Vocabulary Development Best Practices
Learning Resource Metadata Initiative: Vocabulary Development Best Practices
 
ontology.ppt
ontology.pptontology.ppt
ontology.ppt
 
A theory of Metadata enriching & filtering
A theory of  Metadata enriching & filteringA theory of  Metadata enriching & filtering
A theory of Metadata enriching & filtering
 
Hansen Metadata for Institutional Repositories
Hansen Metadata for Institutional RepositoriesHansen Metadata for Institutional Repositories
Hansen Metadata for Institutional Repositories
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 
The “Nomenclature of Multidimensionality” in the Digital Libraries Evaluation...
The “Nomenclature of Multidimensionality” in the Digital Libraries Evaluation...The “Nomenclature of Multidimensionality” in the Digital Libraries Evaluation...
The “Nomenclature of Multidimensionality” in the Digital Libraries Evaluation...
 
Parsons citation geodata2014
Parsons citation geodata2014Parsons citation geodata2014
Parsons citation geodata2014
 
Identity, Location, and Citation at NEON
Identity, Location, and Citation at NEONIdentity, Location, and Citation at NEON
Identity, Location, and Citation at NEON
 
Humanidades digitales por Ryan Shaw (University of North Carolina at Chapel H...
Humanidades digitales por Ryan Shaw (University of North Carolina at Chapel H...Humanidades digitales por Ryan Shaw (University of North Carolina at Chapel H...
Humanidades digitales por Ryan Shaw (University of North Carolina at Chapel H...
 
Elsevier_presentation.pdf
Elsevier_presentation.pdfElsevier_presentation.pdf
Elsevier_presentation.pdf
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
 
Ld4 l triannon
Ld4 l triannonLd4 l triannon
Ld4 l triannon
 
Koppel, Riding, Pace, and Ockerbloom, "Library Systems & Interoperability: Br...
Koppel, Riding, Pace, and Ockerbloom, "Library Systems & Interoperability: Br...Koppel, Riding, Pace, and Ockerbloom, "Library Systems & Interoperability: Br...
Koppel, Riding, Pace, and Ockerbloom, "Library Systems & Interoperability: Br...
 
Analysis of Metadata and Topic Modeling for
Analysis of Metadata and Topic Modeling forAnalysis of Metadata and Topic Modeling for
Analysis of Metadata and Topic Modeling for
 
Duraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository ServicesDuraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository Services
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research Objects
 

Kürzlich hochgeladen

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 

Building an LDA topic model using Wikipedia

  • 1. February 6, 2018 Ron Snyder and Sharon Garewal Building an LDA topic model using Wikipedia
  • 2. ITHAKA is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. JSTOR is a not-for-profit digital library of academic journals, books, and primary sources. Ithaka S+R is a not-for-profit research and consulting service that helps academic, cultural, and publishing communities thrive in the digital environment. Portico is a not-for-profit preservation service for digital publications, including electronic journals, books, and historical collections. Artstor provides 2+ million high-quality images and digital asset management software to enhance scholarship and teaching.
  • 3. JSTOR Labs works with partner publishers, libraries and labs to create tools for researchers, teachers and students that are immediately useful – and a little bit magical.
  • 4. Presentation Outline Text Analyzer – What it is and how it works, including a short demo LDA topic models – What they are and how we’re creating/using them Topic data curation – Our process, lessons-learned, and future work Multilingual Text Analysis – An experimental approach leveraging LDA topic models and Wikipedia relationships (with short demo)
  • 7. Text Analyzer Analyzes arbitrary text to find related content in JSTOR archive Drag-n-drop File select
  • 8. Text Analyzer • Text Analyzer extracts topics and named entities from submitted text to find related/similar documents in the JSTOR archive
  • 9. Text Analyzer • Text Analyzer extracts topics and named entities from submitted text to find related/similar document in the JSTOR archive • Topics are based on the terms in the JSTOR Thesaurus
  • 10. Text Analyzer Text is submitted via: • Direct input • Copy/paste • Local file • Drag and drop from local computer filesystem or a web URL • Photo of text, via phone camera A variety of document types are supported: • PDF • MS-Word • HTML • RTF • Plain text, Powerpoint, and Excel • Images (on-the-fly OCR is performed)
  • 11. Another example My bookshelf at work Topics inferred Using a smartphone photo as input…
  • 12. Image analysis My bookshelf at work Topics inferred
  • 13. Text Analyzer recommendations Recommendations are based on a ”best fit” of all prioritized terms (topics and entities) and weights • The selection of documents in results represents an ‘OR’ of documents containing one or more terms • Results ordering is based on a score representing the number of terms matched, the importance of the term(s) to the document (based on LDA weight) and the user-specified importance • A user is able to quickly refine the terms and weights to tailor the results to a specific need Values used in relevancy and ranking calculations are available for inspection
  • 14. Latent dirichlet allocation (LDA) • LDA is one of the most common algorithms for topic modeling The Latent part of LDA comes into play because in statistics, a variable we have to infer rather than directly observing is called a "latent variable". We're only directly observing the words and not the topics, so the topics themselves are latent variables (along with the distributions themselves).* • LDA is based on the concept that: • Every document is a mixture of topics • Every topic is a mixture of words • LDA is a mathematical method for estimating both of these at the same time: • Finding the mixture of words that is associated with each topic, • while also determining the mixture of topics that describes each document * https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation
  • 15. Latent dirichlet allocation (LDA) • Model training • Can be supervised or unsupervised • When performed unsupervised a predefined number of topics are identified and are represented by word probabilities • Supervised training involves the use of a tagged corpus where the tags will be used as topic labels • For topics we use a subset of the JSTOR Thesaurus • For the model training documents we’re now using Wikipedia articles associated with each topic • Topic inferencing • Using a trained topic model, ‘latent’ topics in a document can be inferred using the words in the text • Topics are expressed as probability distribution
  • 16. What is an LDA topic? It’s simple…
  • 17. OK, the math isn’t so simple but conceptually a topic is just a set of “word” relationships climate temperature earth ice warming global weather atmosphere climate_change ocean cycle oscillation carbon greenhouse model scientist water age ice_age atmospheric event tropical gas pacific heat dioxide carbon_dioxide pattern average region wave surface extreme variation air classification cold precipitation latitude cooling global_warming land radiation solar science greenhouse_gas rainfall determine condition enso report emission current dry theory variability force summer polar infrared annual future north range mass climatic feedback atlantic northern record sea rise scientific natural evidence scale factor planet winter cold_wave glacial air_mass warm climate_model regional cfc extreme_weather climatology niño interglacial oceanic assessment pollution phase absorption location published america ancient result energy arrhenius surface_temperature climate_oscillation vegetation seasonal trend moon south shift activity sun assessment_report climate_pattern humid hurricane fluctuation anomaly methane conclude decadal maritime tree arctic concentration month short infrared_radiation glacier future_climate monsoon forecasting global_temperature continental milankovitch orbital water_vapor vapor james estimate normal observe maximum variable pdo heat_wave pacific_ocean climatologist tree_ring arid convince forcings holocene ice_sheet cloud fourier climate_sensitivity icehouse weather_event wmo southern_oscillation climate_climate solar_variation climate_cycle north_america croll human_emission icehouse_climate agassiz enso_event climate_index global_climate climate_variability mid_latitude paleoclimatology thornthwaite köppen current_climate indian_ocean niño_southern_oscillation niño_southern interglacial_period climate_classification electromagnetic_radiation term_climate ocean_atmosphere wind_shear fossil_fuel ramanathan wetherald manabe keeling absorbing_infrared james_croll charpentier scientific_opinion buckland level_pressure sea_level_pressure inter_decadal tropical_pacific decadal_oscillation mjo climate_science ozone_depletion nao extreme_weather_event change_climate climate_proxy east_pacific annual_basis ice_cap subarctic_climate oceanic_climate humid_subtropical modern_climate climate_zone bergeron polar_ice regular_cycle scientific_literature lake_bed current_interglacial temperature_fluctuation warm_period shorter_term classification_include climate_force milankovitch_cycle projected_increase excessive_heat heatwave bioclimatology cfc_focused dioxide_molecule carbon_dioxide_molecule absorbing_infrared_radiation lovelock_speculate james_lovelock_speculate scientist_james_lovelock_speculate scientist_james_lovelock british_scientist_james_lovelock british_scientist_james core_drilled ice_core_drilled particulate_pollution aerosol_pollution sea_core deep_sea_core david_keeling charles_david_keeling charles_david callendar varves högbom infrared_absorption measure_infrared cycle_lasting venetz perraudin change_climate_change climate_change_climate_change climate_change_climate change_science climate_change_science century_scientist background_climate bake_crust british_scientist langley james_lovelock cfc_molecule chlorofluorocarbon_cfc tyndall john_tyndall extreme_event scientist_james hothouse energy_budget teleconnections sst_anomaly For example, the top “words” associated with the topic Climatology
  • 18. LDA topics Climate change Viticulture An LDA topic can then be thought of as the density of associated terms in an analyzed text For example, in this article on climate change and wine from a recent edition of the JSTOR Daily we see the top words for 2 topics highlighted
  • 19. Named Entity Recognition (NER) • Entities in a submitted text are identified and available for document selection • Persons • Locations • Organizations • Results from multiple entity recognition engines are merged during analysis • IBM Alchemy • OpenCalais (Thompson Reuters) • OpenNLP (Apache) • Stanford NER
  • 20. Using Wikipedia for LDA training data Early versions of LDA topic models were trained with JSTOR documents using MAIstro indexing terms This worked fairly well but had some significant limitations/challenges inhibiting further improvement of the models and inferencing • Many tagged articles were only semi-related to the topic • Documents often contained too many topics • The JSTOR document text was often too “noisy” 1. OCR errors 2. Running headers/footers 3. Citations and references
  • 21. Using Wikipedia for LDA training data Early experimentation with Wikipedia articles for training data in mid-summer proved promising • Performed comparison tests of models built from JSTOR-only, Wikipedia-only, and hybrid training datasets Converted to the use of Wikipedia training data for Text Analyzer in September • Initially hoped for 100% automated mapping from topic to training docs • Eventually concluded that some level of manual curation would be needed • Training data curation performed in Q4 with JSTOR and Access Innovation staff using an internal tool (more on that in a bit)
  • 22. Wikepedia and Wikidata Wikidata provides rich machine readable (semantic) data for augmenting and linking wikipedia training data For example: https://www.wikidata.org/wiki/Q52139
  • 23. Downloadable Wikepedia data dumps provide clean text for model training Uncluttered and error-free • No HTML markup, hyperlinks, etc • Ideal for text processing • As a bonus, summary snippets are easily extracted • These snippets are not currently exposed in the interface but could be used in a number of ways in the future
  • 24. Compare that with some OCR text from a typical JSTOR article
  • 25. Goal: Train a new topic model Produce a “super set” of terms Find training articles using Wikipedia New model will catch nuances & more subtle language
  • 26. Project phases Spreadsheets, curation tool, thesaurus and Wikipedia 1. Mapping thesaurus terms to Wikipedia categories 2. Identifying Wikipedia training articles for thesaurus terms 3. Whitelisting terms 4. Working in the curation tool 5. Spreadsheet validation
  • 27. Mapping terms to Wikipedia categories
  • 28. Research Category page Article level page JSTOR Thesaurus
  • 29. Wikipedia category Thesaurus term Wikipedia category link Notes Musical instruments Musical instruments https://en.wikipedia.org/wiki/Category:Musical_instruments Musical notation Music notation https://en.wikipedia.org/wiki/Category:Musical_notation Musical scales Musical scales https://en.wikipedia.org/wiki/Category:Musical_scales Musical theatre Musical theater https://en.wikipedia.org/wiki/Category:Musical_theatre Musical tuning Musical tuning https://en.wikipedia.org/wiki/Category:Musical_tuning Musicians Musicians https://en.wikipedia.org/wiki/Category:Musicians Musicologists Musicology https://en.wikipedia.org/wiki/Category:Musicologists Musicology Musicology https://en.wikipedia.org/wiki/Category:Musicology Mustaali ENTITY https://en.wikipedia.org/wiki/Category:Mustaali named entity Mustard (condiment) MATCH https://en.wikipedia.org/wiki/Category:Mustard_(condiment) Didn't match due to parens; "Mustards" in jthes Muswell Hill ENTITY https://en.wikipedia.org/wiki/Category:Muswell_Hill named entity Mutilation NO MATCH https://en.wikipedia.org/wiki/Category:Mutilation Don't have this term but variations of term Mutineers Mutiny https://en.wikipedia.org/wiki/Category:Mutineers Mutinies Mutiny https://en.wikipedia.org/wiki/Category:Mutinies Mutualism (biology) MATCH https://en.wikipedia.org/wiki/Category:Mutualism_(biology) Didn't match due to parens; "Mutualism" in jthes Mutualism (movement) NO MATCH https://en.wikipedia.org/wiki/Category:Mutualism_(movement) Didn't match due to parens; "Mutualism" within Ecology Mycology Mycology https://en.wikipedia.org/wiki/Category:Mycology Myeloid neoplasia NO MATCH https://en.wikipedia.org/wiki/Category:Myeloid_neoplasia no match in jthes Myoneural junction and neuromuscular diseases NO MATCH https://en.wikipedia.org/wiki/Category:Myoneural_junction_and_neuromuscular_diseases comprised too many concepts Myrmecophagous mammals NO MATCH https://en.wikipedia.org/wiki/Category:Myrmecophagous_mammals MySQL SQL https://en.wikipedia.org/wiki/Category:MySQL
  • 30. Choose 10+ articles 1 week of Labs time Try to cover first four levels of hierarchy Identifying Wikipedia Training articles
  • 31. The Whitelist Cut down the list of thesaurus terms Used high/low count to help with assessment Chose 18k of original 48k
  • 32. Updated curation tool Levels 1-4 – Full coverage Back in the curation tool we decided, for efficiency, we would do all top level branches down to the 4 level so all subjects were covered to the same depth. Learned from the Labs week training documents that a better target is 1-5 training docs per term and being selective is better than including those that may only be tangentially related. Some terms only have one or two strong documents.
  • 34. Lessons learned Challenges • Size of the thesaurus • Lack of knowledge of some subject areas • Wikipedia only articles • Time/Staffing constraints • Tool glitches The future • Coverage of all thesaurus terms • Other articles outside of Wikipedia • Integrated as part of our weekly workflow • Working with Subject Matter Experts to choose training documents
  • 36. LLDA Topic Model JSTOR Thesaurus Training docs 16,000 topics + 30,000 wikipedia articles Topic model training
  • 37. LLDA Topic ModelJSTOR Thesaurus Training docs + 30,000 wikipedia articles English Arabic (74%) LLDA Topic Model Turkish (55%) Chinese (82%) Dutch (78%) French (86%) German (86%) Hebrew (63%) Italian (76%) Japanese (82%) Korean (66%) Polish (74%) Portuguese (75%) Russian (81%) Spanish (84%) Multilingual topic inferencing

Hinweis der Redaktion

  1. Spreadsheet is 3 levels deep = 44k categories; Ran against PT/NPTs =5k matches
  2. Remove all named entities; Check all Acronyms;
  3. In the end we matched over 360 additional Wikipedia categories to jthes terms; we also ended up adding over 200 pt/npts.
  4. Added additional AI contracted help for month of Dec. Term column linked to curation tool