Presentation given at the Data Harmony User Group 2018 meeting about creating training data for use in JSTOR's new Text Analyzer, a tool that allows users to upload a document, have it automatically analyzed, and find relevant content in JSTOR. Using the JSTOR Thesaurus terms the team identified and reviewed Wikipedia articles to be used as training data for a topic model.
1. February 6, 2018
Ron Snyder and Sharon Garewal
Building an LDA topic
model using Wikipedia
2. ITHAKA is a not-for-profit organization that helps the academic
community use digital technologies to preserve the scholarly record
and to advance research and teaching in sustainable ways.
JSTOR is a not-for-profit
digital library of academic
journals, books, and
primary sources.
Ithaka S+R is a not-for-profit
research and consulting
service that helps academic,
cultural, and publishing
communities thrive in the
digital environment.
Portico is a not-for-profit
preservation service for
digital publications, including
electronic journals, books,
and historical collections.
Artstor provides 2+ million
high-quality images and
digital asset management
software to enhance
scholarship and teaching.
3. JSTOR Labs works with partner publishers,
libraries and labs to create tools for
researchers, teachers and students that are
immediately useful – and a little bit magical.
4. Presentation Outline
Text Analyzer – What it is and how it works, including
a short demo
LDA topic models – What they are and how we’re
creating/using them
Topic data curation – Our process, lessons-learned,
and future work
Multilingual Text Analysis – An experimental
approach leveraging LDA topic models and Wikipedia
relationships (with short demo)
8. Text Analyzer • Text Analyzer extracts topics and
named entities from submitted
text to find related/similar
documents in the JSTOR archive
9. Text Analyzer • Text Analyzer extracts topics and
named entities from submitted
text to find related/similar
document in the JSTOR archive
• Topics are based on the terms in
the JSTOR Thesaurus
10. Text Analyzer
Text is submitted via:
• Direct input
• Copy/paste
• Local file
• Drag and drop from local
computer filesystem or a web URL
• Photo of text, via phone camera
A variety of document types
are supported:
• PDF
• MS-Word
• HTML
• RTF
• Plain text, Powerpoint, and Excel
• Images (on-the-fly OCR is
performed)
13. Text Analyzer recommendations
Recommendations are based on a ”best fit” of all prioritized
terms (topics and entities) and weights
• The selection of documents in results represents an ‘OR’ of documents
containing one or more terms
• Results ordering is based on a score representing the number of terms
matched, the importance of the term(s) to the document (based on LDA
weight) and the user-specified importance
• A user is able to quickly refine the terms and weights to tailor the results
to a specific need
Values used in relevancy and ranking calculations are available
for inspection
14. Latent dirichlet allocation (LDA)
• LDA is one of the most common algorithms for topic modeling
The Latent part of LDA comes into play because in statistics, a variable we
have to infer rather than directly observing is called a "latent variable". We're
only directly observing the words and not the topics, so the topics
themselves are latent variables (along with the distributions themselves).*
• LDA is based on the concept that:
• Every document is a mixture of topics
• Every topic is a mixture of words
• LDA is a mathematical method for estimating both of these at the same
time:
• Finding the mixture of words that is associated with each topic,
• while also determining the mixture of topics that describes each
document
* https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation
15. Latent dirichlet allocation (LDA)
• Model training
• Can be supervised or unsupervised
• When performed unsupervised a predefined number of topics are
identified and are represented by word probabilities
• Supervised training involves the use of a tagged corpus where the tags
will be used as topic labels
• For topics we use a subset of the JSTOR Thesaurus
• For the model training documents we’re now using Wikipedia
articles associated with each topic
• Topic inferencing
• Using a trained topic model, ‘latent’ topics in a document can be
inferred using the words in the text
• Topics are expressed as probability distribution
17. OK, the math isn’t so simple but conceptually
a topic is just a set of “word” relationships
climate temperature earth ice warming global weather atmosphere climate_change ocean cycle oscillation carbon greenhouse model scientist
water age ice_age atmospheric event tropical gas pacific heat dioxide carbon_dioxide pattern average region wave surface extreme variation air
classification cold precipitation latitude cooling global_warming land radiation solar science greenhouse_gas rainfall determine condition enso
report emission current dry theory variability force summer polar infrared annual future north range mass climatic feedback atlantic northern
record sea rise scientific natural evidence scale factor planet winter cold_wave glacial air_mass warm climate_model regional cfc
extreme_weather climatology niño interglacial oceanic assessment pollution phase absorption location published america ancient result energy
arrhenius surface_temperature climate_oscillation vegetation seasonal trend moon south shift activity sun assessment_report climate_pattern
humid hurricane fluctuation anomaly methane conclude decadal maritime tree arctic concentration month short infrared_radiation glacier
future_climate monsoon forecasting global_temperature continental milankovitch orbital water_vapor vapor james estimate normal observe
maximum variable pdo heat_wave pacific_ocean climatologist tree_ring arid convince forcings holocene ice_sheet cloud fourier
climate_sensitivity icehouse weather_event wmo southern_oscillation climate_climate solar_variation climate_cycle north_america croll
human_emission icehouse_climate agassiz enso_event climate_index global_climate climate_variability mid_latitude paleoclimatology
thornthwaite köppen current_climate indian_ocean niño_southern_oscillation niño_southern interglacial_period climate_classification
electromagnetic_radiation term_climate ocean_atmosphere wind_shear fossil_fuel ramanathan wetherald manabe keeling absorbing_infrared
james_croll charpentier scientific_opinion buckland level_pressure sea_level_pressure inter_decadal tropical_pacific decadal_oscillation mjo
climate_science ozone_depletion nao extreme_weather_event change_climate climate_proxy east_pacific annual_basis ice_cap
subarctic_climate oceanic_climate humid_subtropical modern_climate climate_zone bergeron polar_ice regular_cycle scientific_literature
lake_bed current_interglacial temperature_fluctuation warm_period shorter_term classification_include climate_force milankovitch_cycle
projected_increase excessive_heat heatwave bioclimatology cfc_focused dioxide_molecule carbon_dioxide_molecule
absorbing_infrared_radiation lovelock_speculate james_lovelock_speculate scientist_james_lovelock_speculate scientist_james_lovelock
british_scientist_james_lovelock british_scientist_james core_drilled ice_core_drilled particulate_pollution aerosol_pollution sea_core
deep_sea_core david_keeling charles_david_keeling charles_david callendar varves högbom infrared_absorption measure_infrared
cycle_lasting venetz perraudin change_climate_change climate_change_climate_change climate_change_climate change_science
climate_change_science century_scientist background_climate bake_crust british_scientist langley james_lovelock cfc_molecule
chlorofluorocarbon_cfc tyndall john_tyndall extreme_event scientist_james hothouse energy_budget teleconnections sst_anomaly
For example, the top “words” associated with the topic Climatology
18. LDA topics
Climate change Viticulture
An LDA topic can then be thought of as the density of
associated terms in an analyzed text
For example, in this article on climate change and wine
from a recent edition of the JSTOR Daily we see the top
words for 2 topics highlighted
19. Named Entity Recognition (NER)
• Entities in a submitted text are identified and available for
document selection
• Persons
• Locations
• Organizations
• Results from multiple entity recognition engines are merged
during analysis
• IBM Alchemy
• OpenCalais (Thompson Reuters)
• OpenNLP (Apache)
• Stanford NER
20. Using Wikipedia for LDA
training data
Early versions of LDA topic models were trained with JSTOR
documents using MAIstro indexing terms
This worked fairly well but had some significant
limitations/challenges inhibiting further improvement of the
models and inferencing
• Many tagged articles were only semi-related to the topic
• Documents often contained too many topics
• The JSTOR document text was often too “noisy”
1. OCR errors
2. Running headers/footers
3. Citations and references
21. Using Wikipedia for LDA
training data
Early experimentation with Wikipedia articles for training data
in mid-summer proved promising
• Performed comparison tests of models built from JSTOR-only,
Wikipedia-only, and hybrid training datasets
Converted to the use of Wikipedia training data for Text
Analyzer in September
• Initially hoped for 100% automated mapping from topic to training docs
• Eventually concluded that some level of manual curation would be
needed
• Training data curation performed in Q4 with JSTOR and Access
Innovation staff using an internal tool (more on that in a bit)
22. Wikepedia and Wikidata
Wikidata provides rich machine readable (semantic) data for augmenting
and linking wikipedia training data
For example: https://www.wikidata.org/wiki/Q52139
23. Downloadable Wikepedia data dumps
provide clean text for model training
Uncluttered and error-free
• No HTML markup,
hyperlinks, etc
• Ideal for text processing
• As a bonus, summary
snippets are easily extracted
• These snippets are not
currently exposed in the
interface but could be used in a
number of ways in the future
25. Goal: Train a new topic model
Produce a “super set” of terms
Find training articles using
Wikipedia
New model will catch nuances
& more subtle language
26. Project phases
Spreadsheets, curation tool, thesaurus and Wikipedia
1. Mapping thesaurus terms to Wikipedia categories
2. Identifying Wikipedia training articles for thesaurus terms
3. Whitelisting terms
4. Working in the curation tool
5. Spreadsheet validation
29. Wikipedia category Thesaurus term Wikipedia category link Notes
Musical instruments Musical instruments https://en.wikipedia.org/wiki/Category:Musical_instruments
Musical notation Music notation https://en.wikipedia.org/wiki/Category:Musical_notation
Musical scales Musical scales https://en.wikipedia.org/wiki/Category:Musical_scales
Musical theatre Musical theater https://en.wikipedia.org/wiki/Category:Musical_theatre
Musical tuning Musical tuning https://en.wikipedia.org/wiki/Category:Musical_tuning
Musicians Musicians https://en.wikipedia.org/wiki/Category:Musicians
Musicologists Musicology https://en.wikipedia.org/wiki/Category:Musicologists
Musicology Musicology https://en.wikipedia.org/wiki/Category:Musicology
Mustaali ENTITY https://en.wikipedia.org/wiki/Category:Mustaali named entity
Mustard (condiment) MATCH https://en.wikipedia.org/wiki/Category:Mustard_(condiment)
Didn't match due to parens; "Mustards" in
jthes
Muswell Hill ENTITY https://en.wikipedia.org/wiki/Category:Muswell_Hill named entity
Mutilation NO MATCH https://en.wikipedia.org/wiki/Category:Mutilation
Don't have this term but variations of
term
Mutineers Mutiny https://en.wikipedia.org/wiki/Category:Mutineers
Mutinies Mutiny https://en.wikipedia.org/wiki/Category:Mutinies
Mutualism (biology) MATCH https://en.wikipedia.org/wiki/Category:Mutualism_(biology)
Didn't match due to parens; "Mutualism"
in jthes
Mutualism (movement) NO MATCH https://en.wikipedia.org/wiki/Category:Mutualism_(movement)
Didn't match due to parens; "Mutualism"
within Ecology
Mycology Mycology https://en.wikipedia.org/wiki/Category:Mycology
Myeloid neoplasia NO MATCH https://en.wikipedia.org/wiki/Category:Myeloid_neoplasia no match in jthes
Myoneural junction and neuromuscular
diseases NO MATCH
https://en.wikipedia.org/wiki/Category:Myoneural_junction_and_neuromuscular_diseases
comprised too many concepts
Myrmecophagous mammals NO MATCH https://en.wikipedia.org/wiki/Category:Myrmecophagous_mammals
MySQL SQL https://en.wikipedia.org/wiki/Category:MySQL
30. Choose 10+ articles
1 week of Labs time
Try to cover first four
levels of hierarchy
Identifying Wikipedia Training articles
31. The Whitelist
Cut down the list of thesaurus
terms
Used high/low count to help
with assessment
Chose 18k of original 48k
32. Updated curation tool
Levels 1-4 – Full coverage
Back in the curation tool we decided, for efficiency, we would do all top
level branches down to the 4 level so all subjects were covered to the same
depth.
Learned from the Labs week training documents that a better target is 1-5
training docs per term and being selective is better than including those
that may only be tangentially related.
Some terms only have one or two strong documents.
34. Lessons learned
Challenges
• Size of the thesaurus
• Lack of knowledge of some subject
areas
• Wikipedia only articles
• Time/Staffing constraints
• Tool glitches
The future
• Coverage of all thesaurus terms
• Other articles outside of Wikipedia
• Integrated as part of our weekly
workflow
• Working with Subject Matter Experts
to choose training documents
36. LLDA Topic Model
JSTOR Thesaurus
Training docs
16,000 topics
+ 30,000 wikipedia articles
Topic model training
37. LLDA Topic ModelJSTOR Thesaurus
Training docs
+ 30,000 wikipedia articles
English
Arabic (74%)
LLDA Topic Model
Turkish (55%)
Chinese (82%)
Dutch (78%)
French (86%)
German (86%)
Hebrew (63%)
Italian (76%)
Japanese (82%)
Korean (66%)
Polish (74%)
Portuguese (75%)
Russian (81%)
Spanish (84%)
Multilingual topic inferencing