This document summarizes research on using semantic vectors to map journals in a large digital article library onto a semantic vector space. Key points:
1) The researchers used semantic vectors to map over 2,000 journals from a corpus of over 5 million articles onto a 2D semantic space, generating a visualization of journal relationships.
2) They were able to scale this approach to the large corpus size, with the semantic vector index building in under an hour.
3) The resulting visualization grouped journals by discipline reasonably well and identified some cataloging errors, showing potential for interactive discovery tools.
4) Future work proposed exploring precision/recall, non-metric multidimensional scaling, and projecting articles onto the
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Semantic Journal Mapping for Search Visualization
1. Semantic Journal Mapping for Search Visualization
in a Large Scale Article Digital Library
Glen Newton1,2, Alison Callahan1, Michel
Dumontier2
1
National Research Council Canada, 2Carleton University
Second Workshop on Very Large Digital Libraries (VLDL) 2009
at ECDL 2009
Oct 2 2009 Corfu, Greece
2. Outline
• Maps of Science
• Background
• Research Interests
• Research Goals
• Process
• Scalability issues
• Environment
• Results
• Conclusions
• Future Work
6. Background
• Canada Institute of Science and Technical Information (CISTI) ==
Canadian national science library
• ~3000 active researchers at NRC
• Large full text collection of ~8.4m full-text + metadata articles, in
science, technology, medicine (STM)
• 4100 journal titles
• ~1995 to 2009
7. Research Interests
• Domain-specific discovery
• Improved discovery in STM domains through results visualization
and contextualization, browse/explore/refine
• Results set visualization: “mapping”
8. Research Goals
• Find way to extract journal (& article) semantic vector space
• Latent Semantic Analysis (LSA) works for small/medium sized
corpora, does not scale to large scale of items and/or terms
• New alternative: Semantic Vectors (SV): uses random vectors &
avoids expensive singular value decomposition (SVD)
• Can SV scale & generate sensible semantic vector space of
journals on corpus of this size?
• Can the visualization produced be useful for results query
visualization, refinement, discovery?
9. Corpus
• Licensed journal articles from STM publishers: Elsevier, Springer,
etc
• ~4100 journal titles, classified into 23 categories (by librarians)
• ~8.4m journal articles
• Selection of articles/journals:
– Only those with authors, abstract (no notices, obituaries, etc)
– Only English language articles
– Only journals with >50 articles in corpus
– Resulting corpus: 5,733,721 articles from 2231 journals
– Categories overlapping: 1.53 categories per journal
10. Category # Journals
per category
Agriculture & Biological Sciences 358
Arts and Humanities 70
Biochemistry, Genetics and Molecular Biology 240
Business, Management and Accounting 106
Chemical Engineering 126
Chemistry 226
Civil Engineering 64
Computer Science 218
Decision Science 50
Earth and Planetary Science 146
Economics, Econometrics and Finance 112
11. Category # Journals per category
Energy and Power 73
Engineering and Technology 328
Environmental Science 138
Immunology and Microbiology 104
Materials Science 160
Mathematics 205
Medicine 671
Neuroscience 103
Pharmacology, Toxicology and 73
Pharmaceutics
Physics and Astronomy 210
Psychology 126
Social Science 222
12. Process
• Index full-text (only) with Lucene 2.4, aggressive stopword list,
Porter stemming using LuSql tool
• Build Semantic Vectors (v1.18, parallelized) index from Lucene
index, with 512 semantic dimensions
• Find item x item distance matrix from SV index of 512-
dimensional vectors
• Using R, use multidimensional scaling (MDS) to reduce from 512-
D to 2-D
13. Scalability Issues
• #items, #unique terms
– #unique terms: SV easily handles very well
– #items: SV handles fairly well
– #items: impacts size of distance matrix (#items x #items)
– R cannot handle huge article distance matrix in MDS (i.e.
millions of articles vs. thousands of journals)
• Instead of using articles for items, use journals for items
• Make single large full-text document from concatenation of all
articles of particular journal & index these
14. Environment
• Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050
processors with 2x2MB cache, 3.0 Ghz 64bit, 32GB RAM,
attached to a Dell EMC AX150 storage arrays via SilkWorm
200E Series 16-Port Capable 4Gb Fabric Switch.
• Operating system: Linux openSUSE 10.2 (64-bit X86-64), kernel
2.6.18.8-0.10-default #1 SMP
• Java version 1.6.0.07 (build 1.6.0 07-b06) Java HotSpot 64-Bit
Server VM (build 10.0-b23, mixed mode).
• Processing 1.0 (processing.org)
49. Medicine
Medicine
French language Medical
& Psychology Journals
50. Bulletin of
Mathematical Biology
Journal of
Medical
Ultrasonics
Mathematics
51. Conclusions
• Reasonable mapping results
• Full-text only (no citations, metadata) gives good results
• Scalable to significant size
52. Future Work
• Proper precision and recall evaluation using same corpus
• Validate with NetNews-20 collection for P & R
• Evaluate non-metric MDS
• Project articles onto semantic journal space & build interactive
discovery interface & evaluate
– Index journal 'documents' and journal articles
– SV on all
– Distance matrix only on journals
– Do MDS
– Use eigenvectors to transform N-d article vector to 2-D
• Explore 3-D interface (MDS N-d → 3D)