Semantic Journal Mapping for Search Visualization

Semantic Journal Mapping for Search Visualization
in a Large Scale Article Digital Library

Glen Newton1,2, Alison Callahan1, Michel
Dumontier2
1
National Research Council Canada, 2Carleton University
Second Workshop on Very Large Digital Libraries (VLDL) 2009
at ECDL 2009
Oct 2 2009 Corfu, Greece

Outline

• Maps of Science
• Background
• Research Interests
• Research Goals
• Process
• Scalability issues
• Environment
• Results
• Conclusions
• Future Work

From Leydesdorff
From Leydesdorff & Rafols 2006 & Rafols 2006

From Leydesdorff & Rafols 2006

Background

• Canada Institute of Science and Technical Information (CISTI) ==
Canadian national science library
• ~3000 active researchers at NRC
• Large full text collection of ~8.4m full-text + metadata articles, in
science, technology, medicine (STM)
• 4100 journal titles
• ~1995 to 2009

Research Interests

• Domain-specific discovery
• Improved discovery in STM domains through results visualization
and contextualization, browse/explore/refine
• Results set visualization: “mapping”

Research Goals

• Find way to extract journal (& article) semantic vector space
• Latent Semantic Analysis (LSA) works for small/medium sized
corpora, does not scale to large scale of items and/or terms
• New alternative: Semantic Vectors (SV): uses random vectors &
avoids expensive singular value decomposition (SVD)
• Can SV scale & generate sensible semantic vector space of
journals on corpus of this size?
• Can the visualization produced be useful for results query
visualization, refinement, discovery?

Corpus

• Licensed journal articles from STM publishers: Elsevier, Springer,
etc
• ~4100 journal titles, classified into 23 categories (by librarians)
• ~8.4m journal articles
• Selection of articles/journals:
– Only those with authors, abstract (no notices, obituaries, etc)
– Only English language articles
– Only journals with >50 articles in corpus
– Resulting corpus: 5,733,721 articles from 2231 journals
– Categories overlapping: 1.53 categories per journal

Category # Journals
per category
Agriculture & Biological Sciences 358
Arts and Humanities 70
Biochemistry, Genetics and Molecular Biology 240
Business, Management and Accounting 106
Chemical Engineering 126
Chemistry 226
Civil Engineering 64
Computer Science 218
Decision Science 50
Earth and Planetary Science 146
Economics, Econometrics and Finance 112

Category # Journals per category
Energy and Power 73
Engineering and Technology 328
Environmental Science 138
Immunology and Microbiology 104
Materials Science 160
Mathematics 205
Medicine 671
Neuroscience 103
Pharmacology, Toxicology and 73
Pharmaceutics
Physics and Astronomy 210
Psychology 126
Social Science 222

Process

• Index full-text (only) with Lucene 2.4, aggressive stopword list,
Porter stemming using LuSql tool
• Build Semantic Vectors (v1.18, parallelized) index from Lucene
index, with 512 semantic dimensions
• Find item x item distance matrix from SV index of 512-
dimensional vectors
• Using R, use multidimensional scaling (MDS) to reduce from 512-
D to 2-D

Scalability Issues

• #items, #unique terms
– #unique terms: SV easily handles very well
– #items: SV handles fairly well
– #items: impacts size of distance matrix (#items x #items)
– R cannot handle huge article distance matrix in MDS (i.e.
millions of articles vs. thousands of journals)
• Instead of using articles for items, use journals for items
• Make single large full-text document from concatenation of all
articles of particular journal & index these

Environment

• Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050
processors with 2x2MB cache, 3.0 Ghz 64bit, 32GB RAM,
attached to a Dell EMC AX150 storage arrays via SilkWorm
200E Series 16-Port Capable 4Gb Fabric Switch.
• Operating system: Linux openSUSE 10.2 (64-bit X86-64), kernel
2.6.18.8-0.10-default #1 SMP
• Java version 1.6.0.07 (build 1.6.0 07-b06) Java HotSpot 64-Bit
Server VM (build 10.0-b23, mixed mode).
• Processing 1.0 (processing.org)

Results: Scalability

• Corpus: ~600GB full-text
• Lucene index: 43GB
– LuSql: 13 hours 51 minutes to produce
• SV index: 58 minutes, 885 MB, 21.6m terms
– Distance matrix: 6 minutes

Results: Visualization

• Using Processing environment, built simple
validation/visualization tool

Harder sciences and
engineering categories

Agriculture and
biomedical categories

Agriculture and
Biological Sciences

Biochemistry, Genetics
and Molecular Biology

Interdisciplinary and
non-science categories

Economics,
Econometrics
And Finance

Business, Management
and Accounting

Examination of outliers,
extrema and cataloging
errors

Ecotoxicology and
Environmental Safety
Organic Geochemistry

Corporate Environmental
Strategy

Environmental Science

Journal of Biomolecular NMR

Journal of X-Ray
Science and Technology

Medicine
Medicine

Colloidal and
Polymer Science

Annales Henri Poincare

Medicine
Medicine

Medicine
Medicine
French language Medical
& Psychology Journals

Bulletin of
Mathematical Biology

Journal of
Medical
Ultrasonics

Mathematics

Conclusions

• Reasonable mapping results
• Full-text only (no citations, metadata) gives good results
• Scalable to significant size

Future Work

• Proper precision and recall evaluation using same corpus
• Validate with NetNews-20 collection for P & R
• Evaluate non-metric MDS
• Project articles onto semantic journal space & build interactive
discovery interface & evaluate
– Index journal 'documents' and journal articles
– SV on all
– Distance matrix only on journals
– Do MDS
– Use eigenvectors to transform N-d article vector to 2-D
• Explore 3-D interface (MDS N-d → 3D)

Acknowledgements

• Greg Kresko, Andre Vellino, Jeff Demaine @ NRC-CISTI

Demo

• Link to project demo page

License

Creative Commons Attribution-Noncommercial-No Derivative Works 2.5

Semantic Journal Mapping for Search Visualization

Semantic Journal Mapping for Search Visualization

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Semantic Journal Mapping for Search Visualization

Ähnlich wie Semantic Journal Mapping for Search Visualization (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Semantic Journal Mapping for Search Visualization