A detailed look at the graphic representation of text and term mining data. Originally presented by Marjorie M.K. Hlava and Dr. Jay Ven Eman at the 2012 International Information Conference on Search, Data Mining and Visualization in Nice, France.
Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing
1. TEXT MINING, TERM MINING,
AND VISUALIZATION
IMPROVING THE IMPACT OF SCHOLARLY
PUBLISHING
MONDAY 16 APRIL 2012
NICE, FRANCE
Marjorie M.K. Hlava, President
Jay Ven Eman, CEO
Access Innovations, Inc.
mhlava@accessinn.com
J_ven_eman@accessinn.com
1
2. What we will cover today
⢠Term and Text Mining
⢠The basics of visualization
⢠Case studies
⢠Using subject terms as metrics
⢠Applications
⢠Visualizing the results
3. Definitions
⢠Term Mining - a systematic comparison processing
algorithmic method to find patterns in text
⢠Text Mining â using controlled vocabulary tags in text to
find patterns and directions
⢠Term & text mining
ď§ Many similarities
ď§ Can be complimentary; not mutually exclusive
4. Term mining
⢠Precise
ď§ Meaningful semantic relationships; contextual
ď§ Replicable; repeatable; consistent
ď§ Vetted; controlled
ď§ Based on a controlled vocabulary
ď§ Trends; gaps; relationship analysis; visualizations
ď§ Less data processing load
5. Text mining
ď§ Algorithmic; formulaic
ď§ Neural nets, statistical, latent semantic, co -
occurrence
ď§ Serendipitous relationships
ď§ Sentiment; hot topics; trends
ď§ False drops; noise;
ď§ Misleading semantic relationships
ď§ Heavy processing load
6. Why take a visual look?
⢠Humans can process information 17 times faster in visual
presentations
⢠Now data can be analyzed, manipulated and presented as visual
displays.
⢠To see the trends effectively we need to make the data into rich
graph-able formats
6
7. Visualization of data
⢠Needs
â Measurement
â Metrics
â Numbers
⢠Shows
â Adjacency
â Relationships
â Trends
â Co â occurrence
â Conceptual distance
⢠Is richer with
â Linking
â Semantic enrichment
â Classification
⢠Supports
â Forecasting
â Trend analysis
â Segmentation
â Distribution
7
8. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
Manâs attention to
visual display to convey
knowledge is ancient
8
9. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
The art in maps
is a
longstanding
tradition
9
10. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
Super imposing data is now common
A mash up example
10
Traffic Injury Map
UK Data Archive
US National Highway
Safety Administration
Google Maps Base
Accident categories include
children
automobile
bicycle
etc.
Data
time
place
type
Source:
JISC TechWatch: Data Mash-ups September 2010
11. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
Mash up of bird flight migrations and
weather patterns
http://www.youtube.com/watch?v=uPff1t4pXiI&feature=youtu.be
11
12. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
http://www.youtube.com/watch?v=nokQBjk1s_8&feature=player_embedded
13. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
How does it work?
ď§ Develop controlled vocabulary
Âť Prefer one with hierarchy
ď§ Apply to full text
Âť Or to the âheadsâ
ď§ Decide on data points to convey information
ď§ Divide the XML into graphable sections
14. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
Start with data â like this XML file
14
15. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
Index or tag using subject terms from
thesaurus or taxonomy
ď§ date, category, taxonomy term, frequency
15
16. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
Many views of one set of data
16
17. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
Load to a visualization program
Like Prefuse
17
18. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
Or Pajek
18
19. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony 19
20. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
National Information Center
for Educational Media
Albuquerqueâs own
Âť Sandia developed VxInsight
Âť Access Innovations = NICEM
Same data â several views
Primary and Secondary Education in US
Shows the US Valley of Science
Little Science taught in elementary years
20
21. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
22. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
23. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
Using visualization to show
ď§ From a society / publisher perspective
Âť Identify Core, Boundary and Cross Border
Âť Provides Indicators
ď§ Activity
ď§ Growth
ď§ Relatedness
ď§ Centrality
Âť Locates Journal domains
ď§ From a thesaurus perspective
Âť Identifies terms that are too broadly defined
Âť Potential Improvements in thesaurus structure using topic
structures
23
24. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
Case Study:
Mapping IEEE thesaurus space
ď§ We are interested in an expanded map that
includes adjacencies to the IEEE data
Âť Expanded term set shows adjacent white space;
opportunities for expansion
ď§ Overlaps and edges of the science
Âť We need comparison data
ď§ Learn the directions in the field
Âť Low occurrence rate in IEEE documents?
Âť Linkage to terms in IEEE documents?
ď§ Where do we find these terms? How can we
add them?
24
25. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
The process
ď§ Built a rule base to auto index IEEE content
Âť â90 % accuracy out of the box on journal dataâ*
Âť â80% out of the box on proceedings dataâ*
ď§ The overlapping data sets
Âť Auto indexed 1.2 million Xplore records
Âť Auto indexed 10 years of US Patent data
Âť Auto indexed 10 years of Medline
ď§ Term sets used
Âť IEEE thesaurus terms rule base
Âť Medical Subject Headings (MeSH) (and simple rule base)
Âť Defense Technical Information Center (DTIC) Thesaurus (
and simple rule base)
Âť Similar level of detail to current IEEE thesaurus terms
25
26. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
Defining expanded term space
26
IEEE
2kterms
1.2M documents
1. The data - Select related corpus
14kDTIC
475k patents
24kMeSH
PubMed
525k docs
27. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
Defining expanded term space
27
IEEE
2kterms
1.2M documents
2. Identify related terms
Use the IEEE Thesaurus to index the three collections
28. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
Defining expanded term space
28
IEEE
2kterms
1.2M documents
2. Identify related terms
Use MESH and DTIC to also index the three collections
29. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony 29
IEEE
2kterms
1.2M documents
3. Resulting term set
The co-indexed items from the three collections
Defining expanded term space
30. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony 30
4. Term:Term Matrix
Where do the articles and their indexing intersect?
Defining expanded term space
31. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony 31
Visualization Strategies
Matrix
Visualization
Software
32. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony 32
All data up-posted to the top level
33. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony 33
Many map options
IEEE ExperiencePrevious Experience
34. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony 34
Sensors
Council
Nucl Plasma
Sci Soc
Nanotech
Council
Ultrason,
Ferro âŚ
Prod Saf
Engng Soc
Oceanic
Engng Soc
Geosci Rem
Sens Soc
Council
Supercond
Compon,
Packag âŚ
Instr
Measur Soc
Magnetics
Soc
Dielectr El
Insul Soc
Electromag
Compat Soc
Antennas
Propag Soc
Power
Electron Soc
Electron
Dev Soc
Circuits &
Systems
Power &
Energy Soc
Industry
Appl Soc
Solid St
Circuits Soc
Industr
Electr Soc
Microwave
Theory Soc
Aerosp
Electr
Sys Soc
Sys Man
Cyber
Society
Computer
Intelligence
Society
Systems
Council
Reliability
Society Education
Society
Prof
Commun
Society
Computer
Society
Robot
Autom Soc
Social
Impl Techn
Council Electr
Design Auto
Signal
Proc Soc
Intell Transp
Sys Soc
Commun
Soc
Info Theory
Soc
Vehicular
Techn Soc
Consumer
Electr Soc
Broadcast
Techn Soc
Photonics
Soc
Eng Med
Biol Sci
IEEE Portfolio
35. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony 35
Radial Visualization
36. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony 36
Publication Strategy
JASIST reference
37. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony 37
Conference Strategy
38. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony 38
Turbines
Measurement
Circuits
Amplifiers
Displays
Games
Toys
Flow
Cooling
Heating
Components
Gearing
Brakes
Dynamics
Vehicles,
Parts
Disk
Optics
Photochem
Molding
Conductors
Coatings
Lasers
Lamps
Motors
Plants,
Micro-orgs
Control
Boats
Oilfield
Services
Med Instruments
Welding
Conveyers
Rubber
Acyclic Comp
Footwear
Lubricants
Radiology
Catalysis
Macromolecules
Sprayers
Electrochem
Fitness
Hygiene
Cleaning
Printing
Paper
IC Engines
Magn/Elect
Magnets
Textiles
Layers
Medical
Devices
Clocks
Pipes
Valves
Blasting
Cables
Appliances
Outerwear
Exhaust
Pumps
Packaging
Aircraft
Semiconductors
Use a Thesaurus to Label Maps
Agriculture
Food
Consumer
Products
Construction
Automotive +
Defense
Industrial
Products
Leisure
Energy
Telecom Computer
HW/SW
Electronics
Chemicals
Pharma
Metals
Health Care
39. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
Questions Answered
ď§ Is there a way, using our own information, to forecast our
direction?
ď§ Where is the industry headed? What about by technology
sector?
ď§ Does our coverage match our mission and vision?
ď§ Can we become smarter about our data and potential
markets using our collection in new ways?
Are the societies publishing and talking about what their
charter indicates they cover?
ď§ What are the trends â are topics emerging/cooling?
ď§ Can we use technology and our own data to explore these
questions while enhancing our data?
39
40. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
The research team
ď§ Access Innovations / Data Harmony
Âť Founded in 1978
Âť Data enrichment and normalization
Âť Suite of Semantic Enrichment tools
ď§ SciTechStrategies
Âť Understanding data through visualization
ď§ IEEE Indexing & Abstracting Group
40
41. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
We looked at
visualization of data
ď§ Finding the Metrics
Âť Measurement
Âť Numbers
Âť Terms as indicators
ď§ Ways to show
Âť Adjacency
Âť Relationships
Âť Trends
Âť Co â occurrence
Âť Conceptual distance
ď§ How to enrich with
Âť Linking
Âť Semantic enrichment
Âť Classification
ď§ Maps supporting
Âť Forecasting
Âť Trend analysis
Âť Segmentation
Âť Distribution
41
42. Well Formed Data ⢠Semantic Enrichment ⢠Taxonomies ⢠Access Innovations ⢠Data Harmony
Effective maps require
ď§ Contextual data
ď§ Detailed data
ď§ Classification methods
ď§ At least two directions in the matrix
ď§ A little art for fun
42
43. 43
It just takes a little imagination
Thank you
Marjorie M.K. Hlava
President
mhlava@accessinn.com
Jay Ven Eman, CEO
J_ven_eman@accessinn.com
, Access Innovations
505-998-0800
Hinweis der Redaktion
Access Innovations and its software brand Data Harmony are known for the high caliber of data. It is clean, well formed and very accurately semantically enriched. They updated the IEEE thesaurus in 2005, building a rule base for use in indexing at the same time. The application of the terms to the IEEE content was 90% accurate â that is 90% of the terms suggested are what well trained indexers would use from a controlled vocabulary, and 80% accurate from the more difficult proceedings data at launch of the project. Since that time the rule base has improved over time and the IEEE production team only needs to spot check about 10% of the documents to insure a high standard of indexing is maintained. It has allowed IEEE to process a lot more documents with the same team and made the process more fun at the same time. The indexers are allowed time to think about the content, the thesaurus terms, what should be added and what other information can be collected to continue to enrich the files because the Data harmony software removes many of the clerical aspects of the indexing process, leveraging the mental processing of the staff. The accuracy is high enough that we simply indexed the entire contents of the eXplore database back to the earliest records in a single overnight process. Then to explore the edges of science we also indexed the 1.2 million records using Medical Subject headings and the defense Technical Information Center thesauri with similar accuracy results.