Weitere ähnliche Inhalte Ähnlich wie Taxonomy Assessments - Part Two (20) Mehr von Access Innovations, Inc. (20) Kürzlich hochgeladen (20) Taxonomy Assessments - Part Two1. Taxonomy Assessments -
Part Two
February 9, 2012
Access Innovations, Inc.
Leveraging Your Content Semantically
Jay Ven Eman, Ph.D., CEO
j_ven_eman@accessinn.com
www.accessinn.com
www.dataharmony.com
+1.505.998.0800
Albuquerque, NM
© 2012. Access Innovations, Inc. All rights reserved.
2. Indexing
Subject term assignment
Permanent meta-data to indexed object
Used for retrieval and evaluation
Processes
• Manual
• Publisher
• 3rd party aggregators
• Authors
• Automated methods
© 2011. Access Innovations, Inc. All rights reserved.
3. Integration / workflow
API’s, Client/Server,
Author Submission Web Services, HTTP-TCP/IP
System
Books
Content
Repository “A”
Or Intermediate
Conference Processes
Proceedings
Content
ETC.
Repository
“B”, etc.
Thesaurus
M.A.I.
Master
Web Data Harmony
Sites MAIstro Server
Classification System
© 2011. Access Innovations, Inc. All rights reserved.
4. Select the document collection
CMS
Please select the database and the the document directory to load
© 2011. Access Innovations, Inc. All rights reserved.
7. Run the documents through a metadata extraction
process to create well-formed, rich XML
• Automatic (per doc template)
• E.g. Dublin Core Metadata
• Bibliographic citation
© 2011. Access Innovations, Inc. All rights reserved.
8. Automatically add the taxonomy
terms
Entity extraction: People,
Places, Things
Conceptual indexing: using the
taxonomy
© 2011. Access Innovations, Inc. All rights reserved.
9. Classification Process or Assigned Indexing
<Anchor><Date>09-14-11</Date>
09-14-11
<TI>“Solving the Challenge”</TI>
“Solving the Challenge”
<BLH>By</BLH>
By Jay Ven Eman
<Author>
<AU_FN>Jay</AU_FN>
The process of indexing
<AU_MI></AU_MI>
a content object begins
<AU_LN>Ven Eman</AU_LN>
with…
</Author>
<Body>The process of indexing a content
object begins with…</Body>
<Subject>Indexing</Subject>
<Subject>Thesauri</Subject>
<Subject>Standards</Subject>
<Subject>Classification</Subject>
Unstructured
</Anchor>
Structured
Thesaurus
M.A.I.
Master
Content
Data Harmony Repository
MAIstro Server e.g. Database
Classification System
© 2011. Access Innovations, Inc. All rights reserved.
10. Indexing
Indexing measures
• Indexing experts
• Subject matter experts (SME)
• Hits, misses, & noise
• 85% hits
In conjunction with taxonomy measures
• Over & under used terms
• Over & under indexed content
© 2011. Access Innovations, Inc. All rights reserved.
11. Indexing & Search Metrics
Hit, Miss, Noise
Subjective
• Relevance
• Aboutness
Statistical
• Precision
• Recall
• Level of effort
© 2011. Access Innovations, Inc. All rights reserved.
12. Hit, Miss, Noise
Hit – exactly what a human indexer would use
Miss – human indexer would use, but system
did not assign
Noise – system assigned, but human did not
• Relevant noise – could have been assigned
• Irrelevant noise – just plain wrong
© 2011. Access Innovations, Inc. All rights reserved.
13. Subjective
Relevance
• Reflects how akin it is to the users request
“Aboutness”
• Reflects the topical match between the document
content and the term
• How well the topic describes what the document is
about
Varies with level of conceptual terms vs. factual
terms in the thesaurus
© 2011. Access Innovations, Inc. All rights reserved.
14. Indexing
All content types & sources
• Inventory control
• Everything in, everything out
Document types
• Articles
• Proceedings
• Corporate
© 2011. Access Innovations, Inc. All rights reserved.
15. Link to Community Resources
(Source: Helen Atkins, AACR)
CME
Upcoming
Other Activity on
Conference
Journal Topic A
on Topic A
Articles on
Topic A
Job Posting
Journal for Expert
Article on on Topic A
Topic A
Grant Available Podcast Interview
for Researchers with Researcher
Working on Working on Topic A
Topic A Author Networks
Social Networking
SME – Topic A
© 2011. Access Innovations, Inc. All rights reserved.
16. Indexing with Data Harmony® M.A.I.™
Rule base development
• 80/20 rule
• Indexing objectives
GUI
Time-to-market
• Level of effort to build
• Level of effort to maintain
• Less than all other alternatives when
indexing for high precision & recall
© 2011. Access Innovations, Inc. All rights reserved.
17. Updating Rule Base
Automatic for matching rules when using
Data Harmony MAIstro™
80/20 rule
Re-index when 5% to 10% changes to
taxonomy – arbitrary ranges:
• Monthly with small databases – 5k to 20k
• Quarterly with medium – 20k to 1 million
• Annual with large – greater than 1 million
Depends on search software, too
© 2011. Access Innovations, Inc. All rights reserved.
19. What’s in a name?
Juliet:
"What's in a name? That which
we call a rose
By any other name would smell as
sweet."
Romeo and Juliet (II, ii, 1-2)
© 2011. Access Innovations, Inc. All rights reserved.
21. Magnitude of the Problem:
Facebook - 700 Million Users Projected for 2011(Open-First)
700 Million Names
How will your boss, peers,
anyone ever find you?
© 2012. Access Innovations, Inc. All rights reserved.
22. What’s in a name?
My name Jay Ven Eman
Ven Eman, Jay
<First_Name>Jay</First_Name>
<Last_Name>Ven Eman</Last_Name>
Name variants Aliases
Jay Von Eman William Henry McCarty
Jay Van Eman Henry Antrim
Jay van Eman William H. Bonney
Jay ven Eman Billy the Kid
Jay Veneman National & Cultural
Jay Venema Conventions
© 2011. Access Innovations, Inc. All rights reserved.
23. Names
Computationally & editorially intense
Author submissions
Membership records & the like
Industry initiatives – ORCID, VIVO
Subject term disambiguation
Inventory control basics apply here, too
Difficulty level is high
Constance maintenance needed
© 2011. Access Innovations, Inc. All rights reserved.
24. Taxonomy Assessments -
Part Two
February 9, 2012
Thank you! Questions?
Access Innovations, Inc.
Leveraging Your Content Semantically
Jay Ven Eman, Ph.D., CEO
j_ven_eman@accessinn.com
www.accessinn.com
www.dataharmony.com
+1.505.998.0800
Albuquerque, NM
© 2012. Access Innovations, Inc. All rights reserved.
Hinweis der Redaktion PDF Post processing“Labels” content itemBut also classifies author Thanks to Helen Atkins of AACR for this illustration.The real power of this is that the links can all go in all directions, so we take advantage of having the user’s attention regardless of how they step into our “web”Continuing Medical Education (CME) Johnny Carson