SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Taxonomies for
        Human vs. Auto-Indexing
Taxonomy Boot Camp, September 25, 2008, San Jose, CA

Heather Hedden
Hedden Information Management
heather@hedden.net
Background
Heather Hedden's taxonomy-creation experience
 For human indexing
    Developed controlled vocabularies for periodical article index
    databases (Gale)
 For auto-indexing
    Developed taxonomies for integration within an enterprise
    search software product for corporate content and web page
    searching (Viziant)
    Matched controlled vocabulary to keywords for consumer online
    products/services directories (various "yellow pages" clients)
 For either
    Created enterprise taxonomies for corporate web sites and
    intranets for site navigation (Earley & Associations)



                      © 2008 Hedden Information Management
Outline
 Taxonomies & Indexing Background
 Choosing Human vs. Auto-indexing
 Taxonomies & Human Indexing
 Taxonomies & Auto-indexing
 Taxonomy Creation Comparison
   Differences in taxonomy terms
   Differences in term relationships
   Differences in definitions & notes
   Differences in synonyms/variants
 Additional Work for the Taxonomist
 Resources

                  © 2008 Hedden Information Management
Taxonomies & Indexing
Types of Taxonomies
1.   Organization, classification, navigation support
     – more emphasis on hierarchies
2.   Search and retrieval support
     – more emphasis on synonyms

     For indexing, use #2 above:
     Search & retrieval support taxonomies



                    © 2008 Hedden Information Management
Taxonomies & Indexing
Search & retrieval taxonomies:
  Connect users to desired content by means of a
  common nomenclature/terminology/vocabulary
  Matching between:
    1.   the vocabulary of the users
    2.   the vocabulary of the content
  Taxonomies interface with
    1.   the users
    2.   the content


  Indexing/tagging/categorization deals with
  #2 in each case above: the connection of taxonomy to
  content
                       © 2008 Hedden Information Management
Taxonomies & Indexing
Indexing/tagging/categorization:
 Indexing
 – done by (trained) indexers
 – creating a (browsable) index
 Tagging
 – done by any person
 – applying labels, metatag descriptors to documents to
 be picked up by database or search software
 – may not require a taxonomy/controlled vocabulary
 Categorization
  – done more systematically/automatically
  – putting documents into (pre-defined) categories
  – often within facets

                   © 2008 Hedden Information Management
Choosing Human vs. Auto-indexing:
The Content
Human indexing                     Auto-indexing
 Manageable number of                  Very large number of
 documents                             documents
 Includes non-text files               Text files only
 Varied and                            Common document
 undifferentiated                      types/formats (or pre-
 document types/formats                tagged types)
 Varied subject areas                  Focused subject areas
                                       (legal, medical, etc.)



                  © 2008 Hedden Information Management
Choosing Human vs. Auto-indexing:
The Culture
Human indexing                      Auto-indexing
 Higher accuracy in                     Greater volume indexed
 indexing                               Greater speed in indexing
 Invest in people                       Invest in technology
 Low-tech: can build your               High-tech: must purchase
 own indexing UI or buy                 auto-indexing software
 Internal control, or                   Software vendor
 outsourcing vendor                     relationship
 relationship



                   © 2008 Hedden Information Management
Taxonomies & Human Indexing
Who are indexers?
 Specialists or not
   the taxonomist and/or other information
   specialists, librarians
   dedicated hired indexers (with or without prior
   indexing) experience
   supplemental work for other staff (editors,
   writers, administrators)
 One person or multiple people
 Usually in-house but could be contracted out

                 © 2008 Hedden Information Management
Taxonomies & Human Indexing
Indexing software/module
 Indexing user interface optimized for ease,
 speed, and accuracy in indexing
 Method for indexers to nominate new taxonomy
 terms
Training & documentation for indexers
 Indexing policy guidelines
 Method to communicate new and changed
 taxonomy terms to indexers
 Method for checking and quality control

               © 2008 Hedden Information Management
© 2008 Hedden Information Management
© 2008 Hedden Information Management
Taxonomies & Auto-Indexing
Technologies
 Entity extraction
 Text mining and text analytics
 Auto-categorization or auto-classification
 utilizing taxonomies:
  1. Machine-learning and training documents
  2. Rules-based categorization




                © 2008 Hedden Information Management
Taxonomies & Auto-Indexing
Machine-learning auto-categorization:
 Complex mathematical algorithms are created
 Taxonomist must then provide several (at least
 5-10) representative sample documents for each
 taxonomy term to “train” the automated indexing
 system.
 If only using only 5-10 documents, then profile/
 overview, encyclopedic articles are best.
 If pre-indexed records exist (i.e. converting from
 human to automated indexing), then hundreds of
 varied documents can be used for each term.
                 © 2008 Hedden Information Management
Taxonomies & Auto-Indexing
      Machine-learning auto-categorization




                 © 2008 Hedden Information Management
Taxonomies & Auto-Indexing
Rules-based auto-categorization:
  Taxonomist must write rules for each taxonomy
  term
  Similar to advanced Boolean searching

bush
IF (INITIAL CAPS AND (MENTIONS "president*" OR WITH
    administration*" OR AROUND "white house" OR NEAR
    "george"))

USE U.S. president
ELSE USE Shrubs
ENDIF                                                       Data Harmony


                     © 2008 Hedden Information Management
Taxonomy Creation Comparison
 Differences in taxonomy terms
 Differences in term relationships
 Differences in term notes, definitions
 Differences in synonyms/variants




               © 2008 Hedden Information Management
Differences in Taxonomy Terms
 For human indexing
  Create terms as specific (granular) as the
   content will support and users will expect.
 For auto-indexing
   Cannot have subtle differences between
   preferred terms:
   International relations; Foreign policy
   Avoid creating both action and topic terms:
   Investing; Investments

                 © 2008 Hedden Information Management
Differences in Term Relationships
  Hierarchical (broader/narrower) links
  Associative (related terms) links
  For human indexing
  Highly useful to indexer, as is to end-user, in finding the
     best term. Consider indexer behavior.
  For auto-indexing
  Not needed, but could be utilized in search results:
     Broader terms recursively include narrower term
     results
     Related terms display as suggestions
     Consider search results.
                    © 2008 Hedden Information Management
Differences in Term Relationships
  Facets
Certain facets may work better with human
 indexing than with auto-indexing.
 Automated indexing may not distinguish
 between different facet meanings of a term.
Examples:
  Mergers - Action/Event or Business Topic?
  Churches – Place or Organization type?



                 © 2008 Hedden Information Management
Differences in Term Notes
Concise explanatory notes (not a dictionary
   definition) on some terms, as needed:
1.   To restrict or expand the application of a term
2.   To distinguish between terms of overlapping meaning
     (may have reciprocal notes)
3.   To provide advice on term usage

For the end-user, optional aid
For indexing:
     often needed for some terms for human indexing
     never needed for auto-indexing
May have notes for indexers that are not for end-users.

                    © 2008 Hedden Information Management
Differences in Term Notes
Scope Notes examples
ProQuest Controlled Vocabulary:

Occupational health
SN: Employer activities designed to protect and promote the health and
     safety of employees on the job
Inequality
SN: Socioeconomic disparity stemming from racial, cultural, or social
     bias

Medical Subject Headings (MeSH):

Nonverbal Communication
Annotation: human only; for animals use ANIMAL COMMUNICATION
    or VOCALIZATION, ANIMAL

                        © 2008 Hedden Information Management
Differences in Synonyms/Variants
Non-preferred terms. Types include:
  synonyms: Cars USE Automobiles
  near-synonyms: Junior high USE Middle school
  variant spellings: Defence USE Defense
  lexical variants: Hair loss USE Baldness
  foreign language terms: Luftwaffe USE German Air Force
  acronyms/spelled out forms: UN USE United Nations
  scientific/technical names: Neoplasms USE Cancer
  antonyms: Misbehavior USE Behavior
  narrower terms and instances that are not preferred
  terms: Power hand drills USE Power hand tools

  Each preferred term may have multiple non-preferred
  terms.
                   © 2008 Hedden Information Management
Differences in Synonyms/Variants
For human indexing
 “Shortcuts”- unique abbreviations within each
 facet (2-3 letters) for commonly entered terms
    For countries, states; industry codes
    For within a facet of limited size – memorizable
  Examples:
    mna – Mergers & acquisitions
    bnk – Banking
    fr – France

 Phrase inversions for alphabetical browsing
 Example: Photography, digital

                   © 2008 Hedden Information Management
Differences in Synonyms/Variants
For Auto-indexing
If machine-learning auto-categorization:
   Need greater number of non-preferred terms
   Can include non-noun phrases
                                              For auto-indexing
For human-indexing                            Presidential candidate
Presidential candidates                       Presidential candidacy
Candidates, presidential                      Candidate for president
                                              Candidacy for president
                                              Presidential hopeful
                                              Running for president
                                              Campaigning for president
                                              Presidential nominee
                       © 2008 Hedden Information Management
Taxonomy Creation Summary
Human indexing                       Auto-Indexing
  Rich relationships                     Cannot have subtle
  between terms                          differences between
  Term notes for                         terms
  clarification                          Avoid creating action-type
  Common-use shortcuts                   terms
  Phrase inversions as                   Be careful with facets
  term variants                          Need more, varied non-
Also:                                    preferred terms, including
  Browsable (A-Z) display                non-noun phrases
  Multiple ways to search
  (beginning of term, word
  within term, etc.)


                    © 2008 Hedden Information Management
Additional Work for the Taxonomist
Human Indexing                       Auto-Indexing
 Inform indexers of newly                Continual update work,
 added terms                             for each new term:
 Adjustments based on                         Add new training
 review of indexers’ work:                    documents, or
    If terms are                              Write new rules
    overlooked (not used):               Adjustments based on
    - Create more non-                   inappropriate results:
    preferred terms                           Add, delete, edit training
    - Create more related-                    documents
    term links                                Tweak existing rules
    If terms are misused:
    - Re-word terms
    - Add scope notes
                    © 2008 Hedden Information Management
Resources
 American Society for Indexing
 www.asindexing.org
 Taxonomies & Controlled Vocabularies SIG of
 the American Society for Indexing
 www.taxonomies-sig.org
 "Taxonomies and Controlled Vocabularies"
 Simmons College Graduate School of Library and
 Information Science Continuing Education Program
    onsite workshop (October 25, 2008, Boston)
    online workshop (February 2009)
 www.simmons.edu/gslis/continuinged/workshops

                  © 2008 Hedden Information Management
Contact
Heather Hedden
Hedden Information Management
98 East Riding Dr.
Carlisle, MA 01741

978-371-0822
978-467-5195 (mobile)
Heather@hedden.net
www.hedden-information.com




                  © 2008 Hedden Information Management

Weitere ähnliche Inhalte

Was ist angesagt?

Managing Taxonomy Tagging
Managing Taxonomy TaggingManaging Taxonomy Tagging
Managing Taxonomy TaggingHeather Hedden
 
Mapping Taxonomies, Thesauri, and Ontologies
Mapping Taxonomies, Thesauri, and OntologiesMapping Taxonomies, Thesauri, and Ontologies
Mapping Taxonomies, Thesauri, and OntologiesHeather Hedden
 
Selecting Software for Taxonomy, Thesaurus and Ontology Management
Selecting Software for Taxonomy, Thesaurus and Ontology ManagementSelecting Software for Taxonomy, Thesaurus and Ontology Management
Selecting Software for Taxonomy, Thesaurus and Ontology ManagementHeather Hedden
 
Benefits of Taxonomies
Benefits of TaxonomiesBenefits of Taxonomies
Benefits of TaxonomiesHeather Hedden
 
Taxonomies in Support of Search
Taxonomies in Support of SearchTaxonomies in Support of Search
Taxonomies in Support of SearchHeather Hedden
 

Was ist angesagt? (9)

Managing Taxonomy Tagging
Managing Taxonomy TaggingManaging Taxonomy Tagging
Managing Taxonomy Tagging
 
Taxonomy Fundamentals Workshop
Taxonomy Fundamentals WorkshopTaxonomy Fundamentals Workshop
Taxonomy Fundamentals Workshop
 
Taxonomy 101
Taxonomy 101Taxonomy 101
Taxonomy 101
 
Taxonomy and seo sla 05-06-10(jc)
Taxonomy and seo   sla 05-06-10(jc)Taxonomy and seo   sla 05-06-10(jc)
Taxonomy and seo sla 05-06-10(jc)
 
Mapping Taxonomies, Thesauri, and Ontologies
Mapping Taxonomies, Thesauri, and OntologiesMapping Taxonomies, Thesauri, and Ontologies
Mapping Taxonomies, Thesauri, and Ontologies
 
Ontology and Other Semantic Options
Ontology and Other Semantic OptionsOntology and Other Semantic Options
Ontology and Other Semantic Options
 
Selecting Software for Taxonomy, Thesaurus and Ontology Management
Selecting Software for Taxonomy, Thesaurus and Ontology ManagementSelecting Software for Taxonomy, Thesaurus and Ontology Management
Selecting Software for Taxonomy, Thesaurus and Ontology Management
 
Benefits of Taxonomies
Benefits of TaxonomiesBenefits of Taxonomies
Benefits of Taxonomies
 
Taxonomies in Support of Search
Taxonomies in Support of SearchTaxonomies in Support of Search
Taxonomies in Support of Search
 

Ähnlich wie Taxonomies for Human vs Auto-Indexing

Mapping, Merging, and Multilingual Taxonomies
Mapping, Merging, and Multilingual TaxonomiesMapping, Merging, and Multilingual Taxonomies
Mapping, Merging, and Multilingual TaxonomiesHeather Hedden
 
Content Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebContent Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebAmit Sheth
 
Content Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebContent Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebAmit Sheth
 
Taxonomy 101: Classifying DITA Tasks
Taxonomy 101: Classifying DITA TasksTaxonomy 101: Classifying DITA Tasks
Taxonomy 101: Classifying DITA TaskseasyDITA
 
Identifying Security Risks Using Auto-Tagging and Text Analytics
Identifying Security Risks Using Auto-Tagging and Text AnalyticsIdentifying Security Risks Using Auto-Tagging and Text Analytics
Identifying Security Risks Using Auto-Tagging and Text AnalyticsEnterprise Knowledge
 
IWMW 2002: The Value of Metadata and How to Realise It
IWMW 2002: The Value of Metadata and How to Realise ItIWMW 2002: The Value of Metadata and How to Realise It
IWMW 2002: The Value of Metadata and How to Realise ItIWMW
 
Dita webinar 20th march
Dita webinar 20th marchDita webinar 20th march
Dita webinar 20th marchMetapercept
 
Successful Content Management Through Taxonomy And Metadata Design
Successful Content Management Through Taxonomy And Metadata DesignSuccessful Content Management Through Taxonomy And Metadata Design
Successful Content Management Through Taxonomy And Metadata Designsarakirsten
 
3 25 11 Term Store Best Practices
3 25 11 Term Store Best Practices3 25 11 Term Store Best Practices
3 25 11 Term Store Best Practicespuckmiller3
 
Simplified Technical English: How Standardizing Content Saves Translation Cos...
Simplified Technical English: How Standardizing Content Saves Translation Cos...Simplified Technical English: How Standardizing Content Saves Translation Cos...
Simplified Technical English: How Standardizing Content Saves Translation Cos...Scott Abel
 
XXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair KernerXXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair KernerDarrell W. Gunter
 
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOMTEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOMITC Infotech
 
Julie glanville embase sunrise seminar may 2016
Julie glanville embase sunrise seminar may 2016Julie glanville embase sunrise seminar may 2016
Julie glanville embase sunrise seminar may 2016Ann-Marie Roche
 
SWT Lecture Session 7 - Advanced uses of RDFS
SWT Lecture Session 7 - Advanced uses of RDFSSWT Lecture Session 7 - Advanced uses of RDFS
SWT Lecture Session 7 - Advanced uses of RDFSMariano Rodriguez-Muro
 
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...Concept Searching, Inc
 
Using metadata repositories with search
Using metadata repositories with searchUsing metadata repositories with search
Using metadata repositories with searchJean Graef
 
Marlabs - Navigation vs Search Final
Marlabs - Navigation vs Search FinalMarlabs - Navigation vs Search Final
Marlabs - Navigation vs Search FinalMarlabs
 

Ähnlich wie Taxonomies for Human vs Auto-Indexing (20)

Mapping, Merging, and Multilingual Taxonomies
Mapping, Merging, and Multilingual TaxonomiesMapping, Merging, and Multilingual Taxonomies
Mapping, Merging, and Multilingual Taxonomies
 
User-Driven Taxonomies
User-Driven TaxonomiesUser-Driven Taxonomies
User-Driven Taxonomies
 
Content Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebContent Management, Metadata and Semantic Web
Content Management, Metadata and Semantic Web
 
Content Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebContent Management, Metadata and Semantic Web
Content Management, Metadata and Semantic Web
 
Taxonomy 101: Classifying DITA Tasks
Taxonomy 101: Classifying DITA TasksTaxonomy 101: Classifying DITA Tasks
Taxonomy 101: Classifying DITA Tasks
 
Identifying Security Risks Using Auto-Tagging and Text Analytics
Identifying Security Risks Using Auto-Tagging and Text AnalyticsIdentifying Security Risks Using Auto-Tagging and Text Analytics
Identifying Security Risks Using Auto-Tagging and Text Analytics
 
IWMW 2002: The Value of Metadata and How to Realise It
IWMW 2002: The Value of Metadata and How to Realise ItIWMW 2002: The Value of Metadata and How to Realise It
IWMW 2002: The Value of Metadata and How to Realise It
 
Dita webinar 20th march
Dita webinar 20th marchDita webinar 20th march
Dita webinar 20th march
 
Testing Taxonomies
Testing TaxonomiesTesting Taxonomies
Testing Taxonomies
 
Successful Content Management Through Taxonomy And Metadata Design
Successful Content Management Through Taxonomy And Metadata DesignSuccessful Content Management Through Taxonomy And Metadata Design
Successful Content Management Through Taxonomy And Metadata Design
 
3 25 11 Term Store Best Practices
3 25 11 Term Store Best Practices3 25 11 Term Store Best Practices
3 25 11 Term Store Best Practices
 
Simplified Technical English: How Standardizing Content Saves Translation Cos...
Simplified Technical English: How Standardizing Content Saves Translation Cos...Simplified Technical English: How Standardizing Content Saves Translation Cos...
Simplified Technical English: How Standardizing Content Saves Translation Cos...
 
Meta data
Meta dataMeta data
Meta data
 
XXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair KernerXXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair Kerner
 
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOMTEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
 
Julie glanville embase sunrise seminar may 2016
Julie glanville embase sunrise seminar may 2016Julie glanville embase sunrise seminar may 2016
Julie glanville embase sunrise seminar may 2016
 
SWT Lecture Session 7 - Advanced uses of RDFS
SWT Lecture Session 7 - Advanced uses of RDFSSWT Lecture Session 7 - Advanced uses of RDFS
SWT Lecture Session 7 - Advanced uses of RDFS
 
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...
 
Using metadata repositories with search
Using metadata repositories with searchUsing metadata repositories with search
Using metadata repositories with search
 
Marlabs - Navigation vs Search Final
Marlabs - Navigation vs Search FinalMarlabs - Navigation vs Search Final
Marlabs - Navigation vs Search Final
 

Mehr von Heather Hedden

Introduction to Knowledge Graphs for Information Architects.pdf
Introduction to Knowledge Graphs for Information Architects.pdfIntroduction to Knowledge Graphs for Information Architects.pdf
Introduction to Knowledge Graphs for Information Architects.pdfHeather Hedden
 
Thesauri for Indexing Support / Thesauri zur Unterstützung der Registererstel...
Thesauri for Indexing Support / Thesauri zur Unterstützung der Registererstel...Thesauri for Indexing Support / Thesauri zur Unterstützung der Registererstel...
Thesauri for Indexing Support / Thesauri zur Unterstützung der Registererstel...Heather Hedden
 
A Brief Introduction to SKOS
A Brief Introduction to SKOSA Brief Introduction to SKOS
A Brief Introduction to SKOSHeather Hedden
 
A Brief Introduction to Knowledge Graphs
A Brief Introduction to Knowledge GraphsA Brief Introduction to Knowledge Graphs
A Brief Introduction to Knowledge GraphsHeather Hedden
 
Taxonomy Design for SharePoint
Taxonomy Design for SharePointTaxonomy Design for SharePoint
Taxonomy Design for SharePointHeather Hedden
 
Taxonomies, Categories, and Tags in WordPress
Taxonomies, Categories, and Tags in WordPressTaxonomies, Categories, and Tags in WordPress
Taxonomies, Categories, and Tags in WordPressHeather Hedden
 
Customer-Focused Thesauri
Customer-Focused ThesauriCustomer-Focused Thesauri
Customer-Focused ThesauriHeather Hedden
 
Synonyms, Alternative Labels, and Nonpreferred Terms
Synonyms, Alternative Labels, and Nonpreferred TermsSynonyms, Alternative Labels, and Nonpreferred Terms
Synonyms, Alternative Labels, and Nonpreferred TermsHeather Hedden
 
Managing Mature Taxonomies: Resolving Orphan Terms
Managing Mature Taxonomies: Resolving Orphan TermsManaging Mature Taxonomies: Resolving Orphan Terms
Managing Mature Taxonomies: Resolving Orphan TermsHeather Hedden
 
Taxonomies for E-commerce
Taxonomies for E-commerceTaxonomies for E-commerce
Taxonomies for E-commerceHeather Hedden
 
Making Decisions in Creating Taxonomies
Making Decisions in Creating TaxonomiesMaking Decisions in Creating Taxonomies
Making Decisions in Creating TaxonomiesHeather Hedden
 

Mehr von Heather Hedden (12)

Introduction to Knowledge Graphs for Information Architects.pdf
Introduction to Knowledge Graphs for Information Architects.pdfIntroduction to Knowledge Graphs for Information Architects.pdf
Introduction to Knowledge Graphs for Information Architects.pdf
 
Thesauri for Indexing Support / Thesauri zur Unterstützung der Registererstel...
Thesauri for Indexing Support / Thesauri zur Unterstützung der Registererstel...Thesauri for Indexing Support / Thesauri zur Unterstützung der Registererstel...
Thesauri for Indexing Support / Thesauri zur Unterstützung der Registererstel...
 
A Brief Introduction to SKOS
A Brief Introduction to SKOSA Brief Introduction to SKOS
A Brief Introduction to SKOS
 
A Brief Introduction to Knowledge Graphs
A Brief Introduction to Knowledge GraphsA Brief Introduction to Knowledge Graphs
A Brief Introduction to Knowledge Graphs
 
Taxonomies for Users
Taxonomies for UsersTaxonomies for Users
Taxonomies for Users
 
Taxonomy Design for SharePoint
Taxonomy Design for SharePointTaxonomy Design for SharePoint
Taxonomy Design for SharePoint
 
Taxonomies, Categories, and Tags in WordPress
Taxonomies, Categories, and Tags in WordPressTaxonomies, Categories, and Tags in WordPress
Taxonomies, Categories, and Tags in WordPress
 
Customer-Focused Thesauri
Customer-Focused ThesauriCustomer-Focused Thesauri
Customer-Focused Thesauri
 
Synonyms, Alternative Labels, and Nonpreferred Terms
Synonyms, Alternative Labels, and Nonpreferred TermsSynonyms, Alternative Labels, and Nonpreferred Terms
Synonyms, Alternative Labels, and Nonpreferred Terms
 
Managing Mature Taxonomies: Resolving Orphan Terms
Managing Mature Taxonomies: Resolving Orphan TermsManaging Mature Taxonomies: Resolving Orphan Terms
Managing Mature Taxonomies: Resolving Orphan Terms
 
Taxonomies for E-commerce
Taxonomies for E-commerceTaxonomies for E-commerce
Taxonomies for E-commerce
 
Making Decisions in Creating Taxonomies
Making Decisions in Creating TaxonomiesMaking Decisions in Creating Taxonomies
Making Decisions in Creating Taxonomies
 

Kürzlich hochgeladen

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Kürzlich hochgeladen (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Taxonomies for Human vs Auto-Indexing

  • 1. Taxonomies for Human vs. Auto-Indexing Taxonomy Boot Camp, September 25, 2008, San Jose, CA Heather Hedden Hedden Information Management heather@hedden.net
  • 2. Background Heather Hedden's taxonomy-creation experience For human indexing Developed controlled vocabularies for periodical article index databases (Gale) For auto-indexing Developed taxonomies for integration within an enterprise search software product for corporate content and web page searching (Viziant) Matched controlled vocabulary to keywords for consumer online products/services directories (various "yellow pages" clients) For either Created enterprise taxonomies for corporate web sites and intranets for site navigation (Earley & Associations) © 2008 Hedden Information Management
  • 3. Outline Taxonomies & Indexing Background Choosing Human vs. Auto-indexing Taxonomies & Human Indexing Taxonomies & Auto-indexing Taxonomy Creation Comparison Differences in taxonomy terms Differences in term relationships Differences in definitions & notes Differences in synonyms/variants Additional Work for the Taxonomist Resources © 2008 Hedden Information Management
  • 4. Taxonomies & Indexing Types of Taxonomies 1. Organization, classification, navigation support – more emphasis on hierarchies 2. Search and retrieval support – more emphasis on synonyms For indexing, use #2 above: Search & retrieval support taxonomies © 2008 Hedden Information Management
  • 5. Taxonomies & Indexing Search & retrieval taxonomies: Connect users to desired content by means of a common nomenclature/terminology/vocabulary Matching between: 1. the vocabulary of the users 2. the vocabulary of the content Taxonomies interface with 1. the users 2. the content Indexing/tagging/categorization deals with #2 in each case above: the connection of taxonomy to content © 2008 Hedden Information Management
  • 6. Taxonomies & Indexing Indexing/tagging/categorization: Indexing – done by (trained) indexers – creating a (browsable) index Tagging – done by any person – applying labels, metatag descriptors to documents to be picked up by database or search software – may not require a taxonomy/controlled vocabulary Categorization – done more systematically/automatically – putting documents into (pre-defined) categories – often within facets © 2008 Hedden Information Management
  • 7. Choosing Human vs. Auto-indexing: The Content Human indexing Auto-indexing Manageable number of Very large number of documents documents Includes non-text files Text files only Varied and Common document undifferentiated types/formats (or pre- document types/formats tagged types) Varied subject areas Focused subject areas (legal, medical, etc.) © 2008 Hedden Information Management
  • 8. Choosing Human vs. Auto-indexing: The Culture Human indexing Auto-indexing Higher accuracy in Greater volume indexed indexing Greater speed in indexing Invest in people Invest in technology Low-tech: can build your High-tech: must purchase own indexing UI or buy auto-indexing software Internal control, or Software vendor outsourcing vendor relationship relationship © 2008 Hedden Information Management
  • 9. Taxonomies & Human Indexing Who are indexers? Specialists or not the taxonomist and/or other information specialists, librarians dedicated hired indexers (with or without prior indexing) experience supplemental work for other staff (editors, writers, administrators) One person or multiple people Usually in-house but could be contracted out © 2008 Hedden Information Management
  • 10. Taxonomies & Human Indexing Indexing software/module Indexing user interface optimized for ease, speed, and accuracy in indexing Method for indexers to nominate new taxonomy terms Training & documentation for indexers Indexing policy guidelines Method to communicate new and changed taxonomy terms to indexers Method for checking and quality control © 2008 Hedden Information Management
  • 11. © 2008 Hedden Information Management
  • 12. © 2008 Hedden Information Management
  • 13. Taxonomies & Auto-Indexing Technologies Entity extraction Text mining and text analytics Auto-categorization or auto-classification utilizing taxonomies: 1. Machine-learning and training documents 2. Rules-based categorization © 2008 Hedden Information Management
  • 14. Taxonomies & Auto-Indexing Machine-learning auto-categorization: Complex mathematical algorithms are created Taxonomist must then provide several (at least 5-10) representative sample documents for each taxonomy term to “train” the automated indexing system. If only using only 5-10 documents, then profile/ overview, encyclopedic articles are best. If pre-indexed records exist (i.e. converting from human to automated indexing), then hundreds of varied documents can be used for each term. © 2008 Hedden Information Management
  • 15. Taxonomies & Auto-Indexing Machine-learning auto-categorization © 2008 Hedden Information Management
  • 16. Taxonomies & Auto-Indexing Rules-based auto-categorization: Taxonomist must write rules for each taxonomy term Similar to advanced Boolean searching bush IF (INITIAL CAPS AND (MENTIONS "president*" OR WITH administration*" OR AROUND "white house" OR NEAR "george")) USE U.S. president ELSE USE Shrubs ENDIF Data Harmony © 2008 Hedden Information Management
  • 17. Taxonomy Creation Comparison Differences in taxonomy terms Differences in term relationships Differences in term notes, definitions Differences in synonyms/variants © 2008 Hedden Information Management
  • 18. Differences in Taxonomy Terms For human indexing Create terms as specific (granular) as the content will support and users will expect. For auto-indexing Cannot have subtle differences between preferred terms: International relations; Foreign policy Avoid creating both action and topic terms: Investing; Investments © 2008 Hedden Information Management
  • 19. Differences in Term Relationships Hierarchical (broader/narrower) links Associative (related terms) links For human indexing Highly useful to indexer, as is to end-user, in finding the best term. Consider indexer behavior. For auto-indexing Not needed, but could be utilized in search results: Broader terms recursively include narrower term results Related terms display as suggestions Consider search results. © 2008 Hedden Information Management
  • 20. Differences in Term Relationships Facets Certain facets may work better with human indexing than with auto-indexing. Automated indexing may not distinguish between different facet meanings of a term. Examples: Mergers - Action/Event or Business Topic? Churches – Place or Organization type? © 2008 Hedden Information Management
  • 21. Differences in Term Notes Concise explanatory notes (not a dictionary definition) on some terms, as needed: 1. To restrict or expand the application of a term 2. To distinguish between terms of overlapping meaning (may have reciprocal notes) 3. To provide advice on term usage For the end-user, optional aid For indexing: often needed for some terms for human indexing never needed for auto-indexing May have notes for indexers that are not for end-users. © 2008 Hedden Information Management
  • 22. Differences in Term Notes Scope Notes examples ProQuest Controlled Vocabulary: Occupational health SN: Employer activities designed to protect and promote the health and safety of employees on the job Inequality SN: Socioeconomic disparity stemming from racial, cultural, or social bias Medical Subject Headings (MeSH): Nonverbal Communication Annotation: human only; for animals use ANIMAL COMMUNICATION or VOCALIZATION, ANIMAL © 2008 Hedden Information Management
  • 23. Differences in Synonyms/Variants Non-preferred terms. Types include: synonyms: Cars USE Automobiles near-synonyms: Junior high USE Middle school variant spellings: Defence USE Defense lexical variants: Hair loss USE Baldness foreign language terms: Luftwaffe USE German Air Force acronyms/spelled out forms: UN USE United Nations scientific/technical names: Neoplasms USE Cancer antonyms: Misbehavior USE Behavior narrower terms and instances that are not preferred terms: Power hand drills USE Power hand tools Each preferred term may have multiple non-preferred terms. © 2008 Hedden Information Management
  • 24. Differences in Synonyms/Variants For human indexing “Shortcuts”- unique abbreviations within each facet (2-3 letters) for commonly entered terms For countries, states; industry codes For within a facet of limited size – memorizable Examples: mna – Mergers & acquisitions bnk – Banking fr – France Phrase inversions for alphabetical browsing Example: Photography, digital © 2008 Hedden Information Management
  • 25. Differences in Synonyms/Variants For Auto-indexing If machine-learning auto-categorization: Need greater number of non-preferred terms Can include non-noun phrases For auto-indexing For human-indexing Presidential candidate Presidential candidates Presidential candidacy Candidates, presidential Candidate for president Candidacy for president Presidential hopeful Running for president Campaigning for president Presidential nominee © 2008 Hedden Information Management
  • 26. Taxonomy Creation Summary Human indexing Auto-Indexing Rich relationships Cannot have subtle between terms differences between Term notes for terms clarification Avoid creating action-type Common-use shortcuts terms Phrase inversions as Be careful with facets term variants Need more, varied non- Also: preferred terms, including Browsable (A-Z) display non-noun phrases Multiple ways to search (beginning of term, word within term, etc.) © 2008 Hedden Information Management
  • 27. Additional Work for the Taxonomist Human Indexing Auto-Indexing Inform indexers of newly Continual update work, added terms for each new term: Adjustments based on Add new training review of indexers’ work: documents, or If terms are Write new rules overlooked (not used): Adjustments based on - Create more non- inappropriate results: preferred terms Add, delete, edit training - Create more related- documents term links Tweak existing rules If terms are misused: - Re-word terms - Add scope notes © 2008 Hedden Information Management
  • 28. Resources American Society for Indexing www.asindexing.org Taxonomies & Controlled Vocabularies SIG of the American Society for Indexing www.taxonomies-sig.org "Taxonomies and Controlled Vocabularies" Simmons College Graduate School of Library and Information Science Continuing Education Program onsite workshop (October 25, 2008, Boston) online workshop (February 2009) www.simmons.edu/gslis/continuinged/workshops © 2008 Hedden Information Management
  • 29. Contact Heather Hedden Hedden Information Management 98 East Riding Dr. Carlisle, MA 01741 978-371-0822 978-467-5195 (mobile) Heather@hedden.net www.hedden-information.com © 2008 Hedden Information Management