Slides from the London Taxonomy Bootcamp 2016. Discussing town down and bottom up approaches for defining taxonomies. Demonstration of using natuaral language processing to automate the discovery of metadata in SharePoint documents.
5. TOP DOWN - APPROCH
• Defines top level
containers and work
downwards.
• Usually broad (3-10
wide) and shallow (3-4
deep)
• Simple, high level
classification (functional)
6. TOP DOWN – TERMS
• Manually defined or
replicated from existing
structures
• Imported from other
systems
• Industry standards /
purchased taxonomies
7. TOP DOWN – SUMMARY
• People / Committee
Driven approach
• Some guesswork of
what terms should be
• Simple, high level
classification (functional)
– Way better than
folders!
8. BOTTOM UP - APPROCH
• Terms driven by the
words and phrases
within your content
• More complex
taxonomies
• Detailed, accurate terms
that are subject or facet
level
9. BOTTOM UP - TERMS
• Manual analysis of the
documents
• Statistical analysis of
terms and phrases
• Natural Language
processing
10. BOTTOM UP - SUMMARY
• Technology driven
approach (or a very tough
people process)
• Produces detailed
taxonomies that reflect the
actual content
• Extra granulation of
tagging
11. AND THE WINNER IS…
• Combining top down and
bottom up is the best
approach
• Top down classifies the
type of documents
• Bottom up classifies the
subject of the document
• New technology allows
bottom up to be realistic
12. TermSet adds accurate consistent metadata without placing any burden on
end users or your IT team.
Builds taxonomies (bottom up) using NLP
Applies tags
Metadata as a service TM
16. MANUAL TAGGING
• Adoption problem
• Asbestos problem / GIGO
• Challenging to do retrospectively
(migration tools can help)
17. MANUAL TAGGING
• Infer as many terms as possible from:
Document types, Location, Function
• Mandate as few tags as possible
• Stay shallow or flat with hierarchies
18. MACHINE TAGGING
• Simple machine tagging can use search
to match taxonomy terms to the
content of documents
• More advanced taggers allow rules or
weights to be assigned to each tag
(tags not context aware)
• New technologies (NLP) provide a new
approach to creating taxonomies
19. TERMSET TAGGING
• TermSet recommends the right
taxonomies for each library (context
aware tagging)
• TermSet automates building the
underlying IA in SharePoint
• Extra cool NLP tags can be added
(Summaries, Sentiment and Language)
• Monitors for new documents and
terms arriving into your world
21. WRAP UP
• TermSet automates a bottom up
approach to create and use
taxonomies for SharePoint
• Visit www.termset.com or e-mail
brendan@termset.com for a free
licence
• If you need assistance with top down
taxonomies or you use a different DMS
e-mail me to join the beta program for
www.taxononica.com
Editor's Notes
A top down approach defines containers for terms, usually starting with some global taxonomies such as locations, departments or products (used throughout the business).
Lots of level 1 and 2 term sets that define the function of the document. For example, Departments -> HR
Level 3 may begins to define the content itself, for example Departments -> HR -> Policy Documents
Works well to classify content into the right areas. This is functional classification.
Often terms are defined by committees who involve specialist groups to define terms
Line of business systems or databases may contain data that can be imported (http://www.termset.com/blog/2016/8/25/loading-metadata-terms-into-sharepoint-using-powershell)
SKOS is an interesting for advanced taxonomies (https://www.w3.org/2001/sw/wiki/SKOS/Datasets), WAND is off the shelf (http://www.wandinc.com/wand-taxonomy-library-portal.aspx)
The challenge with deciding terms without looking at your documents is that it will be guesswork to know what would be effective.
That said, a simple top down taxonomy is 10x better than a folder structure. No duplication as documents can be tagged within multiple areas.
Bottom up means looking at the information you have in your content (usually documents and e-mails) and building taxonomies that are based on how you actually describe information.
Bottom up results in a taxonomy that can describe the subject or facet of the document.
How long does it take for people to read and process documents: http://www.termset.com/calc/
Getting a working team of people to actually read documents is time consuming and expensive, but sometimes if the information is valuable it may be worth it.
There are tools that can analyse the frequency of works or phrases in your documents. They can be highly effective but need a lot of consultancy to make sense of the results.
NLP is the future of text analysis (more later).
A bottom up approach can be used to describe the contents of the documents (not just the area)
TermSet has a different approach. It manages every step of adding metadata to your SharePoint content. Projects can be completed in days or weeks instead of months or years.
The application uses machine learning that can build over 400 taxonomies that relate to your data. You can also easily train it to apply tags that are important to you.
A full list of features is available at http://www.termset.com/platform/
Natural language processing is at the core of TermSet. We have an engine trained to recognise entities within documents.
(First Click) This a BBC news article, when our engine reads the text it identifies entitles such as people, locations and organisations.
(Second Click) In fact, we identify a vast array of information inside the documents including concepts, sentiment and relationships.
A document library with medical / pharma documents. There is no structure to the documents in this library.
We create a discovery job to process (read) the documents.
We select the location of the documents and can feed in existing taxonomies and define patterns to look for.
TermSet can also suggest new taxonomies that are created from the terms inside your documents.
TermSet can also assess the sentiment, the language and write a summary of any document.
Click to create a brand new taxonomy build from your documents
Select the taxonomy
Verify the terms created from the content
TermSet then creates columns in your libraries
Every time you add a field that needs to be completed in order to save a document you are impeding adoption of a new DMS
If you do mandate fields, many users will pick the first on the list or just randomly pick anything in order to save the document
What do you do with the 1 million documents that came from a file share (or any other source without metadata)?
Manually tagging new content can work well. Always use default values to answer as many questions before the user is involved (infer the metadata wherever possible).
Keeping it simple is a good plan. Single lookup columns may be better than deep hierarchies.
There are a number of taggers for SharePoint that will look at your documents and apply tags from a taxonomy that you have defined
Some tagggers ask for rules to be defined for each term (can work well, takes forever to get right).
Creates site collection columns.
Creates site collection columns.
Tags the documents asynchronously.
Before TermSet.
Two new columns added (Drug and Health condition) and the documents are tagged.
New documents will tagged as they arrive (new terms will need to be approved).
A one sentence summary of each document is created.
Search is super-charged with meta-data available as refinement.
Meta-data allows us to understand the information inside a document library.
Visit www.termset.com or e-mail brendan@termset.com for a free licence
If you need assistance with top down taxonomies or you use a different DMS please e-mail me to join the beta program for www.taxononica.com