Structured vocabularies, thesauri and lexicons are key ingredients for many information management tasks. Creating them however often requires a significant amount of work. Maintaining and extending them often means that the respective manual tasks need to be done on a regular basis in order to prevent the resources from becoming outdated, irrelevant and incomplete. AI has much support to offer for this task. And by wrapping the respective approaches into applications that can be operated by terminologists and domain experts who don't need to be programmers or data scientists themselves, the benefits can be made available to a wide range of users.
Registry Data Accuracy Improvements, presented by Chimi Dorji at SANOG 41 / I...
AI-SDV 2021: Stefan Geissler - AI support for creating and maintaining vocabularies
1. Kairntech & vocabularies:
AI support for creating and
maintaining vocabularies
AI SDV
Oct 4+5, 2021
Stefan Geißler
www.kairntech.com
2. Introducing Kairntech
• Software & Service company with a focus on
NLP & AI for industry use cases
• Focus on making powerful ML approaches
accessible for domain experts (not just
programmers and data scientists)
• Created in dec 2018, HQ in Grenoble, France
• Team with 20+ years of experience in the
field (Xerox, IBM, TEMIS, …)
• We’ve been attending the SDV for many
years, it is a pleasure to be ‘here’ again ☺
Europe’s highest mountain, the Mont
Blanc, is visible from many places in the
surroundings of Grenoble (~100km)
3. Kairntech: Different Approaches to Content Analysis
Create NLP models by importing
annotated data or adding manual
annotation
• Entities, Categories, Relations
• Users adding their domain
expertise
• “Active Learning”: Reduces
required manual efforts
• Immediate feedback
• Annotation as Teamwork: Have
people cooperate on projects
Import of existing vocabularies und
thesauri. System will learn the
relevant concept.
• Integrating your knowledge
sources (company- or domain-
specific)
• Quick creating of respective
annotation models
• New similar terms? variants?
• Import from many different
formats
Benefitting from public world
knowledge : more than 90 mio
concepts, multilingual,
disambiguated, linked.
• Based on wikidata
• Regularly updated knowledge
source
• “Tesla” - inventor or electric
car? Kairntech this and countless
other ambiguous cases.
4. Use case today: AI support in vocabulary management
• Thesaurus: Structured
vocabulary of terms
• Often domain-specific
• Important in information
retrieval and content analysis
• Non-trivial thesauri are often
very large (>>10000 terms)
• … require considerable effort
to build
• … and to keep up-to-date as a
field evolves
• This can be a challenge,
especially when working on
different subjects at once
5. Case study: Kairntech client TecIntelli
• https://www.tecintelli.de
• Technology and Innovation
Intelligence
• Based in Stuttgart, Germany
• Analysis of large volumes of text
content: Web sources, technical
documents, scientific literature
• Technology scouting, technology
monitoring, coaching and consulting
• Which technologies exist, which are
on the rise, what solutions exist for a
given problem? What markets for a
given solution?
6. Example: Technology scouting for tech SME
• Client specializes in building switches / actuators
• Realizes their switches are quite fast, in fact faster
than competitors’ products
• “What else can be done with these? Who else
needs faster switches than what is typically sold?”
• SME → (often no large research department)
• Technology watch project: What are technological
fields that need our fast switches?
• Literature/market review identified markets and
potential clients
• An important part of this literature analysis is the
identification of key concepts and actors
7. Kairntech: AI support for vocabulary maintenance
Raw
documents
Automatic
Annotation
enriching
documents with
imported terms
Train ML model
Broad range of ML
algorithms
Model
application
Automatically created
suggestions of new
terms
Wrap AI/NLP/ML into easy-to-use GUI:
Domain-
specific seed
vocabulary
8. • Powerful approaches supporting this use case
exist (Deep Learning-based entity recognition)
• Productive use requires coding and data
sciences expertise
• Make ML model creation, optimization and
application available to domain users without
coding experience
Point and click AI
9. Sample domain: battery technology
• Technology field with fast-growing economic
potential
• Projected yearly worldwide growth of > 12% to
reach 279 bn US-$ by 2027 by
researchandmarkets.com
• Key component in e-mobility, home batteries
and portable devices and others
• Area of intense research and industrial
innovation
11. A vocabulary maintenance workflow
Seed vocabulary of domain-
specific terms
Apply vocabulary on
document corpus
System suggesting new candidates (here
new, yet « unknown » types of batteries)
Configure Machine
Learning experiment
12. Searching for new terms
• Deep Learning based models take into account various types of clues
• Internal structure of candidate term
• “… ion …”, “ … Li …”, “ … cell … “, “ … redox …”
• Context
• „… electrodes of XYZ batteries are often built from …“
• Model architecture allows both types of clues to be taken into consideration
• Available ML approaches from fast and relatively simple Conditional Random Fields (CRF)
to powerful and computation intensive Deep Learning
• No manual rule-writing process required
• Large scale pre-trained embeddings and transformer models (such as BERT) are key ingredients
13. Full workflow still requires (or benefits from) expert input
• Import of seed vocabulary and definition and application of annotator on
document content
• Fully automatic
• Review of annotation results and eventual curation of seed vocabulary and
annotations (ambiguities)
• Expert input, manual
• Consistency is king: “Alzheimer’s” or “Alzheimer’s Disease”? Be consistent in
your vocabulary and in your annotations
• Definition and application of Machine Learning model training
• Fully automatic
• Review of newly found terms
• Expert decision, manual
14. Outcome
• Application returned “MVI2 flow battery”
and “lithium organosulfor battery” as
potential new battery technologies
• Both are in fact relatively new approaches
in the field (not contained in the seed
vocabulary)
• Setup allows regular, large-scale scanning
of domain-specific content
• Time&effort for thesaurus maintenance
reduced
15. Conclusion
• Kairntech: AI/NLP solutions also for
non-programmers
• Wide range of use cases and
languages
• Consulting, on-premise packaged
software or cloud-based
• We love to hear about your use cases
Danke!
info@kairntech.com
www.kairntech.com