AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization for patents, scientific literature, and web Harald Jenny (CENTREDOC, CH)

Possibilities and limitations
of AI-boosted multi-
categorization for patents,
scientific literature, and
web

AI
Methodology Optimisation Automation Analysis & Synthesis
2015
2016
2017
2018
2019
7 Years of Business Intelligence Developpment
2020
2021

Climbing on the Matterhorn
The everyday use of AI-driven algorithms for
data search, analysis and synthesis comes with
important time savings
but also reveals the
need to understand and accept
the limitations of the technology
A workshop report

Image: geniusgadget.com
HUMAN
INTELLIGENCE
+
ARTIFICIAL
INTELLIGENCE
=
AUGMENTED
INTELLIGENCE

Prepare the case studies by exposing the possibilities and limits of the AI-
assisted automatic categorization process.
Discuss the challenges faced in setting up this process:
• Definition of the trainingset (type of data to be processed, Patent or NPL or both)
• Development of classifiers (single vs multi, selected fields, margin of error to be defined)
• Volume handled: > 300,000
Process Advantage:
• Collaboration with experts in the field
• Multi categorization
• Ability to select the fields to analyze
• Combine AI classification tool with collaborative monitoring tool – take the best of two worlds
Restitution of results in various forms with possible developments on demand

Monitor
oDifferent types of data to process (patent, NPL, web, internal documents)
oIncreasing volume of information to monitor
oMultiple data sources to consult
oLimited time and resources
How to
o Process this ever increasing flow of data without devoting too much time and resources ?
o Boost customer efficiency and bring customer expertise where it is most valuable?
Automate
o Automate the monitoring process from end to end
o Optimize the data classification process by integrating AI

Automate
o Provide a data selection and classification accuracy close to an expert work with
higher stability than humans
o Save time and resources
o Process quickly and efficiently large volumes of data on a regular basis

Import Result
AI classification
Input:
Patent, NPL, Web,
internal documents
Output:
RAPID, export,
synchronisation
Free yourself from doing repetitive tasks
Focus on what’s most matter: the result
SmartCat

SmartCat
Powered by
• Averbis
Integrated in
• RAPID
Designed to
• Process all types of data
• Handle large volumes of data
Empower you to
• Detect relevant documents
• Apply single or multi-label classifications

5.Run the classification process
6.Validate the AI classification
3.Run the learning process
4.Validate the prediction model
1.Provide a training set
2.Set the AI classifier

Key during the definition
and validation steps
Expert
contribution
Classification
• Balanced set
• Unambiguous classification
• Distinctive categories
Trainingset
• Field selection
• Classification mode: Single VS Multi
Classifier
• Metrics validation
Prediction
model
• Classification assessment
• Relevance labels assigned
o Precision
o Recall
o F1 score

Precision Recall F1-Score
1 1 1.00
0.5 0.5 0.50
0.9 0.5 0.64
0.9 0.9 0.90
0.8 0.8 0.80
0.7 0.9 0.79
0.1 0.9 0.18
0.2 0.9 0.33
0.3 0.8 0.44
0.4 0.8 0.53
0.5 0.8 0.62
0.6 0.9 0.72
0.7 0.9 0.79
0.8 0.9 0.85
0.9 0.9 0.90
1 1 1.00
1 1 1.00
1 1 1.00
1 1 1.00
1 1 1.00
0
0,2
0,4
0,6
0,8
1
1,2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Precision Recall F1-Score
Precision of a classifier: Ratio of good documents in a category
Recall of a classifier: Ratio of relevant documents in a category
F1-Score of a classifier: Combination of Precision and Recall

Depends on
o Thematic
o Data quality
o Classification uncertainties
and complexity
Contributes to
o Subject matter expert(s)
o Unambiguous and distinctive
classification
o Delimited search scope

What we intended to do (and some times managed to do)
Raw data One classifier Final result

What we finally did
Raw
data
Binary
classifier
Classifier #1
Classifier #2
Classifier #3
…
Bad
Result #1
Result #2
Result #3
…
Good Final result

Relevance rate estimated for each of the
3 monitoring processes implemented
Number of iterations done before
reaching a suitable relevance rate
Time to multi-classify 1000 documents
>80%
~3
4 min

Fully automated process
hosted in one place
Experts focus on the result
Patent, NPL, Web, internal documents
Import
Classification
Restitution
SmartCat
We did it !

Automated data upload Classification result
SmartCat
AI classification
Expert reviews
Weekly updates Expert evaluation
User communication
AI training based on
expert feedback
Case Study No 1: «enough time, no focus»

Major hurdles Overcome by
Implement a flexible and easy-to-use process Developping RAPID in collaboration with reknown experts in the field
Ambiguities or uncertainties when defining the
classification and the trainingset
Providing reliable definition and selection
Assess the classification quality Involving motivated experts
Shift noted from the initial request Redefining the classification in agreement with the experts involved
Synchronise data between RAPID and PS Setting an automated workflow compatible with RAPID and PS
Reliability control Real time monitoring every step of the automated process
Case Study No 1: «enough time, no focus»

Set-up
oChose a sufficiently large monitoring strategy for the alert
(Criteria: find all the existing documents under observation or with oppositions)
oTrain a classifier with all observation and opposition cases and the same quantity of
clearly non-relevant documents
oTake two month of monitoring data → 4’600 newly published documents
oConfigure SmartCat: 5 certainly relevant documents, 6 probably relevant documents and
62 potentially relevant documents
oCheck these 11 documents with Central IP → Yes, they are relevant.
Case Study No 2: «no time, no monitoring»

Set-up
0 500 1000 1500 2000 2500 3000
Non relevant – very sure
Non relevant – sure
Non relevant – not sure
Relevant – not sure
Relevant – sure
Relevant – very sure 5
6
62
601
909
2823
Effect of additional training cycles
Case Study No 2: «no time, no monitoring»

Climbing on the Matterhorn
1. Establish a good training set
2. Configure the classifier system carefully
3. Don’t despair when your first attempt(s)
fail(s)
4. Take a good guide
5. Study the AI-System carefully, identify
the gradients of convergence
6. Repeat steps 1-5 in cycles until you…
7. Reach the summit
8. Enjoy the view !
9. Be aware that every mountain is
different

From the
data lake
To the key
document
The Project Team
Jean-Baptiste Porier
Senior Data
Analyst
David Borel
Head of
Foresight Team
Harald Jenny
CEO

The time for AI implementation is now.
JACQUET DROZ 1
2002 NEUCHÂTEL
WWW.CENTREDOC.SWISS
INFO@CENTREDOC.CH
+41 32 720 51 31

AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization for patents, scientific literature, and web Harald Jenny (CENTREDOC, CH)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization for patents, scientific literature, and web Harald Jenny (CENTREDOC, CH)

Ähnlich wie AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization for patents, scientific literature, and web Harald Jenny (CENTREDOC, CH) (20)

Mehr von Dr. Haxel Consult

Mehr von Dr. Haxel Consult (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization for patents, scientific literature, and web Harald Jenny (CENTREDOC, CH)