SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Multilayer Collection Selection and Search of
Topically Organized Patents
Michail Salampasis
Vienna University of Technology
Anastasia Giahanou
University of Macedonia
Giorgos Paltoglou
University of Wolverhampton
2
Contents
Overview:
Aim and Objectives of this work
Distributed Information Retrieval / Federated Search
 Topically Organised Patents
 Integration of DIR in patent search: Multilayer Source
Selection
 Experiment Setup
 Results
 Conclusions
Aim of this work
3
To explore the thematic organization of patent documents
using the subdivision of patent data by International
Patent Classification (IPC) codes , and
if this organization can be used to build search tools that
could improve patent search effectiveness using DIR
methods
Which search tools and how should be integrated?
4
It is a mistake if we think the search tools which should be
integrated into patent search systems depend only on
existing IR or text processing technologies,
Probably it has more to do with the attitude that a patent
search is conducted.
Furthermore, it is also very important to deeply
understand a search process and how a specific tool can
attain a specific objective of this process and therefore
increase its efficiency.
If these parameters are not carefully considered
5
‱ Professional searchers will be skeptical and with a very
conservative attitude towards adopting search methods,
tools and technologies beyond the ones which
dominated their domain.
‱ A typical example is patent search where professional
search experts typically use the Boolean search syntax
and quite complex intellectual classification schemes
Understanding Patent Search processes *
* Taken from Mihai Lupu and Allan Hanbury, Review Patent Retrieval
Objectives
7
‱The improvement of our method relates to the very
fundamental step in professional patent search (step 3 in the
use case presented by Lupu and Hanbury) which is
“defining a text query, potentially by Boolean operators and
specific field filters”.
‱ In prior art search probably the most important filter is
based on the IPC (CPC now) classification
Objectives
8
‱The method and tool which we present in this paper can
support this step by automatically selecting IPCs given a
query, make a filtered search based on the query and the
automatically selected IPCs
‱The tool can be used for classification search which will be
used as a starting point to identify and closer examine
technical concepts as these are expressed in IPCs and to
which a patent could be related
9
Distributed IR
Elements composing a Distributed Information Retrieval System
. . .
(1) Source
Representation
. . . .Collection 1 Collection 2 Collection 3 Collection 4 Collection Ν
(2) Source
Selection





(3) Results
Merging
User
Topically Organised Patents based on IPC
taxonomy
10
IPC is a standard taxonomy for classifying patents, and has currently
about 71,000 nodes which are organized into a five-level hierarchical
system which is also extended in greater levels of granularity.
Patent documents produced worldwide have manually-assigned
classification codes which in our experiments are used to topically
organize, distribute and index patents through hundreds or
thousands of sub-collections.
Topically Organised Patents
11
Topically Organised Patents
12
The patents in average have three IPC codes. In the experiments we
report here, we allocated a patent to each sub-collection specified by
at least one of its IPC code, i.e. a sub-collection might overlap with
others in terms of the patents it contains.
IPC are assigned by humans in a very detailed and purposeful
assignment process, something which is very different by the creation
of sub-collections using automated clustering algorithms or the naive
division method by chronological or source order, a division method
which has been extensively used in past DIR research
Topically Organised Patents
13
Analysis of IPC distribution of topics and
their relevant documents
14
IPC
Level
# of
topics
#
relevant
docs per
topic
(a)
# of
IPC
classes
of each
topic
(b)
# of IPC
classes of
relevant
docs
(c)
# of
common IPC
classes
between (b)
and (c)
Training
Split 3 300 8.22 2.08 4.8 1.76
Split 4 300 8.22 3.1 8.76 2.34
Split 5 300 8.22 5.82 19.84 3.63
Testing
Split 3 300 8.57 2.09 5.15 1.75
Split 4 300 8.57 2.95 9.02 2.21
Split 5 300 8.57 5.58 20.56 3.73
Experiment Setup
15
We indexed the collection with the Lemur toolkit.
The fields which have been indexed are: title, abstract,
description (first 500 words), claims, inventor, applicant
and IPC class information.
Patent documents have been pre-processed to produce
a single (virtual) document representing a patent.
Our pre-processing involves also stop-word removal
and stemming using the Porter stemmer. In the
experiments reported here we use the Inquery
algorithm implementation of Lemur
Two different types of
Source Selection Algorithms were used
16
Hyper-document approach (CORI)
o The main characteristic of CORI which is probably
the most widely used and tested source selection
method is that it creates a hyper-document
representing all the documents-members of a sub-
collection.
Source Selection as Voting
o This is a shift of focus from estimating the relevancy of
each remote collection to explicitly estimating the number
of relevant documents in each.
Source Selection Results (level 3)
17
Source Selection Results (level 4)
18
Source Selection Results (level 5)
19
Discussion
‱ The superiority of CORI as source selection method is
unquestionable
‱ best runs are those requesting fewer sub-collections 10
or 20 and more documents from each selected sub-
collection
‱ This fact is probably the result of the small number of
relevant documents which exist for each topic
20
Results of Retrieval Results
SPLIT4
10 Collections Selected 20 Collections Selected
Pres@100 MAP@100 Pres@100 MAP@100
Optimal 0.313 0.128 0.313 0.128
Centralised 0.257 0.105 0.257 0.105
CORI-CORI 0.203 0.081 0.213 0.086
CORI-SSL 0.221 0.091 0.231 0.097
BordaFuse-SSL 0.077 0.035 0.087 0.039
Multilayer 0.256 0.105 0.261 0.105
SPLIT5
10 Collections Selected 20 Collections Selected
Pres@100 MAP@100 Pres@100 MAP@100
Optimal 0.346 0.146 0.351 0.148
Centralised 0.257 0.105 0.257 0.105
CORI-CORI 0.267 0.107 0.259 0.105
CORI-SSL 0.27 0.11 0.263 0.107
BordaFuse-SSL 0.03 0.02 0.04 0.028
Multilayer 0.269 0.106 0.267 0.102
Conclusions
DIR approaches managed to perform better than the
centralized index approaches, with 9 DIR combinations
scoring better than the best centralized approach.
Much more work is required:
o We plan to explore further this line of work with
exploring modifications to state-of-the-art DIR
methods which didn’t perform well enough in this set
of experiments
o Also, we would like to experiment with larger
distribution levels based on IPC (subgroup level). We
plan to report the runs using split-5 in a future paper.
22
23
Thank you


Weitere Àhnliche Inhalte

Ähnlich wie Multilayer Collection Selection and Search of Topically Organized Patents

Hardware enhanced association rule mining
Hardware enhanced association rule miningHardware enhanced association rule mining
Hardware enhanced association rule miningStudsPlanet.com
 
Empirical study of an automated inventory management system with bayesian inf...
Empirical study of an automated inventory management system with bayesian inf...Empirical study of an automated inventory management system with bayesian inf...
Empirical study of an automated inventory management system with bayesian inf...eSAT Journals
 
An improved apriori algorithm for association rules
An improved apriori algorithm for association rulesAn improved apriori algorithm for association rules
An improved apriori algorithm for association rulesijnlc
 
Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04nihshowandtell
 
Ijcatr04051008
Ijcatr04051008Ijcatr04051008
Ijcatr04051008Editor IJCATR
 
Modern Association Rule Mining Methods
Modern Association Rule Mining MethodsModern Association Rule Mining Methods
Modern Association Rule Mining Methodsijcsity
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...BRNSSPublicationHubI
 
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET Journal
 
The Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsThe Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsNeo4j
 
A cyber physical stream algorithm for intelligent software defined storage
A cyber physical stream algorithm for intelligent software defined storageA cyber physical stream algorithm for intelligent software defined storage
A cyber physical stream algorithm for intelligent software defined storageMade Artha
 
IRJET- Code Cloning using Abstract Syntax Tree
IRJET- Code Cloning using Abstract Syntax TreeIRJET- Code Cloning using Abstract Syntax Tree
IRJET- Code Cloning using Abstract Syntax TreeIRJET Journal
 
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...IRJET Journal
 
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...Multi-threaded approach in generating frequent itemset of Apriori algorithm b...
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...TELKOMNIKA JOURNAL
 
Archiver at CS3 - Cloud Storage Synchronization and Sharing Services
Archiver at CS3 - Cloud Storage Synchronization and Sharing ServicesArchiver at CS3 - Cloud Storage Synchronization and Sharing Services
Archiver at CS3 - Cloud Storage Synchronization and Sharing ServicesArchiver
 
Irjet v4 iA Survey on FP (Growth) Tree using Association Rule Mining7351
Irjet v4 iA Survey on FP (Growth) Tree using Association Rule Mining7351Irjet v4 iA Survey on FP (Growth) Tree using Association Rule Mining7351
Irjet v4 iA Survey on FP (Growth) Tree using Association Rule Mining7351IRJET Journal
 
Hybrid Cloud storage deployment models: ARCHIVER presentation at the CS3 Work...
Hybrid Cloud storage deployment models: ARCHIVER presentation at the CS3 Work...Hybrid Cloud storage deployment models: ARCHIVER presentation at the CS3 Work...
Hybrid Cloud storage deployment models: ARCHIVER presentation at the CS3 Work...Archiver
 
Ay4201347349
Ay4201347349Ay4201347349
Ay4201347349IJERA Editor
 

Ähnlich wie Multilayer Collection Selection and Search of Topically Organized Patents (20)

Hardware enhanced association rule mining
Hardware enhanced association rule miningHardware enhanced association rule mining
Hardware enhanced association rule mining
 
UNIT_4.pptx
UNIT_4.pptxUNIT_4.pptx
UNIT_4.pptx
 
Empirical study of an automated inventory management system with bayesian inf...
Empirical study of an automated inventory management system with bayesian inf...Empirical study of an automated inventory management system with bayesian inf...
Empirical study of an automated inventory management system with bayesian inf...
 
An improved apriori algorithm for association rules
An improved apriori algorithm for association rulesAn improved apriori algorithm for association rules
An improved apriori algorithm for association rules
 
Cerita
CeritaCerita
Cerita
 
Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04
 
Ijcatr04051008
Ijcatr04051008Ijcatr04051008
Ijcatr04051008
 
Modern Association Rule Mining Methods
Modern Association Rule Mining MethodsModern Association Rule Mining Methods
Modern Association Rule Mining Methods
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
 
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
 
The Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsThe Case for Graphs in Supply Chains
The Case for Graphs in Supply Chains
 
A cyber physical stream algorithm for intelligent software defined storage
A cyber physical stream algorithm for intelligent software defined storageA cyber physical stream algorithm for intelligent software defined storage
A cyber physical stream algorithm for intelligent software defined storage
 
IRJET- Code Cloning using Abstract Syntax Tree
IRJET- Code Cloning using Abstract Syntax TreeIRJET- Code Cloning using Abstract Syntax Tree
IRJET- Code Cloning using Abstract Syntax Tree
 
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
 
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
 
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...Multi-threaded approach in generating frequent itemset of Apriori algorithm b...
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...
 
Archiver at CS3 - Cloud Storage Synchronization and Sharing Services
Archiver at CS3 - Cloud Storage Synchronization and Sharing ServicesArchiver at CS3 - Cloud Storage Synchronization and Sharing Services
Archiver at CS3 - Cloud Storage Synchronization and Sharing Services
 
Irjet v4 iA Survey on FP (Growth) Tree using Association Rule Mining7351
Irjet v4 iA Survey on FP (Growth) Tree using Association Rule Mining7351Irjet v4 iA Survey on FP (Growth) Tree using Association Rule Mining7351
Irjet v4 iA Survey on FP (Growth) Tree using Association Rule Mining7351
 
Hybrid Cloud storage deployment models: ARCHIVER presentation at the CS3 Work...
Hybrid Cloud storage deployment models: ARCHIVER presentation at the CS3 Work...Hybrid Cloud storage deployment models: ARCHIVER presentation at the CS3 Work...
Hybrid Cloud storage deployment models: ARCHIVER presentation at the CS3 Work...
 
Ay4201347349
Ay4201347349Ay4201347349
Ay4201347349
 

KĂŒrzlich hochgeladen

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...gurkirankumar98700
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

KĂŒrzlich hochgeladen (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Multilayer Collection Selection and Search of Topically Organized Patents

  • 1. Multilayer Collection Selection and Search of Topically Organized Patents Michail Salampasis Vienna University of Technology Anastasia Giahanou University of Macedonia Giorgos Paltoglou University of Wolverhampton
  • 2. 2 Contents Overview: Aim and Objectives of this work Distributed Information Retrieval / Federated Search  Topically Organised Patents  Integration of DIR in patent search: Multilayer Source Selection  Experiment Setup  Results  Conclusions
  • 3. Aim of this work 3 To explore the thematic organization of patent documents using the subdivision of patent data by International Patent Classification (IPC) codes , and if this organization can be used to build search tools that could improve patent search effectiveness using DIR methods
  • 4. Which search tools and how should be integrated? 4 It is a mistake if we think the search tools which should be integrated into patent search systems depend only on existing IR or text processing technologies, Probably it has more to do with the attitude that a patent search is conducted. Furthermore, it is also very important to deeply understand a search process and how a specific tool can attain a specific objective of this process and therefore increase its efficiency.
  • 5. If these parameters are not carefully considered 5 ‱ Professional searchers will be skeptical and with a very conservative attitude towards adopting search methods, tools and technologies beyond the ones which dominated their domain. ‱ A typical example is patent search where professional search experts typically use the Boolean search syntax and quite complex intellectual classification schemes
  • 6. Understanding Patent Search processes * * Taken from Mihai Lupu and Allan Hanbury, Review Patent Retrieval
  • 7. Objectives 7 ‱The improvement of our method relates to the very fundamental step in professional patent search (step 3 in the use case presented by Lupu and Hanbury) which is “defining a text query, potentially by Boolean operators and specific field filters”. ‱ In prior art search probably the most important filter is based on the IPC (CPC now) classification
  • 8. Objectives 8 ‱The method and tool which we present in this paper can support this step by automatically selecting IPCs given a query, make a filtered search based on the query and the automatically selected IPCs ‱The tool can be used for classification search which will be used as a starting point to identify and closer examine technical concepts as these are expressed in IPCs and to which a patent could be related
  • 9. 9 Distributed IR Elements composing a Distributed Information Retrieval System . . . (1) Source Representation . . . .Collection 1 Collection 2 Collection 3 Collection 4 Collection Ν (2) Source Selection 



 (3) Results Merging User
  • 10. Topically Organised Patents based on IPC taxonomy 10 IPC is a standard taxonomy for classifying patents, and has currently about 71,000 nodes which are organized into a five-level hierarchical system which is also extended in greater levels of granularity. Patent documents produced worldwide have manually-assigned classification codes which in our experiments are used to topically organize, distribute and index patents through hundreds or thousands of sub-collections.
  • 12. Topically Organised Patents 12 The patents in average have three IPC codes. In the experiments we report here, we allocated a patent to each sub-collection specified by at least one of its IPC code, i.e. a sub-collection might overlap with others in terms of the patents it contains. IPC are assigned by humans in a very detailed and purposeful assignment process, something which is very different by the creation of sub-collections using automated clustering algorithms or the naive division method by chronological or source order, a division method which has been extensively used in past DIR research
  • 14. Analysis of IPC distribution of topics and their relevant documents 14 IPC Level # of topics # relevant docs per topic (a) # of IPC classes of each topic (b) # of IPC classes of relevant docs (c) # of common IPC classes between (b) and (c) Training Split 3 300 8.22 2.08 4.8 1.76 Split 4 300 8.22 3.1 8.76 2.34 Split 5 300 8.22 5.82 19.84 3.63 Testing Split 3 300 8.57 2.09 5.15 1.75 Split 4 300 8.57 2.95 9.02 2.21 Split 5 300 8.57 5.58 20.56 3.73
  • 15. Experiment Setup 15 We indexed the collection with the Lemur toolkit. The fields which have been indexed are: title, abstract, description (first 500 words), claims, inventor, applicant and IPC class information. Patent documents have been pre-processed to produce a single (virtual) document representing a patent. Our pre-processing involves also stop-word removal and stemming using the Porter stemmer. In the experiments reported here we use the Inquery algorithm implementation of Lemur
  • 16. Two different types of Source Selection Algorithms were used 16 Hyper-document approach (CORI) o The main characteristic of CORI which is probably the most widely used and tested source selection method is that it creates a hyper-document representing all the documents-members of a sub- collection. Source Selection as Voting o This is a shift of focus from estimating the relevancy of each remote collection to explicitly estimating the number of relevant documents in each.
  • 17. Source Selection Results (level 3) 17
  • 18. Source Selection Results (level 4) 18
  • 19. Source Selection Results (level 5) 19
  • 20. Discussion ‱ The superiority of CORI as source selection method is unquestionable ‱ best runs are those requesting fewer sub-collections 10 or 20 and more documents from each selected sub- collection ‱ This fact is probably the result of the small number of relevant documents which exist for each topic 20
  • 21. Results of Retrieval Results SPLIT4 10 Collections Selected 20 Collections Selected Pres@100 MAP@100 Pres@100 MAP@100 Optimal 0.313 0.128 0.313 0.128 Centralised 0.257 0.105 0.257 0.105 CORI-CORI 0.203 0.081 0.213 0.086 CORI-SSL 0.221 0.091 0.231 0.097 BordaFuse-SSL 0.077 0.035 0.087 0.039 Multilayer 0.256 0.105 0.261 0.105 SPLIT5 10 Collections Selected 20 Collections Selected Pres@100 MAP@100 Pres@100 MAP@100 Optimal 0.346 0.146 0.351 0.148 Centralised 0.257 0.105 0.257 0.105 CORI-CORI 0.267 0.107 0.259 0.105 CORI-SSL 0.27 0.11 0.263 0.107 BordaFuse-SSL 0.03 0.02 0.04 0.028 Multilayer 0.269 0.106 0.267 0.102
  • 22. Conclusions DIR approaches managed to perform better than the centralized index approaches, with 9 DIR combinations scoring better than the best centralized approach. Much more work is required: o We plan to explore further this line of work with exploring modifications to state-of-the-art DIR methods which didn’t perform well enough in this set of experiments o Also, we would like to experiment with larger distribution levels based on IPC (subgroup level). We plan to report the runs using split-5 in a future paper. 22