We present a federated patent search system that explores three issues: (a) topical organization of patents based on their IPC, (b) collection selection of topically organised patent collections and (c) integration of collection selection tools to patent search systems.
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Â
Multilayer Collection Selection and Search of Topically Organized Patents
1. Multilayer Collection Selection and Search of
Topically Organized Patents
Michail Salampasis
Vienna University of Technology
Anastasia Giahanou
University of Macedonia
Giorgos Paltoglou
University of Wolverhampton
2. 2
Contents
Overview:
ïAim and Objectives of this work
ïDistributed Information Retrieval / Federated Search
ï Topically Organised Patents
ï Integration of DIR in patent search: Multilayer Source
Selection
ï Experiment Setup
ï Results
ï Conclusions
3. Aim of this work
3
ïTo explore the thematic organization of patent documents
using the subdivision of patent data by International
Patent Classification (IPC) codes , and
ïif this organization can be used to build search tools that
could improve patent search effectiveness using DIR
methods
4. Which search tools and how should be integrated?
4
ïIt is a mistake if we think the search tools which should be
integrated into patent search systems depend only on
existing IR or text processing technologies,
ïProbably it has more to do with the attitude that a patent
search is conducted.
ïFurthermore, it is also very important to deeply
understand a search process and how a specific tool can
attain a specific objective of this process and therefore
increase its efficiency.
5. If these parameters are not carefully considered
5
âą Professional searchers will be skeptical and with a very
conservative attitude towards adopting search methods,
tools and technologies beyond the ones which
dominated their domain.
âą A typical example is patent search where professional
search experts typically use the Boolean search syntax
and quite complex intellectual classification schemes
7. Objectives
7
âąThe improvement of our method relates to the very
fundamental step in professional patent search (step 3 in the
use case presented by Lupu and Hanbury) which is
âdefining a text query, potentially by Boolean operators and
specific field filtersâ.
âą In prior art search probably the most important filter is
based on the IPC (CPC now) classification
8. Objectives
8
âąThe method and tool which we present in this paper can
support this step by automatically selecting IPCs given a
query, make a filtered search based on the query and the
automatically selected IPCs
âąThe tool can be used for classification search which will be
used as a starting point to identify and closer examine
technical concepts as these are expressed in IPCs and to
which a patent could be related
9. 9
Distributed IR
Elements composing a Distributed Information Retrieval System
. . .
(1) Source
Representation
. . . .Collection 1 Collection 2 Collection 3 Collection 4 Collection Î
(2) Source
Selection
âŠâŠâŠâŠ
(3) Results
Merging
User
10. Topically Organised Patents based on IPC
taxonomy
10
ïIPC is a standard taxonomy for classifying patents, and has currently
about 71,000 nodes which are organized into a five-level hierarchical
system which is also extended in greater levels of granularity.
ïPatent documents produced worldwide have manually-assigned
classification codes which in our experiments are used to topically
organize, distribute and index patents through hundreds or
thousands of sub-collections.
12. Topically Organised Patents
12
ïThe patents in average have three IPC codes. In the experiments we
report here, we allocated a patent to each sub-collection specified by
at least one of its IPC code, i.e. a sub-collection might overlap with
others in terms of the patents it contains.
ïIPC are assigned by humans in a very detailed and purposeful
assignment process, something which is very different by the creation
of sub-collections using automated clustering algorithms or the naive
division method by chronological or source order, a division method
which has been extensively used in past DIR research
14. Analysis of IPC distribution of topics and
their relevant documents
14
IPC
Level
# of
topics
#
relevant
docs per
topic
(a)
# of
IPC
classes
of each
topic
(b)
# of IPC
classes of
relevant
docs
(c)
# of
common IPC
classes
between (b)
and (c)
Training
Split 3 300 8.22 2.08 4.8 1.76
Split 4 300 8.22 3.1 8.76 2.34
Split 5 300 8.22 5.82 19.84 3.63
Testing
Split 3 300 8.57 2.09 5.15 1.75
Split 4 300 8.57 2.95 9.02 2.21
Split 5 300 8.57 5.58 20.56 3.73
15. Experiment Setup
15
ïWe indexed the collection with the Lemur toolkit.
ïThe fields which have been indexed are: title, abstract,
description (first 500 words), claims, inventor, applicant
and IPC class information.
ïPatent documents have been pre-processed to produce
a single (virtual) document representing a patent.
ïOur pre-processing involves also stop-word removal
and stemming using the Porter stemmer. In the
experiments reported here we use the Inquery
algorithm implementation of Lemur
16. Two different types of
Source Selection Algorithms were used
16
ïHyper-document approach (CORI)
o The main characteristic of CORI which is probably
the most widely used and tested source selection
method is that it creates a hyper-document
representing all the documents-members of a sub-
collection.
ïSource Selection as Voting
o This is a shift of focus from estimating the relevancy of
each remote collection to explicitly estimating the number
of relevant documents in each.
20. Discussion
âą The superiority of CORI as source selection method is
unquestionable
âą best runs are those requesting fewer sub-collections 10
or 20 and more documents from each selected sub-
collection
âą This fact is probably the result of the small number of
relevant documents which exist for each topic
20
22. Conclusions
ïDIR approaches managed to perform better than the
centralized index approaches, with 9 DIR combinations
scoring better than the best centralized approach.
ïMuch more work is required:
o We plan to explore further this line of work with
exploring modifications to state-of-the-art DIR
methods which didnât perform well enough in this set
of experiments
o Also, we would like to experiment with larger
distribution levels based on IPC (subgroup level). We
plan to report the runs using split-5 in a future paper.
22