Patent Search: An important new test bed for IR
presented at the 9th Dutch-Belgian Information Retrieval Workshop (DIR 2009)
Enschede, The Netherlands
http://dir2009.cs.utwente.nl/
1. Patent Search: An important new test bed for IR
J. Tait, M. Lupu1
H. Berger, G. Roda, M. Dittenbach, A. Pesenhofer2
E. Graf, K. van Rijsbergen3
1 InformationRetrieval Facility
Vienna, Austria
2 Matrixware
Vienna, Austria
3 University of Glasgow
Dept. of Computing Science
Glasgow, UK
DIR 2009 / Feb. 2-3, 2009
2. Patent Search.
Patent search is a highly specialized form of information search.
It is characterized by its
target data
type of information needs
legal and economic implications
3. Target data
Data for patent retrieval comes mainly from:
patent databases from patent authorities (EPO, USPTO,
JPO, SIPO, WIPO, etc.)
scientific publications
prior art databases (IP.com)
A new acronym
SIPO: State Intellectual Property Office of the Peoples’ Republic of
China
4. Target data
Characteristics of patent documents
multilingual and ’legalese’
non uniform formats
some are OCR’d
figures, images, chemical formulas, DNA sequences
include references to patent and non-patent literature
A new acronym
NPL: Non-Patent Literature
5. Information Needs.
K.H. Atkinson, Towards a more rational patent search paradigm:
depending on what group is doing the asking, the types of patent
search requested may include simple patentability, clearance to
market a product, validity, opposition to a patent being sought by
another, infringement watch, creating IP landscapes for business
development or R&D, infringement defense, litigation, prosecution
support, and creation of portfolios for assignments, investments,
mergers and acquisitions [ . . . ]
6. Legal and economic implications.
patents are legal documents
patent portfolios are assets for enterprises
a single patent search can be worth several days of work
High recall searches
Missing even a single relevant document can have severe financial
and economic impact. For example, when a granted patent
becomes invalidated because of a document omitted at application
time.
7. Introduction
Patent Search
A modern IR test bed
Promoting take up of research
Conclusion
We have characterized the patent search problem by describing its
target data, types of information needs, legal and economic
implications.
Next:
evaluating IR techniques in the patent domain
previous initiatives in the area of patent retrieval
the CLEF-IP and TREC-Chem initiatives
promoting take-up of research
Tait et al. Patent Search: An important new test bed for IR
8. Test collections
Test collections in Information Retrieval play a pivotal role in the
evaluation of retrieval models.
Domain-specific test collections already exist for:
Web pages
news stories
legal documents
blogs
genomics
patents
9. Pioneering work in patent retrieval.
Patent retrieval task at the NTCIR Workshop1 since 2001.
produced test collections primarily targeting Japanese patents
retrieval tasks
ad-hoc (goal: find patents on a given topic)
invalidity search (goal: find patents invalidating a given claim)
patent classification according to the F-term system
Two new acronyms
F-term (abbreviation of File-forming term) is the classification
system used in Japan as a complement to IPC (International
Patent Classification)
1
http://research.nii.ac.jp/ntcir
10. Evaluation tracks.
The IRF has engaged in two pilot evaluation tracks on patent
retrieval
CLEF-IP
www.ir-facility.org/the_irf/clef-ip09-track
TREC-Chem
www.ir-facility.org/the_irf/trec_chem.htm
11. CLEF-Intellectual Property Initiative.
CLEF-IP
coordinated by the IRF
part of the Cross-Language Evaluation Forum2
will focus on the task of prior art search
European patents as target data
automatic extraction of relevance assessments
Prior art search
Prior art search consists in identifying all information (including
NPL) that might be relevant to a patent’s claim of novelty.
2
http://www.clef-campaign.org
12. Prior art search.
The most common type of patent search. Performed at various
stages of the patent life-cycle and with different intentions:
before filing an application (novelty search or patentability
search) to determine whether the invention fulfills the
requirements of
novelty
inventive step
before grant - results go into a search report attached to
patent
invalidity search: post-grant search used to unveil prior art
that invalidates a patent’s claims of originality
13. Target data.
The CLEF-IP evaluation track will restrict target data to patents.
Target data:
comprising 16 years (filing date between 1985 and 2000) of
EPO patents
1.9 million patent documents corresponding to 1 million
patents
75 GB, in XML format
documents are in English, German, and French
14. Automatic extraction of relevance assessments.
The data resulting from prior art searches is saved in the EPO or
USPTO databases as:
citations in patent applications
citations in search report
citations in opposition’s legal files
The CLEF-IP track is going to extract this information (as much
as possible) automatically in order to form a large set of topics.
15. Prior art from opposition procedures.
According to the European patent law, a granted patent may
be opposed.
It is often the case that opponent provides new prior art that
invalidates claim of originality of the invention.
Patents cited in opposition procedures are very relevant prior
art documents.
They are the results of a very thorough invalidity search.
16. Crowdsourcing extraction of relevance assessments.
Need to extract citations from documents arising from
opposition procedures
These documents are only are available as scanned images3
Will be using crowdsourcing for extracting these citations.
A new word from business jargon
Crowdsourcing.
3
at http://www.epoline.org
17. Relevance and evaluation measures.
Labels used in search reports:
label means that cited document is
X relevant when taken alone
Y relevant in combination with other documents
A relevant but not prejudicial to novelty or inventive step
How to use these labels for defining new evaluation measures?
18. Challenges.
As a result of the CLEF-IP track we expect to obtain new insights
on:
how to represent information need given by a patent
query reformulation
evaluation metrics for patent retrieval
using machine translation for improving retrieval effectiveness
19. TREC Chemistry track.
Ad-hoc search
Target data:
academic papers (Royal Society of Chemistry)
chemical patent documents (class C in the IPC)
Will use automatic extraction of citations for relevance
assessments
Challenges:
chemical names and structures
chemical interactions, relations, transformations, properties
20. Introduction
Patent Search Pioneering work at NTCIR
A modern IR test bed CLEF-IP
Promoting take up of research TREC-Chem
Conclusion
The IRF is contributing to the creation of new patent test
collections by organizing two tracks within the CLEF and
TREC evaluation campaigns.
In addition to the TREC and CLEF contributions, the IRF,
together with Matrixware, is promoting several initiatives
aimed at facilitating and improving the patent retrieval
process.
Tait et al. Patent Search: An important new test bed for IR
21. Introduction The IRF
Patent Search Matrixware
A modern IR test bed Promoting research
Promoting take up of research Providing the tools
Conclusion Current University Projects
Promoting take up of research
Next:
presentation of the IRF and Matrixware
promoting take up of research
the IRF symposium
the PaIR workshop
providing the tools
funding research in the area of patent retrieval
Tait et al. Patent Search: An important new test bed for IR
22. IRF: the Information Retrieval Facility.
New international not-for-profit
foundation, based in Vienna,
Its mission:
to bridge the gap between the needs of
the industry and the academic know-how
to promote and facilitate research in
large scale information retrieval
maintain a facility that enables large
scale information retrieval and in-depth
data processing
23. Matrixware.
Founded 2005 in Vienna
80 Employees
> 15 Academic Partners Worldwide
Implements solutions for access to patent
information
24. Promoting research.
Matrixware and the IRF have engaged in several initiatives aimed
at promoting research and raising awareness in the area of patent
retrieval.
the Information Retrieval Facility Symposium
an annual symposium held in Vienna to foster knowledge
exchange between IR experts and IP professionals
the PaIR workshop
a workshop on Patent Information retrieval hosted by the
CIKM conference
25. Providing the tools.
Successful IR research conventionally depends on three elements:
1 the availability of test collections
2 access to suitable software systems on which to run
experiments
3 access to sufficiently powerful hardware
The IRF, supported by Matrixware, is providing all three of these.
26. Current University Projects.
Accessibility of Information (Glasgow)
Large Scale Logical Retrieval (Glasgow)
Semantic Analysis of Patent Data (Sheffield and Nijmegen)
Language Modeling for Patent Retrieval (Umass Amherst)
OCR for patents (Umass Amherst)
27. Concluding remarks
Patent retrieval is an interesting and important open
challenge for IR researchers.
The IRF and Matrixware have engaged in several projects
aimed at promoting research in this area.
28. Introduction
Patent Search Concluding remarks
A modern IR test bed Invitation
Promoting take up of research Closing
Conclusion
Invitation.
You are invited to:
join one of the evaluation tracks
CLEF-IP
TREC-Chem
participate in the PaIR workshop
participate in the Information Retrieval Facility Symposium
Tait et al. Patent Search: An important new test bed for IR