Several methods can be applied to create a set of validated terms from existing documents. In this paper we describe an automatic bilingual term candidate extraction method, and the validation process used to create a hierarchical patent terminology. The process described was used to extract terms from patent texts, commissioned by the Swedish Patent Office with the purpose of using the terms for machine translation. Information on the correct linguistic inflection patterns and hierarchical partitioning of terms based on their use are of utmost importance.
The process contains six phases, 1) Analysis of the source material and system configuration; 2) Term candidate extraction; 3) Term candidate filtering and initial linguistic validation; 4) Manual validation by domain experts; 5) Final linguistic validation; and 6) Publishing the validated terms.
Input to the extraction process consisted of more than 91.000 patent document pairs in English and Swedish, 565 million words in English and 450 million words in Swedish. The English documents were supplied in EBD SGML format and the Swedish documents were supplied in OCR processed scans of patent documents. After grammatical and statistical analysis, the documents were word aligned. Using the word aligned material, candidate terms were extracted based on linguistic patterns. 750,000 term candidates were extracted and stored in a relational database. The term candidates were processed in 8 months resulting in 181.000 unique validated term pairs which were then exported into several hierarchically organized OLIF files.
Automatic extraction and manual validation of a hierarchical English-Swedish terminology
1. Automatic extraction and manual validation
of a hierarchical English-Swedish
terminology
NORDTERM 2009
Magnus Merkel*, Jody Foo*, Mikael Andersson**, Lars Edholm**, Mikaela
Gidlund**, Sanna Åsberg**
Presented by Jody Foo
* Department of Computer and Information Science,Linköping University
** Fodina Language Technology AB
2. Overview
!! Background
!! Term extraction and validation process
!! Results
!! Conclusions and future work
Merkel, Foo et al, NORDTERM 2009
3. Some history
NLPLAB, Linköping University Spin-o : Fodina Language
Technology
2004
Patent Information Conference
2006
Results from initial machine translation projects
Patent Abstracts of Japan (PAJ) launches online machine
translation initiative EPO launches patent MT service
2000 2006
First attempts at MT @ EPO PRV term extraction and validation
2004 2008 – 2009
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Merkel, Foo et al, NORDTERM 2009
4. Machine translation
!! Two main approaches
!! Rule based machine translation (RBMT), e.g. Babelfish
!! Statistical machine translation (SMT), e.g. Google Translate
!! MT @ EPO
!! Rule-based MT engine: Systran
!! RBMT requires domain specific dictionaries – patent terms
Merkel, Foo et al, NORDTERM 2009
9. Overview of the term extraction and
validation process
Source data analysis and system Term candidate Term candidate filtering and
SGML & configuration extraction initial linguistic validation
OCR
Manual validation by domain Final linguistic Publishing of validated
experts validation terms OLIF
Merkel, Foo et al, NORDTERM 2009
10. Perform necessary steps before term
extraction is possible
Source data analysis and system Term candidate Term candidate filtering and
SGML & configuration extraction initial linguistic validation
OCR
Manual validation by domain Final linguistic Publishing of validated
experts validation terms OLIF
Merkel, Foo et al, NORDTERM 2009
11. Analysis of source material and system
configuration
*+ *+ >?)>+ @)A+ B*C+ *+
!"#$%&'(%)*+&,-..+/,0123-44+5/+/,-+/-4/036+.715/073+
!"#$%&'(%&)+&25.-/4+/8712.-2+90:+;<794/=..-/+
*+ *+ >?)>+ *+
Merkel, Foo et al, NORDTERM 2009
12. Extract list of term candidates to be
validated
Source data analysis and system Term candidate Term candidate filtering and
SGML & configuration extraction initial linguistic validation
OCR
Manual validation by domain Final linguistic Publishing of validated
experts validation terms OLIF
Merkel, Foo et al, NORDTERM 2009
16. Reduce the number of term candidates to be
processed by the domain experts
Source data analysis and system Term candidate Term candidate filtering and
SGML & configuration extraction initial linguistic validation
OCR
Manual validation by domain Final linguistic Publishing of validated
experts validation terms OLIF
Merkel, Foo et al, NORDTERM 2009
17. Term filtering and initial linguistic validation
!! Filtering criteria
!! General language filtering
!! Q-value (~alignment confidence)
!! Link errors
!! Source OR target frequency > 4
Merkel, Foo et al, NORDTERM 2009
18. Term filtering and initial linguistic validation
!! Example: C04B
Total number of term candidates: 143,341
General language entries: 18,764
Link errors: 653
Freq >4 src|trg: 9,064
Q-value filtering: keep 4,076 DEF95.G(HIJ+
Total after filtering: 3,179
Merkel, Foo et al, NORDTERM 2009
20. Overview of the term extraction and
validation process
Source data analysis and system Term candidate Term candidate filtering and
SGML & configuration extraction initial linguistic validation
OCR
Manual validation by domain Final linguistic Publishing of validated
experts validation terms OLIF
Merkel, Foo et al, NORDTERM 2009
21. Final linguistic validation
!! To be validated
!! Part-of-speech, Inflection pattern, Gender, Number
!! Recycle as much information as possible from previously
validated terms
!! Process terms by recycling status
!! Very reliable information
!! Less reliable information
!! No information available
Merkel, Foo et al, NORDTERM 2009
22. Publishing of validated terms
Top
A C E F H
A61 C03 C11 F42 H05
C21
C03B C03C C21B C21C C21D H05B H05C
Merkel, Foo et al, NORDTERM 2009
23. Final numbers
!! Processed 91,000 document pairs in 8 months.
!! Validated term pairs: 181,260
!! Expert validatation: 4 – 6,000 term candidate pairs/working day
!! Linguistic validation: 2 – 3,000 term pairs/working day
Accumulated amount Accumulated amount of Accumulated amount
Accumulated amount
Section of total number of total number of of UNIQUE term
of term pairs
documents (in %) documents (in %) pairs
D 2,8 2,8 17288 9697
E 2,1 4,9 32045 16304
F 7,1 12 78301 32512
G 10,2 22,2 133912 53731
H 10,3 32,5 187429 72721
A 20,7 53,2 289850 110642
B 18,1 71,3 419185 146665
C 28,7 100 545143 181260
Merkel, Foo et al, NORDTERM 2009
24. Growth of validated terms
600000 Accumulated amount of
validated term pairs
Number of validated term pairs
500000
Accumulated amount of
400000 validated UNIQUE term
pairs
300000 Right section edge of: D
-E-F-G-H-A-B-C
200000
100000
0
0 20 40 60 80 100
Amount of total number of documents (in %)
A blue diamond marks the right edge of a section, left to right: D - E - F - G - H - A - B - C.
Merkel, Foo et al, NORDTERM 2009
25. Conclusions and future work
!! Key concepts
!! using previously validated term pairs to avoid doing the same
work twice
!! using students as domain experts
!! using an e cient validation tool
!! Future work
!! Improving automated filtering and reduction of term candidates
!! Automating termness detection
Merkel, Foo et al, NORDTERM 2009