Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Automatic extraction and manual validation
of a hierarchical English-Swedish
terminology

NORDTERM 2009

Magnus Merkel*, Jody Foo*, Mikael Andersson**, Lars Edholm**, Mikaela
Gidlund**, Sanna Åsberg**

Presented by Jody Foo

* Department of Computer and Information Science,Linköping University
** Fodina Language Technology AB

Overview

!! Background
!! Term extraction and validation process
!! Results
!! Conclusions and future work

Merkel, Foo et al, NORDTERM 2009

Some history

NLPLAB, Linköping University Spin-o : Fodina Language
Technology
2004

Patent Information Conference
2006
Results from initial machine translation projects
Patent Abstracts of Japan (PAJ) launches online machine
translation initiative EPO launches patent MT service
2000 2006

First attempts at MT @ EPO PRV term extraction and validation
2004 2008 – 2009

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011


Machine translation

!! Two main approaches
!! Rule based machine translation (RBMT), e.g. Babelﬁsh
!! Statistical machine translation (SMT), e.g. Google Translate

!! MT @ EPO
!! Rule-based MT engine: Systran
!! RBMT requires domain speciﬁc dictionaries – patent terms


Diallo 2006


0
1000
2000
3000
4000
5000
6000
7000
8000
A01B
A22C
A41B
A45D
A61D

0
2000
4000
6000
8000
10000
12000
14000
A63D
A01 B04B
A23 B21G
A42 B23P
A45 B27B
A61 B29C
B01 B41N
Input data

B04 B60F
B07 B61C
B21 B62M
B65G
B24
C01D
B27
C06F
B30 C08K
B41 C10K
B44 C12S
B62 C23G
B65 D03J
B68 D06P
C02 E02B
C05 E05D
C08 F01N
C11 F04B
0
5000
10000
15000
20000
25000
30000

C14 F16L
C23 F21S
D01
A F23Q

F28B
D04
G01B
D07
B

G01V
E02 G05B
E05 G07D
F01
C

G11B
F04 H01J
F17 H02N
D

F23 H04K
F26
F41
E

G02
G05
G08
F

G11
H01
H04
G
H

Overview of the term extraction and
validation process

Source data analysis and system Term candidate Term candidate ﬁltering and
SGML & conﬁguration extraction initial linguistic validation
OCR

Manual validation by domain Final linguistic Publishing of validated
experts validation terms OLIF


Perform necessary steps before term
extraction is possible

OCR



Analysis of source material and system
conﬁguration

*+ *+ >?)>+ @)A+ B*C+ *+

!"#$%&'(%)*+&,-..+/,0123-44+5/+/,-+/-4/036+.715/073+

!"#$%&'(%&)+&25.-/4+/8712.-2+90:+;<794/=..-/+
*+ *+ >?)>+ *+


Extract list of term candidates to be
validated

OCR



Term candidate extraction


Client-server infrastructure


Reduce the number of term candidates to be
processed by the domain experts

OCR



Term filtering and initial linguistic validation

!! Filtering criteria
!! General language filtering
!! Q-value (~alignment confidence)
!! Link errors
!! Source OR target frequency > 4


Term ﬁltering and initial linguistic validation

!! Example: C04B

Total number of term candidates: 143,341
General language entries: 18,764
Link errors: 653
Freq >4 src|trg: 9,064
Q-value filtering: keep 4,076 DEF95.G(HIJ+

Total after filtering: 3,179


Manual validation by domain experts


Final linguistic validation

!! To be validated
!! Part-of-speech, Inﬂection pattern, Gender, Number

!! Recycle as much information as possible from previously
validated terms

!! Process terms by recycling status
!! Very reliable information
!! Less reliable information
!! No information available


Publishing of validated terms

Top

A C E F H

A61 C03 C11 F42 H05
C21

C03B C03C C21B C21C C21D H05B H05C


Final numbers
!! Processed 91,000 document pairs in 8 months.
!! Validated term pairs: 181,260
!! Expert validatation: 4 – 6,000 term candidate pairs/working day
!! Linguistic validation: 2 – 3,000 term pairs/working day
Accumulated amount Accumulated amount of Accumulated amount
Accumulated amount
Section of total number of total number of of UNIQUE term
of term pairs
documents (in %) documents (in %) pairs
D 2,8 2,8 17288 9697
E 2,1 4,9 32045 16304
F 7,1 12 78301 32512
G 10,2 22,2 133912 53731
H 10,3 32,5 187429 72721
A 20,7 53,2 289850 110642
B 18,1 71,3 419185 146665
C 28,7 100 545143 181260


Growth of validated terms

600000 Accumulated amount of
validated term pairs
Number of validated term pairs

500000
Accumulated amount of
400000 validated UNIQUE term
pairs
300000 Right section edge of: D
-E-F-G-H-A-B-C
200000

100000

0

0 20 40 60 80 100

Amount of total number of documents (in %)
A blue diamond marks the right edge of a section, left to right: D - E - F - G - H - A - B - C.


Conclusions and future work

!! Key concepts
!! using previously validated term pairs to avoid doing the same
work twice
!! using students as domain experts
!! using an e cient validation tool

!! Future work
!! Improving automated ﬁltering and reduction of term candidates
!! Automating termness detection


Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Ähnlich wie Automatic extraction and manual validation of a hierarchical English-Swedish terminology (20)

Automatic extraction and manual validation of a hierarchical English-Swedish terminology