SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Automatic extraction and manual validation
of a hierarchical English-Swedish
terminology

NORDTERM 2009

  Magnus Merkel*, Jody Foo*, Mikael Andersson**, Lars Edholm**, Mikaela
                       Gidlund**, Sanna Åsberg**

                                 Presented by Jody Foo

           * Department of Computer and Information Science,Linköping University
                            ** Fodina Language Technology AB
Overview

!! Background
!! Term extraction and validation process
!! Results
!! Conclusions and future work




                        Merkel, Foo et al, NORDTERM 2009
Some history

                                         NLPLAB, Linköping University Spin-o : Fodina Language
                                         Technology
                                         2004

                                                                   Patent Information Conference
                                                                   2006
                                                                   Results from initial machine translation projects
Patent Abstracts of Japan (PAJ) launches online machine
translation initiative                                             EPO launches patent MT service
2000                                                               2006

                                         First attempts at MT @ EPO                     PRV term extraction and validation
                                         2004                                           2008 – 2009



       2001      2002       2003      2004       2005         2006          2007     2008       2009       2010        2011




                                               Merkel, Foo et al, NORDTERM 2009
Machine translation

!! Two main approaches
  !! Rule based machine translation (RBMT), e.g. Babelfish
  !! Statistical machine translation (SMT), e.g. Google Translate


!! MT @ EPO
  !! Rule-based MT engine: Systran
  !! RBMT requires domain specific dictionaries – patent terms




                         Merkel, Foo et al, NORDTERM 2009
Diallo 2006

Merkel, Foo et al, NORDTERM 2009
Diallo 2006

Merkel, Foo et al, NORDTERM 2009
0
                                                                                               1000
                                                                                                      2000
                                                                                                                3000
                                                                                                                       4000
                                                                                                                              5000
                                                                                                                                     6000
                                                                                                                                            7000
                                                                                                                                                   8000
                                                                               A01B
                                                                               A22C
                                                                               A41B
                                                                               A45D
                                                                               A61D




                                         0
                                             2000
                                                     4000
                                                              6000
                                                                       8000
                                                                                10000
                                                                                             12000
                                                                                                        14000
                                                                               A63D
                                   A01                                         B04B
                                   A23                                         B21G
                                   A42                                         B23P
                                   A45                                         B27B
                                   A61                                         B29C
                                   B01                                         B41N
                                                                                                                                                          Input data



                                   B04                                         B60F
                                   B07                                         B61C
                                   B21                                         B62M
                                                                               B65G
                                   B24
                                                                               C01D
                                   B27
                                                                               C06F
                                   B30                                         C08K
                                   B41                                         C10K
                                   B44                                         C12S
                                   B62                                         C23G
                                   B65                                         D03J
                                   B68                                         D06P
                                   C02                                         E02B
                                   C05                                         E05D
                                   C08                                         F01N
                                   C11                                         F04B
                                                 0
                                                       5000
                                                               10000
                                                                       15000
                                                                                20000
                                                                                            25000
                                                                                                      30000




                                   C14                                         F16L
                                   C23                                         F21S
                                   D01
                                             A                                 F23Q




Merkel, Foo et al, NORDTERM 2009
                                                                               F28B
                                   D04
                                                                               G01B
                                   D07
                                             B


                                                                               G01V
                                   E02                                         G05B
                                   E05                                         G07D
                                   F01
                                             C




                                                                               G11B
                                   F04                                         H01J
                                   F17                                         H02N
                                             D




                                   F23                                         H04K
                                   F26
                                   F41
                                             E




                                   G02
                                   G05
                                   G08
                                             F




                                   G11
                                   H01
                                   H04
                                             G
                                             H
Merkel, Foo et al, NORDTERM 2009
Overview of the term extraction and
    validation process


             Source data analysis and system          Term candidate         Term candidate filtering and
SGML &       configuration                             extraction             initial linguistic validation
 OCR




  Manual validation by domain    Final linguistic      Publishing of validated
  experts                        validation            terms                                   OLIF




                                          Merkel, Foo et al, NORDTERM 2009
Perform necessary steps before term
    extraction is possible


             Source data analysis and system          Term candidate         Term candidate filtering and
SGML &       configuration                             extraction             initial linguistic validation
 OCR




  Manual validation by domain    Final linguistic      Publishing of validated
  experts                        validation            terms                                   OLIF




                                          Merkel, Foo et al, NORDTERM 2009
Analysis of source material and system
configuration

                *+         *+          >?)>+           @)A+   B*C+   *+

  !"#$%&'(%)*+&,-..+/,0123-44+5/+/,-+/-4/036+.715/073+



  !"#$%&'(%&)+&25.-/4+/8712.-2+90:+;<794/=..-/+
                 *+           *+               >?)>+          *+




                      Merkel, Foo et al, NORDTERM 2009
Extract list of term candidates to be
    validated


             Source data analysis and system          Term candidate         Term candidate filtering and
SGML &       configuration                             extraction             initial linguistic validation
 OCR




  Manual validation by domain    Final linguistic      Publishing of validated
  experts                        validation            terms                                   OLIF




                                          Merkel, Foo et al, NORDTERM 2009
Term candidate extraction




                Merkel, Foo et al, NORDTERM 2009
Client-server infrastructure




                 Merkel, Foo et al, NORDTERM 2009
Merkel, Foo et al, NORDTERM 2009
Reduce the number of term candidates to be
    processed by the domain experts


             Source data analysis and system          Term candidate         Term candidate filtering and
SGML &       configuration                             extraction             initial linguistic validation
 OCR




  Manual validation by domain    Final linguistic      Publishing of validated
  experts                        validation            terms                                   OLIF




                                          Merkel, Foo et al, NORDTERM 2009
Term filtering and initial linguistic validation

!! Filtering criteria
   !! General language filtering
   !! Q-value (~alignment confidence)
   !! Link errors
   !! Source OR target frequency > 4




                         Merkel, Foo et al, NORDTERM 2009
Term filtering and initial linguistic validation

!! Example: C04B

  Total number of term candidates: 143,341
  General language entries: 18,764
  Link errors: 653
  Freq >4 src|trg: 9,064
  Q-value filtering: keep 4,076 DEF95.G(HIJ+

  Total after filtering:   3,179


                             Merkel, Foo et al, NORDTERM 2009
Manual validation by domain experts




                Merkel, Foo et al, NORDTERM 2009
Overview of the term extraction and
    validation process


             Source data analysis and system          Term candidate         Term candidate filtering and
SGML &       configuration                             extraction             initial linguistic validation
 OCR




  Manual validation by domain    Final linguistic      Publishing of validated
  experts                        validation            terms                                   OLIF




                                          Merkel, Foo et al, NORDTERM 2009
Final linguistic validation

!! To be validated
  !! Part-of-speech, Inflection pattern, Gender, Number

!! Recycle as much information as possible from previously
   validated terms

!! Process terms by recycling status
  !! Very reliable information
  !! Less reliable information
  !! No information available




                         Merkel, Foo et al, NORDTERM 2009
Publishing of validated terms


                                   Top




       A               C                         E                    F                H




A61          C03      C11                                                 F42          H05
                                          C21




      C03B     C03C     C21B             C21C                  C21D             H05B       H05C




                            Merkel, Foo et al, NORDTERM 2009
Final numbers
!! Processed 91,000 document pairs in 8 months.
!! Validated term pairs: 181,260
!! Expert validatation: 4 – 6,000 term candidate pairs/working day
!! Linguistic validation: 2 – 3,000 term pairs/working day
          Accumulated amount     Accumulated amount of                               Accumulated amount
                                                                Accumulated amount
Section    of total number of        total number of                                 of UNIQUE term
                                                                of term pairs
           documents (in %)        documents (in %)                                  pairs
  D                        2,8                          2,8                 17288                 9697
  E                        2,1                          4,9                 32045                16304
  F                        7,1                           12                 78301                 32512
  G                      10,2                          22,2                133912                 53731
  H                      10,3                          32,5               187429                  72721
  A                      20,7                          53,2               289850                110642
  B                       18,1                         71,3                419185               146665
  C                      28,7                          100                 545143               181260



                                  Merkel, Foo et al, NORDTERM 2009
Growth of validated terms

                                 600000                                                                                        Accumulated amount of
                                                                                                                               validated term pairs
Number of validated term pairs




                                 500000
                                                                                                                               Accumulated amount of
                                 400000                                                                                        validated UNIQUE term
                                                                                                                               pairs
                                 300000                                                                                        Right section edge of: D
                                                                                                                               -E-F-G-H-A-B-C
                                 200000


                                 100000


                                      0

                                          0           20              40               60                  80       100

                                                    Amount of total number of documents (in %)
                                 A blue diamond marks the right edge of a section, left to right: D - E - F - G - H - A - B - C.



                                                                            Merkel, Foo et al, NORDTERM 2009
Conclusions and future work

!! Key concepts
  !! using previously validated term pairs to avoid doing the same
     work twice
  !! using students as domain experts
  !! using an e cient validation tool

!! Future work
  !! Improving automated filtering and reduction of term candidates
  !! Automating termness detection




                         Merkel, Foo et al, NORDTERM 2009

Weitere ähnliche Inhalte

Andere mochten auch

Numerical parametric study on interval shift variation in simo sstd technique...
Numerical parametric study on interval shift variation in simo sstd technique...Numerical parametric study on interval shift variation in simo sstd technique...
Numerical parametric study on interval shift variation in simo sstd technique...eSAT Journals
 
C10 syllabus statements
C10 syllabus statementsC10 syllabus statements
C10 syllabus statementscartlidge
 
WiTricity : Wireless elecTricity
WiTricity : Wireless elecTricity WiTricity : Wireless elecTricity
WiTricity : Wireless elecTricity Sumit Mahajan
 
9 e reactions of metals & metal cmpds
9 e reactions of metals & metal cmpds9 e reactions of metals & metal cmpds
9 e reactions of metals & metal cmpdscartlidge
 
Extraction Of Metals From Ores
Extraction Of Metals From OresExtraction Of Metals From Ores
Extraction Of Metals From OresAlan Crooks
 
C19 metals and their reactivity
C19 metals and their reactivityC19 metals and their reactivity
C19 metals and their reactivityChemrcwss
 
Power point tortugues 2011-12
Power point tortugues 2011-12Power point tortugues 2011-12
Power point tortugues 2011-12Cucaferatona
 
Twitter How To Build Your Network
Twitter How To Build Your NetworkTwitter How To Build Your Network
Twitter How To Build Your NetworkBrad Sage
 
Revenues Are Shrinking but Spending is Not - - Presentation from CBC
Revenues Are Shrinking but Spending is Not - - Presentation from CBCRevenues Are Shrinking but Spending is Not - - Presentation from CBC
Revenues Are Shrinking but Spending is Not - - Presentation from CBCUnshackle Upstate
 
kellogg's 09
kellogg's 09kellogg's 09
kellogg's 09shiv2008
 
とちぎRuby会議02
とちぎRuby会議02とちぎRuby会議02
とちぎRuby会議02akira yamada
 
Presentation Bcg General, Update April09
Presentation Bcg General, Update April09Presentation Bcg General, Update April09
Presentation Bcg General, Update April09janvader
 
Jeffery Candiloro
Jeffery CandiloroJeffery Candiloro
Jeffery Candilorostephenlead
 
Gcec Slideshow Oct 09
Gcec Slideshow Oct 09Gcec Slideshow Oct 09
Gcec Slideshow Oct 09NCIIA
 

Andere mochten auch (20)

Numerical parametric study on interval shift variation in simo sstd technique...
Numerical parametric study on interval shift variation in simo sstd technique...Numerical parametric study on interval shift variation in simo sstd technique...
Numerical parametric study on interval shift variation in simo sstd technique...
 
C10 syllabus statements
C10 syllabus statementsC10 syllabus statements
C10 syllabus statements
 
WiTricity : Wireless elecTricity
WiTricity : Wireless elecTricity WiTricity : Wireless elecTricity
WiTricity : Wireless elecTricity
 
9 e reactions of metals & metal cmpds
9 e reactions of metals & metal cmpds9 e reactions of metals & metal cmpds
9 e reactions of metals & metal cmpds
 
Reactivity Series
Reactivity SeriesReactivity Series
Reactivity Series
 
Extraction Of Metals From Ores
Extraction Of Metals From OresExtraction Of Metals From Ores
Extraction Of Metals From Ores
 
Faraday laws of electrolysis
Faraday laws of electrolysisFaraday laws of electrolysis
Faraday laws of electrolysis
 
C19 metals and their reactivity
C19 metals and their reactivityC19 metals and their reactivity
C19 metals and their reactivity
 
Power point tortugues 2011-12
Power point tortugues 2011-12Power point tortugues 2011-12
Power point tortugues 2011-12
 
Twitter How To Build Your Network
Twitter How To Build Your NetworkTwitter How To Build Your Network
Twitter How To Build Your Network
 
Revenues Are Shrinking but Spending is Not - - Presentation from CBC
Revenues Are Shrinking but Spending is Not - - Presentation from CBCRevenues Are Shrinking but Spending is Not - - Presentation from CBC
Revenues Are Shrinking but Spending is Not - - Presentation from CBC
 
Africa
AfricaAfrica
Africa
 
kellogg's 09
kellogg's 09kellogg's 09
kellogg's 09
 
Chistesgraficos
ChistesgraficosChistesgraficos
Chistesgraficos
 
Hpm3projection
Hpm3projectionHpm3projection
Hpm3projection
 
とちぎRuby会議02
とちぎRuby会議02とちぎRuby会議02
とちぎRuby会議02
 
Presentation Bcg General, Update April09
Presentation Bcg General, Update April09Presentation Bcg General, Update April09
Presentation Bcg General, Update April09
 
Surveilance documents
Surveilance documentsSurveilance documents
Surveilance documents
 
Jeffery Candiloro
Jeffery CandiloroJeffery Candiloro
Jeffery Candiloro
 
Gcec Slideshow Oct 09
Gcec Slideshow Oct 09Gcec Slideshow Oct 09
Gcec Slideshow Oct 09
 

Ähnlich wie Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Shou qing wang
Shou qing wangShou qing wang
Shou qing wangjenidoyle
 
01 edwin koot - solarplaza
01   edwin koot - solarplaza01   edwin koot - solarplaza
01 edwin koot - solarplazaLinea Trovata
 
Shortening distances with destination branding inglés
Shortening distances with destination branding inglésShortening distances with destination branding inglés
Shortening distances with destination branding inglésÁlvaro Fierro
 
Analisis time series
Analisis time seriesAnalisis time series
Analisis time seriesXYZ Williams
 
Autonomous Urban Agents and Modeling with Ambient Computing
Autonomous Urban Agents and Modeling with Ambient ComputingAutonomous Urban Agents and Modeling with Ambient Computing
Autonomous Urban Agents and Modeling with Ambient Computingredfishgroup
 
Autonomous Urban Agents and Modeling with Ambient Computing
Autonomous Urban Agents and Modeling with Ambient ComputingAutonomous Urban Agents and Modeling with Ambient Computing
Autonomous Urban Agents and Modeling with Ambient Computingredfishgroup
 
[Challenge:Future] Disrupt your world: The Future of Work
[Challenge:Future] Disrupt your world: The Future of Work[Challenge:Future] Disrupt your world: The Future of Work
[Challenge:Future] Disrupt your world: The Future of WorkChallenge:Future
 
scana Presentation-Q4-2008_tcm10-227202
scana  Presentation-Q4-2008_tcm10-227202scana  Presentation-Q4-2008_tcm10-227202
scana Presentation-Q4-2008_tcm10-227202finance50
 
Scania Presentation-Q4-2008_tcm10-227202
Scania Presentation-Q4-2008_tcm10-227202Scania Presentation-Q4-2008_tcm10-227202
Scania Presentation-Q4-2008_tcm10-227202finance50
 
American Pharmaceutical Review Barnes Et Al
American Pharmaceutical Review Barnes Et AlAmerican Pharmaceutical Review Barnes Et Al
American Pharmaceutical Review Barnes Et Albarnes72
 
IBM Storwize V7000 Ultimate Performance Eng
IBM Storwize V7000 Ultimate Performance EngIBM Storwize V7000 Ultimate Performance Eng
IBM Storwize V7000 Ultimate Performance EngOleg Korol
 
Ucgis Summer 09 Final
Ucgis Summer 09 FinalUcgis Summer 09 Final
Ucgis Summer 09 FinalFabio Carrera
 
Biomedical Annotation - Kevin Livingston
Biomedical Annotation - Kevin LivingstonBiomedical Annotation - Kevin Livingston
Biomedical Annotation - Kevin LivingstonDLFCLIR
 
米国西海岸のクリーン・エネルギー企業動向(完全版)
米国西海岸のクリーン・エネルギー企業動向(完全版)米国西海岸のクリーン・エネルギー企業動向(完全版)
米国西海岸のクリーン・エネルギー企業動向(完全版)Fusion Reactor LLC
 
Scania Presentation%20Q3%202008_tcm10-219195
Scania Presentation%20Q3%202008_tcm10-219195Scania Presentation%20Q3%202008_tcm10-219195
Scania Presentation%20Q3%202008_tcm10-219195finance50
 

Ähnlich wie Automatic extraction and manual validation of a hierarchical English-Swedish terminology (20)

Shou qing wang
Shou qing wangShou qing wang
Shou qing wang
 
01 edwin koot - solarplaza
01   edwin koot - solarplaza01   edwin koot - solarplaza
01 edwin koot - solarplaza
 
Shortening distances with destination branding inglés
Shortening distances with destination branding inglésShortening distances with destination branding inglés
Shortening distances with destination branding inglés
 
Analisis time series
Analisis time seriesAnalisis time series
Analisis time series
 
Autonomous Urban Agents and Modeling with Ambient Computing
Autonomous Urban Agents and Modeling with Ambient ComputingAutonomous Urban Agents and Modeling with Ambient Computing
Autonomous Urban Agents and Modeling with Ambient Computing
 
Autonomous Urban Agents and Modeling with Ambient Computing
Autonomous Urban Agents and Modeling with Ambient ComputingAutonomous Urban Agents and Modeling with Ambient Computing
Autonomous Urban Agents and Modeling with Ambient Computing
 
Access to open data through open access articles in the life sciences
Access to open data through open access articles in the life sciencesAccess to open data through open access articles in the life sciences
Access to open data through open access articles in the life sciences
 
[Challenge:Future] Disrupt your world: The Future of Work
[Challenge:Future] Disrupt your world: The Future of Work[Challenge:Future] Disrupt your world: The Future of Work
[Challenge:Future] Disrupt your world: The Future of Work
 
Road, Safety, and Health - Is There a Disconnect?
Road, Safety, and Health - Is There a Disconnect?Road, Safety, and Health - Is There a Disconnect?
Road, Safety, and Health - Is There a Disconnect?
 
Ism Presentation
Ism PresentationIsm Presentation
Ism Presentation
 
scana Presentation-Q4-2008_tcm10-227202
scana  Presentation-Q4-2008_tcm10-227202scana  Presentation-Q4-2008_tcm10-227202
scana Presentation-Q4-2008_tcm10-227202
 
Scania Presentation-Q4-2008_tcm10-227202
Scania Presentation-Q4-2008_tcm10-227202Scania Presentation-Q4-2008_tcm10-227202
Scania Presentation-Q4-2008_tcm10-227202
 
American Pharmaceutical Review Barnes Et Al
American Pharmaceutical Review Barnes Et AlAmerican Pharmaceutical Review Barnes Et Al
American Pharmaceutical Review Barnes Et Al
 
IBM Storwize V7000 Ultimate Performance Eng
IBM Storwize V7000 Ultimate Performance EngIBM Storwize V7000 Ultimate Performance Eng
IBM Storwize V7000 Ultimate Performance Eng
 
Enabling Clean Talking
Enabling Clean Talking Enabling Clean Talking
Enabling Clean Talking
 
Ucgis Summer 09 Final
Ucgis Summer 09 FinalUcgis Summer 09 Final
Ucgis Summer 09 Final
 
Biomedical Annotation - Kevin Livingston
Biomedical Annotation - Kevin LivingstonBiomedical Annotation - Kevin Livingston
Biomedical Annotation - Kevin Livingston
 
米国西海岸のクリーン・エネルギー企業動向(完全版)
米国西海岸のクリーン・エネルギー企業動向(完全版)米国西海岸のクリーン・エネルギー企業動向(完全版)
米国西海岸のクリーン・エネルギー企業動向(完全版)
 
Tinker global energy transition feb 2012
Tinker global energy transition feb 2012Tinker global energy transition feb 2012
Tinker global energy transition feb 2012
 
Scania Presentation%20Q3%202008_tcm10-219195
Scania Presentation%20Q3%202008_tcm10-219195Scania Presentation%20Q3%202008_tcm10-219195
Scania Presentation%20Q3%202008_tcm10-219195
 

Automatic extraction and manual validation of a hierarchical English-Swedish terminology

  • 1. Automatic extraction and manual validation of a hierarchical English-Swedish terminology NORDTERM 2009 Magnus Merkel*, Jody Foo*, Mikael Andersson**, Lars Edholm**, Mikaela Gidlund**, Sanna Åsberg** Presented by Jody Foo * Department of Computer and Information Science,Linköping University ** Fodina Language Technology AB
  • 2. Overview !! Background !! Term extraction and validation process !! Results !! Conclusions and future work Merkel, Foo et al, NORDTERM 2009
  • 3. Some history NLPLAB, Linköping University Spin-o : Fodina Language Technology 2004 Patent Information Conference 2006 Results from initial machine translation projects Patent Abstracts of Japan (PAJ) launches online machine translation initiative EPO launches patent MT service 2000 2006 First attempts at MT @ EPO PRV term extraction and validation 2004 2008 – 2009 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Merkel, Foo et al, NORDTERM 2009
  • 4. Machine translation !! Two main approaches !! Rule based machine translation (RBMT), e.g. Babelfish !! Statistical machine translation (SMT), e.g. Google Translate !! MT @ EPO !! Rule-based MT engine: Systran !! RBMT requires domain specific dictionaries – patent terms Merkel, Foo et al, NORDTERM 2009
  • 5. Diallo 2006 Merkel, Foo et al, NORDTERM 2009
  • 6. Diallo 2006 Merkel, Foo et al, NORDTERM 2009
  • 7. 0 1000 2000 3000 4000 5000 6000 7000 8000 A01B A22C A41B A45D A61D 0 2000 4000 6000 8000 10000 12000 14000 A63D A01 B04B A23 B21G A42 B23P A45 B27B A61 B29C B01 B41N Input data B04 B60F B07 B61C B21 B62M B65G B24 C01D B27 C06F B30 C08K B41 C10K B44 C12S B62 C23G B65 D03J B68 D06P C02 E02B C05 E05D C08 F01N C11 F04B 0 5000 10000 15000 20000 25000 30000 C14 F16L C23 F21S D01 A F23Q Merkel, Foo et al, NORDTERM 2009 F28B D04 G01B D07 B G01V E02 G05B E05 G07D F01 C G11B F04 H01J F17 H02N D F23 H04K F26 F41 E G02 G05 G08 F G11 H01 H04 G H
  • 8. Merkel, Foo et al, NORDTERM 2009
  • 9. Overview of the term extraction and validation process Source data analysis and system Term candidate Term candidate filtering and SGML & configuration extraction initial linguistic validation OCR Manual validation by domain Final linguistic Publishing of validated experts validation terms OLIF Merkel, Foo et al, NORDTERM 2009
  • 10. Perform necessary steps before term extraction is possible Source data analysis and system Term candidate Term candidate filtering and SGML & configuration extraction initial linguistic validation OCR Manual validation by domain Final linguistic Publishing of validated experts validation terms OLIF Merkel, Foo et al, NORDTERM 2009
  • 11. Analysis of source material and system configuration *+ *+ >?)>+ @)A+ B*C+ *+ !"#$%&'(%)*+&,-..+/,0123-44+5/+/,-+/-4/036+.715/073+ !"#$%&'(%&)+&25.-/4+/8712.-2+90:+;<794/=..-/+ *+ *+ >?)>+ *+ Merkel, Foo et al, NORDTERM 2009
  • 12. Extract list of term candidates to be validated Source data analysis and system Term candidate Term candidate filtering and SGML & configuration extraction initial linguistic validation OCR Manual validation by domain Final linguistic Publishing of validated experts validation terms OLIF Merkel, Foo et al, NORDTERM 2009
  • 13. Term candidate extraction Merkel, Foo et al, NORDTERM 2009
  • 14. Client-server infrastructure Merkel, Foo et al, NORDTERM 2009
  • 15. Merkel, Foo et al, NORDTERM 2009
  • 16. Reduce the number of term candidates to be processed by the domain experts Source data analysis and system Term candidate Term candidate filtering and SGML & configuration extraction initial linguistic validation OCR Manual validation by domain Final linguistic Publishing of validated experts validation terms OLIF Merkel, Foo et al, NORDTERM 2009
  • 17. Term filtering and initial linguistic validation !! Filtering criteria !! General language filtering !! Q-value (~alignment confidence) !! Link errors !! Source OR target frequency > 4 Merkel, Foo et al, NORDTERM 2009
  • 18. Term filtering and initial linguistic validation !! Example: C04B Total number of term candidates: 143,341 General language entries: 18,764 Link errors: 653 Freq >4 src|trg: 9,064 Q-value filtering: keep 4,076 DEF95.G(HIJ+ Total after filtering: 3,179 Merkel, Foo et al, NORDTERM 2009
  • 19. Manual validation by domain experts Merkel, Foo et al, NORDTERM 2009
  • 20. Overview of the term extraction and validation process Source data analysis and system Term candidate Term candidate filtering and SGML & configuration extraction initial linguistic validation OCR Manual validation by domain Final linguistic Publishing of validated experts validation terms OLIF Merkel, Foo et al, NORDTERM 2009
  • 21. Final linguistic validation !! To be validated !! Part-of-speech, Inflection pattern, Gender, Number !! Recycle as much information as possible from previously validated terms !! Process terms by recycling status !! Very reliable information !! Less reliable information !! No information available Merkel, Foo et al, NORDTERM 2009
  • 22. Publishing of validated terms Top A C E F H A61 C03 C11 F42 H05 C21 C03B C03C C21B C21C C21D H05B H05C Merkel, Foo et al, NORDTERM 2009
  • 23. Final numbers !! Processed 91,000 document pairs in 8 months. !! Validated term pairs: 181,260 !! Expert validatation: 4 – 6,000 term candidate pairs/working day !! Linguistic validation: 2 – 3,000 term pairs/working day Accumulated amount Accumulated amount of Accumulated amount Accumulated amount Section of total number of total number of of UNIQUE term of term pairs documents (in %) documents (in %) pairs D 2,8 2,8 17288 9697 E 2,1 4,9 32045 16304 F 7,1 12 78301 32512 G 10,2 22,2 133912 53731 H 10,3 32,5 187429 72721 A 20,7 53,2 289850 110642 B 18,1 71,3 419185 146665 C 28,7 100 545143 181260 Merkel, Foo et al, NORDTERM 2009
  • 24. Growth of validated terms 600000 Accumulated amount of validated term pairs Number of validated term pairs 500000 Accumulated amount of 400000 validated UNIQUE term pairs 300000 Right section edge of: D -E-F-G-H-A-B-C 200000 100000 0 0 20 40 60 80 100 Amount of total number of documents (in %) A blue diamond marks the right edge of a section, left to right: D - E - F - G - H - A - B - C. Merkel, Foo et al, NORDTERM 2009
  • 25. Conclusions and future work !! Key concepts !! using previously validated term pairs to avoid doing the same work twice !! using students as domain experts !! using an e cient validation tool !! Future work !! Improving automated filtering and reduction of term candidates !! Automating termness detection Merkel, Foo et al, NORDTERM 2009