"Bilingual Terminology Extraction from TMX. A state-of-the-art overview." Presentation at Translating Europe Forum 2016. Focus on translation technology.
2. 2
Key words
Overview of terms
involved in the
process
1st point 2nd point 3rd point 4th point
Evaluation
BATE under evaluation
Measures for accuracy
Quality in use model and tasks
Terminology and extractors
Terminology management
Its timeline
BATE (approaches, state of the art)
Results
Precision & Recall
Parameters & Questionnaire
INDEX
Main points of this presentation
3. Parallel corpus
TMX
Alignment levels
Paragraph, sentence and word
level
ATE & BATE Precision/Recall
Getting only terms and all terms
Gold standard
Exhaustive, manually created
bilingual glossary
Validation
* Term validation facility
* Which TCs are real terms?
Usability
Software used to achieve
user’s objectives with
effectiveness, efficiency,
and satisfaction
Quality in use model
ISO standard
KEY WORDS
Terms involved in the process
5. 5
IDENTIFY
FINDRETRIEVE
the terminology in the source text adequately
Identify and interpret
terminological data
Retrieve and store
proper documentation and
information resources
Find and use
IMPORTANCE OF TERMINOLOGY
Translators were the first professionals to be aware of term-related issues
6. 6
6
Time spent to solve terminological problems (Arntz 1993,
Walker 1993).+40%
In specialized translation
TERMINOLOGY MANAGEMENT
7. 7
7
Managing terminology (extracting, validating, importing, adding, editing, deleting,
revising, updating, exporting, publishing) is a time-comsuming process.
Time spent to solve terminological problems (Arntz 1993, Walker 1993).
+40%
In specialized translation
TERMINOLOGY MANAGEMENT
8. 8
8
Managing terminology (extracting, validating, importing, adding, editing, deleting, revising,
updating, exporting, publishing) is a time-comsuming process.
Time spent to solve terminological problems (Arntz 1993, Walker 1993).
+40%
In specialized translation
TERMINOLOGY MANAGEMENT
Terminology work is “on backstage”, and customer or
employers may not be fully aware of their befefits for QA.
9. 9
9
Managing terminology (extracting, validating, importing, adding, editing, deleting, revising,
updating, exporting, publishing) is a time-comsuming process.
Time spent to solve terminological problems (Arntz 1993, Walker 1993).
+40%
In specialized translation
TERMINOLOGY MANAGEMENT
Return on Investment (ROI) on terminology management
reported by some corporate studies (Childress, 2007;
Popiolek, 2015)
90%
Terminology work is “on backstage”, and customer or employers may not be fully aware of
their befefits for QA.
10. 10
10
TERMINOLOGY MANAGEMENT
Extraction
• List of terms extracted from ST
• List of terms to validate (accept or reject)
Translation
• List is added to a termbase
• List is translated and additional data added
Approval
• List approved by a person in charge of terminology
• When the client has requested there is an addtional
step for client approval
General model por project terminology
creation (Popiolek, 2015: 351)
Monolingual
extraction &
validation
Importing &
looking for
equivalents
11. 11
Preparing the files and import
them into the BATE
Preparation: TMX import
List of candidate term pairs
extracted from TMX
Bilingual extraction
TIMELINE in Terminology Management
with bilingual extraction
12. 12
- List of pair of terms to validate (accept
or reject terms and suggested
equivalents)
- Term by term and additional data are
added to a term base (Synchroterm)
Validation (& data entry)
- Export bilingual terms and additional
data in an available file format (.xls,
.txt, .TBX, …)
- Import output file to a TDB system
(to be integrated into a MT System)
Export/Import
14. 14
Bilingual Automatic Term Extractors
Two approaches (Foo, 2012)
EXTRACT-ALIGN
1ST step: monolingual terminology extraction
in both languages.
2nd step: cross-linguistic matching using
word-alignment or co-occurrence statistics to
find equivalents.
Commercial systems in this approach
15. 15
ALIGN-FILTER
1ST step: word-alignment on the
parallel texts.
2nd step: rank the aligned units to
finally select the most likely pair of
candidates (statistics)
TExSIS (Macken et al, 2013)
Bilingual Automatic Term Extractors
Two approaches (Foo, 2012)
16. 16
Bilingual Automatic Term Extractors
Academic / In-house
- English-French TERMIGHT (Dagan & Church, 1994)
- English-French (Kupiek, 1993)
- English-Dutch (Eijk, 1993)
- English-French (Gaussier, 1995)
- English and Swedish (Ahrenberg et al., 1998)
- French-Japanese (Morin et al 2010, from
ACABIT, Daille, 2003): not bilingual, but
multilingual
- Slovene and English, Luiz (Vintar, 2010);
- English and Swedish ITools suite (Foo &
Merkel, 2010)
- English and German (Gojun et al., 2012).
- English, French, German, Spanish, TTC
TermSuite (Daille, 2012)
- English-Spanish TBXTools (Oliver &
Vázquez, 2015) (under development)
- Chinese, Czech, Dutch, English, French,
German, Italian, Japanese, Korean, Polish,
Portuguese, Russian, Spanish: Sketch
Engine (Baisa et al 2015, Koval et al 2016)
- French-German (Blank, 2000)
- Japanese-English, MNH (Nakagawa & Mori, 2003)
- Spanish-Basque, Elexbi (Hernaiz et al., 2006),
from a TMX;
- Spanish-German, Autoterm (Haller, 2008);
- English-Spanish, Mutual Bilingual Term
Extractor (Ha et al, 2008)
- French-English, French-Italian and French-Dutch
(Lefever et al., 2009)
90s
2000-2009
2010 -2016
17. 17
Bilingual Automatic Term Extractors
Other BATE (free / comercial)
- TermExtractor (Shimohata et al 2001)
- MemoQ's built-in term extractor
- Déjà Vu - Lexicon
- TermoStat Web: http://termostat.ling.umontreal.ca/
- Yate (IULA)
- Okapi
- TerMine:
http://www.nactem.ac.uk/software/termine/
- TerminologyExtractor: https://goo.gl/yA2Cuf
- PRoMT
- FiveFilters (web-based): http://fivefilters.org/term-
extraction/
- Concordace programs: WordSmith Tools,
AntConc (free), …
90s
2010 -2016MONOLINGUAL ATE
- Xerox Terminology Suite (2001)
- SDL Multiterm Extract
- Synchroterm
- CrossMining (Across)
- MultiTrans Term Extractor
- Similis™ (by Lingua et Machina™)
- Anchovy (by Swordfish)
- Araya Term Extractor
- Analysis software: Sketch Engine
(terminology extraction from TMX)
BILINGUAL
20. 20
Multiterm Extract SynchroTerm Similis SkE Araya
Import TMX
Extraction config.
Extraction scores
Validation facility
Term base indexation
Export to TBX (xls, txt…)
Trados TMX
MAIN FEATURES
Others Others
22. Context coverage
degree to which the
product understands the
complete context of its
usage. Flexibility
Effectiveness
accuracy and completeness
with which user achieves
objectives
Satisfaction
Efficiency
resources expended in
relation to the accuracy and
completeness
Freedom from risk
no risk for the security of
users, software, context or the
environment
degree to which user needs are
satisfied when a software is
used in a specified context of use
QUALITY IN USE MODEL
Characteristics (ISO-IEC 25010: 2011)
23. 23
Setting up the
extraction project
CONFIGURATION
Importing the source file
TMX IMPORT
Performing the
extraction to get a
bilingual list
EXTRACTION
Selecting the real terms.
VALIDATION
Creating and managing
term entries
RECORD CREATION
Exporting the final result for
later use in CAT Systems
EXPORTATION
6 TASKS TO EVALUATE
when performing bilingual extraction
26. 26
Characteristics and sub-characteristics to be measured METRICS
EFFECTIVENESS Value between 0 (minimum) and 5 (maximum) (EFE1+EFE2+EFE3)/3
EFE1.- Degree of accuracy – precision of tasks & results
(P1+P7+P13+P19+P25+P31)/6
EFE2.- Degree of completeness (tasks are accomplished and
results are not missing)
(P2+P8+P14+P20+P26+P32)/6
EFE3.- Frequency of errors
(P3+P9+P15+P21+P27+P33)/6
EFFICIENCY Value between 0 (minimum) and 5 (maximum) (EFI2+EFI3+EFI4)/3
EFI1.- Time spent in the accomplishment of the task.
(TM1+TM2+TM3+TM4+TM5+TM6)
EFI2.- Need to use additional sources (material, software, etc.)
for the task
(P4+P10+P16+P22+P28+P34)/6
EFI3.- Productivity – effort exerted by the user to carry out the
task
(P5+P11+P17+P23+P29+P35)/6
EFI4.- Need to consult the software Help to perform the task
(P6+P12+P18+P24+P30+P36)/6
SATISFACTION Value between 0 (minimum) and 5 (maximum)
(P37+P38+P39)/3
SAT1.- Usefulness
SAT2.- Trust
SAT3.- Pleasure
CONTEXT COVERAGE Value between 0 (minimum) and 5 (maximum)
(P40+P41+P42)/3COB1.- Context of use
COB2.- Flexibility
PARAMETERS
28. 28
16
13
14
25
20
26
21
24
0
5
10
15
20
25
30
EXTRACTION VALIDATION
RESULTS FOR EXTRACTION & VALIDATION
Sketch MTE Synchr Similis
3,33
3,00
4,00
3,50
13,83
4,06
4,44
3,00
1,50
13,00
4,11 4,22 4,33
3,00
15,67
3,72
3,11 3,00 3,00
12,83
0,00
2,00
4,00
6,00
8,00
10,00
12,00
14,00
16,00
18,00
EFFECTIVENESS EFFICIENCY SATISFACTION CONTEXT COVERAGE TOTAL QIU
FINAL RESULTS FOR QUALITY IN USE
Sketch MTE Synchr Similis
29. 29
CONCLUSIONS
• Managing terminology still takes a lot of time and effort, even in
this increasingly computerized profession.
• Research on automatic terminology extraction has been
around for more than 20 years and significant enhancements
concerning bilingual extraction and bilingual corpora
exploitation have been introduced.
• I briefly described the BATE under evaluation and illustrated
some results obtained for accuracy and with the QIU model.
• Results make it clear that much more work has to be done for
BATE to be considered of real help to translators and
terminologists, mainly due to poor accuracy results.
30. Some references
• Baisa, Vit, Barbora Ulipová, and Michal Cukr. 2015. “Bilingual Terminology Extraction in Sketch Engine.” In 9th
Workshop on Recent Advances in Slavonic Natural Language Processing (RASLAN 2015), 61–67.
• Childress, Mark D. 2007. “Terminology Work Saves More Time than It Cost.” Multilingual, no. April/May: 43–46.
• Foo, Jody. 2012. Computational Terminology : Exploring Bilingual and Monolingual Term Extraction.
• Foo, Jody; Merkel, Magnus. 2010. “Computer Aided Term Bank Creation and Standardization. Building Stardardize
Term Banks through Automated Term Extraction and Advanced Editing Tools.” In Terminology in Everyday Life,
edited by Marcel Thelen and Fireda Steurs, 163–80. John Benjamins Publishing Company. doi:
10.1075/tlrp.13.12foo.
• Kovář, Vojtěch, Vít Baisa, and Miloš Jakubíček. 2016. “Sketch Engine for Bilingual Lexicography.” International
Journal of Lexicography 29 (3): 339–52. doi:10.1093/ijl/ecw029.
• Macken, Lieve, Els Lefever, and Veronique Hoste. 2013. “TExSIS: Bilingual Terminology Extraction from Parallel
Corpora Using Chunk-Based Alignment.” Terminology 19 (2013): 1–30. doi:10.1075/term.19.1.01mac.
• Oliver, Antoni, and M. Vazquez. 2015. “TBXTools: A Free, Fast and Flexible Tool for Automatic Terminology
Extraction.” In Proceedings of Recent Advances in Natural Language Processing, 473–79.
• Popiolek, Monika. 2015. “Terminology Management within a Translation Quality Assurance Process.” In Handbook
of Terminology (Volume 1), edited by Hendrik J Kockaert and Frieda Steurs, 341–59. John Benjamins Publishing
Company. doi:10.1075/hot.1.ter6.
• Sauron, Véronique. 2002. “Tearing out the Terms : Evaluating Terms Extractors.” In Translating and the Computer
24: Proceedings from the Aslib Conference, 21-22 November 2002.
• Vintar, Špela. 2010. “Bilingual Term Recognition revisited<BR> The Bag-of-Equivalents Term Alignment Approach
and Its Evaluation.” Terminology 16 (2010): 141–58. doi:10.1075/term.16.2.01vin.
31. University of Alicante
IULMA
Campus de San Vicente
Apdo. 99
03080 Alicante
Phone & Fax
Direct Line: +34 965903438
Fax: +34 965903800
chelo.vargas@ua.es
Social Media
@chelovargas
Many thanks for your attention
Chelo Vargas-Sierra