Presenters: Ben Gottesman and Michael Klemme (Acrolinx)
This presentation is a part of TaaS project funded from the European Union Seventh Framework Programme (FP7/2007-2013), grant agreement no 296312
2. Definitions
term extraction: automatically identifying potential terms in a
document (corpus)
multilingual term extraction: automatically identifying potential terms
and their translations in a document and its translation (parallel
corpus / translation memory)
The wizard begins creating the bootable image.
Der Assistent beginnt mit der Erstellung des bootfähigen Image.
(… or, if the source-language terminology already exists, just identify translations)
3. Synonyms
Identify same-language synonyms via translations in common
German
English
Die Spannungsversorgung für
die Elektronik wird vom
Speisegerät G526 sichergestellt.
The voltage supply for the
electronics is maintained by the
power supply unit G526.
Spannungsversorgung für
interne Speisung (X3e)
Power supply for internal supply
(X3e)
Unterspannung in der
Stromversorgung
Undervoltage in the power
supply
Spannungsversorgung
Stromversorgung
voltage supply
power supply
4. Outline
• What is multilingual term extraction?
• What is the workflow from customer perspective?
– customer use case examples
– show extraction results, demonstrate human validation
• How does the extraction work?
– how we identify candidates
•
source-language candidates
•
translation candidates
– how we filter translation candidates
– how we identify source-language synonyms
• What is Acrolinx and how does MTE fit in?
5. Outline
• What is multilingual term extraction?
• What is the workflow from customer perspective?
– customer use case examples
– show extraction results, demonstrate human validation
• How does the extraction work?
– how we identify candidates
•
source-language candidates
•
translation candidates
– how we filter translation candidates
– how we identify source-language synonyms
• What is Acrolinx and how does MTE fit in?
6. Workflow: Customer perspective
1. Customer provides translated documents
2. Acrolinx provides extracted multilingual term
candidates to customer
3. Customer validates candidates
4. Validated results become (or are added to)
customer’s term bank
7. Customer use cases, past examples
Use case 1
– de-<en,fr,es,it,pt> (mostly de-en)
– ~142,000 bilingual segments; ~2,685,000 tokens (total)
Use case 2
– de-<en,fr> (all data trilingual)
– ~132,000 bilingual segments; ~1,259,000 tokens
– data document-aligned, not segment-aligned, so extra step required
Use case 3
–
–
–
–
en-de
~942,000 bilingual segments; ~25,000,000 tokens
extract translations of a given list of keywords
determine which keywords don’t occur in data
8. Results
• human validation in Excel
“Baugruppe” has been translated
inconsistently into English in the past
Mark respective translations as
preferred/deprecated to guide translators
in the future.
9. Results
“Stromversorgung” and “Einspeisung” have translations in common.
→ automatically identified as possible synonyms, so same Cluster ID
To validate synonym link, edit Subcluster IDs to be the same.
Mark respective variants as preferred/deprecated to guide authors.
10. Outline
• What is multilingual term extraction?
• What is the workflow from customer perspective?
– customer use case examples
– show extraction results, demonstrate human validation
• How does the extraction work?
– how we identify candidates
•
source-language candidates
•
translation candidates
– how we filter translation candidates
– how we identify source-language synonyms
• What is Acrolinx and how does MTE fit in?
11. How does the extraction work?
• Extract source-language term candidates from
source-language text (unless source-language
terminology exists)
The wizard begins creating the bootable image.
– linguistics-based
• especially part-of-speech patterns
– same functionality built into the core Acrolinx
product
12. How does the extraction work?
• Extract translation candidates of each sourcelanguage term candidate from target-language
text
The wizard begins creating the bootable image.
Der Assistent beginnt mit der Erstellung des bootfähigen Image.
– use statistical phrase-alignment technology
– same used in statistical machine translation
13. How does the extraction work?
• Filter translation candidates
translation candidates for “Eingangsspannung” (pink = filtered out)
… based on:
– confidence score calculated from translation probabilities
•
can adjust threshold to favour precision or recall
– surface characteristics (closed-class words, punctuation)
– term-candidacy of translation (if possible for language)
14. How does the extraction work?
• Identify synonyms (‘cluster’ candidates)
cluster around “Stromwandler” (minimum link confidence threshold = 0.01)
– link confidence based on the degree to which translations are shared
– can adjust threshold to favour precision or recall of links
15. How does the extraction work?
• Identify synonyms (‘cluster’ candidates)
cluster around “Stromwandler” (minimum link confidence threshold = 0.03)
– link confidence based on the degree to which translations are shared
– can adjust threshold to favour precision or recall of links
16. Outline
• What is multilingual term extraction?
• What is the workflow from customer perspective?
– customer use case examples
– show extraction results, demonstrate human validation
• How does the extraction work?
– how we identify candidates
•
source-language candidates
•
translation candidates
– how we filter translation candidates
– how we identify source-language synonyms
• What is Acrolinx and how does MTE fit in?
17. What is Acrolinx?
Acrolinx is Content Optimization Software. It helps
authors make there text
– more correct,
– more consistent,
– and more readable.
18. What is Acrolinx?
Acrolinx is Content Optimization Software. It helps
authors make their text
– more correct,
– more consistent,
– and more readable.
Consistent use of terminology is an important factor in
the readability of text. Acrolinx provides:
– term extraction (monolingual, aka term harvesting)
– terminology management
– term checking
Multilingual Term Extraction as a Service is a natural
complement to the prior terminology functions.
20. Outline
• What is multilingual term extraction?
• What is the workflow from customer perspective?
– customer use case examples
– show extraction results, demonstrate human validation
• How does the extraction work?
– how we identify candidates
•
source-language candidates
•
translation candidates
– how we filter translation candidates
– how we identify source-language synonyms
• What is Acrolinx and how does MTE fit in?