II-SDV 2017: Localizing International Content for Search, Data Mining and Analytics Applications

Localizing International Content for Search,
Data Mining and Analytics Applications
Andrew Rufener
E: andrew.rufener@omniscien.com
Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Agenda
• Who we are and what we do
• Setting the scene – a architecture for our discussion and the key challenges
• The localization workflow and why content localization and search are
intertwined
• Illustrating using a practical example
• Summary & Recommendations

COMPANY OVERVIEW
• Founded in 2007 as Asia Online, changed company name in 2016 to
Omniscien Technologies
• Award winning, leading global supplier of specialized and highly scalable
language processing, machine translation and machine learning solutions
offering in excess of 540 global language pairs
• HQ in Singapore, European operation in The Hague, The Netherlands, Asian
operation in Bangkok, Thailand
• Global customer base in North America, Europe and Asia Pacific

MARKETS AND SOLUTIONS
• eCommerce and Online Travel
Automated, high-volume localization of complex product catalogue information as
well as user generated content and reviews
• Online Research System and Digital Publishing
Automated, high-volume tagging, language processing, translation and transliteration
of legal, intellectual property, scientific, financial and business information content as
well as generation of relevant meta data
• Government & Intelligence
Automated, high-volume language identification, entity and entity relationship recognition,
sentiment analysis, linking and translation and transliteration of various information sources
• Technology & Enterprise
Complex language processing, tagging, enriching and localization
• Localization Industry
Support of complex and high-volume localization
• Media and Subtitling
Subtitle extraction and manufacturing from different sources, support of re-writing source
for subtitling, localization and post-editing, automated placement in frames and improvement
• eDiscovery
Automated a high volume content tagging, localization and discovery for
litigation data gathering, analysis and support

Setting the scene and why content localization
and search are intertwined
• 31, MARCH 2017

SIMPLIFIED REFERENCE ARCHITECTURE FOR OUR DISCUSSION
Unstructured
Data
Structured
Data
Search “Engine”

HOW DO I KNOW WHAT TO “ASK” FOR?
Unstructured
Data
Structured
Data
Search “Engine”
• How do I construct the right
query / search?
• How do I know what
keywords to use?
• Semantic or Concept Search
• Keyword lists
• Domain classifications
• Keyword based domain
classification (AI)
• …

HOW DO WE DEAL WITH MULTI-LINGUAL CONTENT?
Unstructured
Data
Structured
Data
Search “Engine”
Option 1:
Normalize to a single language
Option 2:
Cross-lingual search
What domain, how do
we maintain quality,
what is quality, what
language do we
normalize to..?
What kind of data, is
normalization or
transliteration needed,
how do we dal with
variants?

THE GENERIC LOCALIZATION WORKFLOW
Extraction Enrichment Translation Enrichment Delivery
1 2 3 4 5
Extract from
source format
to text or XML
Identifying
entities, entity
relationships,
adding meta
data, sentiment
analysis, etc.
Translation
and/or
transliteration,
normalizing
terminology,
maintaining
meta-data
Post-translation
corrections,
additional
enrichment and
classification,
etc.
Delivery to user
/ application
with or without
enrichments

THE GENERIC LOCALIZATION WORKFLOW
Extraction Enrichment Translation Enrichment Delivery
1 2 3 4 5
Extract from
source format
to text or XML
Identifying
entities, entity
relationships,
adding meta
data, sentiment
analysis, etc.
Translation
and/or
transliteration,
normalizing
terminology,
maintaining
meta-data
Post-translation
corrections,
additional
enrichment and
classification,
etc.
Delivery to user
/ application
with or without
enrichments
- Translation naturally provides the translated source – using either Statistical or Neural
Machine Translation
- However, bi-products and translation capabilities that are interesting in this context
are:
- Ability to normalize terminology
- Pre-processing and enriching content prior to translation (tagging, conversion..)
- Using the term analysis generated during the engine build
Extrémne problémy extrémne problémy extrémne problémy extrémnej problémy
refraktérnym
mnohopočetným
myelómom
refraktérnym mnohopočetným
myelómom
refraktérnym mnohopočetným
myelómom žiaruvzdorné myelómom je mladších
veľkosti nádoru veľkosť nádoru veľkosti nádoru veľkosti nádoru

JA-EN Sample Patent Translations; one is machine, one human
• The coagulation time was determined as described above.
• The setting time was determined as described above.
• The lighting device also typically includes a light source disposed at the end of the light conductor.
• The light device typically also includes a light source arranged at an end of the light guide.
• Such communication between components is but one example of a unidirectional communication system.
• Such communication between components is only one example of a one-way communication system.
• The use of a hearing aid by a healthcare provider is routine.
• The use of a stethoscope by health care providers is routine.
• This can further enhance the electrical and long-term performance of the backsheet.
• This may further increase the electrical properties and long-term performance of the backsheets.
• Initial Binding measurements were performed as described above for Plaque Initial Binding measurements.
• Initial bonding measurements were carried out as described above for Plaque Initial Bonding Measurements.
• The subtractive color mixture selected may depend on the metalized surface area and the resistance material used.
• The subtractive process selected can depend upon the metallized structured surface region and the resist material
utilized.

A REAL-LIFE EXAMPLE APPLICATION
• Example term (n-gram) extraction; extracted from actual human translations. The -gram variants show the
(green) suggested n-gram based on frequency but also the other candidates that were found. ”Distance” is
an available parameter.
• This process provides term variants, distance but also term relationships
• The results can be used for different purposes, amongst others
• Term normalization
• Term suggestions for search
• In conjunction with other meta data, domain identification
• …
actual swirl speed Vitesse de rotation réelle la vitesse de turbulence réelle vitesse réelle tourbillon vitesse réelle de remous
high byte octet haut octet de poids fort octet haut byte élevé
non-freezing fluid fluide antigel fluide incongelable sans gel fluide fluide de non-congélation
dental spray jet dentaire pulvérisation dentaire jet dentaire jet dentaire

A REAL-LIFE EXAMPLE APPLICATION (2)
• WIPO Patentscope (Patent Research) uses this
data extensively
• WIPO Pearl is an example application
• Many other examples exist in
• eCommerce (Products, Brands, etc.)
• Business Information (Names,
Locations, etc.)
• Scientific Research Platforms (Medical
Terms, Chemical Compounds, Domain
Identification, etc.)
• ..
Source: http://www.wipo.int/wipopearl/search/linguisticSearch.html

A FEW KEY RECOMMENDATIONS
1. Take a holistic view of your workflow end to end
2. Work from the desired application result backwards
3. Ensure you review the data production and localization process, both the
engine build as well as the production workflow. Ensure valuable meta
data is not discarded. The localization team will have a vey different view
on the “value” of certain data elements than the team handling search or
even the application
4. Keep in mind the enrichment capabilities of the localization workflow
ranging from entities, sentiment right to the ability to manipulate data on
the fly and call external data sources and subsequently “locking” the data
in for localization

SUMMARY
• The Machine Translation and associated Language Processing workflow
provides a wealth of information that can support search
• Understanding the interaction between the content localization and search
is critical to good search results and allows balancing precision and recall
• With Machine Learning entering translation with Neural Machine
Translation, a number of Machine learning applications are enabled
• Use the localization workflow to your advantage in a multi-lingual
environment

Q & A

II-SDV 2017: Localizing International Content for Search, Data Mining and Analytics Applications

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie II-SDV 2017: Localizing International Content for Search, Data Mining and Analytics Applications

Ähnlich wie II-SDV 2017: Localizing International Content for Search, Data Mining and Analytics Applications (20)

Mehr von Dr. Haxel Consult

Mehr von Dr. Haxel Consult (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

II-SDV 2017: Localizing International Content for Search, Data Mining and Analytics Applications