Advances in text mining, analytics and machine learning are transforming our applications and enabling ever more powerful applications, yet most applications and platforms are designed to deal with a single (normalized) language. Hence as our applications and platforms are increasingly required to ingest international content, the challenge becomes to find ways to normalize content to a single language without compromising quality. An extension of this question in terms of such applications is also how we define quality in this context and what, if any, bi-products a localization effort can produce that may enhance the usefulness of the application.
This talk will, using patent searching as an example use case, review the challenges and possible solution approaches for handling localization effectively and will show what current emerging technology offers, what to expect and what not to expect and provide an introductory practical guide to handling localization in the context of data mining and analytics.
2. Agenda
• Who we are and what we do
• Setting the scene – a architecture for our discussion and the key challenges
• The localization workflow and why content localization and search are
intertwined
• Illustrating using a practical example
• Summary & Recommendations
4. MARKETS AND SOLUTIONS
• eCommerce and Online Travel
Automated, high-volume localization of complex product catalogue information as
well as user generated content and reviews
• Online Research System and Digital Publishing
Automated, high-volume tagging, language processing, translation and transliteration
of legal, intellectual property, scientific, financial and business information content as
well as generation of relevant meta data
• Government & Intelligence
Automated, high-volume language identification, entity and entity relationship recognition,
sentiment analysis, linking and translation and transliteration of various information sources
• Technology & Enterprise
Complex language processing, tagging, enriching and localization
• Localization Industry
Support of complex and high-volume localization
• Media and Subtitling
Subtitle extraction and manufacturing from different sources, support of re-writing source
for subtitling, localization and post-editing, automated placement in frames and improvement
• eDiscovery
Automated a high volume content tagging, localization and discovery for
litigation data gathering, analysis and support
7. HOW DO I KNOW WHAT TO “ASK” FOR?
Unstructured
Data
Structured
Data
Search “Engine”
• How do I construct the right
query / search?
• How do I know what
keywords to use?
• Semantic or Concept Search
• Keyword lists
• Domain classifications
• Keyword based domain
classification (AI)
• …
8. HOW DO WE DEAL WITH MULTI-LINGUAL CONTENT?
Unstructured
Data
Structured
Data
Search “Engine”
Option 1:
Normalize to a single language
Option 2:
Cross-lingual search
What domain, how do
we maintain quality,
what is quality, what
language do we
normalize to..?
What kind of data, is
normalization or
transliteration needed,
how do we dal with
variants?
9. THE GENERIC LOCALIZATION WORKFLOW
Extraction Enrichment Translation Enrichment Delivery
1 2 3 4 5
Extract from
source format
to text or XML
Identifying
entities, entity
relationships,
adding meta
data, sentiment
analysis, etc.
Translation
and/or
transliteration,
normalizing
terminology,
maintaining
meta-data
Post-translation
corrections,
additional
enrichment and
classification,
etc.
Delivery to user
/ application
with or without
enrichments
10. THE GENERIC LOCALIZATION WORKFLOW
Extraction Enrichment Translation Enrichment Delivery
1 2 3 4 5
Extract from
source format
to text or XML
Identifying
entities, entity
relationships,
adding meta
data, sentiment
analysis, etc.
Translation
and/or
transliteration,
normalizing
terminology,
maintaining
meta-data
Post-translation
corrections,
additional
enrichment and
classification,
etc.
Delivery to user
/ application
with or without
enrichments
11. THE GENERIC LOCALIZATION WORKFLOW
Extraction Enrichment Translation Enrichment Delivery
1 2 3 4 5
Extract from
source format
to text or XML
Identifying
entities, entity
relationships,
adding meta
data, sentiment
analysis, etc.
Translation
and/or
transliteration,
normalizing
terminology,
maintaining
meta-data
Post-translation
corrections,
additional
enrichment and
classification,
etc.
Delivery to user
/ application
with or without
enrichments
- Translation naturally provides the translated source – using either Statistical or Neural
Machine Translation
- However, bi-products and translation capabilities that are interesting in this context
are:
- Ability to normalize terminology
- Pre-processing and enriching content prior to translation (tagging, conversion..)
- Using the term analysis generated during the engine build
Extrémne problémy extrémne problémy extrémne problémy extrémnej problémy
refraktérnym
mnohopočetným
myelómom
refraktérnym mnohopočetným
myelómom
refraktérnym mnohopočetným
myelómom žiaruvzdorné myelómom je mladších
veľkosti nádoru veľkosť nádoru veľkosti nádoru veľkosti nádoru
13. THE GENERIC LOCALIZATION WORKFLOW
Extraction Enrichment Translation Enrichment Delivery
1 2 3 4 5
Extract from
source format
to text or XML
Identifying
entities, entity
relationships,
adding meta
data, sentiment
analysis, etc.
Translation
and/or
transliteration,
normalizing
terminology,
maintaining
meta-data
Post-translation
corrections,
additional
enrichment and
classification,
etc.
Delivery to user
/ application
with or without
enrichments