SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Localizing International Content for Search,
Data Mining and Analytics Applications
Andrew Rufener
E: andrew.rufener@omniscien.com
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Agenda
• Who we are and what we do
• Setting the scene – a architecture for our discussion and the key challenges
• The localization workflow and why content localization and search are
intertwined
• Illustrating using a practical example
• Summary & Recommendations
COMPANY OVERVIEW
• Founded in 2007 as Asia Online, changed company name in 2016 to
Omniscien Technologies
• Award winning, leading global supplier of specialized and highly scalable
language processing, machine translation and machine learning solutions
offering in excess of 540 global language pairs
• HQ in Singapore, European operation in The Hague, The Netherlands, Asian
operation in Bangkok, Thailand
• Global customer base in North America, Europe and Asia Pacific
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
MARKETS AND SOLUTIONS
• eCommerce and Online Travel
Automated, high-volume localization of complex product catalogue information as
well as user generated content and reviews
• Online Research System and Digital Publishing
Automated, high-volume tagging, language processing, translation and transliteration
of legal, intellectual property, scientific, financial and business information content as
well as generation of relevant meta data
• Government & Intelligence
Automated, high-volume language identification, entity and entity relationship recognition,
sentiment analysis, linking and translation and transliteration of various information sources
• Technology & Enterprise
Complex language processing, tagging, enriching and localization
• Localization Industry
Support of complex and high-volume localization
• Media and Subtitling
Subtitle extraction and manufacturing from different sources, support of re-writing source
for subtitling, localization and post-editing, automated placement in frames and improvement
• eDiscovery
Automated a high volume content tagging, localization and discovery for
litigation data gathering, analysis and support
Setting the scene and why content localization
and search are intertwined
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
• 31, MARCH 2017
SIMPLIFIED REFERENCE ARCHITECTURE FOR OUR DISCUSSION
Unstructured
Data
Structured
Data
Search “Engine”
HOW DO I KNOW WHAT TO “ASK” FOR?
Unstructured
Data
Structured
Data
Search “Engine”
• How do I construct the right
query / search?
• How do I know what
keywords to use?
• Semantic or Concept Search
• Keyword lists
• Domain classifications
• Keyword based domain
classification (AI)
• …
HOW DO WE DEAL WITH MULTI-LINGUAL CONTENT?
Unstructured
Data
Structured
Data
Search “Engine”
Option 1:
Normalize to a single language
Option 2:
Cross-lingual search
What domain, how do
we maintain quality,
what is quality, what
language do we
normalize to..?
What kind of data, is
normalization or
transliteration needed,
how do we dal with
variants?
THE GENERIC LOCALIZATION WORKFLOW
Extraction Enrichment Translation Enrichment Delivery
1 2 3 4 5
Extract from
source format
to text or XML
Identifying
entities, entity
relationships,
adding meta
data, sentiment
analysis, etc.
Translation
and/or
transliteration,
normalizing
terminology,
maintaining
meta-data
Post-translation
corrections,
additional
enrichment and
classification,
etc.
Delivery to user
/ application
with or without
enrichments
THE GENERIC LOCALIZATION WORKFLOW
Extraction Enrichment Translation Enrichment Delivery
1 2 3 4 5
Extract from
source format
to text or XML
Identifying
entities, entity
relationships,
adding meta
data, sentiment
analysis, etc.
Translation
and/or
transliteration,
normalizing
terminology,
maintaining
meta-data
Post-translation
corrections,
additional
enrichment and
classification,
etc.
Delivery to user
/ application
with or without
enrichments
THE GENERIC LOCALIZATION WORKFLOW
Extraction Enrichment Translation Enrichment Delivery
1 2 3 4 5
Extract from
source format
to text or XML
Identifying
entities, entity
relationships,
adding meta
data, sentiment
analysis, etc.
Translation
and/or
transliteration,
normalizing
terminology,
maintaining
meta-data
Post-translation
corrections,
additional
enrichment and
classification,
etc.
Delivery to user
/ application
with or without
enrichments
- Translation naturally provides the translated source – using either Statistical or Neural
Machine Translation
- However, bi-products and translation capabilities that are interesting in this context
are:
- Ability to normalize terminology
- Pre-processing and enriching content prior to translation (tagging, conversion..)
- Using the term analysis generated during the engine build
Extrémne problémy extrémne problémy extrémne problémy extrémnej problémy
refraktérnym
mnohopočetným
myelómom
refraktérnym mnohopočetným
myelómom
refraktérnym mnohopočetným
myelómom žiaruvzdorné myelómom je mladších
veľkosti nádoru veľkosť nádoru veľkosti nádoru veľkosti nádoru
JA-EN Sample Patent Translations; one is machine, one human
• The coagulation time was determined as described above.
• The setting time was determined as described above.
• The lighting device also typically includes a light source disposed at the end of the light conductor.
• The light device typically also includes a light source arranged at an end of the light guide.
• Such communication between components is but one example of a unidirectional communication system.
• Such communication between components is only one example of a one-way communication system.
• The use of a hearing aid by a healthcare provider is routine.
• The use of a stethoscope by health care providers is routine.
• This can further enhance the electrical and long-term performance of the backsheet.
• This may further increase the electrical properties and long-term performance of the backsheets.
• Initial Binding measurements were performed as described above for Plaque Initial Binding measurements.
• Initial bonding measurements were carried out as described above for Plaque Initial Bonding Measurements.
• The subtractive color mixture selected may depend on the metalized surface area and the resistance material used.
• The subtractive process selected can depend upon the metallized structured surface region and the resist material
utilized.
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
THE GENERIC LOCALIZATION WORKFLOW
Extraction Enrichment Translation Enrichment Delivery
1 2 3 4 5
Extract from
source format
to text or XML
Identifying
entities, entity
relationships,
adding meta
data, sentiment
analysis, etc.
Translation
and/or
transliteration,
normalizing
terminology,
maintaining
meta-data
Post-translation
corrections,
additional
enrichment and
classification,
etc.
Delivery to user
/ application
with or without
enrichments
A REAL-LIFE EXAMPLE APPLICATION
• Example term (n-gram) extraction; extracted from actual human translations. The -gram variants show the
(green) suggested n-gram based on frequency but also the other candidates that were found. ”Distance” is
an available parameter.
• This process provides term variants, distance but also term relationships
• The results can be used for different purposes, amongst others
• Term normalization
• Term suggestions for search
• In conjunction with other meta data, domain identification
• …
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
actual swirl speed Vitesse de rotation réelle la vitesse de turbulence réelle vitesse réelle tourbillon vitesse réelle de remous
high byte octet haut octet de poids fort octet haut byte élevé
non-freezing fluid fluide antigel fluide incongelable sans gel fluide fluide de non-congélation
dental spray jet dentaire pulvérisation dentaire jet dentaire jet dentaire
A REAL-LIFE EXAMPLE APPLICATION (2)
• WIPO Patentscope (Patent Research) uses this
data extensively
• WIPO Pearl is an example application
• Many other examples exist in
• eCommerce (Products, Brands, etc.)
• Business Information (Names,
Locations, etc.)
• Scientific Research Platforms (Medical
Terms, Chemical Compounds, Domain
Identification, etc.)
• ..
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Source: http://www.wipo.int/wipopearl/search/linguisticSearch.html
A FEW KEY RECOMMENDATIONS
1. Take a holistic view of your workflow end to end
2. Work from the desired application result backwards
3. Ensure you review the data production and localization process, both the
engine build as well as the production workflow. Ensure valuable meta
data is not discarded. The localization team will have a vey different view
on the “value” of certain data elements than the team handling search or
even the application
4. Keep in mind the enrichment capabilities of the localization workflow
ranging from entities, sentiment right to the ability to manipulate data on
the fly and call external data sources and subsequently “locking” the data
in for localization
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
SUMMARY
• The Machine Translation and associated Language Processing workflow
provides a wealth of information that can support search
• Understanding the interaction between the content localization and search
is critical to good search results and allows balancing precision and recall
• With Machine Learning entering translation with Neural Machine
Translation, a number of Machine learning applications are enabled
• Use the localization workflow to your advantage in a multi-lingual
environment
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Q & A

Weitere ähnliche Inhalte

Was ist angesagt?

II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
Dr. Haxel Consult
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
AI-SDV 2021: Francisco Webber - Efficiency is the New Precision
AI-SDV 2021: Francisco Webber - Efficiency is the New PrecisionAI-SDV 2021: Francisco Webber - Efficiency is the New Precision
AI-SDV 2021: Francisco Webber - Efficiency is the New Precision
Dr. Haxel Consult
 
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
Dr. Haxel Consult
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
Dr. Haxel Consult
 
II-PIC 2017: The Use of Patent Information for Innovation and Competitive Int...
II-PIC 2017: The Use of Patent Information for Innovation and Competitive Int...II-PIC 2017: The Use of Patent Information for Innovation and Competitive Int...
II-PIC 2017: The Use of Patent Information for Innovation and Competitive Int...
Dr. Haxel Consult
 

Was ist angesagt? (20)

II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
 
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
II-SDV 2017: What is Innovation and how can we measure it?
II-SDV 2017: What is Innovation and how can we measure it?II-SDV 2017: What is Innovation and how can we measure it?
II-SDV 2017: What is Innovation and how can we measure it?
 
II-SV 2017: How to effectively monitor Technological Developments in IP
II-SV 2017: How to effectively monitor Technological Developments in IPII-SV 2017: How to effectively monitor Technological Developments in IP
II-SV 2017: How to effectively monitor Technological Developments in IP
 
II-SDV 2017: Gridlogics Technologies
II-SDV 2017: Gridlogics TechnologiesII-SDV 2017: Gridlogics Technologies
II-SDV 2017: Gridlogics Technologies
 
AI-SDV 2021: Francisco Webber - Efficiency is the New Precision
AI-SDV 2021: Francisco Webber - Efficiency is the New PrecisionAI-SDV 2021: Francisco Webber - Efficiency is the New Precision
AI-SDV 2021: Francisco Webber - Efficiency is the New Precision
 
IC-SDV 2018: Averbis
IC-SDV 2018: AverbisIC-SDV 2018: Averbis
IC-SDV 2018: Averbis
 
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
 
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
 
II-PIC 2017: Gain insight into technical, legal and business information thro...
II-PIC 2017: Gain insight into technical, legal and business information thro...II-PIC 2017: Gain insight into technical, legal and business information thro...
II-PIC 2017: Gain insight into technical, legal and business information thro...
 
II-SDV 2017: Effective Communication of Complex Monitoring Results: An innova...
II-SDV 2017: Effective Communication of Complex Monitoring Results: An innova...II-SDV 2017: Effective Communication of Complex Monitoring Results: An innova...
II-SDV 2017: Effective Communication of Complex Monitoring Results: An innova...
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
II-SDV 2017: Deep SEARCH 9
II-SDV 2017: Deep SEARCH 9II-SDV 2017: Deep SEARCH 9
II-SDV 2017: Deep SEARCH 9
 
II-PIC 2017: Product Presentation LexisNexis
II-PIC 2017: Product Presentation LexisNexisII-PIC 2017: Product Presentation LexisNexis
II-PIC 2017: Product Presentation LexisNexis
 
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
 
II-PIC 2017: The Use of Patent Information for Innovation and Competitive Int...
II-PIC 2017: The Use of Patent Information for Innovation and Competitive Int...II-PIC 2017: The Use of Patent Information for Innovation and Competitive Int...
II-PIC 2017: The Use of Patent Information for Innovation and Competitive Int...
 
SciBite
SciBiteSciBite
SciBite
 
IC-SDV 2018: Deep Search 9
IC-SDV 2018: Deep Search 9IC-SDV 2018: Deep Search 9
IC-SDV 2018: Deep Search 9
 
Effieient Algorithms to Find Frequent Itemset using Data Mining
Effieient Algorithms to Find Frequent Itemset using Data MiningEffieient Algorithms to Find Frequent Itemset using Data Mining
Effieient Algorithms to Find Frequent Itemset using Data Mining
 

Ähnlich wie II-SDV 2017: Localizing International Content for Search, Data Mining and Analytics Applications

Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 
NLP, Expert system and pattern recognition
NLP, Expert system and pattern recognitionNLP, Expert system and pattern recognition
NLP, Expert system and pattern recognition
Mohammad Ilyas Malik
 

Ähnlich wie II-SDV 2017: Localizing International Content for Search, Data Mining and Analytics Applications (20)

Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo Unstructured
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLP
 
File000162
File000162File000162
File000162
 
NLP, Expert system and pattern recognition
NLP, Expert system and pattern recognitionNLP, Expert system and pattern recognition
NLP, Expert system and pattern recognition
 
Chatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine LearningChatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine Learning
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
 
Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scale
 
Groundbreaking and Game-changing Enterprise Search Webinar
Groundbreaking and Game-changing Enterprise Search WebinarGroundbreaking and Game-changing Enterprise Search Webinar
Groundbreaking and Game-changing Enterprise Search Webinar
 
Precision Content™ Tools, Techniques, and Technology
Precision Content™ Tools, Techniques, and TechnologyPrecision Content™ Tools, Techniques, and Technology
Precision Content™ Tools, Techniques, and Technology
 
Expert systems
Expert systemsExpert systems
Expert systems
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
ICIC 2014 New Product Introduction Gridlogisc
ICIC 2014 New Product Introduction GridlogiscICIC 2014 New Product Introduction Gridlogisc
ICIC 2014 New Product Introduction Gridlogisc
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
 
data analytics.pptx
data analytics.pptxdata analytics.pptx
data analytics.pptx
 
Big Data + Sentiment Analysis = Awesome
Big Data + Sentiment Analysis = AwesomeBig Data + Sentiment Analysis = Awesome
Big Data + Sentiment Analysis = Awesome
 
Oracle analytics cloud overview feb 2017
Oracle analytics cloud overview   feb 2017Oracle analytics cloud overview   feb 2017
Oracle analytics cloud overview feb 2017
 
An Introduction to Natural Language Processing
An Introduction to Natural Language ProcessingAn Introduction to Natural Language Processing
An Introduction to Natural Language Processing
 
Semantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBISemantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBI
 

Mehr von Dr. Haxel Consult

AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
Dr. Haxel Consult
 

Mehr von Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 

Kürzlich hochgeladen

Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
@Chandigarh #call #Girls 9053900678 @Call #Girls in @Punjab 9053900678
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
ellan12
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
Diya Sharma
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 

Kürzlich hochgeladen (20)

Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night StandHot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
 
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 

II-SDV 2017: Localizing International Content for Search, Data Mining and Analytics Applications

  • 1. Localizing International Content for Search, Data Mining and Analytics Applications Andrew Rufener E: andrew.rufener@omniscien.com Copyright © 2017 Omniscien Technologies. All Rights Reserved.
  • 2. Agenda • Who we are and what we do • Setting the scene – a architecture for our discussion and the key challenges • The localization workflow and why content localization and search are intertwined • Illustrating using a practical example • Summary & Recommendations
  • 3. COMPANY OVERVIEW • Founded in 2007 as Asia Online, changed company name in 2016 to Omniscien Technologies • Award winning, leading global supplier of specialized and highly scalable language processing, machine translation and machine learning solutions offering in excess of 540 global language pairs • HQ in Singapore, European operation in The Hague, The Netherlands, Asian operation in Bangkok, Thailand • Global customer base in North America, Europe and Asia Pacific Copyright © 2017 Omniscien Technologies. All Rights Reserved.
  • 4. MARKETS AND SOLUTIONS • eCommerce and Online Travel Automated, high-volume localization of complex product catalogue information as well as user generated content and reviews • Online Research System and Digital Publishing Automated, high-volume tagging, language processing, translation and transliteration of legal, intellectual property, scientific, financial and business information content as well as generation of relevant meta data • Government & Intelligence Automated, high-volume language identification, entity and entity relationship recognition, sentiment analysis, linking and translation and transliteration of various information sources • Technology & Enterprise Complex language processing, tagging, enriching and localization • Localization Industry Support of complex and high-volume localization • Media and Subtitling Subtitle extraction and manufacturing from different sources, support of re-writing source for subtitling, localization and post-editing, automated placement in frames and improvement • eDiscovery Automated a high volume content tagging, localization and discovery for litigation data gathering, analysis and support
  • 5. Setting the scene and why content localization and search are intertwined Copyright © 2017 Omniscien Technologies. All Rights Reserved. • 31, MARCH 2017
  • 6. SIMPLIFIED REFERENCE ARCHITECTURE FOR OUR DISCUSSION Unstructured Data Structured Data Search “Engine”
  • 7. HOW DO I KNOW WHAT TO “ASK” FOR? Unstructured Data Structured Data Search “Engine” • How do I construct the right query / search? • How do I know what keywords to use? • Semantic or Concept Search • Keyword lists • Domain classifications • Keyword based domain classification (AI) • …
  • 8. HOW DO WE DEAL WITH MULTI-LINGUAL CONTENT? Unstructured Data Structured Data Search “Engine” Option 1: Normalize to a single language Option 2: Cross-lingual search What domain, how do we maintain quality, what is quality, what language do we normalize to..? What kind of data, is normalization or transliteration needed, how do we dal with variants?
  • 9. THE GENERIC LOCALIZATION WORKFLOW Extraction Enrichment Translation Enrichment Delivery 1 2 3 4 5 Extract from source format to text or XML Identifying entities, entity relationships, adding meta data, sentiment analysis, etc. Translation and/or transliteration, normalizing terminology, maintaining meta-data Post-translation corrections, additional enrichment and classification, etc. Delivery to user / application with or without enrichments
  • 10. THE GENERIC LOCALIZATION WORKFLOW Extraction Enrichment Translation Enrichment Delivery 1 2 3 4 5 Extract from source format to text or XML Identifying entities, entity relationships, adding meta data, sentiment analysis, etc. Translation and/or transliteration, normalizing terminology, maintaining meta-data Post-translation corrections, additional enrichment and classification, etc. Delivery to user / application with or without enrichments
  • 11. THE GENERIC LOCALIZATION WORKFLOW Extraction Enrichment Translation Enrichment Delivery 1 2 3 4 5 Extract from source format to text or XML Identifying entities, entity relationships, adding meta data, sentiment analysis, etc. Translation and/or transliteration, normalizing terminology, maintaining meta-data Post-translation corrections, additional enrichment and classification, etc. Delivery to user / application with or without enrichments - Translation naturally provides the translated source – using either Statistical or Neural Machine Translation - However, bi-products and translation capabilities that are interesting in this context are: - Ability to normalize terminology - Pre-processing and enriching content prior to translation (tagging, conversion..) - Using the term analysis generated during the engine build Extrémne problémy extrémne problémy extrémne problémy extrémnej problémy refraktérnym mnohopočetným myelómom refraktérnym mnohopočetným myelómom refraktérnym mnohopočetným myelómom žiaruvzdorné myelómom je mladších veľkosti nádoru veľkosť nádoru veľkosti nádoru veľkosti nádoru
  • 12. JA-EN Sample Patent Translations; one is machine, one human • The coagulation time was determined as described above. • The setting time was determined as described above. • The lighting device also typically includes a light source disposed at the end of the light conductor. • The light device typically also includes a light source arranged at an end of the light guide. • Such communication between components is but one example of a unidirectional communication system. • Such communication between components is only one example of a one-way communication system. • The use of a hearing aid by a healthcare provider is routine. • The use of a stethoscope by health care providers is routine. • This can further enhance the electrical and long-term performance of the backsheet. • This may further increase the electrical properties and long-term performance of the backsheets. • Initial Binding measurements were performed as described above for Plaque Initial Binding measurements. • Initial bonding measurements were carried out as described above for Plaque Initial Bonding Measurements. • The subtractive color mixture selected may depend on the metalized surface area and the resistance material used. • The subtractive process selected can depend upon the metallized structured surface region and the resist material utilized. Copyright © 2017 Omniscien Technologies. All Rights Reserved.
  • 13. THE GENERIC LOCALIZATION WORKFLOW Extraction Enrichment Translation Enrichment Delivery 1 2 3 4 5 Extract from source format to text or XML Identifying entities, entity relationships, adding meta data, sentiment analysis, etc. Translation and/or transliteration, normalizing terminology, maintaining meta-data Post-translation corrections, additional enrichment and classification, etc. Delivery to user / application with or without enrichments
  • 14. A REAL-LIFE EXAMPLE APPLICATION • Example term (n-gram) extraction; extracted from actual human translations. The -gram variants show the (green) suggested n-gram based on frequency but also the other candidates that were found. ”Distance” is an available parameter. • This process provides term variants, distance but also term relationships • The results can be used for different purposes, amongst others • Term normalization • Term suggestions for search • In conjunction with other meta data, domain identification • … Copyright © 2017 Omniscien Technologies. All Rights Reserved. actual swirl speed Vitesse de rotation réelle la vitesse de turbulence réelle vitesse réelle tourbillon vitesse réelle de remous high byte octet haut octet de poids fort octet haut byte élevé non-freezing fluid fluide antigel fluide incongelable sans gel fluide fluide de non-congélation dental spray jet dentaire pulvérisation dentaire jet dentaire jet dentaire
  • 15. A REAL-LIFE EXAMPLE APPLICATION (2) • WIPO Patentscope (Patent Research) uses this data extensively • WIPO Pearl is an example application • Many other examples exist in • eCommerce (Products, Brands, etc.) • Business Information (Names, Locations, etc.) • Scientific Research Platforms (Medical Terms, Chemical Compounds, Domain Identification, etc.) • .. Copyright © 2017 Omniscien Technologies. All Rights Reserved. Source: http://www.wipo.int/wipopearl/search/linguisticSearch.html
  • 16. A FEW KEY RECOMMENDATIONS 1. Take a holistic view of your workflow end to end 2. Work from the desired application result backwards 3. Ensure you review the data production and localization process, both the engine build as well as the production workflow. Ensure valuable meta data is not discarded. The localization team will have a vey different view on the “value” of certain data elements than the team handling search or even the application 4. Keep in mind the enrichment capabilities of the localization workflow ranging from entities, sentiment right to the ability to manipulate data on the fly and call external data sources and subsequently “locking” the data in for localization Copyright © 2017 Omniscien Technologies. All Rights Reserved.
  • 17. SUMMARY • The Machine Translation and associated Language Processing workflow provides a wealth of information that can support search • Understanding the interaction between the content localization and search is critical to good search results and allows balancing precision and recall • With Machine Learning entering translation with Neural Machine Translation, a number of Machine learning applications are enabled • Use the localization workflow to your advantage in a multi-lingual environment Copyright © 2017 Omniscien Technologies. All Rights Reserved.
  • 18. Copyright © 2017 Omniscien Technologies. All Rights Reserved. Q & A