Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Past, Present, and Future: Machine Translation & Natural Language Processing for Patent Information

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 37 Anzeige

Past, Present, and Future: Machine Translation & Natural Language Processing for Patent Information

Herunterladen, um offline zu lesen

This was a presentation given at the European Patent Office's annual Patent Information Conference in Madrid, Spain on November 10th, 2016.

In it, we give an overview of how machine translation works, latest advances in neural MT, and how this can be applied to patents and intellectual property content, not only for translations but also information extraction and other NLP applications.

This was a presentation given at the European Patent Office's annual Patent Information Conference in Madrid, Spain on November 10th, 2016.

In it, we give an overview of how machine translation works, latest advances in neural MT, and how this can be applied to patents and intellectual property content, not only for translations but also information extraction and other NLP applications.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Anzeige

Ähnlich wie Past, Present, and Future: Machine Translation & Natural Language Processing for Patent Information (20)

Aktuellste (20)

Anzeige

Past, Present, and Future: Machine Translation & Natural Language Processing for Patent Information

  1. 1. ‘Past, Present, and Future’ Machine Translation & Natural Language Processing for Patent Information Dr. John Tinsley CEO, Iconic Translation Machines Ltd. EPOPIC. Madrid. 10th November 2016
  2. 2. BSc in Computational Linguistics PhD in Machine Translation Language Technology consultant Founder of Iconic Translation Machines Why listen to me? Machine Translation is what I do! The world’s first and only patent specific machine translation platform
  3. 3.  The use of computers to translate from one language into another  The use of computers to automate some, or all, of the translation process  An approach to Machine Translation, where translations for an input are estimated based on previous seen translation examples and associated (inferred) probabilities.  e.g. IPTranslator, Google Translate  Rule-based (or transfer-based): based on linguistic rules • e.g. Systran; Altavista’s Babelfish  Example-based: based on translation examples and inferred linguistic patterns Machine Translation: The Basics Machine Translation = automatic translation Statistical Machine Translation (SMT) Other approaches SMT is now by far the predominant approach*
  4. 4. A corpus (pl. corpora) is a collection of texts, in electronic format, in a single language  document(s)  book(s) Bilingual Corpora a bilingual corpus Note source language = original language or language we’re translating from target language = language we’re translating into A bilingual corpus is a collection of corresponding texts, in multiple languages  a document & its translation  a book in multiple languages  European Parliament proceedings
  5. 5. Aligned Bilingual Corpora A document-aligned bilingual corpus corresponds on a document level For translation, we required sentence-aligned bilingual corpora  The sentence on line 1 in the source language text corresponds to (i.e. is a translation of) the sentence on line 1 in the target language text etc.  Often referred to as parallel aligned corpora Sentence aligned bilingual parallel corpora are essential for statistical machine translation
  6. 6. Learning from Previous Translations Suppose we already know (from a sentence-aligned bilingual corpus) that:  “dog” is translated as “perro”  “I have a cat” is translated as “Tengo un gato” We can theoretically translate:  “I have a dog”  “Tengo un perro”  Even though we have never seen “I have a dog” before Statistical machine translation induces information about unseen input, based on previously known translations:  Primarily co-occurrence statistics  Takes contextual information into account
  7. 7. Statistical Machine Translation  Example of a small sentence-aligned bilingual corpus for English-French
  8. 8. Statistical Machine Translation  We take some new sentence to translate
  9. 9. Statistical Machine Translation  From the corpus we can infer possible target (French) translations for various source (English) words  We can then select the most probable translations based on simple frequencies (co-occurrence statistics)
  10. 10. Statistical Machine Translation Given a previously unseen input sentence, and our collated statistics, we can estimate translation
  11. 11. Advanced MT All modern approaches are based on building translations for complete sentences by putting together smaller pieces of translation Previous example is very simplistic  In reality SMT systems calculate much more complex statistical models over millions of sentence pairs for a pair of languages  Upwards of 2M sentence pairs on average for large-scale systems  Word-to-word translation probabilities  Phrase-to-phrase translation probabilities  Word order probabilities  Linguistic information (are the words nouns, verbs?)  Fluency of the final output Previous example is very simplistic Other statistics calculated include
  12. 12. Data is Key For SMT data is key  Information (word/phrase correspondences and associated statistics) is only based on what we have seen before in the data Important that data used to train SMT systems is:  Of sufficient size  avoid sparseness/skewed statistics  Representative and relevant  contains the right type of language  High-quality  absence of misspellings, incorrect alignments etc.  Proofed by human translators training data
  13. 13. Why is MT Difficult? A word or a phrase can have more than one meaning (ambiguity – lexical or structural)  e.g. “bank”, “dive”, “I saw the man with the telescope” People use language creatively  New words are cropping up all the time Linguistic differences between languages  e.g. structure of Irish sentences vs. structure of English sentences:  “Tá (Is) ocras (hunger) orm (on me)” <-> “I am hungry” There can be more than one way to express the same meaning.  “New York”, “The Big Apple”, “NYC”
  14. 14. Why is MT Difficult?  Israeli officials are responsible for airport security.  Israel is in charge of the security at this airport.  The security work for this airport is the responsibility of the Israel government.  Israeli side was in charge of the security of this airport.  Israel is responsible for the airport’s security.  Israel is responsible for safety work at this airport.  Israel presides over the security of the airport.  Israel took charge of the airport security.  The safety of this airport is taken charge of by Israel.  This airport’s security is the responsibility of the Israeli security officials.
  15. 15. No single solution for all languages Number agreement: the house / the houses vs. la maison / les maisons Gender agreement: the house / the cheese vs. la maison / le frommage English - Spanish English - French
  16. 16. No single solution for all languages English - German English - Chinese 种水果的农民 The farmer who grows fruit [Lit: “grow fruit (particle) farmer”]
  17. 17. Not all languages are created equal French German Turkish Finnish Spanish Chinese Korean Hungarian Portuguese Japanese Thai Basque
  18. 18. The Challenge of Patents L is an organic group selected from -CH2- (OCH2CH2)n-, -CO-NR'-, with R'=H or C1-C4 alkyl group; n=0-8; Y=F, CF3 … maximum stress of 1.2 to 3.5 N/mm<2> and a maximum elongation of 700 to 1,300% at 0[deg.] C. Long Sentences Technical constructions Largest single document: 249,322 words Longest Sentence: 1,417 words
  19. 19. The Challenge of Patents Very long sentences as standard Grammatically incomplete using nominal and telegraphic style (!) Passive forms are frequent Frequent use of subordinate clauses, participles, implicit constructs Inconsistent and incorrect spelling High use of neologisms Instances of synonymy and polysemy Spurious use of punctuation Authoring guide for “to be translated” text Patents break almost all of the rules!
  20. 20. Judge the quality of an MT system by comparing its output against a human-produced “reference” translation  Pros: Quick, cheap, consistent  Cons: Inflexible, cannot be used on ‘new’ input  Pros: Reliable, flexible, multi-faceted (fluency, error analyses, benchmarking)  Cons: Slow, expensive, subjective  Fluency vs. Adequacy Evaluating Machine Translation Quality Automatic Evaluation Human Evaluation Task-Based Evaluation
  21. 21. Evaluating Machine Translation Quality Task Based Evaluation  Standalone evaluation of MT systems is necessary to get a sense of the overall quality of a system  To determine the ultimate usability of an MT system, intrinsic task-based evaluation is required  Why? Fluency vs. Adequacy Fluency how fluent and grammatically correct the translation output is Adequacy how accurately the translation conveys the meaning of the source Output 1 The big blue house Output 2 The big house red Source La gran casa roja Task-Based Evaluation
  22. 22. Practical uses of Machine Translation Understand its limitations and you’ll understand its capabilities! No  Translate a patent for filing  Translate literature for publication  Translate marketing materials  Anything mission critical without review Yes  Productivity tool for professional translation  Understand foreign patents  Localisation processes and “controlled’ content  High volume, e.g. eDiscovery
  23. 23. Use cases in practice Product descriptions to open new markets MT for post-editing productivity across industries Developer, and user for web content Tens of thousands of people using online tools daily
  24. 24. Neural Networks  Using artificial intelligence and deep learning to develop a completely new way of doing machine translation! Quality Estimation  Functionality through which machine translation can “self- assess” the quality of the translations it produces. Online Adaptive Translation  Machine translations that can automatically learn and improve based on feedback, particularly from revisions. Use-case specific MT  Just like patent MT, but for countless other areas. Current Hot Topics
  25. 25. About Iconic We are a Machine Translation and Natural Language Processing software and services provider, delivering expert solutions with Subject Matter Expertise
  26. 26. Iconic Ensemble Architecture…
  27. 27. …enhanced with Neural MT
  28. 28. Speed, Cost, and Quality What is the difference between machine translation vs. manual translation when translating a 10 page patent document from Chinese into English? Machine Translation is not designed to replace professional translation but there are many cases where costly and time- consuming manual translation is simply not necessary.
  29. 29. - Data confidentiality - File formats - Potential for customisation, enhancements, and improvement for specific domains
  30. 30. More than just translation DATA PROCESSING E.G. OPTICAL CHARACTER RECOGNITION, DIGITISATION DATABASE BUILDING E.G. COMBINING THE ABOVE, WITH TRANSLATION, FOR EXPORT DATA UNDERSTANDING E.G. SUMMARISATION, CONCEPT & KEY TERM IDENTIFICATION INFORMATION EXTRACTION E.G. CITATION ANALYSIS, CROSS- LINGUAL SEARCH
  31. 31. Record Extraction Extraction algorithms work on cleaned OCR output, using patterns, keywords, and formatting information.
  32. 32. Citation Analysis Assessment of record and reference patterns Application for record extraction Tracking variations across years Application for bibliographic data fielding
  33. 33. Reference extraction + fielding
  34. 34. .com Visit and use the promo code epo2016 to get 20 free pages of translation
  35. 35. Thank You! john@iptranslator.com @IconicTrans

Hinweis der Redaktion

  • Second point is important. It has different uses and usability. The concept of FAHQMT is no more. Focus is now on HAMT and PEMT.

    Problems with rule-based is that they didn’t scale
    You need bilingual experts for each language pair

    SMT is the predominant approach
  • Starting point for all systems is data.
    The most important aspect is the quality of the data…
  • They are essential and the quality is crucial.

    The translations must be accurate and the alignment must be correct, otherwise we infer the wrong things. Introduce “noise” into our systems.
  • How do we use these corpora? It’s all about learning and remembering things we’ve seen before, the same way you might go about translating something
  • Ok, so the translation isn’t exactly right here. It should be “Je parle a la fille” but we haven’t seen enough examples (don’t have enough data) for reliable estimates, we’re just going on the counts of the words
  • How likely a word is to translate to another word – as you have seen
    How likely the different phrases are to translate as one another
    What’s the likelihood a certain word will have a different position in the target sentence
    Sometimes we take into account linguistic information about the words, is it a verb, then it should go here, articles should proceed nouns, etc.
    Look at models of the target language and see if what we have produce makes sense (can these words go together in this order?)
  • Google Translate aims to be a general system, but what happens when your translating a sports website? Quality issues can be caused by the fact that there’s a lot of other data in their models than sports news.

    Similarly, if I have a translation system for car manuals, it won’t be any good at translating sports websites.

    This is reflected in our systems at IPTranslator too where all of our models are built using patents which have been filed in multiple languages to ensure we get the style correct
    (patents are a bigger fish than this though)
  • The simple answer is that language is complex! Which is what makes it difficult to learn but also so interesting at the same time!

    Who has the telescope, him or I?

    New words, especially in patents. And new usage of words. The verb “to tweet” didn’t exist so long ago…
  • The last piece in the puzzle is understanding the languages you’re developing MT systems for. And that’s not understanding them in isolation – that’s understanding, for each language pair, what the differences are between them, e.g. many of the things we need to look out for when developing English-Spanish translation engines we don’t need to do for French-Spanish translation
  • With certain language pairs, things get more complex. The processes that we need to develop are harder to develop, less studied, require smarter people!

    Chinese, need to identify these DE constructions so we know to move the head noun
    No tense, going into English, how do we know what tense?
    There’s no article! We have to generate it!
    DE particle has many translations, which one!

    FIRST THINGS FIRST, which ones are the words!? We need to segment the Chinese!

    ONLY WITH THESE SKILLS CAN YOU EXPLOIT THE TECHNOLOGY TO ITS FULLEST – AND WHAT DO WE GET IN DOING THIS? MT WITH SUBJECT MATTER EXPERTISE
  • **EFFECT ON FEASIBILITY**

    Basically, some languages are easier for MT that others.
    General rule, closer two languages are to one another in terms of word order, grammatical structure, the easier.
    Here’s some rules of thumb (with English)
  • But of course it’s not just that easy.
    Patents for example have a range of highly complex linguistic characteristics that make this challenging, both for PROFESSIONAL translators as well as for Translation Software.
    Lets look for example at this patent – what’s highlighted in blue is a SINGLE sentence, (which is an individual legal claim).
    Additionally, we have to deal with complex technical constructions such as chemical formulae, alphanumeric sequences, even genomic and amino acid sequences.

    And then we have patents which introduce a whole new level of complexity on top of the language issues…

    Patents are hard to read, never mind translate, never mind try to teach a computer how to translate them!
  • Sometimes it’s hard to tell whether the translation is bad or that’s simply how the original patent was written
  • Commercial machine translation is plagued with misleading marketing with unrealistic claims and promises - Need to manage expectations

    When I say NO, I mean no in a fully-automatic manner with no human intervention

    Filing – not when meaning is CRUCIAL
    Publication – no, there will be errors
    Marketing – no, not with subtleties, idioms, etc.
  • MT solutions and services provider, specializing in providing customised solutions with subject matter expertise for specific techincal sectors, such as Patents/IP, life sciences, and financial.

    We are the MT partner of choice for some of the world’s largest translation companies, information providers, and government and enterprise organisations.

    For Translation Companies: We help translation companies to translate more content, more accurately for faster project turnaround, resulting in significant cost savings and increased revenue.

    For Enterprise Clients: We help enterprises to translate more content in less time, resulting in faster products to market and enhanced global reach.

    For Information Providers: We help information providers to translate knowledge, literature and documentary information faster and more accurately, resulting in broader knowledge offerings and faster time to market.
  • THERE’S VALUE TO BE ADDED, HOW CAN WE HARNESS?

    We literally already have the perfect environment to allow NMT to be another string in the bow and let us use the most appropriate MT for the job

    WHETHER IT BE NEURAL FOR KOREAN, FOR CHAT TEXT, OR WHATEVER THE CASE MAY BE
  • It’s not a one size fits all solution and who knows when it will be, but we have developed a framework that allows us to leverage it’s strength on a case by case basis to deliver the best possible translation for a given task.

    Overtime we fully expect the “brain to grow” and become the best MT on offer for various language pairs and content types, and when it is, WE”RE PERFECTLY POSITIONS FROM A TECHINCOLOGY AND EXPERTISE PERSPECTIVE to capitalise on this wave.
  • We’ve launched a new product this year which is essentially repurposing the technology that we have and focusing on very particular use cases…

    Firstly, let’s just look at the stark motivation for using MT for patent information in the first place…
  • The “standard” solution to the problem of foreign language documents is translations.
    But translation is costly, not that quick, and often it is complete overkill for what is required!!

    This is where MT comes as a much more cost-effective, rapid solution that allow you to make a QUICK determination as to whether something is relevant or not before you invest in a professional translation.

    And, while we all know that MT isn’t perfect, the reality of the situation is that the quality is often “good enough” or fit for the purpose of make this determiniation.

    SO IT’S A NO-BRAINER
  • So going back to IPTranslator, the elephant in the room for us for a long time has been Google Translate. The first question we get asked always is “is it better than Google Translate?”

    The answer is yes, the majority of the time for most of the languages that we cover. However, is that increase in quality enough to justify the cost of our server over Google which is a free service? It’s hard to beat free! The reality is now, the “fit for purpose / good enough” quality level is something that Google can achieve often, especially since it started working with the EPO.

    So where does IPTranslator fit?

    Confidential Data
    File formats incl. pdf
    Potential for customisation, enhancements, and improvement for specific domains

  • Not just for patents, but for journals and other non-patent literature
  • Why was it challenging?

    Exceptions to patterns
    OCR errors
    Lack of formatting information
  • The record extraction example is from Pattern B
    The bib data example is from Pattern 5

×