SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Babies and Bathwater
Keeping linguistics alongside machine
learning in patent search
David Woolls – CFL Software Limited, UK
Matter
• Therefore, we cannot think that matter is made of points
without extension, because no matter how many of these we
manage to put together, we never obtain something with an
extended dimension.
Carlo Rovelli , Reality is not what it seems (2016 p:12)
• Quindi non si può pensare che la materia sia fatta di punti
senza estensione, perché, per quanti ne mettessimo
insieme, non otterremmo mai qualcosa con una dimensione
estesa.
• What is the matter with this sentence? Does this matter? As
a matter of fact it does. That’s another matter.
• What does ‘matter’ mean on this page?
Imagined Readers – Text differences
"It was a dark and stormy night, the rain
came down in torrents, there were brigands on
the mountains, and wolves, and the chief of the
brigands said to Antonio, 'I'm bored - tell us a
story!’”
Janet and Allan Ahlberg
From “Paul Clifford”
LSTM and linguistics
• But there are also cases where we need more
context.
• Consider trying to predict the last word in the text “I
grew up in France… I speak fluent French.”
Humans usually provide linguistic assistance in the form of function words
(grammar)
I grew up in France so I speak fluent … Definitely French
I grew up in France and I speak fluent … Possibly French but maybe another
I grew up in France but I speak fluent … Definitely not French
I grew up in France but I also speak fluent … Very definitely not French
I grew up in France but I don’t speak fluent … Definitely French
I grew up in France so I don’t speak fluent … Definitely not French
Babies, bathwater,
stems, lemmas and function words
Becomes
I think Christoph is brilliant Think Christoph brilli
I thought Christoph was brilliant Think Christoph brilli
I thought Christoph was brilliant but now I’m not
so sure.
Think Christoph brilli sure
Hearing Christoph’s brilliance I asked him to
speak.
Hear Christoph brilli ask speak
I wouldn’t do that if I were you! !
This is called telegraphic language and is spoken by children between 18
months and three years old during language acquisition. Perhaps not ideal for
computers and comprehension.
Linguistic LSTM with real sentences.
• It is a truth universally acknowledged, [6]
• that a single man [4]
• in possession of a good fortune, [6]
• must be in want of a wife. [7]
• [23/4] = 6
The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for
Processing Information
by George A. Miller
originally published in The Psychological Review, 1956, vol. 63, pp. 81-97
http://www.musanim.com/miller1956/
It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.
LSTM
• However little known the feelings or views of such a man may be
on his first entering a neighbourhood, this truth is so well fixed in
the minds of the surrounding families, that he is considered the
rightful property of some one or other of their daughters.
• However little known the feelings or views [7]
• of such a man may be [7]
• on his first entering a neighbourhood, [6]
• this truth is so well fixed [6]
• in the minds of the surrounding families, [7]
• that he is considered the rightful property [7]
• of some one or other of their daughters. [7]
• [47/7] = 7
Why linguistics?
• Patents are communicative documents, written in many languages.
• Communication is achieved by context which can be close or distant.
• Boolean searching gives results by document; range searching needs to be done
by claim.
• There are distractor numbers in a claim (e.g. Claim numbers, temperatures,
lengths).
• There are potential data quality or format problems introduced by OCR, machine
translation
or extraction from a database.
• All these and others need to be taken into account to find only relevant
material.
ICIC 2017 8
Why linguistics for ranges?
• Range information is in the unstructured text
– The location and referent of ranges is signalled by linguistic structures and forms:
• Range then element or Element then range or both 0,80 < Si < 1,20
• Elements by symbol Si or in full Silicon or silicon
• Implicit or explicit marking: 1-5 or between 1 and 5
• Symbolic or lexical marking: <2.5 or less than 2.5, ≥ .76 or greater than or equal to 0.76
• Variation in proximity of additional markings 0.5%, 0.5wt%, 0.5 wt %
– There can be mixtures of these forms in a single claim.
ICIC 2017 9
Reading
The program is a linear text reader because we need to:
1. Identify claims
2. Identify pairs of elements and ranges in each claim.
So each line in the file is read word by word just once in the
same sequence as a human reader.
ICIC 2017 10
Reading
• Items are identified as numbers, range indicators or elements in sequence.
• As each element/range pair is identified, the relationship with the specification is
calculated.
• Following calculation the element and the range is colour-coded and the claim
built for potential display.
• At the conclusion of each claim the total found is compared with the total
specification.
• If the claim meets the overall specification requirement it is added to the list for
display.
• At the conclusion of the reading process, all the results are ranked and
displayed.
• The program can process the full claims of around 300 patents per second.
ICIC 2017 11
Native languages v Machine Translation
ICIC 2017 12
Here is the problem from the PatBase collection.
<Claims><![CDATA[<CLA_MT><XXC1> <p> CN 1. A non-magnetic alloy of high strength and toughness,
characterized in that the chemical composition in weight percent of: C:.. 0 14 ~0 30 percent, Si:.. 0 15 ~0 80
percent,.. Mn: 20 00 ~27 00 percent; Ni:.. 0 60 ~2 00 percent; Cr:.. 12 50 ~19 00 percent;
</CLA_MT><CLA_CN><XXC1><p>CN 1. 一种高强度韧性无磁合金,其特征在于,化学成分重量百分数为: C
:0. 14 〜0. 30%, Si :0. 15 〜0. 80%, Mn :20. 00 〜27. 00% ; Ni :0. 60 〜2. 00% ; Cr :12. 50 〜19. 00%
;
You can see that the MT version into English is appalling!.
You can also see that the original claim will be understandable by the program because the presentation is clear.
Detailed example (continued)
ICIC 2017 13
It is not practicable to write a program that takes account of all the things that might go wrong, without also
introducing potential errors to data that is actually ok. But it is possible for SpanMatch to recognise the original
as correct as you see here.
So, given clean data or cleaning the data up as best we can, we can do this in all the languages. Once you have an
indication of potential interest you can use a good MT program to translate just the claims of interest.
This is Google Translate translating the claim, and you can see that it is struggling, but is better than the PatBase
one.
CN is a high strength toughness nonmagnetic alloy characterized in that the chemical composition is in a weight
percentage of C: 0.14 to 0. 30 Si: 0.015 to 80 Mn: 20 to 0000. Ni: 0.60 ~ 2.00; Cr: 12. 50 ~ 1900; Mo or W
elements of one or two: 0. 60 ~ 2.50 ;; 0.8 ~ [0. LXMn (% - 0.5); 0 20 to 0.50; Ca, rare earth elements of one or
two: 0. 003 ~ 0.05;: 彡 0.03:: 彡 0.03; Fe: balance.
Use of CN, JP, KR originals - rationale
• Machine translation is often hard to understand and sometimes incomprehensible
• Using native language patents ensures data quality
• Limited inbuilt knowledge required for numerical searching
– Searching for elements requires only that a program has the CJK equivalents
for full element names; international symbols are identical.
– Searching for ranges requires knowledge of potential CJK equivalent codes
for digits
– Searching for range indicators requires language specific identification of hyphen, <,
> and words.
• Accurate identification of the search specification with display of the claims means only
those claims of interest need translation by machine or human
ICIC 2017 14
Thank you
Contact: d.woolls@cflsoftware.com
Website: www.cflsoftware.com

Weitere ähnliche Inhalte

Ähnlich wie ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learning in patent search

2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
c.titus.brown
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
evonnehoggarth79783
 
BUS216 Exam #3 Review – SP14 1 1. In order to ha.docx
BUS216 Exam #3 Review – SP14  1  1. In order to ha.docxBUS216 Exam #3 Review – SP14  1  1. In order to ha.docx
BUS216 Exam #3 Review – SP14 1 1. In order to ha.docx
RAHUL126667
 

Ähnlich wie ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learning in patent search (20)

Sequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTSequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RAST
 
Exploiting Loopholes in CAP
Exploiting Loopholes in CAPExploiting Loopholes in CAP
Exploiting Loopholes in CAP
 
Serge astm-presentation-chicago-2014-final
Serge astm-presentation-chicago-2014-finalSerge astm-presentation-chicago-2014-final
Serge astm-presentation-chicago-2014-final
 
Messaging
MessagingMessaging
Messaging
 
TDC 2020 - Implementing a Mini-Language
TDC 2020 - Implementing a Mini-LanguageTDC 2020 - Implementing a Mini-Language
TDC 2020 - Implementing a Mini-Language
 
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entities
 
Should we be afraid of Transformers?
Should we be afraid of Transformers?Should we be afraid of Transformers?
Should we be afraid of Transformers?
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Get full visibility and find hidden security issues
Get full visibility and find hidden security issuesGet full visibility and find hidden security issues
Get full visibility and find hidden security issues
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...
 
HyperLogLog Intuition Without Hard Math
HyperLogLog Intuition Without Hard MathHyperLogLog Intuition Without Hard Math
HyperLogLog Intuition Without Hard Math
 
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
 
KantanFest: Andy Way
KantanFest: Andy WayKantanFest: Andy Way
KantanFest: Andy Way
 
BUS216 Exam #3 Review – SP14 1 1. In order to ha.docx
BUS216 Exam #3 Review – SP14  1  1. In order to ha.docxBUS216 Exam #3 Review – SP14  1  1. In order to ha.docx
BUS216 Exam #3 Review – SP14 1 1. In order to ha.docx
 
Formidable College Supplemental Essays Th
Formidable College Supplemental Essays ThFormidable College Supplemental Essays Th
Formidable College Supplemental Essays Th
 

Mehr von Dr. Haxel Consult

AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
Dr. Haxel Consult
 

Mehr von Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 

Kürzlich hochgeladen

6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
@Chandigarh #call #Girls 9053900678 @Call #Girls in @Punjab 9053900678
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
Diya Sharma
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
nilamkumrai
 

Kürzlich hochgeladen (20)

Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
 
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
 
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
Real Escorts in Al Nahda +971524965298 Dubai Escorts Service
Real Escorts in Al Nahda +971524965298 Dubai Escorts ServiceReal Escorts in Al Nahda +971524965298 Dubai Escorts Service
Real Escorts in Al Nahda +971524965298 Dubai Escorts Service
 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
Al Barsha Night Partner +0567686026 Call Girls Dubai
Al Barsha Night Partner +0567686026 Call Girls  DubaiAl Barsha Night Partner +0567686026 Call Girls  Dubai
Al Barsha Night Partner +0567686026 Call Girls Dubai
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
 
VVIP Pune Call Girls Mohammadwadi WhatSapp Number 8005736733 With Elite Staff...
VVIP Pune Call Girls Mohammadwadi WhatSapp Number 8005736733 With Elite Staff...VVIP Pune Call Girls Mohammadwadi WhatSapp Number 8005736733 With Elite Staff...
VVIP Pune Call Girls Mohammadwadi WhatSapp Number 8005736733 With Elite Staff...
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Dubai Call Girls Milky O525547819 Call Girls Dubai Soft Dating
Dubai Call Girls Milky O525547819 Call Girls Dubai Soft DatingDubai Call Girls Milky O525547819 Call Girls Dubai Soft Dating
Dubai Call Girls Milky O525547819 Call Girls Dubai Soft Dating
 

ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learning in patent search

  • 1. Babies and Bathwater Keeping linguistics alongside machine learning in patent search David Woolls – CFL Software Limited, UK
  • 2. Matter • Therefore, we cannot think that matter is made of points without extension, because no matter how many of these we manage to put together, we never obtain something with an extended dimension. Carlo Rovelli , Reality is not what it seems (2016 p:12) • Quindi non si può pensare che la materia sia fatta di punti senza estensione, perché, per quanti ne mettessimo insieme, non otterremmo mai qualcosa con una dimensione estesa. • What is the matter with this sentence? Does this matter? As a matter of fact it does. That’s another matter. • What does ‘matter’ mean on this page?
  • 3. Imagined Readers – Text differences "It was a dark and stormy night, the rain came down in torrents, there were brigands on the mountains, and wolves, and the chief of the brigands said to Antonio, 'I'm bored - tell us a story!’” Janet and Allan Ahlberg From “Paul Clifford”
  • 4. LSTM and linguistics • But there are also cases where we need more context. • Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Humans usually provide linguistic assistance in the form of function words (grammar) I grew up in France so I speak fluent … Definitely French I grew up in France and I speak fluent … Possibly French but maybe another I grew up in France but I speak fluent … Definitely not French I grew up in France but I also speak fluent … Very definitely not French I grew up in France but I don’t speak fluent … Definitely French I grew up in France so I don’t speak fluent … Definitely not French
  • 5. Babies, bathwater, stems, lemmas and function words Becomes I think Christoph is brilliant Think Christoph brilli I thought Christoph was brilliant Think Christoph brilli I thought Christoph was brilliant but now I’m not so sure. Think Christoph brilli sure Hearing Christoph’s brilliance I asked him to speak. Hear Christoph brilli ask speak I wouldn’t do that if I were you! ! This is called telegraphic language and is spoken by children between 18 months and three years old during language acquisition. Perhaps not ideal for computers and comprehension.
  • 6. Linguistic LSTM with real sentences. • It is a truth universally acknowledged, [6] • that a single man [4] • in possession of a good fortune, [6] • must be in want of a wife. [7] • [23/4] = 6 The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information by George A. Miller originally published in The Psychological Review, 1956, vol. 63, pp. 81-97 http://www.musanim.com/miller1956/ It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.
  • 7. LSTM • However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters. • However little known the feelings or views [7] • of such a man may be [7] • on his first entering a neighbourhood, [6] • this truth is so well fixed [6] • in the minds of the surrounding families, [7] • that he is considered the rightful property [7] • of some one or other of their daughters. [7] • [47/7] = 7
  • 8. Why linguistics? • Patents are communicative documents, written in many languages. • Communication is achieved by context which can be close or distant. • Boolean searching gives results by document; range searching needs to be done by claim. • There are distractor numbers in a claim (e.g. Claim numbers, temperatures, lengths). • There are potential data quality or format problems introduced by OCR, machine translation or extraction from a database. • All these and others need to be taken into account to find only relevant material. ICIC 2017 8
  • 9. Why linguistics for ranges? • Range information is in the unstructured text – The location and referent of ranges is signalled by linguistic structures and forms: • Range then element or Element then range or both 0,80 < Si < 1,20 • Elements by symbol Si or in full Silicon or silicon • Implicit or explicit marking: 1-5 or between 1 and 5 • Symbolic or lexical marking: <2.5 or less than 2.5, ≥ .76 or greater than or equal to 0.76 • Variation in proximity of additional markings 0.5%, 0.5wt%, 0.5 wt % – There can be mixtures of these forms in a single claim. ICIC 2017 9
  • 10. Reading The program is a linear text reader because we need to: 1. Identify claims 2. Identify pairs of elements and ranges in each claim. So each line in the file is read word by word just once in the same sequence as a human reader. ICIC 2017 10
  • 11. Reading • Items are identified as numbers, range indicators or elements in sequence. • As each element/range pair is identified, the relationship with the specification is calculated. • Following calculation the element and the range is colour-coded and the claim built for potential display. • At the conclusion of each claim the total found is compared with the total specification. • If the claim meets the overall specification requirement it is added to the list for display. • At the conclusion of the reading process, all the results are ranked and displayed. • The program can process the full claims of around 300 patents per second. ICIC 2017 11
  • 12. Native languages v Machine Translation ICIC 2017 12 Here is the problem from the PatBase collection. <Claims><![CDATA[<CLA_MT><XXC1> <p> CN 1. A non-magnetic alloy of high strength and toughness, characterized in that the chemical composition in weight percent of: C:.. 0 14 ~0 30 percent, Si:.. 0 15 ~0 80 percent,.. Mn: 20 00 ~27 00 percent; Ni:.. 0 60 ~2 00 percent; Cr:.. 12 50 ~19 00 percent; </CLA_MT><CLA_CN><XXC1><p>CN 1. 一种高强度韧性无磁合金,其特征在于,化学成分重量百分数为: C :0. 14 〜0. 30%, Si :0. 15 〜0. 80%, Mn :20. 00 〜27. 00% ; Ni :0. 60 〜2. 00% ; Cr :12. 50 〜19. 00% ; You can see that the MT version into English is appalling!. You can also see that the original claim will be understandable by the program because the presentation is clear.
  • 13. Detailed example (continued) ICIC 2017 13 It is not practicable to write a program that takes account of all the things that might go wrong, without also introducing potential errors to data that is actually ok. But it is possible for SpanMatch to recognise the original as correct as you see here. So, given clean data or cleaning the data up as best we can, we can do this in all the languages. Once you have an indication of potential interest you can use a good MT program to translate just the claims of interest. This is Google Translate translating the claim, and you can see that it is struggling, but is better than the PatBase one. CN is a high strength toughness nonmagnetic alloy characterized in that the chemical composition is in a weight percentage of C: 0.14 to 0. 30 Si: 0.015 to 80 Mn: 20 to 0000. Ni: 0.60 ~ 2.00; Cr: 12. 50 ~ 1900; Mo or W elements of one or two: 0. 60 ~ 2.50 ;; 0.8 ~ [0. LXMn (% - 0.5); 0 20 to 0.50; Ca, rare earth elements of one or two: 0. 003 ~ 0.05;: 彡 0.03:: 彡 0.03; Fe: balance.
  • 14. Use of CN, JP, KR originals - rationale • Machine translation is often hard to understand and sometimes incomprehensible • Using native language patents ensures data quality • Limited inbuilt knowledge required for numerical searching – Searching for elements requires only that a program has the CJK equivalents for full element names; international symbols are identical. – Searching for ranges requires knowledge of potential CJK equivalent codes for digits – Searching for range indicators requires language specific identification of hyphen, <, > and words. • Accurate identification of the search specification with display of the claims means only those claims of interest need translation by machine or human ICIC 2017 14