David Woolls (CFL Software, UK)
The growth in computing power has made it possible to gain insights into very large quantities of text by both statistical and neural methodologies and linguistics, the way languages work for humans, is not a major part of that process. However, decisions on FTO, Invalidity and competition are still made by humans, which means reading the patents identified by the machines. Because humans are endlessly creative even in an apparently constrained world of patent writing, and different human languages have different ways of expressing similar concepts, identifying ranges in alloys, compounds, formulations etc. is a complex challenge for computer programs. This paper explores how the same computer hardware advances which have enabled machine learning can be exploited to produce overall solutions to the problems that natural languages present to humans and computers alike. It will identify those areas in which computer programs can outmatch human capability in identifying and assessing complex interactions of molecules, elements and the like and comparing them with potential or actual specifications, a capability which allows humans much more time to focus on the interpretation rather than the finding. And it will illustrate the application of such programs to both the main European languages and Chinese, Japanese and Korean.
Dubai Call Girls Milky O525547819 Call Girls Dubai Soft Dating
ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learning in patent search
1. Babies and Bathwater
Keeping linguistics alongside machine
learning in patent search
David Woolls – CFL Software Limited, UK
2. Matter
• Therefore, we cannot think that matter is made of points
without extension, because no matter how many of these we
manage to put together, we never obtain something with an
extended dimension.
Carlo Rovelli , Reality is not what it seems (2016 p:12)
• Quindi non si può pensare che la materia sia fatta di punti
senza estensione, perché, per quanti ne mettessimo
insieme, non otterremmo mai qualcosa con una dimensione
estesa.
• What is the matter with this sentence? Does this matter? As
a matter of fact it does. That’s another matter.
• What does ‘matter’ mean on this page?
3. Imagined Readers – Text differences
"It was a dark and stormy night, the rain
came down in torrents, there were brigands on
the mountains, and wolves, and the chief of the
brigands said to Antonio, 'I'm bored - tell us a
story!’”
Janet and Allan Ahlberg
From “Paul Clifford”
4. LSTM and linguistics
• But there are also cases where we need more
context.
• Consider trying to predict the last word in the text “I
grew up in France… I speak fluent French.”
Humans usually provide linguistic assistance in the form of function words
(grammar)
I grew up in France so I speak fluent … Definitely French
I grew up in France and I speak fluent … Possibly French but maybe another
I grew up in France but I speak fluent … Definitely not French
I grew up in France but I also speak fluent … Very definitely not French
I grew up in France but I don’t speak fluent … Definitely French
I grew up in France so I don’t speak fluent … Definitely not French
5. Babies, bathwater,
stems, lemmas and function words
Becomes
I think Christoph is brilliant Think Christoph brilli
I thought Christoph was brilliant Think Christoph brilli
I thought Christoph was brilliant but now I’m not
so sure.
Think Christoph brilli sure
Hearing Christoph’s brilliance I asked him to
speak.
Hear Christoph brilli ask speak
I wouldn’t do that if I were you! !
This is called telegraphic language and is spoken by children between 18
months and three years old during language acquisition. Perhaps not ideal for
computers and comprehension.
6. Linguistic LSTM with real sentences.
• It is a truth universally acknowledged, [6]
• that a single man [4]
• in possession of a good fortune, [6]
• must be in want of a wife. [7]
• [23/4] = 6
The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for
Processing Information
by George A. Miller
originally published in The Psychological Review, 1956, vol. 63, pp. 81-97
http://www.musanim.com/miller1956/
It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.
7. LSTM
• However little known the feelings or views of such a man may be
on his first entering a neighbourhood, this truth is so well fixed in
the minds of the surrounding families, that he is considered the
rightful property of some one or other of their daughters.
• However little known the feelings or views [7]
• of such a man may be [7]
• on his first entering a neighbourhood, [6]
• this truth is so well fixed [6]
• in the minds of the surrounding families, [7]
• that he is considered the rightful property [7]
• of some one or other of their daughters. [7]
• [47/7] = 7
8. Why linguistics?
• Patents are communicative documents, written in many languages.
• Communication is achieved by context which can be close or distant.
• Boolean searching gives results by document; range searching needs to be done
by claim.
• There are distractor numbers in a claim (e.g. Claim numbers, temperatures,
lengths).
• There are potential data quality or format problems introduced by OCR, machine
translation
or extraction from a database.
• All these and others need to be taken into account to find only relevant
material.
ICIC 2017 8
9. Why linguistics for ranges?
• Range information is in the unstructured text
– The location and referent of ranges is signalled by linguistic structures and forms:
• Range then element or Element then range or both 0,80 < Si < 1,20
• Elements by symbol Si or in full Silicon or silicon
• Implicit or explicit marking: 1-5 or between 1 and 5
• Symbolic or lexical marking: <2.5 or less than 2.5, ≥ .76 or greater than or equal to 0.76
• Variation in proximity of additional markings 0.5%, 0.5wt%, 0.5 wt %
– There can be mixtures of these forms in a single claim.
ICIC 2017 9
10. Reading
The program is a linear text reader because we need to:
1. Identify claims
2. Identify pairs of elements and ranges in each claim.
So each line in the file is read word by word just once in the
same sequence as a human reader.
ICIC 2017 10
11. Reading
• Items are identified as numbers, range indicators or elements in sequence.
• As each element/range pair is identified, the relationship with the specification is
calculated.
• Following calculation the element and the range is colour-coded and the claim
built for potential display.
• At the conclusion of each claim the total found is compared with the total
specification.
• If the claim meets the overall specification requirement it is added to the list for
display.
• At the conclusion of the reading process, all the results are ranked and
displayed.
• The program can process the full claims of around 300 patents per second.
ICIC 2017 11
12. Native languages v Machine Translation
ICIC 2017 12
Here is the problem from the PatBase collection.
<Claims><![CDATA[<CLA_MT><XXC1> <p> CN 1. A non-magnetic alloy of high strength and toughness,
characterized in that the chemical composition in weight percent of: C:.. 0 14 ~0 30 percent, Si:.. 0 15 ~0 80
percent,.. Mn: 20 00 ~27 00 percent; Ni:.. 0 60 ~2 00 percent; Cr:.. 12 50 ~19 00 percent;
</CLA_MT><CLA_CN><XXC1><p>CN 1. 一种高强度韧性无磁合金,其特征在于,化学成分重量百分数为: C
:0. 14 〜0. 30%, Si :0. 15 〜0. 80%, Mn :20. 00 〜27. 00% ; Ni :0. 60 〜2. 00% ; Cr :12. 50 〜19. 00%
;
You can see that the MT version into English is appalling!.
You can also see that the original claim will be understandable by the program because the presentation is clear.
13. Detailed example (continued)
ICIC 2017 13
It is not practicable to write a program that takes account of all the things that might go wrong, without also
introducing potential errors to data that is actually ok. But it is possible for SpanMatch to recognise the original
as correct as you see here.
So, given clean data or cleaning the data up as best we can, we can do this in all the languages. Once you have an
indication of potential interest you can use a good MT program to translate just the claims of interest.
This is Google Translate translating the claim, and you can see that it is struggling, but is better than the PatBase
one.
CN is a high strength toughness nonmagnetic alloy characterized in that the chemical composition is in a weight
percentage of C: 0.14 to 0. 30 Si: 0.015 to 80 Mn: 20 to 0000. Ni: 0.60 ~ 2.00; Cr: 12. 50 ~ 1900; Mo or W
elements of one or two: 0. 60 ~ 2.50 ;; 0.8 ~ [0. LXMn (% - 0.5); 0 20 to 0.50; Ca, rare earth elements of one or
two: 0. 003 ~ 0.05;: 彡 0.03:: 彡 0.03; Fe: balance.
14. Use of CN, JP, KR originals - rationale
• Machine translation is often hard to understand and sometimes incomprehensible
• Using native language patents ensures data quality
• Limited inbuilt knowledge required for numerical searching
– Searching for elements requires only that a program has the CJK equivalents
for full element names; international symbols are identical.
– Searching for ranges requires knowledge of potential CJK equivalent codes
for digits
– Searching for range indicators requires language specific identification of hyphen, <,
> and words.
• Accurate identification of the search specification with display of the claims means only
those claims of interest need translation by machine or human
ICIC 2017 14