8. @FrantaPolach 8
Why patents
● The system is broken
● Messy, slow & costly process
● USPTO data freely available
● Data structured, mostly consistent
● A chance to learn
9. @FrantaPolach 9
Data kung fu
Kung fu or Gung fu (/ˌkʌŋˈfuː/ or /ˌkʊŋˈfuː/; 功夫 ,
Pinyin: gōngfu)
– a Chinese term referring to any study, learning, or
practice that requires patience, energy, and time to
complete
10. @FrantaPolach 10
USPTO Data
● xml, SGML key-value store
● 1975 – present
● eight different formats
● > 70GB (compressed)
● patent grants
● patent applications
● How to parse?
● Parsed data available?
– Harvard Dataverse Network
– Coleman Fung Institute for Engineering Leadership, UC Berkeley
– PATENT SEARCH TOOL by Fung Institute
– http://funginstitute.berkeley.edu/tools-and-data
11. @FrantaPolach 11
Coleman Fung Institute for Engineering Leadership, UC Berkeley
patent data process flow
The code is in Python 2 on Github.
15. @FrantaPolach 15
Topic modelling
● Goal: build a topic space of the patent
documents
● i.e. compute semantic similarity
● Tools: nltk, gensim
● Data: patent abstracts, claims, descriptions
● Usage: have invention description, find
semantically similar patents
16. @FrantaPolach 16
Text preprocessing
● Have: parsed data in a relational database
● Want: data ready for semantic analysis
● Do:
– lemmatization, stemming
– collocations, Named Entity Recognition
17. @FrantaPolach 17
Text preprocessing
Lemmatization, stemming
print(gensim.utils.lemmatize("Changing the way scientists, engineers, and
analysts perceive big data"))
['change/VB', 'way/NN', 'scientist/NN', 'engineer/NN', 'analyst/NN', 'perceive/VB', 'big/JJ', 'datum/NN']
i.e. group together different inflected forms of a word so they can be analysed as a single item
Collocations, Named Entity Recognition
detect a sequence of words that co-occur more often than would be expected by chance
import nltk
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures
e.g. entity such as "General Electric" stays a single token
Stopwords
generic words, such as "six", "then", "be", "do"....
from gensim.parsing.preprocessing import STOPWORDS
18. @FrantaPolach 18
Data streaming
Why? data is too large to fit into RAM
Itertools are your friend
class PatCorpus(object):
def __init__(self, fname):
self.fname = fname
def __iter__(self):
for line in open(self.fname):
patent=line.lower().split('t')
tokens = gensim.utils.tokenize(patent[5], lower=True)
title = patent[6]
yield title, list(tokens)
corpus_tokenized = PatCorpus('in.tsv')
print(list(itertools.islice(corpus_tokenized, 2)))
[('easy wagon/easy cart/bicycle wheel mounting brackets system', [u'a',
u'specific', u'wheel', u'mounting', u'bracket', u'and', u'a', u'versatile',
u'method', u'of', u'using', u'these', u'brackets', u'or', u'similar', u'items',
u'to', u'attach', u'bicycle', u'wheels', u'to', u'various', u'vehicle',
u'frames', u'primarily', u'made', u'of', u'wood', u'and', u'a', u'general',
u'vehicle', u'structure', u'or', u'frame', u'design', u'using', u'the',
u'brackets', u'the', u'brackets', u'are', u'flat', …
19. @FrantaPolach 19
Vectorization
● First we create a dictionary, i.e. index text tokens by integers
id2word = gensim.corpora.Dictionary(corpus_tokenized)
● Create bag-of-words vectors using a streamed corpus and a
dictionary
text = "A community for developers and users of Python
data tools."
bow = id2word.doc2bow(tokenize(text))
print(bow)
[(12832, 1), (28124, 1), (28301, 1), (32835, 1)]
def tokenize(text):
return [t for t in simple_preprocess(text) if t not in
STOPWORDS]
21. @FrantaPolach 21
Transforming unseen documents
text = "A method of configuring the link maximum transmission unit (MTU) in a
user equipment."
1) transform text into the bag-of-words space
bow_vector = id2word.doc2bow(tokenize(text))
print([(id2word[id], count) for id, count in bow_vector])
[(u'method', 1), (u'configuring', 1), (u'link', 1), (u'maximum', 1),
(u'transmission', 1), (u'unit', 1), (u'user', 1), (u'equipment', 1)]
2) transform text into our LDA space
vector = model[bow_vector]
[(0, 0.024384265946835323), (1, 0.78941547921042373),...
3) find the document's most significant LDA topic
model.print_topic(max(vector, key=lambda item: item[1])[0])
0.022*network + 0.021*performance + 0.018*protocol + 0.015*data + 0.009*system +
0.008*internet + ...
22. @FrantaPolach 22
Evaluation
● Topic modelling is an unsupervised task ->> evaluation tricky
● Need to evaluate the improvement of the intended task
● Our goal is to retrieve semantically similar documents, thus we tag a
set of similar documents and compare with the results of given
semantic model
● "word intrusion" method: for each trained topic, take its first ten words,
substitute one of them with a randomly chosen word (intruder!) and let
a human detect the intruder
● Method without human intervention: split each document into two parts,
and check that topics of the first half are similar to topics of the second;
halves of different documents are dissimilar
23. @FrantaPolach 23
The topic space
● a topic is a distribution over a fixed vocabulary
of terms
● the idea behind Latent Dirichlet Allocation is to
statistically model documents as containing
multiple hidden semantic topics
26. @FrantaPolach 26
Semantic distance in topic space
● Semantic distance queries
from scipy.spatial import distance
pairwise = distance.squareform(distance.pdist(matrix))
>> MemoryError
● Document indexing
from gensim.similarities import Similarity
index = Similarity('tmp/index', corpus,
num_features=corpus.num_terms)
The Similarity class splits the index into several smaller sub-indexes
->> scales well
27. @FrantaPolach 27
Semantic distance queries
query = "A method of configuring the link maximum transmission unit (MTU) in a
user equipment."
1) vectorize the text into bag-of-words space
bow_vector = id2word.doc2bow(tokenize(query))
2) transform the text into our LDA space
query_lda = model[bow_vector]
3) query the LDA index, get the top 3 most similar documents
index.num_best = 3
print(index[query_lda])
[(2026, 0.91495784099521484), (32384, 0.8226358470916238), (11525,
0.80638835174553156)]
28. @FrantaPolach 28
Future
● Graph of USPTO data (Neo4j)
● Elasticsearch search and analytics
● Recommendation engine (for applications)
● Drawings analysis
● Blockchain based smart contracts
● Artificial patent lawyer