SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Exploring patent space
with python
Franta Polach
@FrantaPolach
IPberry.com
PyData 2014
@FrantaPolach 2
@FrantaPolach 3
@FrantaPolach 4
@FrantaPolach 5
@FrantaPolach 6
Outline
● Why patents
● Data kung fu
● Topic modelling
● Future
@FrantaPolach 7
Why patents
@FrantaPolach 8
Why patents
● The system is broken
● Messy, slow & costly process
● USPTO data freely available
● Data structured, mostly consistent
● A chance to learn
@FrantaPolach 9
Data kung fu
Kung fu or Gung fu (/ˌkʌŋˈfuː/ or /ˌkʊŋˈfuː/; 功夫 ,
Pinyin: gōngfu)
– a Chinese term referring to any study, learning, or
practice that requires patience, energy, and time to
complete
@FrantaPolach 10
USPTO Data
● xml, SGML key-value store
● 1975 – present
● eight different formats
● > 70GB (compressed)
● patent grants
● patent applications
● How to parse?
● Parsed data available?
– Harvard Dataverse Network
– Coleman Fung Institute for Engineering Leadership, UC Berkeley
– PATENT SEARCH TOOL by Fung Institute
– http://funginstitute.berkeley.edu/tools-and-data
@FrantaPolach 11
Coleman Fung Institute for Engineering Leadership, UC Berkeley
patent data process flow
The code is in Python 2 on Github.
@FrantaPolach 12
Fung Institute SQL database schema
@FrantaPolach 13
Entity-relationship diagram
Patents with citations, claims, applications and classes
@FrantaPolach 14
Descriptive statistics
@FrantaPolach 15
Topic modelling
● Goal: build a topic space of the patent
documents
● i.e. compute semantic similarity
● Tools: nltk, gensim
● Data: patent abstracts, claims, descriptions
● Usage: have invention description, find
semantically similar patents
@FrantaPolach 16
Text preprocessing
● Have: parsed data in a relational database
● Want: data ready for semantic analysis
● Do:
– lemmatization, stemming
– collocations, Named Entity Recognition
@FrantaPolach 17
Text preprocessing
Lemmatization, stemming
print(gensim.utils.lemmatize("Changing the way scientists, engineers, and
analysts perceive big data"))
['change/VB', 'way/NN', 'scientist/NN', 'engineer/NN', 'analyst/NN', 'perceive/VB', 'big/JJ', 'datum/NN']
i.e. group together different inflected forms of a word so they can be analysed as a single item
Collocations, Named Entity Recognition
detect a sequence of words that co-occur more often than would be expected by chance
import nltk
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures
e.g. entity such as "General Electric" stays a single token
Stopwords
generic words, such as "six", "then", "be", "do"....
from gensim.parsing.preprocessing import STOPWORDS
@FrantaPolach 18
Data streaming
Why? data is too large to fit into RAM
Itertools are your friend
class PatCorpus(object):
def __init__(self, fname):
self.fname = fname
def __iter__(self):
for line in open(self.fname):
patent=line.lower().split('t')
tokens = gensim.utils.tokenize(patent[5], lower=True)
title = patent[6]
yield title, list(tokens)
corpus_tokenized = PatCorpus('in.tsv')
print(list(itertools.islice(corpus_tokenized, 2)))
[('easy wagon/easy cart/bicycle wheel mounting brackets system', [u'a',
u'specific', u'wheel', u'mounting', u'bracket', u'and', u'a', u'versatile',
u'method', u'of', u'using', u'these', u'brackets', u'or', u'similar', u'items',
u'to', u'attach', u'bicycle', u'wheels', u'to', u'various', u'vehicle',
u'frames', u'primarily', u'made', u'of', u'wood', u'and', u'a', u'general',
u'vehicle', u'structure', u'or', u'frame', u'design', u'using', u'the',
u'brackets', u'the', u'brackets', u'are', u'flat', …
@FrantaPolach 19
Vectorization
● First we create a dictionary, i.e. index text tokens by integers
id2word = gensim.corpora.Dictionary(corpus_tokenized)
● Create bag-of-words vectors using a streamed corpus and a
dictionary
text = "A community for developers and users of Python
data tools."
bow = id2word.doc2bow(tokenize(text))
print(bow)
[(12832, 1), (28124, 1), (28301, 1), (32835, 1)]
def tokenize(text):
return [t for t in simple_preprocess(text) if t not in
STOPWORDS]
@FrantaPolach 20
Semantic transformations
● A transformation takes a corpus and outputs another corpus
● Choice: Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), Random
Projections (RP), etc.
model = gensim.models.LdaModel(corpus, num_topics=100,
id2word=id2word, passes = 4, alpha=None)
_ = model.print_topics(-1)
INFO:gensim.models.ldamodel:topic #0 (0.010): 0.116*memory + 0.090*cell +
0.063*plurality + 0.054*array + 0.052*each + 0.044*bit + 0.039*cells + 0.032*address +
0.022*logic + 0.017*row
INFO:gensim.models.ldamodel:topic #1 (0.010): 0.101*speed + 0.092*lines +
0.060*performance + 0.045*characteristic + 0.036*skin + 0.028*characteristics +
0.025*suspension + 0.024*enclosure + 0.023*transducer + 0.022*loss
INFO:gensim.models.ldamodel:topic #2 (0.010): 0.141*portion + 0.049*housing +
0.031*portions + 0.028*end + 0.024*edge + 0.020*mounting + 0.018*has + 0.017*each +
0.016*formed + 0.016*arm
INFO:gensim.models.ldamodel:topic #3 (0.010): 0.224*signal + 0.099*output + 0.075*input
+ 0.057*signals + 0.043*frequency + 0.034*phase + 0.024*clock + 0.020*circuit +
0.016*amplifier + 0.014*reference
@FrantaPolach 21
Transforming unseen documents
text = "A method of configuring the link maximum transmission unit (MTU) in a
user equipment."
1) transform text into the bag-of-words space
bow_vector = id2word.doc2bow(tokenize(text))
print([(id2word[id], count) for id, count in bow_vector])
[(u'method', 1), (u'configuring', 1), (u'link', 1), (u'maximum', 1),
(u'transmission', 1), (u'unit', 1), (u'user', 1), (u'equipment', 1)]
2) transform text into our LDA space
vector = model[bow_vector]
[(0, 0.024384265946835323), (1, 0.78941547921042373),...
3) find the document's most significant LDA topic
model.print_topic(max(vector, key=lambda item: item[1])[0])
0.022*network + 0.021*performance + 0.018*protocol + 0.015*data + 0.009*system +
0.008*internet + ...
@FrantaPolach 22
Evaluation
● Topic modelling is an unsupervised task ->> evaluation tricky
● Need to evaluate the improvement of the intended task
● Our goal is to retrieve semantically similar documents, thus we tag a
set of similar documents and compare with the results of given
semantic model
● "word intrusion" method: for each trained topic, take its first ten words,
substitute one of them with a randomly chosen word (intruder!) and let
a human detect the intruder
● Method without human intervention: split each document into two parts,
and check that topics of the first half are similar to topics of the second;
halves of different documents are dissimilar
@FrantaPolach 23
The topic space
● a topic is a distribution over a fixed vocabulary
of terms
● the idea behind Latent Dirichlet Allocation is to
statistically model documents as containing
multiple hidden semantic topics
@FrantaPolach 24
memory: 188
cell: 146
plurality: 102
array: 86
bit: 71
address: 51
Exploring topic space
speed: 178
line: 163
performance: 107
characteristic: 79
skin: 63
suspension: 45
signal: 324
output: 142
input: 108
frequency: 62
phase: 49
clock: 35
portion: 310
housing: 109
end: 62
edge: 53
mounting: 43
form: 35
@FrantaPolach 25
Topics distribution
many topics in total, but each document contains just a few of them
->> sparse model
@FrantaPolach 26
Semantic distance in topic space
● Semantic distance queries
from scipy.spatial import distance
pairwise = distance.squareform(distance.pdist(matrix))
>> MemoryError
● Document indexing
from gensim.similarities import Similarity
index = Similarity('tmp/index', corpus,
num_features=corpus.num_terms)
The Similarity class splits the index into several smaller sub-indexes
->> scales well
@FrantaPolach 27
Semantic distance queries
query = "A method of configuring the link maximum transmission unit (MTU) in a
user equipment."
1) vectorize the text into bag-of-words space
bow_vector = id2word.doc2bow(tokenize(query))
2) transform the text into our LDA space
query_lda = model[bow_vector]
3) query the LDA index, get the top 3 most similar documents
index.num_best = 3
print(index[query_lda])
[(2026, 0.91495784099521484), (32384, 0.8226358470916238), (11525,
0.80638835174553156)]
@FrantaPolach 28
Future
● Graph of USPTO data (Neo4j)
● Elasticsearch search and analytics
● Recommendation engine (for applications)
● Drawings analysis
● Blockchain based smart contracts
● Artificial patent lawyer

Weitere ähnliche Inhalte

Was ist angesagt?

Grape generative fuzzing
Grape generative fuzzingGrape generative fuzzing
Grape generative fuzzingFFRI, Inc.
 
Compact ordered dict__k_lab_meeting_
Compact ordered dict__k_lab_meeting_Compact ordered dict__k_lab_meeting_
Compact ordered dict__k_lab_meeting_miki koganei
 
Python for Linux System Administration
Python for Linux System AdministrationPython for Linux System Administration
Python for Linux System Administrationvceder
 
All'ombra del Leviatano: Filesystem in Userspace
All'ombra del Leviatano: Filesystem in UserspaceAll'ombra del Leviatano: Filesystem in Userspace
All'ombra del Leviatano: Filesystem in UserspaceRoberto Reale
 
Python and HDF5: Overview
Python and HDF5: OverviewPython and HDF5: Overview
Python and HDF5: Overviewandrewcollette
 
The Ring programming language version 1.8 book - Part 116 of 202
The Ring programming language version 1.8 book - Part 116 of 202The Ring programming language version 1.8 book - Part 116 of 202
The Ring programming language version 1.8 book - Part 116 of 202Mahmoud Samir Fayed
 
Seminar Hacking & Security Analysis
Seminar Hacking & Security AnalysisSeminar Hacking & Security Analysis
Seminar Hacking & Security AnalysisDan H
 
Дмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI векеДмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI векеSergey Platonov
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In PythonMarwan Osman
 
C++17 std::filesystem - Overview
C++17 std::filesystem - OverviewC++17 std::filesystem - Overview
C++17 std::filesystem - OverviewBartlomiej Filipek
 
Slide cipher based encryption
Slide cipher based encryptionSlide cipher based encryption
Slide cipher based encryptionMizi Mohamad
 

Was ist angesagt? (18)

Python for Dummies
Python for DummiesPython for Dummies
Python for Dummies
 
Functional concepts in C#
Functional concepts in C#Functional concepts in C#
Functional concepts in C#
 
Grape generative fuzzing
Grape generative fuzzingGrape generative fuzzing
Grape generative fuzzing
 
Compact ordered dict__k_lab_meeting_
Compact ordered dict__k_lab_meeting_Compact ordered dict__k_lab_meeting_
Compact ordered dict__k_lab_meeting_
 
Python for Linux System Administration
Python for Linux System AdministrationPython for Linux System Administration
Python for Linux System Administration
 
Learn How to Master Solr1 4
Learn How to Master Solr1 4Learn How to Master Solr1 4
Learn How to Master Solr1 4
 
All'ombra del Leviatano: Filesystem in Userspace
All'ombra del Leviatano: Filesystem in UserspaceAll'ombra del Leviatano: Filesystem in Userspace
All'ombra del Leviatano: Filesystem in Userspace
 
Python and HDF5: Overview
Python and HDF5: OverviewPython and HDF5: Overview
Python and HDF5: Overview
 
The Ring programming language version 1.8 book - Part 116 of 202
The Ring programming language version 1.8 book - Part 116 of 202The Ring programming language version 1.8 book - Part 116 of 202
The Ring programming language version 1.8 book - Part 116 of 202
 
Schizophrenic files
Schizophrenic filesSchizophrenic files
Schizophrenic files
 
Rust vs C++
Rust vs C++Rust vs C++
Rust vs C++
 
Lz77 by ayush
Lz77 by ayushLz77 by ayush
Lz77 by ayush
 
Seminar Hacking & Security Analysis
Seminar Hacking & Security AnalysisSeminar Hacking & Security Analysis
Seminar Hacking & Security Analysis
 
Дмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI векеДмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI веке
 
Python cheat-sheet
Python cheat-sheetPython cheat-sheet
Python cheat-sheet
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In Python
 
C++17 std::filesystem - Overview
C++17 std::filesystem - OverviewC++17 std::filesystem - Overview
C++17 std::filesystem - Overview
 
Slide cipher based encryption
Slide cipher based encryptionSlide cipher based encryption
Slide cipher based encryption
 

Ähnlich wie Franta Polach - Exploring Patent Data with Python

Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...ICSM 2011
 
Python 3.6 Features 20161207
Python 3.6 Features 20161207Python 3.6 Features 20161207
Python 3.6 Features 20161207Jay Coskey
 
These questions will be a bit advanced level 2
These questions will be a bit advanced level 2These questions will be a bit advanced level 2
These questions will be a bit advanced level 2sadhana312471
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta PyData
 
CrateDB 101: Sensor data
CrateDB 101: Sensor dataCrateDB 101: Sensor data
CrateDB 101: Sensor dataClaus Matzinger
 
Get Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor DataGet Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor DataCrate.io
 
Gpu programming with java
Gpu programming with javaGpu programming with java
Gpu programming with javaGary Sieling
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
Php Extensions for Dummies
Php Extensions for DummiesPhp Extensions for Dummies
Php Extensions for DummiesElizabeth Smith
 
EuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devicesEuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devicesHua Chu
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging EnvironmentsPaul Groth
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008eComm2008
 
Threads and multi threading
Threads and multi threadingThreads and multi threading
Threads and multi threadingAntonio Cesarano
 
Exploitation of counter overflows in the Linux kernel
Exploitation of counter overflows in the Linux kernelExploitation of counter overflows in the Linux kernel
Exploitation of counter overflows in the Linux kernelVitaly Nikolenko
 
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...Kinson Chan
 
Getting started with Clojure
Getting started with ClojureGetting started with Clojure
Getting started with ClojureJohn Stevenson
 

Ähnlich wie Franta Polach - Exploring Patent Data with Python (20)

Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...
 
Python 3.6 Features 20161207
Python 3.6 Features 20161207Python 3.6 Features 20161207
Python 3.6 Features 20161207
 
These questions will be a bit advanced level 2
These questions will be a bit advanced level 2These questions will be a bit advanced level 2
These questions will be a bit advanced level 2
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
 
CrateDB 101: Sensor data
CrateDB 101: Sensor dataCrateDB 101: Sensor data
CrateDB 101: Sensor data
 
Get Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor DataGet Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor Data
 
posix.pdf
posix.pdfposix.pdf
posix.pdf
 
Gpu programming with java
Gpu programming with javaGpu programming with java
Gpu programming with java
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Php Extensions for Dummies
Php Extensions for DummiesPhp Extensions for Dummies
Php Extensions for Dummies
 
EuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devicesEuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devices
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008
 
Threads and multi threading
Threads and multi threadingThreads and multi threading
Threads and multi threading
 
Exploitation of counter overflows in the Linux kernel
Exploitation of counter overflows in the Linux kernelExploitation of counter overflows in the Linux kernel
Exploitation of counter overflows in the Linux kernel
 
GE3151_PSPP_UNIT_5_Notes
GE3151_PSPP_UNIT_5_NotesGE3151_PSPP_UNIT_5_Notes
GE3151_PSPP_UNIT_5_Notes
 
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
 
Getting started with Clojure
Getting started with ClojureGetting started with Clojure
Getting started with Clojure
 
Dynamic Python
Dynamic PythonDynamic Python
Dynamic Python
 
Intro to Python
Intro to PythonIntro to Python
Intro to Python
 

Mehr von PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

Mehr von PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Kürzlich hochgeladen

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Kürzlich hochgeladen (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Franta Polach - Exploring Patent Data with Python

  • 1. Exploring patent space with python Franta Polach @FrantaPolach IPberry.com PyData 2014
  • 6. @FrantaPolach 6 Outline ● Why patents ● Data kung fu ● Topic modelling ● Future
  • 8. @FrantaPolach 8 Why patents ● The system is broken ● Messy, slow & costly process ● USPTO data freely available ● Data structured, mostly consistent ● A chance to learn
  • 9. @FrantaPolach 9 Data kung fu Kung fu or Gung fu (/ˌkʌŋˈfuː/ or /ˌkʊŋˈfuː/; 功夫 , Pinyin: gōngfu) – a Chinese term referring to any study, learning, or practice that requires patience, energy, and time to complete
  • 10. @FrantaPolach 10 USPTO Data ● xml, SGML key-value store ● 1975 – present ● eight different formats ● > 70GB (compressed) ● patent grants ● patent applications ● How to parse? ● Parsed data available? – Harvard Dataverse Network – Coleman Fung Institute for Engineering Leadership, UC Berkeley – PATENT SEARCH TOOL by Fung Institute – http://funginstitute.berkeley.edu/tools-and-data
  • 11. @FrantaPolach 11 Coleman Fung Institute for Engineering Leadership, UC Berkeley patent data process flow The code is in Python 2 on Github.
  • 12. @FrantaPolach 12 Fung Institute SQL database schema
  • 13. @FrantaPolach 13 Entity-relationship diagram Patents with citations, claims, applications and classes
  • 15. @FrantaPolach 15 Topic modelling ● Goal: build a topic space of the patent documents ● i.e. compute semantic similarity ● Tools: nltk, gensim ● Data: patent abstracts, claims, descriptions ● Usage: have invention description, find semantically similar patents
  • 16. @FrantaPolach 16 Text preprocessing ● Have: parsed data in a relational database ● Want: data ready for semantic analysis ● Do: – lemmatization, stemming – collocations, Named Entity Recognition
  • 17. @FrantaPolach 17 Text preprocessing Lemmatization, stemming print(gensim.utils.lemmatize("Changing the way scientists, engineers, and analysts perceive big data")) ['change/VB', 'way/NN', 'scientist/NN', 'engineer/NN', 'analyst/NN', 'perceive/VB', 'big/JJ', 'datum/NN'] i.e. group together different inflected forms of a word so they can be analysed as a single item Collocations, Named Entity Recognition detect a sequence of words that co-occur more often than would be expected by chance import nltk from nltk.collocations import TrigramCollocationFinder from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures e.g. entity such as "General Electric" stays a single token Stopwords generic words, such as "six", "then", "be", "do".... from gensim.parsing.preprocessing import STOPWORDS
  • 18. @FrantaPolach 18 Data streaming Why? data is too large to fit into RAM Itertools are your friend class PatCorpus(object): def __init__(self, fname): self.fname = fname def __iter__(self): for line in open(self.fname): patent=line.lower().split('t') tokens = gensim.utils.tokenize(patent[5], lower=True) title = patent[6] yield title, list(tokens) corpus_tokenized = PatCorpus('in.tsv') print(list(itertools.islice(corpus_tokenized, 2))) [('easy wagon/easy cart/bicycle wheel mounting brackets system', [u'a', u'specific', u'wheel', u'mounting', u'bracket', u'and', u'a', u'versatile', u'method', u'of', u'using', u'these', u'brackets', u'or', u'similar', u'items', u'to', u'attach', u'bicycle', u'wheels', u'to', u'various', u'vehicle', u'frames', u'primarily', u'made', u'of', u'wood', u'and', u'a', u'general', u'vehicle', u'structure', u'or', u'frame', u'design', u'using', u'the', u'brackets', u'the', u'brackets', u'are', u'flat', …
  • 19. @FrantaPolach 19 Vectorization ● First we create a dictionary, i.e. index text tokens by integers id2word = gensim.corpora.Dictionary(corpus_tokenized) ● Create bag-of-words vectors using a streamed corpus and a dictionary text = "A community for developers and users of Python data tools." bow = id2word.doc2bow(tokenize(text)) print(bow) [(12832, 1), (28124, 1), (28301, 1), (32835, 1)] def tokenize(text): return [t for t in simple_preprocess(text) if t not in STOPWORDS]
  • 20. @FrantaPolach 20 Semantic transformations ● A transformation takes a corpus and outputs another corpus ● Choice: Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), Random Projections (RP), etc. model = gensim.models.LdaModel(corpus, num_topics=100, id2word=id2word, passes = 4, alpha=None) _ = model.print_topics(-1) INFO:gensim.models.ldamodel:topic #0 (0.010): 0.116*memory + 0.090*cell + 0.063*plurality + 0.054*array + 0.052*each + 0.044*bit + 0.039*cells + 0.032*address + 0.022*logic + 0.017*row INFO:gensim.models.ldamodel:topic #1 (0.010): 0.101*speed + 0.092*lines + 0.060*performance + 0.045*characteristic + 0.036*skin + 0.028*characteristics + 0.025*suspension + 0.024*enclosure + 0.023*transducer + 0.022*loss INFO:gensim.models.ldamodel:topic #2 (0.010): 0.141*portion + 0.049*housing + 0.031*portions + 0.028*end + 0.024*edge + 0.020*mounting + 0.018*has + 0.017*each + 0.016*formed + 0.016*arm INFO:gensim.models.ldamodel:topic #3 (0.010): 0.224*signal + 0.099*output + 0.075*input + 0.057*signals + 0.043*frequency + 0.034*phase + 0.024*clock + 0.020*circuit + 0.016*amplifier + 0.014*reference
  • 21. @FrantaPolach 21 Transforming unseen documents text = "A method of configuring the link maximum transmission unit (MTU) in a user equipment." 1) transform text into the bag-of-words space bow_vector = id2word.doc2bow(tokenize(text)) print([(id2word[id], count) for id, count in bow_vector]) [(u'method', 1), (u'configuring', 1), (u'link', 1), (u'maximum', 1), (u'transmission', 1), (u'unit', 1), (u'user', 1), (u'equipment', 1)] 2) transform text into our LDA space vector = model[bow_vector] [(0, 0.024384265946835323), (1, 0.78941547921042373),... 3) find the document's most significant LDA topic model.print_topic(max(vector, key=lambda item: item[1])[0]) 0.022*network + 0.021*performance + 0.018*protocol + 0.015*data + 0.009*system + 0.008*internet + ...
  • 22. @FrantaPolach 22 Evaluation ● Topic modelling is an unsupervised task ->> evaluation tricky ● Need to evaluate the improvement of the intended task ● Our goal is to retrieve semantically similar documents, thus we tag a set of similar documents and compare with the results of given semantic model ● "word intrusion" method: for each trained topic, take its first ten words, substitute one of them with a randomly chosen word (intruder!) and let a human detect the intruder ● Method without human intervention: split each document into two parts, and check that topics of the first half are similar to topics of the second; halves of different documents are dissimilar
  • 23. @FrantaPolach 23 The topic space ● a topic is a distribution over a fixed vocabulary of terms ● the idea behind Latent Dirichlet Allocation is to statistically model documents as containing multiple hidden semantic topics
  • 24. @FrantaPolach 24 memory: 188 cell: 146 plurality: 102 array: 86 bit: 71 address: 51 Exploring topic space speed: 178 line: 163 performance: 107 characteristic: 79 skin: 63 suspension: 45 signal: 324 output: 142 input: 108 frequency: 62 phase: 49 clock: 35 portion: 310 housing: 109 end: 62 edge: 53 mounting: 43 form: 35
  • 25. @FrantaPolach 25 Topics distribution many topics in total, but each document contains just a few of them ->> sparse model
  • 26. @FrantaPolach 26 Semantic distance in topic space ● Semantic distance queries from scipy.spatial import distance pairwise = distance.squareform(distance.pdist(matrix)) >> MemoryError ● Document indexing from gensim.similarities import Similarity index = Similarity('tmp/index', corpus, num_features=corpus.num_terms) The Similarity class splits the index into several smaller sub-indexes ->> scales well
  • 27. @FrantaPolach 27 Semantic distance queries query = "A method of configuring the link maximum transmission unit (MTU) in a user equipment." 1) vectorize the text into bag-of-words space bow_vector = id2word.doc2bow(tokenize(query)) 2) transform the text into our LDA space query_lda = model[bow_vector] 3) query the LDA index, get the top 3 most similar documents index.num_best = 3 print(index[query_lda]) [(2026, 0.91495784099521484), (32384, 0.8226358470916238), (11525, 0.80638835174553156)]
  • 28. @FrantaPolach 28 Future ● Graph of USPTO data (Neo4j) ● Elasticsearch search and analytics ● Recommendation engine (for applications) ● Drawings analysis ● Blockchain based smart contracts ● Artificial patent lawyer