SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Exploring patent space
with python
Franta Polach
@FrantaPolach
IPberry.com
PyData 2014
@FrantaPolach 2
@FrantaPolach 3
@FrantaPolach 4
@FrantaPolach 5
@FrantaPolach 6
Outline
● Why patents
● Data kung fu
● Topic modelling
● Future
@FrantaPolach 7
Why patents
@FrantaPolach 8
Why patents
● The system is broken
● Messy, slow & costly process
● USPTO data freely available
● Data structured, mostly consistent
● A chance to learn
@FrantaPolach 9
Data kung fu
Kung fu or Gung fu (/ˌkʌŋˈfuː/ or /ˌkʊŋˈfuː/; 功夫 ,
Pinyin: gōngfu)
– a Chinese term referring to any study, learning, or
practice that requires patience, energy, and time to
complete
@FrantaPolach 10
USPTO Data
● xml, SGML key-value store
● 1975 – present
● eight different formats
● > 70GB (compressed)
● patent grants
● patent applications
● How to parse?
● Parsed data available?
– Harvard Dataverse Network
– Coleman Fung Institute for Engineering Leadership, UC Berkeley
– PATENT SEARCH TOOL by Fung Institute
– http://funginstitute.berkeley.edu/tools-and-data
@FrantaPolach 11
Coleman Fung Institute for Engineering Leadership, UC Berkeley
patent data process flow
The code is in Python 2 on Github.
@FrantaPolach 12
Fung Institute SQL database schema
@FrantaPolach 13
Entity-relationship diagram
Patents with citations, claims, applications and classes
@FrantaPolach 14
Descriptive statistics
@FrantaPolach 15
Topic modelling
● Goal: build a topic space of the patent
documents
● i.e. compute semantic similarity
● Tools: nltk, gensim
● Data: patent abstracts, claims, descriptions
● Usage: have invention description, find
semantically similar patents
@FrantaPolach 16
Text preprocessing
● Have: parsed data in a relational database
● Want: data ready for semantic analysis
● Do:
– lemmatization, stemming
– collocations, Named Entity Recognition
@FrantaPolach 17
Text preprocessing
Lemmatization, stemming
print(gensim.utils.lemmatize("Changing the way scientists, engineers, and
analysts perceive big data"))
['change/VB', 'way/NN', 'scientist/NN', 'engineer/NN', 'analyst/NN', 'perceive/VB', 'big/JJ', 'datum/NN']
i.e. group together different inflected forms of a word so they can be analysed as a single item
Collocations, Named Entity Recognition
detect a sequence of words that co-occur more often than would be expected by chance
import nltk
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures
e.g. entity such as "General Electric" stays a single token
Stopwords
generic words, such as "six", "then", "be", "do"....
from gensim.parsing.preprocessing import STOPWORDS
@FrantaPolach 18
Data streaming
Why? data is too large to fit into RAM
Itertools are your friend
class PatCorpus(object):
def __init__(self, fname):
self.fname = fname
def __iter__(self):
for line in open(self.fname):
patent=line.lower().split('t')
tokens = gensim.utils.tokenize(patent[5], lower=True)
title = patent[6]
yield title, list(tokens)
corpus_tokenized = PatCorpus('in.tsv')
print(list(itertools.islice(corpus_tokenized, 2)))
[('easy wagon/easy cart/bicycle wheel mounting brackets system', [u'a',
u'specific', u'wheel', u'mounting', u'bracket', u'and', u'a', u'versatile',
u'method', u'of', u'using', u'these', u'brackets', u'or', u'similar', u'items',
u'to', u'attach', u'bicycle', u'wheels', u'to', u'various', u'vehicle',
u'frames', u'primarily', u'made', u'of', u'wood', u'and', u'a', u'general',
u'vehicle', u'structure', u'or', u'frame', u'design', u'using', u'the',
u'brackets', u'the', u'brackets', u'are', u'flat', …
@FrantaPolach 19
Vectorization
● First we create a dictionary, i.e. index text tokens by integers
id2word = gensim.corpora.Dictionary(corpus_tokenized)
● Create bag-of-words vectors using a streamed corpus and a
dictionary
text = "A community for developers and users of Python
data tools."
bow = id2word.doc2bow(tokenize(text))
print(bow)
[(12832, 1), (28124, 1), (28301, 1), (32835, 1)]
def tokenize(text):
return [t for t in simple_preprocess(text) if t not in
STOPWORDS]
@FrantaPolach 20
Semantic transformations
● A transformation takes a corpus and outputs another corpus
● Choice: Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), Random
Projections (RP), etc.
model = gensim.models.LdaModel(corpus, num_topics=100,
id2word=id2word, passes = 4, alpha=None)
_ = model.print_topics(-1)
INFO:gensim.models.ldamodel:topic #0 (0.010): 0.116*memory + 0.090*cell +
0.063*plurality + 0.054*array + 0.052*each + 0.044*bit + 0.039*cells + 0.032*address +
0.022*logic + 0.017*row
INFO:gensim.models.ldamodel:topic #1 (0.010): 0.101*speed + 0.092*lines +
0.060*performance + 0.045*characteristic + 0.036*skin + 0.028*characteristics +
0.025*suspension + 0.024*enclosure + 0.023*transducer + 0.022*loss
INFO:gensim.models.ldamodel:topic #2 (0.010): 0.141*portion + 0.049*housing +
0.031*portions + 0.028*end + 0.024*edge + 0.020*mounting + 0.018*has + 0.017*each +
0.016*formed + 0.016*arm
INFO:gensim.models.ldamodel:topic #3 (0.010): 0.224*signal + 0.099*output + 0.075*input
+ 0.057*signals + 0.043*frequency + 0.034*phase + 0.024*clock + 0.020*circuit +
0.016*amplifier + 0.014*reference
@FrantaPolach 21
Transforming unseen documents
text = "A method of configuring the link maximum transmission unit (MTU) in a
user equipment."
1) transform text into the bag-of-words space
bow_vector = id2word.doc2bow(tokenize(text))
print([(id2word[id], count) for id, count in bow_vector])
[(u'method', 1), (u'configuring', 1), (u'link', 1), (u'maximum', 1),
(u'transmission', 1), (u'unit', 1), (u'user', 1), (u'equipment', 1)]
2) transform text into our LDA space
vector = model[bow_vector]
[(0, 0.024384265946835323), (1, 0.78941547921042373),...
3) find the document's most significant LDA topic
model.print_topic(max(vector, key=lambda item: item[1])[0])
0.022*network + 0.021*performance + 0.018*protocol + 0.015*data + 0.009*system +
0.008*internet + ...
@FrantaPolach 22
Evaluation
● Topic modelling is an unsupervised task ->> evaluation tricky
● Need to evaluate the improvement of the intended task
● Our goal is to retrieve semantically similar documents, thus we tag a
set of similar documents and compare with the results of given
semantic model
● "word intrusion" method: for each trained topic, take its first ten words,
substitute one of them with a randomly chosen word (intruder!) and let
a human detect the intruder
● Method without human intervention: split each document into two parts,
and check that topics of the first half are similar to topics of the second;
halves of different documents are dissimilar
@FrantaPolach 23
The topic space
● a topic is a distribution over a fixed vocabulary
of terms
● the idea behind Latent Dirichlet Allocation is to
statistically model documents as containing
multiple hidden semantic topics
@FrantaPolach 24
memory: 188
cell: 146
plurality: 102
array: 86
bit: 71
address: 51
Exploring topic space
speed: 178
line: 163
performance: 107
characteristic: 79
skin: 63
suspension: 45
signal: 324
output: 142
input: 108
frequency: 62
phase: 49
clock: 35
portion: 310
housing: 109
end: 62
edge: 53
mounting: 43
form: 35
@FrantaPolach 25
Topics distribution
many topics in total, but each document contains just a few of them
->> sparse model
@FrantaPolach 26
Semantic distance in topic space
● Semantic distance queries
from scipy.spatial import distance
pairwise = distance.squareform(distance.pdist(matrix))
>> MemoryError
● Document indexing
from gensim.similarities import Similarity
index = Similarity('tmp/index', corpus,
num_features=corpus.num_terms)
The Similarity class splits the index into several smaller sub-indexes
->> scales well
@FrantaPolach 27
Semantic distance queries
query = "A method of configuring the link maximum transmission unit (MTU) in a
user equipment."
1) vectorize the text into bag-of-words space
bow_vector = id2word.doc2bow(tokenize(query))
2) transform the text into our LDA space
query_lda = model[bow_vector]
3) query the LDA index, get the top 3 most similar documents
index.num_best = 3
print(index[query_lda])
[(2026, 0.91495784099521484), (32384, 0.8226358470916238), (11525,
0.80638835174553156)]
@FrantaPolach 28
Future
● Graph of USPTO data (Neo4j)
● Elasticsearch search and analytics
● Recommendation engine (for applications)
● Drawings analysis
● Blockchain based smart contracts
● Artificial patent lawyer

Weitere ähnliche Inhalte

Was ist angesagt?

Grape generative fuzzing
Grape generative fuzzingGrape generative fuzzing
Grape generative fuzzing
FFRI, Inc.
 
Python and HDF5: Overview
Python and HDF5: OverviewPython and HDF5: Overview
Python and HDF5: Overview
andrewcollette
 

Was ist angesagt? (18)

Python for Dummies
Python for DummiesPython for Dummies
Python for Dummies
 
Functional concepts in C#
Functional concepts in C#Functional concepts in C#
Functional concepts in C#
 
Grape generative fuzzing
Grape generative fuzzingGrape generative fuzzing
Grape generative fuzzing
 
Compact ordered dict__k_lab_meeting_
Compact ordered dict__k_lab_meeting_Compact ordered dict__k_lab_meeting_
Compact ordered dict__k_lab_meeting_
 
Python for Linux System Administration
Python for Linux System AdministrationPython for Linux System Administration
Python for Linux System Administration
 
Learn How to Master Solr1 4
Learn How to Master Solr1 4Learn How to Master Solr1 4
Learn How to Master Solr1 4
 
All'ombra del Leviatano: Filesystem in Userspace
All'ombra del Leviatano: Filesystem in UserspaceAll'ombra del Leviatano: Filesystem in Userspace
All'ombra del Leviatano: Filesystem in Userspace
 
Python and HDF5: Overview
Python and HDF5: OverviewPython and HDF5: Overview
Python and HDF5: Overview
 
The Ring programming language version 1.8 book - Part 116 of 202
The Ring programming language version 1.8 book - Part 116 of 202The Ring programming language version 1.8 book - Part 116 of 202
The Ring programming language version 1.8 book - Part 116 of 202
 
Schizophrenic files
Schizophrenic filesSchizophrenic files
Schizophrenic files
 
Rust vs C++
Rust vs C++Rust vs C++
Rust vs C++
 
Lz77 by ayush
Lz77 by ayushLz77 by ayush
Lz77 by ayush
 
Seminar Hacking & Security Analysis
Seminar Hacking & Security AnalysisSeminar Hacking & Security Analysis
Seminar Hacking & Security Analysis
 
Дмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI векеДмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI веке
 
Python cheat-sheet
Python cheat-sheetPython cheat-sheet
Python cheat-sheet
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In Python
 
C++17 std::filesystem - Overview
C++17 std::filesystem - OverviewC++17 std::filesystem - Overview
C++17 std::filesystem - Overview
 
Slide cipher based encryption
Slide cipher based encryptionSlide cipher based encryption
Slide cipher based encryption
 

Ähnlich wie Franta Polach - Exploring Patent Data with Python

Php Extensions for Dummies
Php Extensions for DummiesPhp Extensions for Dummies
Php Extensions for Dummies
Elizabeth Smith
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008
eComm2008
 

Ähnlich wie Franta Polach - Exploring Patent Data with Python (20)

Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...
 
Python 3.6 Features 20161207
Python 3.6 Features 20161207Python 3.6 Features 20161207
Python 3.6 Features 20161207
 
These questions will be a bit advanced level 2
These questions will be a bit advanced level 2These questions will be a bit advanced level 2
These questions will be a bit advanced level 2
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
 
CrateDB 101: Sensor data
CrateDB 101: Sensor dataCrateDB 101: Sensor data
CrateDB 101: Sensor data
 
Get Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor DataGet Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor Data
 
posix.pdf
posix.pdfposix.pdf
posix.pdf
 
Gpu programming with java
Gpu programming with javaGpu programming with java
Gpu programming with java
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Php Extensions for Dummies
Php Extensions for DummiesPhp Extensions for Dummies
Php Extensions for Dummies
 
EuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devicesEuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devices
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008
 
Threads and multi threading
Threads and multi threadingThreads and multi threading
Threads and multi threading
 
Exploitation of counter overflows in the Linux kernel
Exploitation of counter overflows in the Linux kernelExploitation of counter overflows in the Linux kernel
Exploitation of counter overflows in the Linux kernel
 
GE3151_PSPP_UNIT_5_Notes
GE3151_PSPP_UNIT_5_NotesGE3151_PSPP_UNIT_5_Notes
GE3151_PSPP_UNIT_5_Notes
 
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
 
Getting started with Clojure
Getting started with ClojureGetting started with Clojure
Getting started with Clojure
 
Dynamic Python
Dynamic PythonDynamic Python
Dynamic Python
 
Intro to Python
Intro to PythonIntro to Python
Intro to Python
 

Mehr von PyData

Mehr von PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Kürzlich hochgeladen

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 

Kürzlich hochgeladen (20)

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 

Franta Polach - Exploring Patent Data with Python

  • 1. Exploring patent space with python Franta Polach @FrantaPolach IPberry.com PyData 2014
  • 6. @FrantaPolach 6 Outline ● Why patents ● Data kung fu ● Topic modelling ● Future
  • 8. @FrantaPolach 8 Why patents ● The system is broken ● Messy, slow & costly process ● USPTO data freely available ● Data structured, mostly consistent ● A chance to learn
  • 9. @FrantaPolach 9 Data kung fu Kung fu or Gung fu (/ˌkʌŋˈfuː/ or /ˌkʊŋˈfuː/; 功夫 , Pinyin: gōngfu) – a Chinese term referring to any study, learning, or practice that requires patience, energy, and time to complete
  • 10. @FrantaPolach 10 USPTO Data ● xml, SGML key-value store ● 1975 – present ● eight different formats ● > 70GB (compressed) ● patent grants ● patent applications ● How to parse? ● Parsed data available? – Harvard Dataverse Network – Coleman Fung Institute for Engineering Leadership, UC Berkeley – PATENT SEARCH TOOL by Fung Institute – http://funginstitute.berkeley.edu/tools-and-data
  • 11. @FrantaPolach 11 Coleman Fung Institute for Engineering Leadership, UC Berkeley patent data process flow The code is in Python 2 on Github.
  • 12. @FrantaPolach 12 Fung Institute SQL database schema
  • 13. @FrantaPolach 13 Entity-relationship diagram Patents with citations, claims, applications and classes
  • 15. @FrantaPolach 15 Topic modelling ● Goal: build a topic space of the patent documents ● i.e. compute semantic similarity ● Tools: nltk, gensim ● Data: patent abstracts, claims, descriptions ● Usage: have invention description, find semantically similar patents
  • 16. @FrantaPolach 16 Text preprocessing ● Have: parsed data in a relational database ● Want: data ready for semantic analysis ● Do: – lemmatization, stemming – collocations, Named Entity Recognition
  • 17. @FrantaPolach 17 Text preprocessing Lemmatization, stemming print(gensim.utils.lemmatize("Changing the way scientists, engineers, and analysts perceive big data")) ['change/VB', 'way/NN', 'scientist/NN', 'engineer/NN', 'analyst/NN', 'perceive/VB', 'big/JJ', 'datum/NN'] i.e. group together different inflected forms of a word so they can be analysed as a single item Collocations, Named Entity Recognition detect a sequence of words that co-occur more often than would be expected by chance import nltk from nltk.collocations import TrigramCollocationFinder from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures e.g. entity such as "General Electric" stays a single token Stopwords generic words, such as "six", "then", "be", "do".... from gensim.parsing.preprocessing import STOPWORDS
  • 18. @FrantaPolach 18 Data streaming Why? data is too large to fit into RAM Itertools are your friend class PatCorpus(object): def __init__(self, fname): self.fname = fname def __iter__(self): for line in open(self.fname): patent=line.lower().split('t') tokens = gensim.utils.tokenize(patent[5], lower=True) title = patent[6] yield title, list(tokens) corpus_tokenized = PatCorpus('in.tsv') print(list(itertools.islice(corpus_tokenized, 2))) [('easy wagon/easy cart/bicycle wheel mounting brackets system', [u'a', u'specific', u'wheel', u'mounting', u'bracket', u'and', u'a', u'versatile', u'method', u'of', u'using', u'these', u'brackets', u'or', u'similar', u'items', u'to', u'attach', u'bicycle', u'wheels', u'to', u'various', u'vehicle', u'frames', u'primarily', u'made', u'of', u'wood', u'and', u'a', u'general', u'vehicle', u'structure', u'or', u'frame', u'design', u'using', u'the', u'brackets', u'the', u'brackets', u'are', u'flat', …
  • 19. @FrantaPolach 19 Vectorization ● First we create a dictionary, i.e. index text tokens by integers id2word = gensim.corpora.Dictionary(corpus_tokenized) ● Create bag-of-words vectors using a streamed corpus and a dictionary text = "A community for developers and users of Python data tools." bow = id2word.doc2bow(tokenize(text)) print(bow) [(12832, 1), (28124, 1), (28301, 1), (32835, 1)] def tokenize(text): return [t for t in simple_preprocess(text) if t not in STOPWORDS]
  • 20. @FrantaPolach 20 Semantic transformations ● A transformation takes a corpus and outputs another corpus ● Choice: Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), Random Projections (RP), etc. model = gensim.models.LdaModel(corpus, num_topics=100, id2word=id2word, passes = 4, alpha=None) _ = model.print_topics(-1) INFO:gensim.models.ldamodel:topic #0 (0.010): 0.116*memory + 0.090*cell + 0.063*plurality + 0.054*array + 0.052*each + 0.044*bit + 0.039*cells + 0.032*address + 0.022*logic + 0.017*row INFO:gensim.models.ldamodel:topic #1 (0.010): 0.101*speed + 0.092*lines + 0.060*performance + 0.045*characteristic + 0.036*skin + 0.028*characteristics + 0.025*suspension + 0.024*enclosure + 0.023*transducer + 0.022*loss INFO:gensim.models.ldamodel:topic #2 (0.010): 0.141*portion + 0.049*housing + 0.031*portions + 0.028*end + 0.024*edge + 0.020*mounting + 0.018*has + 0.017*each + 0.016*formed + 0.016*arm INFO:gensim.models.ldamodel:topic #3 (0.010): 0.224*signal + 0.099*output + 0.075*input + 0.057*signals + 0.043*frequency + 0.034*phase + 0.024*clock + 0.020*circuit + 0.016*amplifier + 0.014*reference
  • 21. @FrantaPolach 21 Transforming unseen documents text = "A method of configuring the link maximum transmission unit (MTU) in a user equipment." 1) transform text into the bag-of-words space bow_vector = id2word.doc2bow(tokenize(text)) print([(id2word[id], count) for id, count in bow_vector]) [(u'method', 1), (u'configuring', 1), (u'link', 1), (u'maximum', 1), (u'transmission', 1), (u'unit', 1), (u'user', 1), (u'equipment', 1)] 2) transform text into our LDA space vector = model[bow_vector] [(0, 0.024384265946835323), (1, 0.78941547921042373),... 3) find the document's most significant LDA topic model.print_topic(max(vector, key=lambda item: item[1])[0]) 0.022*network + 0.021*performance + 0.018*protocol + 0.015*data + 0.009*system + 0.008*internet + ...
  • 22. @FrantaPolach 22 Evaluation ● Topic modelling is an unsupervised task ->> evaluation tricky ● Need to evaluate the improvement of the intended task ● Our goal is to retrieve semantically similar documents, thus we tag a set of similar documents and compare with the results of given semantic model ● "word intrusion" method: for each trained topic, take its first ten words, substitute one of them with a randomly chosen word (intruder!) and let a human detect the intruder ● Method without human intervention: split each document into two parts, and check that topics of the first half are similar to topics of the second; halves of different documents are dissimilar
  • 23. @FrantaPolach 23 The topic space ● a topic is a distribution over a fixed vocabulary of terms ● the idea behind Latent Dirichlet Allocation is to statistically model documents as containing multiple hidden semantic topics
  • 24. @FrantaPolach 24 memory: 188 cell: 146 plurality: 102 array: 86 bit: 71 address: 51 Exploring topic space speed: 178 line: 163 performance: 107 characteristic: 79 skin: 63 suspension: 45 signal: 324 output: 142 input: 108 frequency: 62 phase: 49 clock: 35 portion: 310 housing: 109 end: 62 edge: 53 mounting: 43 form: 35
  • 25. @FrantaPolach 25 Topics distribution many topics in total, but each document contains just a few of them ->> sparse model
  • 26. @FrantaPolach 26 Semantic distance in topic space ● Semantic distance queries from scipy.spatial import distance pairwise = distance.squareform(distance.pdist(matrix)) >> MemoryError ● Document indexing from gensim.similarities import Similarity index = Similarity('tmp/index', corpus, num_features=corpus.num_terms) The Similarity class splits the index into several smaller sub-indexes ->> scales well
  • 27. @FrantaPolach 27 Semantic distance queries query = "A method of configuring the link maximum transmission unit (MTU) in a user equipment." 1) vectorize the text into bag-of-words space bow_vector = id2word.doc2bow(tokenize(query)) 2) transform the text into our LDA space query_lda = model[bow_vector] 3) query the LDA index, get the top 3 most similar documents index.num_best = 3 print(index[query_lda]) [(2026, 0.91495784099521484), (32384, 0.8226358470916238), (11525, 0.80638835174553156)]
  • 28. @FrantaPolach 28 Future ● Graph of USPTO data (Neo4j) ● Elasticsearch search and analytics ● Recommendation engine (for applications) ● Drawings analysis ● Blockchain based smart contracts ● Artificial patent lawyer