Text mining and Visualizations

Patent Data Mining and
Visualization Functionalities
A foray into the worlds of
&

Overview
Data Mining
What is Text Mining?
Text Mining Process
Text Transformation
Feature Selection - tf-idf
Feature Selection -Term Document Matrix
Feature Selection –Term Term Matrix
Word Clouds and Clustering Examples
R and KNIME
Live Example - R Shiny
Visualizations SVG and D3
The ‘Big Data’, R and KNIME
KNIME Versus R
Conclusions
Document
Vectorization

Data Mining
• Data Mining = Building Models
• Model (Regression, Decision Trees, Neural Networks) = Set of rules connecting
Collection of Inputs to particular target outcome
• Model can result in explaining outcomes of particular interest predicted by
available facts
• Data Mining Tasks
• Classification
• Estimation
• Prediction
• Affinity grouping
• Clustering
Directed –Finding Particular Target Variable
Undirected – discover structure in Data without
any target variable in mind

Why this Study?
Apply Data Mining Techniques
to understand fine structure of
published Patent Documents.
Features of Patent Documents
• Structured Component
• Patent Number, Filing Dates,
Assignees, Regional Coverage
• Unstructured Components
• Title, Claims, Abstract, Descriptions
Data Mining Visualizations
Outcome
• Augment Manual interpretation of the results
• Address Visualization limitations
• Providing Collapsible lay-outs, Interactive Graphs etc

What Is Text Mining?“The objective of Text Mining is to exploit information contained in textual
documents in various ways, including …discovery of patterns and trends in
data, associations among entities, predictive rules, etc.” (Grobelnik et al.,
2001)
“Another way to view text data mining is as a process of exploratory data
analysis that leads to heretofore unknown information, or to answers for
questions for which the answer is not currently known.” (Hearst, 1999)
References
M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37th Annual Meeting of the Association
for Computational Linguistics, 1999.
M. Grobelnik, D. Mladenic, and N. Milic-Frayling, “Text Mining as Integration of Several Related Research
Areas: Report on KDD’2000 Workshop on Text Mining,” 2000.

Text Mining Process
Preprocessing
• Data Import
• Text preprocessing
Text Transformation
• Stop word Removal
• Stemming
• Parts of Speech Tagging
• Ngrams Generation
• Synonym Generalization
Feature Selection And Data Mining
• Term Document Matrix
• Term-Term Matrix
• Clustering or Classification

Text Transformation
Gulf Applied Technologies
Inc said it sold its
subsidiaries engaged in
Stop Word Removal
(and", "for", "in", "is",
"it", "not", "the",
"to“,”its”)
"Gulf Applied Technologies
Inc said sold its subsidiaries
engaged
Stemming "Gulf Appli Technolog Inc
said it sold it subsidiari engag
in pipelin"
Parts of Speech Tagging "Gulf/NNP Applied/NNP
Technologies/NNPS Inc/NNP
said/VBD its/PRP sold/VBD
NNP stands for proper noun, singular, or
e.g., VBD stands
for verb, past tense
Gulf Appli Ngrams “Gulf Appli”
Company Synonyms (wordnet) synonyms("company")
"caller" "companionship"
"company" "fellowship“ …

Text Transformation – Regular
Expressions (regex)
A regular expression (abbreviated regex or regexp) is a sequence of
characters that forms a search pattern, mainly for use in pattern
matching with strings, or string matching, i.e. "find and replace"-like
operations – standard feature Unix text processing utilities like “grep”.
Now supported by almost all software
 A simple regexp ^[ t]+|[ t]+$ matches excess whitespace at the beginning and
end of a line.
 An advanced regexp used to match any numeral is ^[+-]?(d+.?d*|.d+)([eE][+-
]?d+)?$
One More Example
[c|C]ollimat*
DAP.*
[g|G]uid.*[f|F]ield
[f|F]ield.*[g|G]uid
[L|l]ight.*[b|B]eam
[L|l]aser.*[b|B]eam
[b|B]eam.*[L|l]ight
[b|B]eam.*[L|l]aser

Feature Selection – Term Frequency
Inverse Document Frequency (tf-idf)
tf–idf is a numerical statistic that reflects how important
a word is to a document in a collection or corpus.
It is often used as a weighting factor in information
retrieval and text mining.
The tf-idf value increases proportionally to the number
of times a word appears in the document, but is offset
by the frequency of the word in the corpus, which helps
to control for the fact that some words are generally
more common than others

Feature Selection – Term-Document
Matrix

Feature Extraction – Term-Term
Matrix

12
Word Clouds and Hierarchical Clustering
Using Term-Term Matrix

Clustering (Kmeans) and Contour
Plots

• A software package especially
suitable for data analysis, data (text)
Mining with rich visualization
functionality
• Scripting interface
• Graphical User Interface
development via “shiny package”
 Supports Modular Node based
workflows
 Core functionality required for
Data and Text mining are
implemented via these nodes
 Extensibility of the functionality
of nodes via R and Java code
Snippets in the nodes
R and KNIME
R Example

Live Example - R Shiny Package
Web Applications Using (Only) R
No Need for HTML or Javascript
Great for Communication and Visualization
http://www.rstudio.com/shiny/showcase/
http://rstudio.github.io/shiny/tutorial/
Ui.r
Put all UI related
code hear
Server.r
Put all UI related
code hear
Socket
R Shiny Example

The ‘Big Data’, R and KNIME
pbdr is an academic initiative – requires special
permission to access a cluster of computers
called Tara
All Revolution R Enterprise 7 editions are distributed
with Open Source R (version 3.0.2), are 100%
compatible with R scripts, functions and CRAN
packages, and include phone and online technical
support.
ParAccel Hadoop Analytics

KNIME Versus R
KNIME R
Visual Programming Interface – Intuitive but some
amount familiarity is required
Scripting interface – Steep Learning curve
Workflows could be tailor made Workflows could be tailor made R Shiny user Interface
All Text mining & data analytic tools are available from a
single user interface. Classification problems – Supervised
learning could be handled better here as all the required
libraries are present at one place and one can view
intermediate results at the node output ports
Most of the libraries for Text Mining & data analytic are
available but they require prior invocation before their
usage
The Desktop version of the KNIME is available for free but
for server version requires special requirements
Server as well as desktop version is available
KNIME requires a reasonably modern PC running Linux,
Windows (XP and later), or Max OSX. Multi core systems
is a plus
The memory limitations could be overcome using
packages like:
• “ff”
• “ffBase”
Graphics output could be sent SVG etc Graphics could be sent SVG etc. One could also send
Graphics to DHTML using R Shiny
R and Java code could be at nodes for creating proprietary
analysis and visualizations
Robust big data extensions are available for distributed
frameworks such as Hadoop
Programming with Big Data in R pbdR and distributed
frameworks such as Hadoop

Conclusions
Starting with reasons for doing this project, tools like R and KNIME were looked at for their suitability for
Text data mining and automatic classification
Due to the availability of several built-in Libraries R and KNIME are more amenable to Text Data mining.
R and KNIME could be used in an “Big Data” Setting though this may be require additional hardware and
use of proprietary software
KNIME scores over R in terms of ease of use due to its node based visual programming interface
This study is very exploratory in nature and no serious attempt is made solve problems related to
automatic document classification. Some of the text mining libraries that were explored are:
− TM library in R for Generating the so called Term-Document Matrix and also for removing stop words
and punctuation marks in text
− TM library is also used for N-gram Tokenization (Taking Two Words at a time)
− OpenNLP Library for Parts of speech tagging
− Snowball and Potter Stemmer for Stemming text
− Graphing capabilities of R and KNIME were explored for Visual depiction of Text in the form of Word
Clouds

Text mining With R Regular Expressions
Tag Meaning Examples
ADJ adjective new, good, high, special, big, local
ADV adverb really, already, still, early, now
CNJ conjunction and, or, but, if, while, although
DET determiner the, a, some, most, every, no
EX existential there, there's
FW foreign word dolce, ersatz, esprit, quo, maitre
MOD modal verb will, can, would, may, must, should
N noun year, home, costs, time, education
NP proper noun Alison, Africa, April, Washington
NUM number twenty-four, fourth, 1991, 14:24
PRO pronoun he, their, her, its, my, I, us
P preposition on, of, at, with, by, into, under
TO the word to to
UH interjection ah, bang, ha, whee, hmpf, oops
V verb is, has, get, do, make, see, run
VD past tense said, took, told, made, asked
VG present participle making, going, playing, working
VN past participle given, taken, begun, sung
WH wh determiner who, which, when, what, where, how
Parts of Speech Tagging
(POS)

Invocation of Shiny
runApp takes the name of the Test directory in this example it is
Test_Shiny01. This directory contains Test.csv as the data source
and two R files called “ui.R” and “server.R”. The Ui.r invokes the
user interface in this case it is an HTML page with tabs and sidebar
panel (with user controls). The server.R file does all the event
handling after user selection of “Test.csv” file. The present
implementation works only with Test.csv file only

Choosing the data source
Click on browse button
and Choose the file
“Test.csv”
Click the Update now

Different Tab Views
Histogram of Value Scores
Value Score

Box Plots based on Value Score for
Top Five Players
Companies

Word Cloud Based on IPC Codes
Bigram Cloud based (Bi-gram contains two
words)
Word Cloud
R – Patent Informatics
Word Clouds and Cluster Dendograms
Cluster Dendrogram – Different
technical aspects related Ultrasound
that are associated with the Ultrasound
Probe

Each individual patent is treated
as a file- these files are
generated using R Code. For this
Text Mining example Title,
Abstract and claims data is used
31
Workflows In KNIME
Java Code
Snippet
R Code
Snippet

Principal Components Analysis
33
Principal Component Analysis
Appendix – II
Partition Clustering in R (Kmeans)

Text mining and Visualizations

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Text mining and Visualizations

Ähnlich wie Text mining and Visualizations (20)

Text mining and Visualizations

Hinweis der Redaktion