SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Patent Data Mining and
Visualization Functionalities
A foray into the worlds of
&
Overview
Data Mining
What is Text Mining?
Text Mining Process
Text Transformation
Feature Selection - tf-idf
Feature Selection -Term Document Matrix
Feature Selection –Term Term Matrix
Word Clouds and Clustering Examples
R and KNIME
Live Example - R Shiny
Visualizations SVG and D3
The ‘Big Data’, R and KNIME
KNIME Versus R
Conclusions
Document
Vectorization
Data Mining
• Data Mining = Building Models
• Model (Regression, Decision Trees, Neural Networks) = Set of rules connecting
Collection of Inputs to particular target outcome
• Model can result in explaining outcomes of particular interest predicted by
available facts
• Data Mining Tasks
• Classification
• Estimation
• Prediction
• Affinity grouping
• Clustering
Directed –Finding Particular Target Variable
Undirected – discover structure in Data without
any target variable in mind
Why this Study?
Apply Data Mining Techniques
to understand fine structure of
published Patent Documents.
Features of Patent Documents
• Structured Component
• Patent Number, Filing Dates,
Assignees, Regional Coverage
• Unstructured Components
• Title, Claims, Abstract, Descriptions
Data Mining Visualizations
Outcome
• Augment Manual interpretation of the results
• Address Visualization limitations
• Providing Collapsible lay-outs, Interactive Graphs etc
What Is Text Mining?“The objective of Text Mining is to exploit information contained in textual
documents in various ways, including …discovery of patterns and trends in
data, associations among entities, predictive rules, etc.” (Grobelnik et al.,
2001)
“Another way to view text data mining is as a process of exploratory data
analysis that leads to heretofore unknown information, or to answers for
questions for which the answer is not currently known.” (Hearst, 1999)
References
M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37th Annual Meeting of the Association
for Computational Linguistics, 1999.
M. Grobelnik, D. Mladenic, and N. Milic-Frayling, “Text Mining as Integration of Several Related Research
Areas: Report on KDD’2000 Workshop on Text Mining,” 2000.
Text Mining Process
Preprocessing
• Data Import
• Text preprocessing
Text Transformation
• Stop word Removal
• Stemming
• Parts of Speech Tagging
• Ngrams Generation
• Synonym Generalization
Feature Selection And Data Mining
• Term Document Matrix
• Term-Term Matrix
• Clustering or Classification
Text Transformation
Gulf Applied Technologies
Inc said it sold its
subsidiaries engaged in
Stop Word Removal
(and", "for", "in", "is",
"it", "not", "the",
"to“,”its”)
"Gulf Applied Technologies
Inc said sold its subsidiaries
engaged
Gulf Applied Technologies
Inc said it sold its
subsidiaries engaged in
Stemming "Gulf Appli Technolog Inc
said it sold it subsidiari engag
in pipelin"
Gulf Applied Technologies
Inc said it sold its
subsidiaries engaged in
Parts of Speech Tagging "Gulf/NNP Applied/NNP
Technologies/NNPS Inc/NNP
said/VBD its/PRP sold/VBD
NNP stands for proper noun, singular, or
e.g., VBD stands
for verb, past tense
Gulf Appli Ngrams “Gulf Appli”
Company Synonyms (wordnet) synonyms("company")
"caller" "companionship"
"company" "fellowship“ …
Text Transformation – Regular
Expressions (regex)
A regular expression (abbreviated regex or regexp) is a sequence of
characters that forms a search pattern, mainly for use in pattern
matching with strings, or string matching, i.e. "find and replace"-like
operations – standard feature Unix text processing utilities like “grep”.
Now supported by almost all software
 A simple regexp ^[ t]+|[ t]+$ matches excess whitespace at the beginning and
end of a line.
 An advanced regexp used to match any numeral is ^[+-]?(d+.?d*|.d+)([eE][+-
]?d+)?$
One More Example
[c|C]ollimat*
DAP.*
[g|G]uid.*[f|F]ield
[f|F]ield.*[g|G]uid
[L|l]ight.*[b|B]eam
[L|l]aser.*[b|B]eam
[b|B]eam.*[L|l]ight
[b|B]eam.*[L|l]aser
Feature Selection – Term Frequency
Inverse Document Frequency (tf-idf)
tf–idf is a numerical statistic that reflects how important
a word is to a document in a collection or corpus.
It is often used as a weighting factor in information
retrieval and text mining.
The tf-idf value increases proportionally to the number
of times a word appears in the document, but is offset
by the frequency of the word in the corpus, which helps
to control for the fact that some words are generally
more common than others
Feature Selection – Term-Document
Matrix
Feature Extraction – Term-Term
Matrix
12
Word Clouds and Hierarchical Clustering
Using Term-Term Matrix
Clustering (Kmeans) and Contour
Plots
• A software package especially
suitable for data analysis, data (text)
Mining with rich visualization
functionality
• Scripting interface
• Graphical User Interface
development via “shiny package”
 Supports Modular Node based
workflows
 Core functionality required for
Data and Text mining are
implemented via these nodes
 Extensibility of the functionality
of nodes via R and Java code
Snippets in the nodes
R and KNIME
R Example
Workflow in KNIME
Live Example - R Shiny Package
Web Applications Using (Only) R
No Need for HTML or Javascript
Great for Communication and Visualization
http://www.rstudio.com/shiny/showcase/
http://rstudio.github.io/shiny/tutorial/
Ui.r
Put all UI related
code hear
Server.r
Put all UI related
code hear
Socket
R Shiny Example
SVG and D3
The ‘Big Data’, R and KNIME
pbdr is an academic initiative – requires special
permission to access a cluster of computers
called Tara
All Revolution R Enterprise 7 editions are distributed
with Open Source R (version 3.0.2), are 100%
compatible with R scripts, functions and CRAN
packages, and include phone and online technical
support.
ParAccel Hadoop Analytics
KNIME Versus R
KNIME R
Visual Programming Interface – Intuitive but some
amount familiarity is required
Scripting interface – Steep Learning curve
Workflows could be tailor made Workflows could be tailor made R Shiny user Interface
All Text mining & data analytic tools are available from a
single user interface. Classification problems – Supervised
learning could be handled better here as all the required
libraries are present at one place and one can view
intermediate results at the node output ports
Most of the libraries for Text Mining & data analytic are
available but they require prior invocation before their
usage
The Desktop version of the KNIME is available for free but
for server version requires special requirements
Server as well as desktop version is available
KNIME requires a reasonably modern PC running Linux,
Windows (XP and later), or Max OSX. Multi core systems
is a plus
The memory limitations could be overcome using
packages like:
• “ff”
• “ffBase”
Graphics output could be sent SVG etc Graphics could be sent SVG etc. One could also send
Graphics to DHTML using R Shiny
R and Java code could be at nodes for creating proprietary
analysis and visualizations
Robust big data extensions are available for distributed
frameworks such as Hadoop
Programming with Big Data in R pbdR and distributed
frameworks such as Hadoop
Conclusions
Starting with reasons for doing this project, tools like R and KNIME were looked at for their suitability for
Text data mining and automatic classification
Due to the availability of several built-in Libraries R and KNIME are more amenable to Text Data mining.
R and KNIME could be used in an “Big Data” Setting though this may be require additional hardware and
use of proprietary software
KNIME scores over R in terms of ease of use due to its node based visual programming interface
This study is very exploratory in nature and no serious attempt is made solve problems related to
automatic document classification. Some of the text mining libraries that were explored are:
− TM library in R for Generating the so called Term-Document Matrix and also for removing stop words
and punctuation marks in text
− TM library is also used for N-gram Tokenization (Taking Two Words at a time)
− OpenNLP Library for Parts of speech tagging
− Snowball and Potter Stemmer for Stemming text
− Graphing capabilities of R and KNIME were explored for Visual depiction of Text in the form of Word
Clouds
Thank You
Backup Slides
Text mining With R Regular Expressions
Tag Meaning Examples
ADJ adjective new, good, high, special, big, local
ADV adverb really, already, still, early, now
CNJ conjunction and, or, but, if, while, although
DET determiner the, a, some, most, every, no
EX existential there, there's
FW foreign word dolce, ersatz, esprit, quo, maitre
MOD modal verb will, can, would, may, must, should
N noun year, home, costs, time, education
NP proper noun Alison, Africa, April, Washington
NUM number twenty-four, fourth, 1991, 14:24
PRO pronoun he, their, her, its, my, I, us
P preposition on, of, at, with, by, into, under
TO the word to to
UH interjection ah, bang, ha, whee, hmpf, oops
V verb is, has, get, do, make, see, run
VD past tense said, took, told, made, asked
VG present participle making, going, playing, working
VN past participle given, taken, begun, sung
WH wh determiner who, which, when, what, where, how
Parts of Speech Tagging
(POS)
Invocation of Shiny
runApp takes the name of the Test directory in this example it is
Test_Shiny01. This directory contains Test.csv as the data source
and two R files called “ui.R” and “server.R”. The Ui.r invokes the
user interface in this case it is an HTML page with tabs and sidebar
panel (with user controls). The server.R file does all the event
handling after user selection of “Test.csv” file. The present
implementation works only with Test.csv file only
Choosing the data source
Click on browse button
and Choose the file
“Test.csv”
Click the Update now
Different Tab Views
Histogram of Value Scores
Value Score
IPC Word Cloud
Box Plots based on Value Score for
Top Five Players
Companies
Word Cloud Based on IPC Codes
Bigram Cloud based (Bi-gram contains two
words)
Word Cloud
R – Patent Informatics
Word Clouds and Cluster Dendograms
Cluster Dendrogram – Different
technical aspects related Ultrasound
that are associated with the Ultrasound
Probe
Each individual patent is treated
as a file- these files are
generated using R Code. For this
Text Mining example Title,
Abstract and claims data is used
31
Workflows In KNIME
Java Code
Snippet
R Code
Snippet
Appendix – III
Word Cloud
Principal Components Analysis
33
Principal Component Analysis
Appendix – II
Partition Clustering in R (Kmeans)

Weitere ähnliche Inhalte

Was ist angesagt?

DConf2015 - Using D for Development of Large Scale Primary Storage
DConf2015 - Using D for Development  of Large Scale Primary StorageDConf2015 - Using D for Development  of Large Scale Primary Storage
DConf2015 - Using D for Development of Large Scale Primary StorageLiran Zvibel
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
 
Hadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at FacebookHadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at FacebookDataWorks Summit
 
Big data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond HadoopBig data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond HadoopHPCC Systems
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data CertificationAdam Doyle
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionBenoit Perroud
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confSujee Maniyam
 
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Simplilearn
 
PhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond BatchPhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond Batchboorad
 
Word Embedding for Nearest Words
Word Embedding for Nearest WordsWord Embedding for Nearest Words
Word Embedding for Nearest WordsEkaKurniawan40
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batchboorad
 
"The Vision Acceleration API Landscape: Options and Trade-offs," a Presentati...
"The Vision Acceleration API Landscape: Options and Trade-offs," a Presentati..."The Vision Acceleration API Landscape: Options and Trade-offs," a Presentati...
"The Vision Acceleration API Landscape: Options and Trade-offs," a Presentati...Edge AI and Vision Alliance
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesLINE Corporation (Tech Unit)
 

Was ist angesagt? (19)

DConf2015 - Using D for Development of Large Scale Primary Storage
DConf2015 - Using D for Development  of Large Scale Primary StorageDConf2015 - Using D for Development  of Large Scale Primary Storage
DConf2015 - Using D for Development of Large Scale Primary Storage
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Hadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at FacebookHadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at Facebook
 
Big data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond HadoopBig data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond Hadoop
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data Certification
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment Evolution
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA conf
 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
 
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
 
Chtp415
Chtp415Chtp415
Chtp415
 
PhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond BatchPhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond Batch
 
Word Embedding for Nearest Words
Word Embedding for Nearest WordsWord Embedding for Nearest Words
Word Embedding for Nearest Words
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
 
"The Vision Acceleration API Landscape: Options and Trade-offs," a Presentati...
"The Vision Acceleration API Landscape: Options and Trade-offs," a Presentati..."The Vision Acceleration API Landscape: Options and Trade-offs," a Presentati...
"The Vision Acceleration API Landscape: Options and Trade-offs," a Presentati...
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messages
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 

Andere mochten auch

Text Processing with KNIME
Text Processing with KNIMEText Processing with KNIME
Text Processing with KNIMEKNIMESlides
 
Mining IP Value
Mining IP ValueMining IP Value
Mining IP Valuearush
 
Patent: Presentation on Patent Mining
Patent: Presentation on Patent MiningPatent: Presentation on Patent Mining
Patent: Presentation on Patent MiningBananaIP Counsels
 
Introduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System CompetitionIntroduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System CompetitionNYC Predictive Analytics
 
KNIME - Create Workflow with KNIME
KNIME - Create Workflow with KNIMEKNIME - Create Workflow with KNIME
KNIME - Create Workflow with KNIMEBilly Wong
 
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Richard Sheng
 
Business Models in the Data Economy: A Case Study from the Business Partner D...
Business Models in the Data Economy: A Case Study from the Business Partner D...Business Models in the Data Economy: A Case Study from the Business Partner D...
Business Models in the Data Economy: A Case Study from the Business Partner D...Boris Otto
 
Can Digital Data help predict the results of the US elections?
Can Digital Data help predict the results of the US elections? Can Digital Data help predict the results of the US elections?
Can Digital Data help predict the results of the US elections? Laurence Borel
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightMatthew Russell
 
Digital Winners 2013: Aleksander stensby
Digital Winners 2013: Aleksander stensbyDigital Winners 2013: Aleksander stensby
Digital Winners 2013: Aleksander stensbyTelenor Group
 
Searching lexis nexis in power search mode
Searching lexis nexis in power search modeSearching lexis nexis in power search mode
Searching lexis nexis in power search modeJoyce Johnston
 
What is 1st, 2nd, 3rd party data?
What is 1st, 2nd, 3rd party data?What is 1st, 2nd, 3rd party data?
What is 1st, 2nd, 3rd party data?Sparc Media Poland
 
Data Driven PR: 8 Steps to Building Media Attention with Research
Data Driven PR: 8 Steps to Building Media Attention with ResearchData Driven PR: 8 Steps to Building Media Attention with Research
Data Driven PR: 8 Steps to Building Media Attention with ResearchWalkerSands
 
Influence mapping Toolbox Presentation London 2015
Influence mapping Toolbox Presentation London 2015Influence mapping Toolbox Presentation London 2015
Influence mapping Toolbox Presentation London 2015Jun Julien Matsushita
 
Analysis and Visualization of Real-Time Twitter Data
Analysis and Visualization of Real-Time Twitter DataAnalysis and Visualization of Real-Time Twitter Data
Analysis and Visualization of Real-Time Twitter DataEducational Technology
 
Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...
Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...
Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...SocialBiz UserGroup
 
Analyzing social conversation: a guide to data mining and data visualization
Analyzing social conversation: a guide to data mining and data visualization Analyzing social conversation: a guide to data mining and data visualization
Analyzing social conversation: a guide to data mining and data visualization Tempero UK
 
Text Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowText Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowTony Russell-Rose
 

Andere mochten auch (20)

Text Processing with KNIME
Text Processing with KNIMEText Processing with KNIME
Text Processing with KNIME
 
OTN - Mining the patent system to improve research and its commercialization ...
OTN - Mining the patent system to improve research and its commercialization ...OTN - Mining the patent system to improve research and its commercialization ...
OTN - Mining the patent system to improve research and its commercialization ...
 
Mining IP Value
Mining IP ValueMining IP Value
Mining IP Value
 
Patent: Presentation on Patent Mining
Patent: Presentation on Patent MiningPatent: Presentation on Patent Mining
Patent: Presentation on Patent Mining
 
Introduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System CompetitionIntroduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System Competition
 
KNIME - Create Workflow with KNIME
KNIME - Create Workflow with KNIMEKNIME - Create Workflow with KNIME
KNIME - Create Workflow with KNIME
 
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
 
Introduction to Global Patent Searching & Analysis
Introduction to Global Patent Searching & AnalysisIntroduction to Global Patent Searching & Analysis
Introduction to Global Patent Searching & Analysis
 
Business Models in the Data Economy: A Case Study from the Business Partner D...
Business Models in the Data Economy: A Case Study from the Business Partner D...Business Models in the Data Economy: A Case Study from the Business Partner D...
Business Models in the Data Economy: A Case Study from the Business Partner D...
 
Can Digital Data help predict the results of the US elections?
Can Digital Data help predict the results of the US elections? Can Digital Data help predict the results of the US elections?
Can Digital Data help predict the results of the US elections?
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Digital Winners 2013: Aleksander stensby
Digital Winners 2013: Aleksander stensbyDigital Winners 2013: Aleksander stensby
Digital Winners 2013: Aleksander stensby
 
Searching lexis nexis in power search mode
Searching lexis nexis in power search modeSearching lexis nexis in power search mode
Searching lexis nexis in power search mode
 
What is 1st, 2nd, 3rd party data?
What is 1st, 2nd, 3rd party data?What is 1st, 2nd, 3rd party data?
What is 1st, 2nd, 3rd party data?
 
Data Driven PR: 8 Steps to Building Media Attention with Research
Data Driven PR: 8 Steps to Building Media Attention with ResearchData Driven PR: 8 Steps to Building Media Attention with Research
Data Driven PR: 8 Steps to Building Media Attention with Research
 
Influence mapping Toolbox Presentation London 2015
Influence mapping Toolbox Presentation London 2015Influence mapping Toolbox Presentation London 2015
Influence mapping Toolbox Presentation London 2015
 
Analysis and Visualization of Real-Time Twitter Data
Analysis and Visualization of Real-Time Twitter DataAnalysis and Visualization of Real-Time Twitter Data
Analysis and Visualization of Real-Time Twitter Data
 
Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...
Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...
Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...
 
Analyzing social conversation: a guide to data mining and data visualization
Analyzing social conversation: a guide to data mining and data visualization Analyzing social conversation: a guide to data mining and data visualization
Analyzing social conversation: a guide to data mining and data visualization
 
Text Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowText Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and Tomorrow
 

Ähnlich wie Text mining and Visualizations

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 
Ssas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesiSsas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesiKoray Kocabas
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaMopuru Babu
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_SparkSunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_SparkMopuru Babu
 
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_SparkSunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_SparkMopuru Babu
 
H2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFH2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFSri Ambati
 
H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16Sri Ambati
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaData Science Thailand
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsCloudera, Inc.
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAlex Palamides
 
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...Amazon Web Services
 
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 

Ähnlich wie Text mining and Visualizations (20)

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Ssas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesiSsas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesi
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_SparkSunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
 
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_SparkSunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
 
H2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFH2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SF
 
H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Erlang at Nu Echo
Erlang at Nu EchoErlang at Nu Echo
Erlang at Nu Echo
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data Applications
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
 
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Poorna Hadoop
Poorna HadoopPoorna Hadoop
Poorna Hadoop
 

Text mining and Visualizations

  • 1. Patent Data Mining and Visualization Functionalities A foray into the worlds of &
  • 2. Overview Data Mining What is Text Mining? Text Mining Process Text Transformation Feature Selection - tf-idf Feature Selection -Term Document Matrix Feature Selection –Term Term Matrix Word Clouds and Clustering Examples R and KNIME Live Example - R Shiny Visualizations SVG and D3 The ‘Big Data’, R and KNIME KNIME Versus R Conclusions Document Vectorization
  • 3. Data Mining • Data Mining = Building Models • Model (Regression, Decision Trees, Neural Networks) = Set of rules connecting Collection of Inputs to particular target outcome • Model can result in explaining outcomes of particular interest predicted by available facts • Data Mining Tasks • Classification • Estimation • Prediction • Affinity grouping • Clustering Directed –Finding Particular Target Variable Undirected – discover structure in Data without any target variable in mind
  • 4. Why this Study? Apply Data Mining Techniques to understand fine structure of published Patent Documents. Features of Patent Documents • Structured Component • Patent Number, Filing Dates, Assignees, Regional Coverage • Unstructured Components • Title, Claims, Abstract, Descriptions Data Mining Visualizations Outcome • Augment Manual interpretation of the results • Address Visualization limitations • Providing Collapsible lay-outs, Interactive Graphs etc
  • 5. What Is Text Mining?“The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001) “Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999) References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, 1999. M. Grobelnik, D. Mladenic, and N. Milic-Frayling, “Text Mining as Integration of Several Related Research Areas: Report on KDD’2000 Workshop on Text Mining,” 2000.
  • 6. Text Mining Process Preprocessing • Data Import • Text preprocessing Text Transformation • Stop word Removal • Stemming • Parts of Speech Tagging • Ngrams Generation • Synonym Generalization Feature Selection And Data Mining • Term Document Matrix • Term-Term Matrix • Clustering or Classification
  • 7. Text Transformation Gulf Applied Technologies Inc said it sold its subsidiaries engaged in Stop Word Removal (and", "for", "in", "is", "it", "not", "the", "to“,”its”) "Gulf Applied Technologies Inc said sold its subsidiaries engaged Gulf Applied Technologies Inc said it sold its subsidiaries engaged in Stemming "Gulf Appli Technolog Inc said it sold it subsidiari engag in pipelin" Gulf Applied Technologies Inc said it sold its subsidiaries engaged in Parts of Speech Tagging "Gulf/NNP Applied/NNP Technologies/NNPS Inc/NNP said/VBD its/PRP sold/VBD NNP stands for proper noun, singular, or e.g., VBD stands for verb, past tense Gulf Appli Ngrams “Gulf Appli” Company Synonyms (wordnet) synonyms("company") "caller" "companionship" "company" "fellowship“ …
  • 8. Text Transformation – Regular Expressions (regex) A regular expression (abbreviated regex or regexp) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations – standard feature Unix text processing utilities like “grep”. Now supported by almost all software  A simple regexp ^[ t]+|[ t]+$ matches excess whitespace at the beginning and end of a line.  An advanced regexp used to match any numeral is ^[+-]?(d+.?d*|.d+)([eE][+- ]?d+)?$ One More Example [c|C]ollimat* DAP.* [g|G]uid.*[f|F]ield [f|F]ield.*[g|G]uid [L|l]ight.*[b|B]eam [L|l]aser.*[b|B]eam [b|B]eam.*[L|l]ight [b|B]eam.*[L|l]aser
  • 9. Feature Selection – Term Frequency Inverse Document Frequency (tf-idf) tf–idf is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others
  • 10. Feature Selection – Term-Document Matrix
  • 11. Feature Extraction – Term-Term Matrix
  • 12. 12 Word Clouds and Hierarchical Clustering Using Term-Term Matrix
  • 13. Clustering (Kmeans) and Contour Plots
  • 14. • A software package especially suitable for data analysis, data (text) Mining with rich visualization functionality • Scripting interface • Graphical User Interface development via “shiny package”  Supports Modular Node based workflows  Core functionality required for Data and Text mining are implemented via these nodes  Extensibility of the functionality of nodes via R and Java code Snippets in the nodes R and KNIME R Example
  • 16. Live Example - R Shiny Package Web Applications Using (Only) R No Need for HTML or Javascript Great for Communication and Visualization http://www.rstudio.com/shiny/showcase/ http://rstudio.github.io/shiny/tutorial/ Ui.r Put all UI related code hear Server.r Put all UI related code hear Socket R Shiny Example
  • 18. The ‘Big Data’, R and KNIME pbdr is an academic initiative – requires special permission to access a cluster of computers called Tara All Revolution R Enterprise 7 editions are distributed with Open Source R (version 3.0.2), are 100% compatible with R scripts, functions and CRAN packages, and include phone and online technical support. ParAccel Hadoop Analytics
  • 19. KNIME Versus R KNIME R Visual Programming Interface – Intuitive but some amount familiarity is required Scripting interface – Steep Learning curve Workflows could be tailor made Workflows could be tailor made R Shiny user Interface All Text mining & data analytic tools are available from a single user interface. Classification problems – Supervised learning could be handled better here as all the required libraries are present at one place and one can view intermediate results at the node output ports Most of the libraries for Text Mining & data analytic are available but they require prior invocation before their usage The Desktop version of the KNIME is available for free but for server version requires special requirements Server as well as desktop version is available KNIME requires a reasonably modern PC running Linux, Windows (XP and later), or Max OSX. Multi core systems is a plus The memory limitations could be overcome using packages like: • “ff” • “ffBase” Graphics output could be sent SVG etc Graphics could be sent SVG etc. One could also send Graphics to DHTML using R Shiny R and Java code could be at nodes for creating proprietary analysis and visualizations Robust big data extensions are available for distributed frameworks such as Hadoop Programming with Big Data in R pbdR and distributed frameworks such as Hadoop
  • 20. Conclusions Starting with reasons for doing this project, tools like R and KNIME were looked at for their suitability for Text data mining and automatic classification Due to the availability of several built-in Libraries R and KNIME are more amenable to Text Data mining. R and KNIME could be used in an “Big Data” Setting though this may be require additional hardware and use of proprietary software KNIME scores over R in terms of ease of use due to its node based visual programming interface This study is very exploratory in nature and no serious attempt is made solve problems related to automatic document classification. Some of the text mining libraries that were explored are: − TM library in R for Generating the so called Term-Document Matrix and also for removing stop words and punctuation marks in text − TM library is also used for N-gram Tokenization (Taking Two Words at a time) − OpenNLP Library for Parts of speech tagging − Snowball and Potter Stemmer for Stemming text − Graphing capabilities of R and KNIME were explored for Visual depiction of Text in the form of Word Clouds
  • 23. Text mining With R Regular Expressions Tag Meaning Examples ADJ adjective new, good, high, special, big, local ADV adverb really, already, still, early, now CNJ conjunction and, or, but, if, while, although DET determiner the, a, some, most, every, no EX existential there, there's FW foreign word dolce, ersatz, esprit, quo, maitre MOD modal verb will, can, would, may, must, should N noun year, home, costs, time, education NP proper noun Alison, Africa, April, Washington NUM number twenty-four, fourth, 1991, 14:24 PRO pronoun he, their, her, its, my, I, us P preposition on, of, at, with, by, into, under TO the word to to UH interjection ah, bang, ha, whee, hmpf, oops V verb is, has, get, do, make, see, run VD past tense said, took, told, made, asked VG present participle making, going, playing, working VN past participle given, taken, begun, sung WH wh determiner who, which, when, what, where, how Parts of Speech Tagging (POS)
  • 24. Invocation of Shiny runApp takes the name of the Test directory in this example it is Test_Shiny01. This directory contains Test.csv as the data source and two R files called “ui.R” and “server.R”. The Ui.r invokes the user interface in this case it is an HTML page with tabs and sidebar panel (with user controls). The server.R file does all the event handling after user selection of “Test.csv” file. The present implementation works only with Test.csv file only
  • 25. Choosing the data source Click on browse button and Choose the file “Test.csv” Click the Update now
  • 26. Different Tab Views Histogram of Value Scores Value Score
  • 28. Box Plots based on Value Score for Top Five Players Companies
  • 29. Word Cloud Based on IPC Codes Bigram Cloud based (Bi-gram contains two words) Word Cloud R – Patent Informatics Word Clouds and Cluster Dendograms Cluster Dendrogram – Different technical aspects related Ultrasound that are associated with the Ultrasound Probe
  • 30. Each individual patent is treated as a file- these files are generated using R Code. For this Text Mining example Title, Abstract and claims data is used 31 Workflows In KNIME Java Code Snippet R Code Snippet
  • 32. Principal Components Analysis 33 Principal Component Analysis Appendix – II Partition Clustering in R (Kmeans)

Hinweis der Redaktion

  1. Motion (Basic) Note: This video template is optimized for Microsoft PowerPoint 2010. In PowerPoint 2007, video elements will play, but any content overlapping the video bars will be covered by the video when in slideshow mode. In PowerPoint 2003, video will not play, but the poster frame of the videos will remain in place as static images. The video: Plays automatically after each slide transition. Is 15 seconds long. Seamlessly loops for infinite playback. To add slides or change layout: To add a new slide, on the Home tab, in the Slides group, click the arrow under New Slide, then click under Motion Background Theme, select the desired layout. To change the layout of an existing slide, on the Home tab, in the Slides group, click Layout, then select the desired layout. Other animated elements: Any animated element you insert will begin after the slide transition and the background video has started.
  2. Affinity Group – Shared Common Interest