This is an introduction to an algorithm and methodology to extract semantics from one or several documents using Natural Language Processing and Machine learning techniques. The presentation describes the different components of the semantic analyzer using Wikipedia and Dbpedia as data sets.
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Semantic Analysis using Wikipedia Taxonomy
1. Creating a taxonomy
for Wikipedia
Patrick Nicolas
Feb 11, 2012
http://patricknicolas.blogspot.com
http://www.slideshare.net/pnicolas
https://github.com/prnicolas
2. Introduction
The goal of the study is to build a Taxonomy Graph for the 3+
millions Wikipedia entries by leveraging the WordNet
hyponyms as a training set.
This model can used in a wide variety of commercial
applications from extracting context extraction and
automated Wiki classification to text summarization.
Notes:
• Definitions and notations are defined in the appendices
• The presentation assumes the reader has basic knowledge in
information retrieval, Natural Language Processing and Machine
Learning.
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
2
3. Process
The computation flow for the generation of taxonomy for
Wikipedia is summarized in the following 5 simple steps.
1. Extract abstract & categories from Wikipedia datasets
2. Generate the Hypernyms lineages for Wikipedia entries
which overlap with WordNet synsets
3. Extract, reduce and ordered N-Grams and their tags
(NNP, NN,.) from each Wikipedia abstract.
4. Create a training set of weighted graphs of each Wikipedia
abstract that has a corresponding hypernyms hierarchy
5. Optimize and apply the model for generating taxonomy
lineages for each Wikipedia entry
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
3
4. Semantic Data Sources
Terms Frequency Corpora
Reuters corpus and Google N-Grams frequency is used to
compute the inverse document frequency values.
Word Net Hypernyms
WordNet database of Synsets is used to generate hierarchy of
hypernyms.
entity/physical
entity/object/location/region/district/country/European country/Italy
Wikipedia Datasets
Entry (label), long abstract and categories are to be extracted
from the Wikipedia reference database.
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
4
5. N-Grams Extraction Model
The relevancy (or weight ω) of a N-Gram to the context of a
document depends on syntactic, semantic and probabilistic features.
Frequency N-Gram in
document
Similarity of N-Gram
with Categories
β
fD
N-Gram
tag
N-Gram
α
Term 1
Semantic
Definition?
…
Frequency
of terms
ρ Frequency N-Gram in
categories abstracts
Term n
idf
φ
Contained in
1st sentence?
Frequency N-Gram in
Universe (Corpus)
Fig. 1 Illustration of features of N-Gram Extraction Model
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
5
6. Computation Flow
The computation flow is broken down in ‘plug & play’ processing units to
enable design of experiments and audit.
N-Grams
Corpus
idf
Freq.
N-Grams
Abstract
Wikipedia
Datasets
WordNet
Synsets
Categories
Weighted N-Grams
N-Gram tags
Abstract
Semantic match
label
Labeled
Lineage
Normalized
N-Gram Weights
Hypernyms
Taxonomy Graph
Trained Model
Fig. 2 Typical computation flow for generation of taxonomy
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
6
7. NGrams Frequency Analysis
Let’s define an N-Gram, w(n) (i.e. w(3) for a 3-Gram). The frequency of
the N-Gram within the corpus C is expressed as.
The inverse document frequency (IDF) is computed as
Let w(n) be a N-Gram with a frequency count(w(n)) composed of terms
wj j =1,n with a frequency count(wj) with a document D. The
frequency of the N-Gram is computed as
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
7
8. Weighting N-Grams
Most of Wikipedia concepts are well described in the first sentence
of each abstract. Therefore we can attribute a great weight to NGrams that are contained in the first sentence. The frequency f lD of
a N-Grams in the 1st sentence of a document is defined as
A simple regression analysis showed that a square root function
provide a more accurate contribution (weight) of a N-Gram in a
document D.
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
8
9. Tagging N-Grams
Although Conditional Random Fields is the predominant discriminative
classifier to predict sentence boundaries, token tags we found out that
the Maximum Entropy for binary features were more appropriate to
classify the first term in a sentence (NNP or NN).
The model features functions ft (w) => {0,1} are extracted by
maximizing the entropy H(p) of the probability of a word, w, has a
specific tag t.
Subjected to the constraints..
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
9
10. Wikipedia Tags Distribution
We extract the tags of Wikipedia
entries (1 to 4-Grams) in the
context of their abstracts. The
distribution of the frequency of
the tags shows that the proper
nouns (stemmed as NNP tags)
are the predominant tags.
The frequency distribution is used as
prior probability for finding a
Wikipedia entry of a specific tag.
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
10
11. Tag Predictive Model
We use a multinomial Naïve Bayes to predict the tag of
any given Wikipedia entry.
Let’s defined a set of classes Ck = { w(n) | tg(w(n)) = k } of
Wikipedia entries of specific tags (CNNP NN) & p(t| Ck)
the prior probability of a tag t to belong to a class.
The likelihood a given Wikipedia entry as a tag k is
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
11
12. Taxonomy Weighted Graph
Let’s define:
• taxonomy class (or Taxa) as a
graph node representing a
Hypernym (i.e. class=‘person’)
• taxonomy instance as entity
name (i.e. instance=‘Peter’ or
Peter IS-A a Person)
• Taxonomy lineage as the list
of ancestors (Hypernyms) of
an instance
Fig. Example of taxonomy lineage
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
12
13. Document taxonomy
Any document can be represented as a weighted
graph of taxonomy classes and instances.
Fig. Example of taxonomy graph
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
13
14. Propagation Rule for Taxonomy Weights
The flow model is applied to the taxonomy weighted graph to compute
the weight of each taxonomy class from the normalized weight of
semantic N-Grams. The weights of taxonomy classes are normalized
with the root ‘entity’ (ω =1 ). The taxonomy instances (N-Grams) are
ordered & normalized by their respective weights ω( wk(n) )
Fig. Weight propagation in Taxonomy Graph
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
14
15. Normalized Taxonomy Weight in Wikipedia
We analyze the
distribution of
weights along the
taxonomy lineage
for all Wikipedia
entries
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
15
16. Lineage Weights Estimator
The training using the initial set of WordNet hypernyms shows
that the distribution of normalized weights ωkalong the taxonomy
lineage for a specific similarity class C, can be approximated with
polynomial function (spline).
This estimator is used in the classification of the taxonomy
lineages of a Wikipedia abstract.
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
16
17. Similarity Metrics
In order to train a model using labeled WordNet hypernyms, a
similarity (or distance) metrics need to be defined. Let’s consider 2
taxonomy lineages Vi and Vk of respective length n(k) and n(j)
Cosine Distance
Shortest Path Distance
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
17
18. Taxonomy Generation Model
Let consider m classes of taxonomy lineage similarity and labeled
lineage VH . A class Ciis defined by
A taxonomy lineage Vj is classified using Naïve Bayes.
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
18
20. Appendix: References
• “Introduction to Information Retrieval”C. Manning, P Raghavan,
H Schūtze Cambridge University Press
• “Elements of Statistical Learning”
T Hastie, R Tibshirani, J
Friedman Springer
• “Semantic Taxonomy Induction from Heterogeneous Evidence” R
Snow, D Jurafsky, A Ng
• “A Study on Linking Wikipedia Categories to WordNet synsets
using text similarity” A Toral, O Fernandez, E Agirre, R Muňoz
• “Regularization Predicts While Discovery Taxonomy” Y. Mroueh, T
Poggio, L Rosasco
• “Natural Language Semantics Term Project” M Tao.
• “A Maximum Entropy Approach to Natural Language Processing”
A Berger, V Della Pietra, S Della Pietra.
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com
20