SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
DOI:10.5121/ijcsa.2015.5404 39
Empirical Evaluation of Web Based Personal
Name Disambiguation Techniques
Dr.A.Suruliandi1
and Selvaperumal.P2
1
Professor, Department of Computer science and Engineering,
ManonmaniamSundaranarUniversity, Tirunelveli, India
2
Research scholar, Department of Computer science and Engineering,
ManonmaniamSundaranar University, Tirunelveli, India
Abstract
Personal name ambiguity in the web arises when different people share the same name in the web.
Resolving the name ambiguity in the web is useful in a number of applications like Information retrieval,
Information extraction and Question and answering system etc. A general name disambiguation process
involves clustering the web pages such that each cluster represents an ambiguous person. In this article,
five important name disambiguation techniques that make use of Hierarchical agglomerative clustering are
empirically compared. Experiments were conducted on the benchmark dataset and their performances are
evaluated in terms of purity, Inverse purity and f-score. Results show the method that uses features like
Lexical, linguistic and personal information hierarchical agglomerative clustering performs better than
disambiguation using other techniques.
Keywords
Personal name disambiguation, Entity name resolution, web page clustering.
1.Introduction
In the web, entities names are highly ambiguous. There are two types of ambiguities as far as
entity names are concerned. A single entity having multiple names and multiple entities sharing a
single name. Solving the former problem is known as alias extraction [1], and the latter is known
as personal name disambiguation.Name ambiguity refers to the problem of two or more entities
sharing the same name. This is in contrast to alias names, where several name refers to the same
entity. Name disambiguation involves correctly predicting the number of persons sharing the
ambiguous name and clustering the web pages such that each person’s document needs to be in a
single cluster.
Name disambiguation is a special case of entity name disambiguation.Sub fields of entity
disambiguation includes name disambiguation, place name disambiguation [2], organizational
name disambiguation [3] etc. In general personal name disambiguation arises when two or more
persons are referred by the same name. Name ambiguity arises when the same name refers to
multiple entities. Retrieving all the entities that the name refers to decreases the precision of an
Information retrieval system. Similarly In sentimental analysis the presence of ambiguous name
decreases the accuracy of the system. In Question and answering system, the presence of name
ambiguity makes it difficult to retrieve relevant answers to the questions from the web. Similarly
in data integration, the presence of ambiguity challenges the integrity of the data.
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
40
People search engines like’411.com’, ’PeopleFinder’, ’people.yahoo.com’ are increasingly
becoming popular because of offering a capability to identifying an individual over the vast web
space. People search, [4]involves name disambiguation is an inherent step. Name disambiguation
has been a challenging problem in many application areas like Information retrieval, Social
network extraction, Ontology integration, Information extraction, Question and answering system
etc. Disambiguating a name will in turn improve the efficiency the aforementioned systems. A
closely related area is social network extraction in which an inherent step is personal name
disambiguation. In social network extraction, [5] name disambiguation is an integral part. A large
number of social networking sites are offering people search facility to enable users to identify a
specific individual who is sharing the name with several other persons. Name ambiguity also
arises in the biomedical text documents. The ambiguity arises because of genes and proteins
sharing the same name [6].
Most of the existing method uses bag of words model and named entities for web based name
disambiguation. Methods employed may be local, which involves using words, named entities
and keywords as features [7][8]. Some methods employ global features like usage of knowledge
base and external corpora for the purpose [9]. Since there is no certainty that the all the words are
unique to web pages of a particular individual, it is difficult to use just words as discriminative
features. One alternative of this is limiting the scope of the words taken for consideration
[10].Named entities are considered to be the efficient feature for name disambiguation task as it
can uniquely identify as person. The important step in all of the name disambiguation process is
finding the discriminative features for the individual sharing the ambiguous name.Features
extracted can be direct features and indirect features. Direct features extraction includes
extracting n-grams, named entities, hyperlinks from the web pages of relevant persons and
indirect features includes extracting features from the external knowledge bases.
Web People search workshops (WePS) is an evaluation campaign focuses on personal name
disambiguation task. First WePS was conducted on 2007 followed by second on 2009 and third
on 2010. WePS workshops involves two tracks namely clustering documents to solve the
personal name ambiguity and extracting various information like dob, designation, Birthplace,
occupation, phone etc for each person sharing the ambiguous name. Most of the systems in WePS
uses named entities as main feature for name disambiguation tasks.
1.1.Relative work
Most of the previous works related to name disambiguation involves unsupervised clustering [11]
[12] [13].RonBekkerman,, et al [14] used link structure and in another method they used
Agglomerative/conglomerative double clustering for name disambiguation process. Their focus is
on disambiguating name in social network and is quite different from disambiguating peoples’
name in the web. Masaki et al, [8] proposed a two stage personal name disambiguation technique
involving hard clustering and soft clustering. The extracted features like Nouns, Compound
keywords and Urlsfrom the data and first applied Hierarchical agglomerative clustering to cluster
documents of same person. Compound keywords were then extracted from each of these clusters
and are used as features for second stage soft clustering. There have been attempts to
disambiguate names using knowledge of Wikipedia [15]. They used knowledge from Wikipedia
to disambiguate names. They mapped ambiguous names to Wikipedia entry, making it difficult to
apply for all the people names because of small coverage of personal names by Wikipedia.Other
commonly used methods include usage of biographic features like name, occupation, location etc
for disambiguation process. The problem with using biographic information from name
disambiguation is, it is often impossible to extract the required biographic information for each
person from the web. If a person is less visible in the web then it is not possible to extract all the
biographic information required for disambiguation process. Zhengzhong Liu et al, [16] proposed
a clustering algorithm for name disambiguation using topic information.
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
41
Their method extracts web page title, URL, meta data, words in the web page etc as feature vector
followed by merging of clusters sharing similar topics. Yiming Lu et al, [17] used web corpora to
increase the accuracy of author name disambiguation system. They considered if two papers
appear in the same page, both of it belong to the same author. Using web connection as well as a
series of constraints they disambiguated author names. V.Kalashnikov [18]exploits a variety of
semantic information like named entities, hyperlinks etc to construct a Entity-Relationship graph
to disambiguate names in web pages. They proposed a method for web people search that makes
use of name disambiguation to facilitate better people search. kazunari Sugiyama et al,[19] used
Semi supervised approach for name disambiguation in the web. He uses seed pages to guide the
clustering process. Zhao Lu et al, [20]proposed ontology based name disambiguation model.
They first constructed an ontology with large set of instances followed by creation of temporary
instance and extraction of features from the web. The similarity between temporary instance and
the instance in the ontology to find the appropriate personal name. Xianpei Han et al, [21] used
reference entity tables mined from the web for name disambiguation. They used web querying to
construct reference table then personal names are disambiguated by linking them to the entries in
the reference table. The exhaustive literature survey conducted shows that resolving name
disambiguation problem is far from over because of the following challenges.
1. The number of persons sharing the ambiguous names keeps increasing as the time go by.
2. Challenge of finding the number of persons sharing the ambiguous name (Which is
required before performing the clustering).
3. The unstructured nature of the web content and difficulty in extracting features for
clustering.
4. The cost and resource required to construct knowledge base if external resources are used
for resolving name ambiguity.
1.2.Motivation and Justification
Personal name queries accounts for 5% of search queries in search engine[22].Over the years, a
range of techniques like usage of ontologies, reference table and extraction of different features
etc have been proposed for web based name disambiguation problem. To the best of authors’
knowledge, there are not many papersevaluating the performance of existing name
disambiguation techniques. Motivated by this, in this paper five significant work of name
disambiguation is empirically evaluated using benchmark dataset.
Hierarchical agglomerative clustering is the commonly used clustering algorithms for name
disambiguation process. Most of the method uses internal features like tokens, nouns, and
biographic features etc as discriminative features for disambiguation process. Justified by this, in
this paper five important works of name disambiguation process is empirically evaluated.
2.Method
2.1.Term-Entity based name disambiguation
Term-entities are set of terms or named entities that uniquely represents a person. Bollegala et
al,[7] proposed an unsupervised method based on term-entity model for automatically
disambiguating people name. They used nouns, nouns phrases and keywords extracting from the
web for the process. They represented each person sharing the ambiguous name in term-entity
model and used group average agglomerative clustering to cluster documents that belong to the
same person. Fig 1 shows various steps involved in the process.
International Journal on Computational Science
Fig 1: Term
The name disambiguation method
Step 1: query the search engine with “personal name” and get top ‘n’
Step 2: Extract terms and named entities
ambiguous personal name.
Step 3: Group average agglomerative clustering is performed
cluster (i.e hard clustering). Terminate the clustering process when the
a predefined value.
Care is taken such that the number of clusters formed is equal to the number of
the ambiguous name in the web page collection.
cluster quality, which is the optimum number of clusters to be
with the help of experimental dataset.
Experiments have shown that the use of noun and noun phrases alone forms good feature for
clustering process. Since the method relies mainly on nouns and terms, the method is simpl
robust for name disambiguation process.
expensive and using all these terms may not always yield better clustering results.
value computed using experiment dataset will not always suitable for all the dataset.
Disambiguation
Where, A(i,j) is the number of person i, predicted as j, and A(i,i)
correctly predicted as i.
ournal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
Term-Entity based name disambiguation process
name disambiguation method involves the following steps.
query the search engine with “personal name” and get top ‘n’ web pages.
Extract terms and named entities in the web pages using POS tagger that contains the
Group average agglomerative clustering is performed to assign each document in a
(i.e hard clustering). Terminate the clustering process when the clustering quality
Care is taken such that the number of clusters formed is equal to the number of persons sharing
the ambiguous name in the web page collection. The method uses a predefined threshold
optimum number of clusters to be formed. The threshold is calculated
dataset.
Experiments have shown that the use of noun and noun phrases alone forms good feature for
Since the method relies mainly on nouns and terms, the method is simpl
robust for name disambiguation process. Extracting a large number of terms is computationally
using all these terms may not always yield better clustering results. The
using experiment dataset will not always suitable for all the dataset.
Disambiguation	Accuracy ൌ	
∑ ‫ܣ‬ሺ݅, ݅ሻ.
௜
∑ ‫ܣ‬ሺ݅, ݆ሻ.
௜,௝
person i, predicted as j, and A(i,i) is the number of persons
CSA) Vol.5, No.4, August 2015
42
that contains the
to assign each document in a
clustering quality reaches
persons sharing
The method uses a predefined threshold value of
formed. The threshold is calculated
Experiments have shown that the use of noun and noun phrases alone forms good feature for
Since the method relies mainly on nouns and terms, the method is simple and
Extracting a large number of terms is computationally
The predefined
persons i
International Journal on Computational Science
2.2.Two stage clustering
Masaki Ikeda et al, [8] proposed two stage clusteri
stage clustering. The process involved is represented in fig
from the whole document gives better results than fixing a limited window size for feature
extraction.
Fig 2: Two stage clustering based name disambiguation process
The process involved is explained in the following p
Step 1: In the web page collection,
(compound nouns), link features(
Step2: Extracted features are then used for hierarchical agglomerative
clustering) and use group average
Step 3: Extract compound keywords again from the resulting cluster,
Step4: Perform clustering process
of clusters are generated.
Since Entity names, compound keywords, urls are important features in name disambiguation
process, this method exploits these potent features for the process. The use of two stage clustering
hard followed by soft clustering
clustering approach does not always suits well for name disambiguation because
of the web pages belongs to a single person sharing the ambiguous name
containing names of multiple persons
ournal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
] proposed two stage clustering name disambiguation process involving two
The process involved is represented in fig 2. He found that extracting features
from the whole document gives better results than fixing a limited window size for feature
Two stage clustering based name disambiguation process
The process involved is explained in the following procedure.
In the web page collection, extract named entity features, compound keywords
(compound nouns), link features(URL’s)
features are then used for hierarchical agglomerative clustering (
clustering) and use group average method for finding distance between clusters.
Extract compound keywords again from the resulting cluster,
erform clustering process with the extracted compound keywords until required number
Since Entity names, compound keywords, urls are important features in name disambiguation
process, this method exploits these potent features for the process. The use of two stage clustering
hard followed by soft clustering provides robust results than simple hard clustering.
clustering approach does not always suits well for name disambiguation because in the web
of the web pages belongs to a single person sharing the ambiguous name and a single web page
containing names of multiple persons sharing the name.
CSA) Vol.5, No.4, August 2015
43
ng name disambiguation process involving two
He found that extracting features
from the whole document gives better results than fixing a limited window size for feature
extract named entity features, compound keywords
clustering (Hard
until required number
Since Entity names, compound keywords, urls are important features in name disambiguation
process, this method exploits these potent features for the process. The use of two stage clustering
ple hard clustering. Soft
in the web, most
and a single web page
International Journal on Computational Science
2.3.Disambiguation with rich features
ErginElmaciogluet al, [10]method
problem. It is based on extracting variety of features from the web
features to the hierarchical agglomerative clustering algorithm.
in name disambiguation using rich set of features.
Fig 3: Name disambiguation using rich features set
The process involves the following steps
Step 1: Extract Tokens, Named entities,
collection.
Step 2: Perform Hierarchical agglomerative clustering with the feature vectors
Step 3: Iterate the clustering process until
The use of features in the web pages like Tokens, Named entities produce robust results in
clustering. The method is simple and easy to implement. Name disambiguation cannot always be
successful using the features in the web pages alone. The use of external feat
from link structure can be better exploited for resolving name ambiguity.
2.4.Unsupervised name disambiguation
S. Mann et al, [23]used rich set of biographic information as features for clustering process. The
method uses language independent bootstrapping process for extracting these features.
the use of different features ranging from simple words in the web pages to
relevant words (tf-idf), biographical f
in name disambiguation process with unsupervised learning.
ournal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
Disambiguation with rich features
method views name disambiguation clustering as hard clustering
It is based on extracting variety of features from the web and are input as vector of
the hierarchical agglomerative clustering algorithm. Fig 3 shows the process involved
name disambiguation using rich set of features.
Name disambiguation using rich features set
The process involves the following steps
amed entities, Hostnames and domains, Page URL’s from the web page
Perform Hierarchical agglomerative clustering with the feature vectors
Iterate the clustering process until desirable number of clusters are generated.
e of features in the web pages like Tokens, Named entities produce robust results in
clustering. The method is simple and easy to implement. Name disambiguation cannot always be
successful using the features in the web pages alone. The use of external features and features
from link structure can be better exploited for resolving name ambiguity.
Unsupervised name disambiguation
used rich set of biographic information as features for clustering process. The
method uses language independent bootstrapping process for extracting these features.
the use of different features ranging from simple words in the web pages to im
idf), biographical features and so on. Fig 4 shows the actual process involved
in name disambiguation process with unsupervised learning.
CSA) Vol.5, No.4, August 2015
44
as hard clustering
and are input as vector of
Fig 3 shows the process involved
from the web page
desirable number of clusters are generated.
e of features in the web pages like Tokens, Named entities produce robust results in
clustering. The method is simple and easy to implement. Name disambiguation cannot always be
ures and features
used rich set of biographic information as features for clustering process. The
method uses language independent bootstrapping process for extracting these features. It explores
important and
Fig 4 shows the actual process involved
International Journal on Computational Science
Fig 4:
Procedure involved in the process is as
Step 1: Extract Nouns, basic biographic information (dob, occupation, designation etc)
idf(term-inverse document frequency) from the document collection
Step 2: Perform Hierarchical agglomerative clustering with the feature vectors
Step 3: Terminate the clustering process when the desirable number of clusters are generated.
The use of nouns, biographic information works well for name disambiguation process. The use
of external knowledge base like Wikipedia may increase the accuracy of the
2.5.Disambiguation using Lexical, linguistic and personal
Xiaojun Wan et al, [13]proposed a method called Web hawk for name disambiguation that
leverages on Lexical linguistic and personal information features for clustering.Lexical features
includes title, words in the web pages, linguistic
personal information features includes
number. Fig 5 shows the process involved in disambiguation process.
process.
Step 1: Pre-process the document collection that includes removing
multiple persons sharing the ambiguous names, removing junk
human names like ford, Bloomberg, Disney etc)
Step 2: Extract lexical, linguistic and personal information features from the document collection
ournal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
Fig 4: Unsupervised name disambiguation
Procedure involved in the process is as follows
Nouns, basic biographic information (dob, occupation, designation etc)
inverse document frequency) from the document collection.
Perform Hierarchical agglomerative clustering with the feature vectors
minate the clustering process when the desirable number of clusters are generated.
The use of nouns, biographic information works well for name disambiguation process. The use
of external knowledge base like Wikipedia may increase the accuracy of the system.
Disambiguation using Lexical, linguistic and personal features
proposed a method called Web hawk for name disambiguation that
and personal information features for clustering.Lexical features
words in the web pages, linguistic features includes Nouns, Noun phrases and
personal information features includes person name, title, organization, email address and phone
Fig 5 shows the process involved in disambiguation process. It involves the following
process the document collection that includes removing web pages that refers to
multiple persons sharing the ambiguous names, removing junk pages (pages referring to non
human names like ford, Bloomberg, Disney etc).
c and personal information features from the document collection
CSA) Vol.5, No.4, August 2015
45
Nouns, basic biographic information (dob, occupation, designation etc), tf-
minate the clustering process when the desirable number of clusters are generated.
The use of nouns, biographic information works well for name disambiguation process. The use
proposed a method called Web hawk for name disambiguation that
and personal information features for clustering.Lexical features
features includes Nouns, Noun phrases and
person name, title, organization, email address and phone
It involves the following
web pages that refers to
pages (pages referring to non-
c and personal information features from the document collection.
International Journal on Computational Science
Step 3: Perform hierarchical agglomerative clustering
similarity between two clusters as follows
ܵ݅݉
Where pi and pj are the web pages and c1,c2 are clusters.
Fig 5: Disambiguation using lexical,linguistic and personal
2.6.Hierarchical agglomerative clustering
HAC,[24]is a bottom up clustering method
singleton clusterand successively similar pairs are
until a single cluster containing all the documents is formed.
Step1: Assign each document to individual cluster
Step 2: Construct distance matrix and look for the pair of matrix that are shortest from each other.
Step 3: Merge the clusters that are closest to each other.
Step 4: Find the distance between the new
distance matrix.
Step 5: Repeat the above steps until single cluster is formed or required number of clusters
formed.
ournal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
Perform hierarchical agglomerative clustering and average-link method to compute the
similarity between two clusters as follows
ܵ݅݉ሺܿ1, ܿ1ሻ ൌ	
∑ ∑ ‫݉݅ݏ‬ሺ‫݌‬௜, ‫݌‬௝ሻ௡
௝ୀଵ
௠
௜ୀଵ
݉ ൈ ݊
are the web pages and c1,c2 are clusters.
isambiguation using lexical,linguistic and personal features
Hierarchical agglomerative clustering
is a bottom up clustering method in which initially each document is treated as
and successively similar pairs are merged (agglomerate) to form larger cluster
cluster containing all the documents is formed. It involves the following steps
Assign each document to individual cluster
Construct distance matrix and look for the pair of matrix that are shortest from each other.
Merge the clusters that are closest to each other.
Find the distance between the new cluster with all the other clusters and update the
Repeat the above steps until single cluster is formed or required number of clusters
CSA) Vol.5, No.4, August 2015
46
link method to compute the
in which initially each document is treated as
agglomerate) to form larger cluster
It involves the following steps.
Construct distance matrix and look for the pair of matrix that are shortest from each other.
with all the other clusters and update the
Repeat the above steps until single cluster is formed or required number of clusters are
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
47
Since the required number of clusters is not known a priori, when to terminate the clustering
process is a significant factor in performance of clustering process. The use of different distance
metrics for measuring distance between clusters may produce different results. Thus selection of
distance parameter forms a vital part in clustering process.Inthe following experiments, group
average agglomerative clustering is used to find the average distance between clusters. In group
average clustering, the distance between two clusters is calculated average distance between data
points in both the clusters. In practical, it is impossible to known what the ‘n’ value beforehand
and different techniques have been devised to handle this problem [7].
3.Experimental results and Discussion
3.1.Dataset
In this experiment, two datasets are used. Danushka et al, [7] used names Jim Clark and Michael
Jackson as dataset and in this paper, this dataset is referred as dataset I. In that dataset, the number
of persons sharing the name and the number of contexts has been downsized for our convenience.
In order to use benchmark dataset for the evaluation purpose, Pedersen and Kulkarni’s data set is
used. Pedersen and Kulkarni’s data set, [25] contains data for five ambiguous names collected
from the web. For these five names, they have collected the data, data formatted and cleaned.
This dataset is hereafter referred as dataset – II. Table 1 and 2 shows the summary of datasetI and
II.
Table 1: Summary of dataset – I.
Table 2: Summary of dataset – II.
3.2 Performance metrics
The performance of clustering can be measured by Purity index, Inverse purity Index and
harmonic mean of purity and inverse purity index, [26].Purity index, [27] is the number of
correctly assigned documents (majority) by number of documents.
Name No. of persons sharing
the name
Number of
contexts
Jim Clark 6 230
Michael Jackson 2 90
Name No. of persons sharing
the name
Number of
contexts
Richard Alston 2 247
Sarah Connor 2 150
George Miller 3 286
Ted Pedersen 4 333
Michael Collins 4 359
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
48
ܲ‫ݕݐ݅ݎݑ‬ ൌ	෍
|‫ܥ‬௜|
ܰ
ூ
݉ܽ‫ݔ‬௝
|‫ܥ‬௜ ∩ ‫ܮ‬௝|
|‫ܥ‬௜|
Where C the set of clusters,L is the list of classes and N is the total number of documents
clustered. Purity penalizes noise in the cluster.Inverse purity index, [28] rewards grouping items
together.
‫ݕݐ݅ݎݑܲ݁ݏݎ݁ݒ݊ܫ‬ ൌ	෍
|‫ܮ‬௜|
ܰ
ூ
݉ܽ‫ݔ‬௝
|‫ܮ‬௜ ∩ ‫ܥ‬௝|
|‫ܮ‬௜|
F-score is the harmonic mean of purity and inverse purity value.
‫ܨ‬ − ܵܿ‫݁ݎ݋‬ ൌ	
2 ∗ ܲ‫ݕݐ݅ݎݑ‬ ∗ ‫ݕݐ݅ݎݑ݌݁ݏݎ݁ݒ݊ܫ‬
ܲ‫ݕݐ݅ݎݑ‬ + ‫ݕݐ݅ݎݑ݌݁ݏݎ݁ݒ݊ܫ‬
All of the above metrics ranges between 0 to 1, where 1 is the optimal value.
3.3.Results and Discussion
3.3.1 Experiments –I
The hierarchical agglomerative clustering is performed to produce ‘n’ number of clusters where
‘n’ is the number of persons sharing the ambiguous name. For the experiment, a part of web page
in each of top 100 web pages returned by search engine is used. The performance of various
methods for dataset-I can be seen in the table 3.
Table 3: Performance of various techniques on dataset - I.
From the table 3 it is evident that the usage of lexical, linguistic and personal information
discriminative features works well for name disambiguation task. Term-entity model that uses
nouns and noun phrases as discriminative features also performs better in the disambiguation
task.
3.3.2 Experiments – II
The same experiment is repeated with dataset – II using a portion of top 100 web pages returned
by search engine for the name query. The performance of various methods for dataset-I can be
seen in the table 4.
Method Purity Index Inverse purity
index
F-Score
Term-entity based
model
0.69 0.83 0.75
Two stage clustering 0.65 0.78 0.70
Disambiguation with
rich features
0.53 0.73 0.61
Unsupervised name
disambiguation
0.63 0.81 0.70
using Lexical, linguistic
and personal features
0.79 0.82 0.80
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
49
Table 4: Performance of various techniques on dataset - II.
The performance of method that uses lexical, linguistic and personal info is evident from the table
4. This method outperforms the other methods both in terms of purity and inverse purity. Similar
to experiment I, term entity based method performs well in the experiment II also. It can be
inferred from the experiments that the usage of lexical, linguistic features along with nouns and
noun phrases serves better for name discrimination tasks.
4.Conclusion
Considering the wider application of name disambiguation process, in this paper five important
techniques of name disambiguation process is empirically evaluated. Experiment is conducted
using benchmark dataset and the results are evaluated using standard metrics like purity, inverse
purity and f-score values. Results show that method that uses features like Lexical, linguistic and
personal information hierarchical agglomerative clustering performs better than other name
disambiguation techniques taken for evaluation.
The work is prequel for proposing a novel web based name disambiguation technique.
Disambiguating place names, product names etc are interesting topic to work on. The usage of
name disambiguation process in a large number of applications like information retrieval,
information extraction, document merging, question and answering etc are compelling areas to
work on.
Reference
1. Bollegala, Danushka, Yutaka Matsuo, and Mitsuru Ishizuka. "Automatic discovery of personal name
aliases from the web." IEEE Transactions on Knowledge and Data Engineering 23, no. 6 (2011): 831-
844.
2. Overell, Simon, Joao Magalhaes, and Stefan Rüger. "Place disambiguation with co-occurrence
models." In CLEF 2006 Workshop, Working notes. 2006.
3. Zhang, Shu, Jianwei Wu, Dequan Zheng, Yao Meng, and Hao Yu. "An adaptive method for
organization name disambiguation with feature reinforcing." In Proceedings of the 26th Pacific Asia
Conference on Language, Information, and Computation, pp. 237-245. 2012.
4. Kalashnikov, Dmitri V., Zhaoqi Chen, SharadMehrotra, and RabiaNuray-Turan. "Web people search
via connection analysis." Knowledge and Data Engineering, IEEE Transactions on 20, no. 11 (2008):
1550-1565.
5. Matsuo, Yutaka, Junichiro Mori, Masahiro Hamasaki, Takuichi Nishimura, Hideaki Takeda,
KoitiHasida, and Mitsuru Ishizuka. "POLYPHONET: an advanced social network extraction system
Method Purity Index Inverse purity
index
F-Score
Term-entity based
model
0.73 0.81 0.76
Two stage clustering 0.59 0.71 0.64
Disambiguation with
rich features
0.68 0.79 0.73
Unsupervised name
disambiguation
0.71 0.73 0.71
using Lexical,
linguistic and personal
features
0.77 0.83 0.79
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
50
from the web." Web Semantics: Science, Services and Agents on the World Wide Web 5, no. 4
(2007): 262-278.
6. Al-Mubaid, Hisham, and Ping Chen. "Biomedical term disambiguation: An application to gene-
protein name disambiguation." In Information Technology: New Generations, 2006. ITNG 2006.
Third International Conference on, pp. 606-612. IEEE, 2006.
7. Bollegala, Danushka, Yutaka Matsuo, and Mitsuru Ishizuka. "Automatic annotation of ambiguous
personal names on the web." Computational Intelligence 28, no. 3 (2012): 398-425.
8. Ikeda, Masaki, Shingo Ono, Issei Sato, Minoru Yoshida, and Hiroshi Nakagawa. "Person name
disambiguation on the web by two-stage clustering." In 2nd Web People Search Evaluation Workshop
(WePS 2009), 18th WWW Conference. 2009.
9. Nguyen, Hien T., and Tru H. Cao. "A knowledge-based approach to named entity disambiguation in
news articles." In AI 2007: Advances in Artificial Intelligence, pp. 619-624. Springer Berlin
Heidelberg, 2007.
10. Elmacioglu, Ergin, Yee Fan Tan, Su Yan, Min-Yen Kan, and Dongwon Lee. "Psnus: Web people
name disambiguation by simple clustering with rich features." In Proceedings of the 4th International
Workshop on Semantic Evaluations, pp. 268-271. Association for Computational Linguistics, 2007.
11. Chen, Ying, and James Martin. "Towards Robust Unsupervised Personal Name Disambiguation." In
EMNLP-CoNLL, pp. 190-198. 2007.
12. http://www.d.umn.edu/~tpederse/namedata.html
13. Wan, Xiaojun, JianfengGao, Mu Li, and Binggong Ding. "Person resolution in person search results:
Webhawk." In Proceedings of the 14th ACM international conference on Information and knowledge
management, pp. 163-170. ACM, 2005.
14. Bekkerman, Ron, and Andrew McCallum. "Disambiguating web appearances of people in a social
network." In Proceedings of the 14th international conference on World Wide Web, pp. 463-470.
ACM, 2005.
15. Bunescu, Razvan C., and Marius Pasca. "Using Encyclopedic Knowledge for Named entity
Disambiguation." In EACL, vol. 6, pp. 9-16. 2006.
16. Liu, Zhengzhong, Qin Lu, and JianXu. "High performance clustering for web person name
disambiguation using topic capturing." Ratio (2011).
17. Lu, Yiming, ZaiqingNie, Taoyuan Cheng, Ying Gao, and Ji-Rong Wen. "Name disambiguation using
Web connection." In Proceedings of AAAI 2007 Workshop on Information Integration on the Web.
2007.
18. Bunescu, Razvan C., and Marius Pasca. "Using Encyclopedic Knowledge for Named entity
Disambiguation." In EACL, vol. 6, pp. 9-16. 2006.
19. Sugiyama, Kazunari, and Manabu Okumura. "Personal name disambiguation in web search results
based on a semi-supervised clustering approach." In Asian Digital Libraries. Looking Back 10 Years
and Forging New Frontiers, pp. 250-256. Springer Berlin Heidelberg, 2007.
20. Lu, Zhao, Zhixian Yan, and Liang He. "OnPerDis: Ontology-Based Personal Name Disambiguation
on the Web." In Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2013
IEEE/WIC/ACM International Joint Conferences on, vol. 1, pp. 185-192. IEEE, 2013.
21. Han, Xianpei, and Jun Zhao. "Web personal name disambiguation based on reference entity tables
mined from the web." In Proceedings of the eleventh international workshop on Web information and
data management, pp. 75-82. ACM, 2009.
22. Guha and Garg. Disambiguating people in search.WWW’04.
23. Mann, Gideon S., and David Yarowsky. "Unsupervised personal name disambiguation." In
Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume
4, pp. 33-40. Association for Computational Linguistics, 2003.
24. Jain, Anil K., M. NarasimhaMurty, and Patrick J. Flynn. "Data clustering: a review." ACM computing
surveys (CSUR) 31, no. 3 (1999): 264-323.
25. http://www.d.umn.edu/~tpederse/namedata.html
26. Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge
University Press (2008)
27. Amigó, Enrique, Julio Gonzalo, Javier Artiles, and FelisaVerdejo. "A comparison of extrinsic
clustering evaluation metrics based on formal constraints." Information retrieval 12, no. 4 (2009):
461-486.

Weitere ähnliche Inhalte

Was ist angesagt?

Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Scienceresearchinventy
 
Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Peter Mika
 
Rule based messege filtering and blacklist management for online social network
Rule based messege filtering and blacklist management for online social networkRule based messege filtering and blacklist management for online social network
Rule based messege filtering and blacklist management for online social networkeSAT Publishing House
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesTraian Rebedea
 
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...Editor IJCATR
 
Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015Peter Mika
 
Presentation at School of Information and Library Science, UNC, USA
Presentation at School of Information and Library Science, UNC, USAPresentation at School of Information and Library Science, UNC, USA
Presentation at School of Information and Library Science, UNC, USACharles Seger
 
Record matching over query results
Record matching over query resultsRecord matching over query results
Record matching over query resultsambitlick
 
Identification of inference attacks on private Information from Social Networks
Identification of inference attacks on private Information from Social NetworksIdentification of inference attacks on private Information from Social Networks
Identification of inference attacks on private Information from Social Networkseditorjournal
 
Sinks Method Paper Presentation @ Duke Political Networks Conference 2010
Sinks Method Paper Presentation @ Duke Political Networks Conference 2010Sinks Method Paper Presentation @ Duke Political Networks Conference 2010
Sinks Method Paper Presentation @ Duke Political Networks Conference 2010Daniel Katz
 
Webometrics 1.0 - from AltaVista to Small Worlds and Genre Drift
Webometrics 1.0 - from AltaVista to Small Worlds and Genre DriftWebometrics 1.0 - from AltaVista to Small Worlds and Genre Drift
Webometrics 1.0 - from AltaVista to Small Worlds and Genre Driftguest5ec99a
 
Survey on personality predication methods using AI
Survey on personality predication methods using AISurvey on personality predication methods using AI
Survey on personality predication methods using AIIJAEMSJORNAL
 
Harnessing the Social Web: The Science of Identity Disambiguation
Harnessing the Social Web: The Science of Identity DisambiguationHarnessing the Social Web: The Science of Identity Disambiguation
Harnessing the Social Web: The Science of Identity DisambiguationMatthew Rowe
 
Social Network Analysis based on MOOC's (Massive Open Online Classes)
Social Network Analysis based on MOOC's (Massive Open Online Classes)Social Network Analysis based on MOOC's (Massive Open Online Classes)
Social Network Analysis based on MOOC's (Massive Open Online Classes)ShankarPrasaadRajama
 
Benchmarking the Privacy-­Preserving People Search
Benchmarking the Privacy-­Preserving People SearchBenchmarking the Privacy-­Preserving People Search
Benchmarking the Privacy-­Preserving People SearchDaqing He
 
Information Access on Social Web
Information Access on Social WebInformation Access on Social Web
Information Access on Social WebDaqing He
 

Was ist angesagt? (19)

[IJET-V2I3P19] Authors: Priyanka Sharma
[IJET-V2I3P19] Authors: Priyanka Sharma[IJET-V2I3P19] Authors: Priyanka Sharma
[IJET-V2I3P19] Authors: Priyanka Sharma
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015
 
Rule based messege filtering and blacklist management for online social network
Rule based messege filtering and blacklist management for online social networkRule based messege filtering and blacklist management for online social network
Rule based messege filtering and blacklist management for online social network
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profiles
 
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
 
Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015
 
Presentation at School of Information and Library Science, UNC, USA
Presentation at School of Information and Library Science, UNC, USAPresentation at School of Information and Library Science, UNC, USA
Presentation at School of Information and Library Science, UNC, USA
 
Record matching over query results
Record matching over query resultsRecord matching over query results
Record matching over query results
 
B1803020915
B1803020915B1803020915
B1803020915
 
Identification of inference attacks on private Information from Social Networks
Identification of inference attacks on private Information from Social NetworksIdentification of inference attacks on private Information from Social Networks
Identification of inference attacks on private Information from Social Networks
 
Sinks Method Paper Presentation @ Duke Political Networks Conference 2010
Sinks Method Paper Presentation @ Duke Political Networks Conference 2010Sinks Method Paper Presentation @ Duke Political Networks Conference 2010
Sinks Method Paper Presentation @ Duke Political Networks Conference 2010
 
Webometrics 1.0 - from AltaVista to Small Worlds and Genre Drift
Webometrics 1.0 - from AltaVista to Small Worlds and Genre DriftWebometrics 1.0 - from AltaVista to Small Worlds and Genre Drift
Webometrics 1.0 - from AltaVista to Small Worlds and Genre Drift
 
Survey on personality predication methods using AI
Survey on personality predication methods using AISurvey on personality predication methods using AI
Survey on personality predication methods using AI
 
Research Paper On Correlation
Research Paper On CorrelationResearch Paper On Correlation
Research Paper On Correlation
 
Harnessing the Social Web: The Science of Identity Disambiguation
Harnessing the Social Web: The Science of Identity DisambiguationHarnessing the Social Web: The Science of Identity Disambiguation
Harnessing the Social Web: The Science of Identity Disambiguation
 
Social Network Analysis based on MOOC's (Massive Open Online Classes)
Social Network Analysis based on MOOC's (Massive Open Online Classes)Social Network Analysis based on MOOC's (Massive Open Online Classes)
Social Network Analysis based on MOOC's (Massive Open Online Classes)
 
Benchmarking the Privacy-­Preserving People Search
Benchmarking the Privacy-­Preserving People SearchBenchmarking the Privacy-­Preserving People Search
Benchmarking the Privacy-­Preserving People Search
 
Information Access on Social Web
Information Access on Social WebInformation Access on Social Web
Information Access on Social Web
 

Andere mochten auch

An approach to decrease dimensions of drift
An approach to decrease dimensions of driftAn approach to decrease dimensions of drift
An approach to decrease dimensions of driftijcsa
 
An efficient recovery mechanism
An efficient recovery mechanismAn efficient recovery mechanism
An efficient recovery mechanismijcsa
 
ANGLE ROUTING:A FULLY ADAPTIVE PACKET ROUTING FOR NOC
ANGLE ROUTING:A FULLY ADAPTIVE PACKET ROUTING FOR NOCANGLE ROUTING:A FULLY ADAPTIVE PACKET ROUTING FOR NOC
ANGLE ROUTING:A FULLY ADAPTIVE PACKET ROUTING FOR NOCijcsa
 
Computational science guided soft
Computational science guided softComputational science guided soft
Computational science guided softijcsa
 
A SERIAL COMPUTING MODEL OF AGENT ENABLED MINING OF GLOBALLY STRONG ASSOCIATI...
A SERIAL COMPUTING MODEL OF AGENT ENABLED MINING OF GLOBALLY STRONG ASSOCIATI...A SERIAL COMPUTING MODEL OF AGENT ENABLED MINING OF GLOBALLY STRONG ASSOCIATI...
A SERIAL COMPUTING MODEL OF AGENT ENABLED MINING OF GLOBALLY STRONG ASSOCIATI...ijcsa
 
Recognition of optical images based on the
Recognition of optical images based on theRecognition of optical images based on the
Recognition of optical images based on theijcsa
 
ON APPROACH OF OPTIMIZATION OF FORMATION OF INHOMOGENOUS DISTRIBUTIONS OF DOP...
ON APPROACH OF OPTIMIZATION OF FORMATION OF INHOMOGENOUS DISTRIBUTIONS OF DOP...ON APPROACH OF OPTIMIZATION OF FORMATION OF INHOMOGENOUS DISTRIBUTIONS OF DOP...
ON APPROACH OF OPTIMIZATION OF FORMATION OF INHOMOGENOUS DISTRIBUTIONS OF DOP...ijcsa
 
Techniques of lattice based
Techniques of lattice basedTechniques of lattice based
Techniques of lattice basedijcsa
 
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...ijcsa
 
A NOVEL BINNING AND INDEXING APPROACH USING HAND GEOMETRY AND PALM PRINT TO E...
A NOVEL BINNING AND INDEXING APPROACH USING HAND GEOMETRY AND PALM PRINT TO E...A NOVEL BINNING AND INDEXING APPROACH USING HAND GEOMETRY AND PALM PRINT TO E...
A NOVEL BINNING AND INDEXING APPROACH USING HAND GEOMETRY AND PALM PRINT TO E...ijcsa
 
STABILIZATION AT UPRIGHT EQUILIBRIUM POSITION OF A DOUBLE INVERTED PENDULUM W...
STABILIZATION AT UPRIGHT EQUILIBRIUM POSITION OF A DOUBLE INVERTED PENDULUM W...STABILIZATION AT UPRIGHT EQUILIBRIUM POSITION OF A DOUBLE INVERTED PENDULUM W...
STABILIZATION AT UPRIGHT EQUILIBRIUM POSITION OF A DOUBLE INVERTED PENDULUM W...ijcsa
 
A SURVEY OF MACHINE LEARNING TECHNIQUES FOR SENTIMENT CLASSIFICATION
A SURVEY OF MACHINE LEARNING TECHNIQUES FOR SENTIMENT CLASSIFICATIONA SURVEY OF MACHINE LEARNING TECHNIQUES FOR SENTIMENT CLASSIFICATION
A SURVEY OF MACHINE LEARNING TECHNIQUES FOR SENTIMENT CLASSIFICATIONijcsa
 
COVERAGE OPTIMIZED AND TIME EFFICIENT LOCAL SEARCH BETWEENNESS ROUTING FOR HE...
COVERAGE OPTIMIZED AND TIME EFFICIENT LOCAL SEARCH BETWEENNESS ROUTING FOR HE...COVERAGE OPTIMIZED AND TIME EFFICIENT LOCAL SEARCH BETWEENNESS ROUTING FOR HE...
COVERAGE OPTIMIZED AND TIME EFFICIENT LOCAL SEARCH BETWEENNESS ROUTING FOR HE...ijcsa
 
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHMijcsa
 
An approach to decrease dimentions of logical
An approach to decrease dimentions of logicalAn approach to decrease dimentions of logical
An approach to decrease dimentions of logicalijcsa
 
SUCCESSIVE LINEARIZATION SOLUTION OF A BOUNDARY LAYER CONVECTIVE HEAT TRANSFE...
SUCCESSIVE LINEARIZATION SOLUTION OF A BOUNDARY LAYER CONVECTIVE HEAT TRANSFE...SUCCESSIVE LINEARIZATION SOLUTION OF A BOUNDARY LAYER CONVECTIVE HEAT TRANSFE...
SUCCESSIVE LINEARIZATION SOLUTION OF A BOUNDARY LAYER CONVECTIVE HEAT TRANSFE...ijcsa
 
Effects of missing observations on
Effects of missing observations onEffects of missing observations on
Effects of missing observations onijcsa
 

Andere mochten auch (20)

An approach to decrease dimensions of drift
An approach to decrease dimensions of driftAn approach to decrease dimensions of drift
An approach to decrease dimensions of drift
 
An efficient recovery mechanism
An efficient recovery mechanismAn efficient recovery mechanism
An efficient recovery mechanism
 
ANGLE ROUTING:A FULLY ADAPTIVE PACKET ROUTING FOR NOC
ANGLE ROUTING:A FULLY ADAPTIVE PACKET ROUTING FOR NOCANGLE ROUTING:A FULLY ADAPTIVE PACKET ROUTING FOR NOC
ANGLE ROUTING:A FULLY ADAPTIVE PACKET ROUTING FOR NOC
 
Computational science guided soft
Computational science guided softComputational science guided soft
Computational science guided soft
 
A SERIAL COMPUTING MODEL OF AGENT ENABLED MINING OF GLOBALLY STRONG ASSOCIATI...
A SERIAL COMPUTING MODEL OF AGENT ENABLED MINING OF GLOBALLY STRONG ASSOCIATI...A SERIAL COMPUTING MODEL OF AGENT ENABLED MINING OF GLOBALLY STRONG ASSOCIATI...
A SERIAL COMPUTING MODEL OF AGENT ENABLED MINING OF GLOBALLY STRONG ASSOCIATI...
 
Recognition of optical images based on the
Recognition of optical images based on theRecognition of optical images based on the
Recognition of optical images based on the
 
ON APPROACH OF OPTIMIZATION OF FORMATION OF INHOMOGENOUS DISTRIBUTIONS OF DOP...
ON APPROACH OF OPTIMIZATION OF FORMATION OF INHOMOGENOUS DISTRIBUTIONS OF DOP...ON APPROACH OF OPTIMIZATION OF FORMATION OF INHOMOGENOUS DISTRIBUTIONS OF DOP...
ON APPROACH OF OPTIMIZATION OF FORMATION OF INHOMOGENOUS DISTRIBUTIONS OF DOP...
 
Techniques of lattice based
Techniques of lattice basedTechniques of lattice based
Techniques of lattice based
 
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
 
A NOVEL BINNING AND INDEXING APPROACH USING HAND GEOMETRY AND PALM PRINT TO E...
A NOVEL BINNING AND INDEXING APPROACH USING HAND GEOMETRY AND PALM PRINT TO E...A NOVEL BINNING AND INDEXING APPROACH USING HAND GEOMETRY AND PALM PRINT TO E...
A NOVEL BINNING AND INDEXING APPROACH USING HAND GEOMETRY AND PALM PRINT TO E...
 
STABILIZATION AT UPRIGHT EQUILIBRIUM POSITION OF A DOUBLE INVERTED PENDULUM W...
STABILIZATION AT UPRIGHT EQUILIBRIUM POSITION OF A DOUBLE INVERTED PENDULUM W...STABILIZATION AT UPRIGHT EQUILIBRIUM POSITION OF A DOUBLE INVERTED PENDULUM W...
STABILIZATION AT UPRIGHT EQUILIBRIUM POSITION OF A DOUBLE INVERTED PENDULUM W...
 
A SURVEY OF MACHINE LEARNING TECHNIQUES FOR SENTIMENT CLASSIFICATION
A SURVEY OF MACHINE LEARNING TECHNIQUES FOR SENTIMENT CLASSIFICATIONA SURVEY OF MACHINE LEARNING TECHNIQUES FOR SENTIMENT CLASSIFICATION
A SURVEY OF MACHINE LEARNING TECHNIQUES FOR SENTIMENT CLASSIFICATION
 
COVERAGE OPTIMIZED AND TIME EFFICIENT LOCAL SEARCH BETWEENNESS ROUTING FOR HE...
COVERAGE OPTIMIZED AND TIME EFFICIENT LOCAL SEARCH BETWEENNESS ROUTING FOR HE...COVERAGE OPTIMIZED AND TIME EFFICIENT LOCAL SEARCH BETWEENNESS ROUTING FOR HE...
COVERAGE OPTIMIZED AND TIME EFFICIENT LOCAL SEARCH BETWEENNESS ROUTING FOR HE...
 
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 
An approach to decrease dimentions of logical
An approach to decrease dimentions of logicalAn approach to decrease dimentions of logical
An approach to decrease dimentions of logical
 
SUCCESSIVE LINEARIZATION SOLUTION OF A BOUNDARY LAYER CONVECTIVE HEAT TRANSFE...
SUCCESSIVE LINEARIZATION SOLUTION OF A BOUNDARY LAYER CONVECTIVE HEAT TRANSFE...SUCCESSIVE LINEARIZATION SOLUTION OF A BOUNDARY LAYER CONVECTIVE HEAT TRANSFE...
SUCCESSIVE LINEARIZATION SOLUTION OF A BOUNDARY LAYER CONVECTIVE HEAT TRANSFE...
 
Effects of missing observations on
Effects of missing observations onEffects of missing observations on
Effects of missing observations on
 
Residence 305
Residence 305Residence 305
Residence 305
 
A S T A53 O P E N E S P
A S T  A53 O P E N  E S PA S T  A53 O P E N  E S P
A S T A53 O P E N E S P
 
A S T A S43 E S P
A S T  A S43 E S PA S T  A S43 E S P
A S T A S43 E S P
 

Ähnlich wie Empirical evaluation of web based personal

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Challenging Issues and Similarity Measures for Web Document Clustering
Challenging Issues and Similarity Measures for Web Document ClusteringChallenging Issues and Similarity Measures for Web Document Clustering
Challenging Issues and Similarity Measures for Web Document ClusteringIOSR Journals
 
Lexical Pattern- Based Approach for Extracting Name Aliases
Lexical Pattern- Based Approach for Extracting Name AliasesLexical Pattern- Based Approach for Extracting Name Aliases
Lexical Pattern- Based Approach for Extracting Name AliasesIJMER
 
Mining and Analyzing Academic Social Networks
Mining and Analyzing Academic Social NetworksMining and Analyzing Academic Social Networks
Mining and Analyzing Academic Social NetworksEditor IJCATR
 
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGRAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGijaia
 
Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCSCJournals
 
Survey on text mining networks
Survey on text mining networksSurvey on text mining networks
Survey on text mining networksTeerapat Saeting
 
Design issues in collecting the data on Ego-Centered Social Networks on the W...
Design issues in collecting the data on Ego-Centered Social Networks on the W...Design issues in collecting the data on Ego-Centered Social Networks on the W...
Design issues in collecting the data on Ego-Centered Social Networks on the W...Gašper Koren
 
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...Carlton Northern
 
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKINGTOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKINGcsandit
 
Entity Resolution in Large Graphs
Entity Resolution in Large GraphsEntity Resolution in Large Graphs
Entity Resolution in Large GraphsApurva Kumar
 
The Revolution Of Cloud Computing
The Revolution Of Cloud ComputingThe Revolution Of Cloud Computing
The Revolution Of Cloud ComputingCarmen Sanborn
 
Document Retrieval System, a Case Study
Document Retrieval System, a Case StudyDocument Retrieval System, a Case Study
Document Retrieval System, a Case StudyIJERA Editor
 
EXTRACTING ARABIC RELATIONS FROM THE WEB
EXTRACTING ARABIC RELATIONS FROM THE WEBEXTRACTING ARABIC RELATIONS FROM THE WEB
EXTRACTING ARABIC RELATIONS FROM THE WEBijcsit
 
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...IOSR Journals
 

Ähnlich wie Empirical evaluation of web based personal (20)

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
J017145559
J017145559J017145559
J017145559
 
Challenging Issues and Similarity Measures for Web Document Clustering
Challenging Issues and Similarity Measures for Web Document ClusteringChallenging Issues and Similarity Measures for Web Document Clustering
Challenging Issues and Similarity Measures for Web Document Clustering
 
Lexical Pattern- Based Approach for Extracting Name Aliases
Lexical Pattern- Based Approach for Extracting Name AliasesLexical Pattern- Based Approach for Extracting Name Aliases
Lexical Pattern- Based Approach for Extracting Name Aliases
 
Mining and Analyzing Academic Social Networks
Mining and Analyzing Academic Social NetworksMining and Analyzing Academic Social Networks
Mining and Analyzing Academic Social Networks
 
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGRAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
 
Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector Machine
 
05567101
0556710105567101
05567101
 
Survey on text mining networks
Survey on text mining networksSurvey on text mining networks
Survey on text mining networks
 
Design issues in collecting the data on Ego-Centered Social Networks on the W...
Design issues in collecting the data on Ego-Centered Social Networks on the W...Design issues in collecting the data on Ego-Centered Social Networks on the W...
Design issues in collecting the data on Ego-Centered Social Networks on the W...
 
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
 
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKINGTOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
 
Entity Resolution in Large Graphs
Entity Resolution in Large GraphsEntity Resolution in Large Graphs
Entity Resolution in Large Graphs
 
The Revolution Of Cloud Computing
The Revolution Of Cloud ComputingThe Revolution Of Cloud Computing
The Revolution Of Cloud Computing
 
Document Retrieval System, a Case Study
Document Retrieval System, a Case StudyDocument Retrieval System, a Case Study
Document Retrieval System, a Case Study
 
Js3616841689
Js3616841689Js3616841689
Js3616841689
 
50320140501002
5032014050100250320140501002
50320140501002
 
EXTRACTING ARABIC RELATIONS FROM THE WEB
EXTRACTING ARABIC RELATIONS FROM THE WEBEXTRACTING ARABIC RELATIONS FROM THE WEB
EXTRACTING ARABIC RELATIONS FROM THE WEB
 
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
 
C017161925
C017161925C017161925
C017161925
 

Kürzlich hochgeladen

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Kürzlich hochgeladen (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Empirical evaluation of web based personal

  • 1. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 DOI:10.5121/ijcsa.2015.5404 39 Empirical Evaluation of Web Based Personal Name Disambiguation Techniques Dr.A.Suruliandi1 and Selvaperumal.P2 1 Professor, Department of Computer science and Engineering, ManonmaniamSundaranarUniversity, Tirunelveli, India 2 Research scholar, Department of Computer science and Engineering, ManonmaniamSundaranar University, Tirunelveli, India Abstract Personal name ambiguity in the web arises when different people share the same name in the web. Resolving the name ambiguity in the web is useful in a number of applications like Information retrieval, Information extraction and Question and answering system etc. A general name disambiguation process involves clustering the web pages such that each cluster represents an ambiguous person. In this article, five important name disambiguation techniques that make use of Hierarchical agglomerative clustering are empirically compared. Experiments were conducted on the benchmark dataset and their performances are evaluated in terms of purity, Inverse purity and f-score. Results show the method that uses features like Lexical, linguistic and personal information hierarchical agglomerative clustering performs better than disambiguation using other techniques. Keywords Personal name disambiguation, Entity name resolution, web page clustering. 1.Introduction In the web, entities names are highly ambiguous. There are two types of ambiguities as far as entity names are concerned. A single entity having multiple names and multiple entities sharing a single name. Solving the former problem is known as alias extraction [1], and the latter is known as personal name disambiguation.Name ambiguity refers to the problem of two or more entities sharing the same name. This is in contrast to alias names, where several name refers to the same entity. Name disambiguation involves correctly predicting the number of persons sharing the ambiguous name and clustering the web pages such that each person’s document needs to be in a single cluster. Name disambiguation is a special case of entity name disambiguation.Sub fields of entity disambiguation includes name disambiguation, place name disambiguation [2], organizational name disambiguation [3] etc. In general personal name disambiguation arises when two or more persons are referred by the same name. Name ambiguity arises when the same name refers to multiple entities. Retrieving all the entities that the name refers to decreases the precision of an Information retrieval system. Similarly In sentimental analysis the presence of ambiguous name decreases the accuracy of the system. In Question and answering system, the presence of name ambiguity makes it difficult to retrieve relevant answers to the questions from the web. Similarly in data integration, the presence of ambiguity challenges the integrity of the data.
  • 2. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 40 People search engines like’411.com’, ’PeopleFinder’, ’people.yahoo.com’ are increasingly becoming popular because of offering a capability to identifying an individual over the vast web space. People search, [4]involves name disambiguation is an inherent step. Name disambiguation has been a challenging problem in many application areas like Information retrieval, Social network extraction, Ontology integration, Information extraction, Question and answering system etc. Disambiguating a name will in turn improve the efficiency the aforementioned systems. A closely related area is social network extraction in which an inherent step is personal name disambiguation. In social network extraction, [5] name disambiguation is an integral part. A large number of social networking sites are offering people search facility to enable users to identify a specific individual who is sharing the name with several other persons. Name ambiguity also arises in the biomedical text documents. The ambiguity arises because of genes and proteins sharing the same name [6]. Most of the existing method uses bag of words model and named entities for web based name disambiguation. Methods employed may be local, which involves using words, named entities and keywords as features [7][8]. Some methods employ global features like usage of knowledge base and external corpora for the purpose [9]. Since there is no certainty that the all the words are unique to web pages of a particular individual, it is difficult to use just words as discriminative features. One alternative of this is limiting the scope of the words taken for consideration [10].Named entities are considered to be the efficient feature for name disambiguation task as it can uniquely identify as person. The important step in all of the name disambiguation process is finding the discriminative features for the individual sharing the ambiguous name.Features extracted can be direct features and indirect features. Direct features extraction includes extracting n-grams, named entities, hyperlinks from the web pages of relevant persons and indirect features includes extracting features from the external knowledge bases. Web People search workshops (WePS) is an evaluation campaign focuses on personal name disambiguation task. First WePS was conducted on 2007 followed by second on 2009 and third on 2010. WePS workshops involves two tracks namely clustering documents to solve the personal name ambiguity and extracting various information like dob, designation, Birthplace, occupation, phone etc for each person sharing the ambiguous name. Most of the systems in WePS uses named entities as main feature for name disambiguation tasks. 1.1.Relative work Most of the previous works related to name disambiguation involves unsupervised clustering [11] [12] [13].RonBekkerman,, et al [14] used link structure and in another method they used Agglomerative/conglomerative double clustering for name disambiguation process. Their focus is on disambiguating name in social network and is quite different from disambiguating peoples’ name in the web. Masaki et al, [8] proposed a two stage personal name disambiguation technique involving hard clustering and soft clustering. The extracted features like Nouns, Compound keywords and Urlsfrom the data and first applied Hierarchical agglomerative clustering to cluster documents of same person. Compound keywords were then extracted from each of these clusters and are used as features for second stage soft clustering. There have been attempts to disambiguate names using knowledge of Wikipedia [15]. They used knowledge from Wikipedia to disambiguate names. They mapped ambiguous names to Wikipedia entry, making it difficult to apply for all the people names because of small coverage of personal names by Wikipedia.Other commonly used methods include usage of biographic features like name, occupation, location etc for disambiguation process. The problem with using biographic information from name disambiguation is, it is often impossible to extract the required biographic information for each person from the web. If a person is less visible in the web then it is not possible to extract all the biographic information required for disambiguation process. Zhengzhong Liu et al, [16] proposed a clustering algorithm for name disambiguation using topic information.
  • 3. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 41 Their method extracts web page title, URL, meta data, words in the web page etc as feature vector followed by merging of clusters sharing similar topics. Yiming Lu et al, [17] used web corpora to increase the accuracy of author name disambiguation system. They considered if two papers appear in the same page, both of it belong to the same author. Using web connection as well as a series of constraints they disambiguated author names. V.Kalashnikov [18]exploits a variety of semantic information like named entities, hyperlinks etc to construct a Entity-Relationship graph to disambiguate names in web pages. They proposed a method for web people search that makes use of name disambiguation to facilitate better people search. kazunari Sugiyama et al,[19] used Semi supervised approach for name disambiguation in the web. He uses seed pages to guide the clustering process. Zhao Lu et al, [20]proposed ontology based name disambiguation model. They first constructed an ontology with large set of instances followed by creation of temporary instance and extraction of features from the web. The similarity between temporary instance and the instance in the ontology to find the appropriate personal name. Xianpei Han et al, [21] used reference entity tables mined from the web for name disambiguation. They used web querying to construct reference table then personal names are disambiguated by linking them to the entries in the reference table. The exhaustive literature survey conducted shows that resolving name disambiguation problem is far from over because of the following challenges. 1. The number of persons sharing the ambiguous names keeps increasing as the time go by. 2. Challenge of finding the number of persons sharing the ambiguous name (Which is required before performing the clustering). 3. The unstructured nature of the web content and difficulty in extracting features for clustering. 4. The cost and resource required to construct knowledge base if external resources are used for resolving name ambiguity. 1.2.Motivation and Justification Personal name queries accounts for 5% of search queries in search engine[22].Over the years, a range of techniques like usage of ontologies, reference table and extraction of different features etc have been proposed for web based name disambiguation problem. To the best of authors’ knowledge, there are not many papersevaluating the performance of existing name disambiguation techniques. Motivated by this, in this paper five significant work of name disambiguation is empirically evaluated using benchmark dataset. Hierarchical agglomerative clustering is the commonly used clustering algorithms for name disambiguation process. Most of the method uses internal features like tokens, nouns, and biographic features etc as discriminative features for disambiguation process. Justified by this, in this paper five important works of name disambiguation process is empirically evaluated. 2.Method 2.1.Term-Entity based name disambiguation Term-entities are set of terms or named entities that uniquely represents a person. Bollegala et al,[7] proposed an unsupervised method based on term-entity model for automatically disambiguating people name. They used nouns, nouns phrases and keywords extracting from the web for the process. They represented each person sharing the ambiguous name in term-entity model and used group average agglomerative clustering to cluster documents that belong to the same person. Fig 1 shows various steps involved in the process.
  • 4. International Journal on Computational Science Fig 1: Term The name disambiguation method Step 1: query the search engine with “personal name” and get top ‘n’ Step 2: Extract terms and named entities ambiguous personal name. Step 3: Group average agglomerative clustering is performed cluster (i.e hard clustering). Terminate the clustering process when the a predefined value. Care is taken such that the number of clusters formed is equal to the number of the ambiguous name in the web page collection. cluster quality, which is the optimum number of clusters to be with the help of experimental dataset. Experiments have shown that the use of noun and noun phrases alone forms good feature for clustering process. Since the method relies mainly on nouns and terms, the method is simpl robust for name disambiguation process. expensive and using all these terms may not always yield better clustering results. value computed using experiment dataset will not always suitable for all the dataset. Disambiguation Where, A(i,j) is the number of person i, predicted as j, and A(i,i) correctly predicted as i. ournal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 Term-Entity based name disambiguation process name disambiguation method involves the following steps. query the search engine with “personal name” and get top ‘n’ web pages. Extract terms and named entities in the web pages using POS tagger that contains the Group average agglomerative clustering is performed to assign each document in a (i.e hard clustering). Terminate the clustering process when the clustering quality Care is taken such that the number of clusters formed is equal to the number of persons sharing the ambiguous name in the web page collection. The method uses a predefined threshold optimum number of clusters to be formed. The threshold is calculated dataset. Experiments have shown that the use of noun and noun phrases alone forms good feature for Since the method relies mainly on nouns and terms, the method is simpl robust for name disambiguation process. Extracting a large number of terms is computationally using all these terms may not always yield better clustering results. The using experiment dataset will not always suitable for all the dataset. Disambiguation Accuracy ൌ ∑ ‫ܣ‬ሺ݅, ݅ሻ. ௜ ∑ ‫ܣ‬ሺ݅, ݆ሻ. ௜,௝ person i, predicted as j, and A(i,i) is the number of persons CSA) Vol.5, No.4, August 2015 42 that contains the to assign each document in a clustering quality reaches persons sharing The method uses a predefined threshold value of formed. The threshold is calculated Experiments have shown that the use of noun and noun phrases alone forms good feature for Since the method relies mainly on nouns and terms, the method is simple and Extracting a large number of terms is computationally The predefined persons i
  • 5. International Journal on Computational Science 2.2.Two stage clustering Masaki Ikeda et al, [8] proposed two stage clusteri stage clustering. The process involved is represented in fig from the whole document gives better results than fixing a limited window size for feature extraction. Fig 2: Two stage clustering based name disambiguation process The process involved is explained in the following p Step 1: In the web page collection, (compound nouns), link features( Step2: Extracted features are then used for hierarchical agglomerative clustering) and use group average Step 3: Extract compound keywords again from the resulting cluster, Step4: Perform clustering process of clusters are generated. Since Entity names, compound keywords, urls are important features in name disambiguation process, this method exploits these potent features for the process. The use of two stage clustering hard followed by soft clustering clustering approach does not always suits well for name disambiguation because of the web pages belongs to a single person sharing the ambiguous name containing names of multiple persons ournal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 ] proposed two stage clustering name disambiguation process involving two The process involved is represented in fig 2. He found that extracting features from the whole document gives better results than fixing a limited window size for feature Two stage clustering based name disambiguation process The process involved is explained in the following procedure. In the web page collection, extract named entity features, compound keywords (compound nouns), link features(URL’s) features are then used for hierarchical agglomerative clustering ( clustering) and use group average method for finding distance between clusters. Extract compound keywords again from the resulting cluster, erform clustering process with the extracted compound keywords until required number Since Entity names, compound keywords, urls are important features in name disambiguation process, this method exploits these potent features for the process. The use of two stage clustering hard followed by soft clustering provides robust results than simple hard clustering. clustering approach does not always suits well for name disambiguation because in the web of the web pages belongs to a single person sharing the ambiguous name and a single web page containing names of multiple persons sharing the name. CSA) Vol.5, No.4, August 2015 43 ng name disambiguation process involving two He found that extracting features from the whole document gives better results than fixing a limited window size for feature extract named entity features, compound keywords clustering (Hard until required number Since Entity names, compound keywords, urls are important features in name disambiguation process, this method exploits these potent features for the process. The use of two stage clustering ple hard clustering. Soft in the web, most and a single web page
  • 6. International Journal on Computational Science 2.3.Disambiguation with rich features ErginElmaciogluet al, [10]method problem. It is based on extracting variety of features from the web features to the hierarchical agglomerative clustering algorithm. in name disambiguation using rich set of features. Fig 3: Name disambiguation using rich features set The process involves the following steps Step 1: Extract Tokens, Named entities, collection. Step 2: Perform Hierarchical agglomerative clustering with the feature vectors Step 3: Iterate the clustering process until The use of features in the web pages like Tokens, Named entities produce robust results in clustering. The method is simple and easy to implement. Name disambiguation cannot always be successful using the features in the web pages alone. The use of external feat from link structure can be better exploited for resolving name ambiguity. 2.4.Unsupervised name disambiguation S. Mann et al, [23]used rich set of biographic information as features for clustering process. The method uses language independent bootstrapping process for extracting these features. the use of different features ranging from simple words in the web pages to relevant words (tf-idf), biographical f in name disambiguation process with unsupervised learning. ournal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 Disambiguation with rich features method views name disambiguation clustering as hard clustering It is based on extracting variety of features from the web and are input as vector of the hierarchical agglomerative clustering algorithm. Fig 3 shows the process involved name disambiguation using rich set of features. Name disambiguation using rich features set The process involves the following steps amed entities, Hostnames and domains, Page URL’s from the web page Perform Hierarchical agglomerative clustering with the feature vectors Iterate the clustering process until desirable number of clusters are generated. e of features in the web pages like Tokens, Named entities produce robust results in clustering. The method is simple and easy to implement. Name disambiguation cannot always be successful using the features in the web pages alone. The use of external features and features from link structure can be better exploited for resolving name ambiguity. Unsupervised name disambiguation used rich set of biographic information as features for clustering process. The method uses language independent bootstrapping process for extracting these features. the use of different features ranging from simple words in the web pages to im idf), biographical features and so on. Fig 4 shows the actual process involved in name disambiguation process with unsupervised learning. CSA) Vol.5, No.4, August 2015 44 as hard clustering and are input as vector of Fig 3 shows the process involved from the web page desirable number of clusters are generated. e of features in the web pages like Tokens, Named entities produce robust results in clustering. The method is simple and easy to implement. Name disambiguation cannot always be ures and features used rich set of biographic information as features for clustering process. The method uses language independent bootstrapping process for extracting these features. It explores important and Fig 4 shows the actual process involved
  • 7. International Journal on Computational Science Fig 4: Procedure involved in the process is as Step 1: Extract Nouns, basic biographic information (dob, occupation, designation etc) idf(term-inverse document frequency) from the document collection Step 2: Perform Hierarchical agglomerative clustering with the feature vectors Step 3: Terminate the clustering process when the desirable number of clusters are generated. The use of nouns, biographic information works well for name disambiguation process. The use of external knowledge base like Wikipedia may increase the accuracy of the 2.5.Disambiguation using Lexical, linguistic and personal Xiaojun Wan et al, [13]proposed a method called Web hawk for name disambiguation that leverages on Lexical linguistic and personal information features for clustering.Lexical features includes title, words in the web pages, linguistic personal information features includes number. Fig 5 shows the process involved in disambiguation process. process. Step 1: Pre-process the document collection that includes removing multiple persons sharing the ambiguous names, removing junk human names like ford, Bloomberg, Disney etc) Step 2: Extract lexical, linguistic and personal information features from the document collection ournal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 Fig 4: Unsupervised name disambiguation Procedure involved in the process is as follows Nouns, basic biographic information (dob, occupation, designation etc) inverse document frequency) from the document collection. Perform Hierarchical agglomerative clustering with the feature vectors minate the clustering process when the desirable number of clusters are generated. The use of nouns, biographic information works well for name disambiguation process. The use of external knowledge base like Wikipedia may increase the accuracy of the system. Disambiguation using Lexical, linguistic and personal features proposed a method called Web hawk for name disambiguation that and personal information features for clustering.Lexical features words in the web pages, linguistic features includes Nouns, Noun phrases and personal information features includes person name, title, organization, email address and phone Fig 5 shows the process involved in disambiguation process. It involves the following process the document collection that includes removing web pages that refers to multiple persons sharing the ambiguous names, removing junk pages (pages referring to non human names like ford, Bloomberg, Disney etc). c and personal information features from the document collection CSA) Vol.5, No.4, August 2015 45 Nouns, basic biographic information (dob, occupation, designation etc), tf- minate the clustering process when the desirable number of clusters are generated. The use of nouns, biographic information works well for name disambiguation process. The use proposed a method called Web hawk for name disambiguation that and personal information features for clustering.Lexical features features includes Nouns, Noun phrases and person name, title, organization, email address and phone It involves the following web pages that refers to pages (pages referring to non- c and personal information features from the document collection.
  • 8. International Journal on Computational Science Step 3: Perform hierarchical agglomerative clustering similarity between two clusters as follows ܵ݅݉ Where pi and pj are the web pages and c1,c2 are clusters. Fig 5: Disambiguation using lexical,linguistic and personal 2.6.Hierarchical agglomerative clustering HAC,[24]is a bottom up clustering method singleton clusterand successively similar pairs are until a single cluster containing all the documents is formed. Step1: Assign each document to individual cluster Step 2: Construct distance matrix and look for the pair of matrix that are shortest from each other. Step 3: Merge the clusters that are closest to each other. Step 4: Find the distance between the new distance matrix. Step 5: Repeat the above steps until single cluster is formed or required number of clusters formed. ournal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 Perform hierarchical agglomerative clustering and average-link method to compute the similarity between two clusters as follows ܵ݅݉ሺܿ1, ܿ1ሻ ൌ ∑ ∑ ‫݉݅ݏ‬ሺ‫݌‬௜, ‫݌‬௝ሻ௡ ௝ୀଵ ௠ ௜ୀଵ ݉ ൈ ݊ are the web pages and c1,c2 are clusters. isambiguation using lexical,linguistic and personal features Hierarchical agglomerative clustering is a bottom up clustering method in which initially each document is treated as and successively similar pairs are merged (agglomerate) to form larger cluster cluster containing all the documents is formed. It involves the following steps Assign each document to individual cluster Construct distance matrix and look for the pair of matrix that are shortest from each other. Merge the clusters that are closest to each other. Find the distance between the new cluster with all the other clusters and update the Repeat the above steps until single cluster is formed or required number of clusters CSA) Vol.5, No.4, August 2015 46 link method to compute the in which initially each document is treated as agglomerate) to form larger cluster It involves the following steps. Construct distance matrix and look for the pair of matrix that are shortest from each other. with all the other clusters and update the Repeat the above steps until single cluster is formed or required number of clusters are
  • 9. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 47 Since the required number of clusters is not known a priori, when to terminate the clustering process is a significant factor in performance of clustering process. The use of different distance metrics for measuring distance between clusters may produce different results. Thus selection of distance parameter forms a vital part in clustering process.Inthe following experiments, group average agglomerative clustering is used to find the average distance between clusters. In group average clustering, the distance between two clusters is calculated average distance between data points in both the clusters. In practical, it is impossible to known what the ‘n’ value beforehand and different techniques have been devised to handle this problem [7]. 3.Experimental results and Discussion 3.1.Dataset In this experiment, two datasets are used. Danushka et al, [7] used names Jim Clark and Michael Jackson as dataset and in this paper, this dataset is referred as dataset I. In that dataset, the number of persons sharing the name and the number of contexts has been downsized for our convenience. In order to use benchmark dataset for the evaluation purpose, Pedersen and Kulkarni’s data set is used. Pedersen and Kulkarni’s data set, [25] contains data for five ambiguous names collected from the web. For these five names, they have collected the data, data formatted and cleaned. This dataset is hereafter referred as dataset – II. Table 1 and 2 shows the summary of datasetI and II. Table 1: Summary of dataset – I. Table 2: Summary of dataset – II. 3.2 Performance metrics The performance of clustering can be measured by Purity index, Inverse purity Index and harmonic mean of purity and inverse purity index, [26].Purity index, [27] is the number of correctly assigned documents (majority) by number of documents. Name No. of persons sharing the name Number of contexts Jim Clark 6 230 Michael Jackson 2 90 Name No. of persons sharing the name Number of contexts Richard Alston 2 247 Sarah Connor 2 150 George Miller 3 286 Ted Pedersen 4 333 Michael Collins 4 359
  • 10. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 48 ܲ‫ݕݐ݅ݎݑ‬ ൌ ෍ |‫ܥ‬௜| ܰ ூ ݉ܽ‫ݔ‬௝ |‫ܥ‬௜ ∩ ‫ܮ‬௝| |‫ܥ‬௜| Where C the set of clusters,L is the list of classes and N is the total number of documents clustered. Purity penalizes noise in the cluster.Inverse purity index, [28] rewards grouping items together. ‫ݕݐ݅ݎݑܲ݁ݏݎ݁ݒ݊ܫ‬ ൌ ෍ |‫ܮ‬௜| ܰ ூ ݉ܽ‫ݔ‬௝ |‫ܮ‬௜ ∩ ‫ܥ‬௝| |‫ܮ‬௜| F-score is the harmonic mean of purity and inverse purity value. ‫ܨ‬ − ܵܿ‫݁ݎ݋‬ ൌ 2 ∗ ܲ‫ݕݐ݅ݎݑ‬ ∗ ‫ݕݐ݅ݎݑ݌݁ݏݎ݁ݒ݊ܫ‬ ܲ‫ݕݐ݅ݎݑ‬ + ‫ݕݐ݅ݎݑ݌݁ݏݎ݁ݒ݊ܫ‬ All of the above metrics ranges between 0 to 1, where 1 is the optimal value. 3.3.Results and Discussion 3.3.1 Experiments –I The hierarchical agglomerative clustering is performed to produce ‘n’ number of clusters where ‘n’ is the number of persons sharing the ambiguous name. For the experiment, a part of web page in each of top 100 web pages returned by search engine is used. The performance of various methods for dataset-I can be seen in the table 3. Table 3: Performance of various techniques on dataset - I. From the table 3 it is evident that the usage of lexical, linguistic and personal information discriminative features works well for name disambiguation task. Term-entity model that uses nouns and noun phrases as discriminative features also performs better in the disambiguation task. 3.3.2 Experiments – II The same experiment is repeated with dataset – II using a portion of top 100 web pages returned by search engine for the name query. The performance of various methods for dataset-I can be seen in the table 4. Method Purity Index Inverse purity index F-Score Term-entity based model 0.69 0.83 0.75 Two stage clustering 0.65 0.78 0.70 Disambiguation with rich features 0.53 0.73 0.61 Unsupervised name disambiguation 0.63 0.81 0.70 using Lexical, linguistic and personal features 0.79 0.82 0.80
  • 11. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 49 Table 4: Performance of various techniques on dataset - II. The performance of method that uses lexical, linguistic and personal info is evident from the table 4. This method outperforms the other methods both in terms of purity and inverse purity. Similar to experiment I, term entity based method performs well in the experiment II also. It can be inferred from the experiments that the usage of lexical, linguistic features along with nouns and noun phrases serves better for name discrimination tasks. 4.Conclusion Considering the wider application of name disambiguation process, in this paper five important techniques of name disambiguation process is empirically evaluated. Experiment is conducted using benchmark dataset and the results are evaluated using standard metrics like purity, inverse purity and f-score values. Results show that method that uses features like Lexical, linguistic and personal information hierarchical agglomerative clustering performs better than other name disambiguation techniques taken for evaluation. The work is prequel for proposing a novel web based name disambiguation technique. Disambiguating place names, product names etc are interesting topic to work on. The usage of name disambiguation process in a large number of applications like information retrieval, information extraction, document merging, question and answering etc are compelling areas to work on. Reference 1. Bollegala, Danushka, Yutaka Matsuo, and Mitsuru Ishizuka. "Automatic discovery of personal name aliases from the web." IEEE Transactions on Knowledge and Data Engineering 23, no. 6 (2011): 831- 844. 2. Overell, Simon, Joao Magalhaes, and Stefan Rüger. "Place disambiguation with co-occurrence models." In CLEF 2006 Workshop, Working notes. 2006. 3. Zhang, Shu, Jianwei Wu, Dequan Zheng, Yao Meng, and Hao Yu. "An adaptive method for organization name disambiguation with feature reinforcing." In Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation, pp. 237-245. 2012. 4. Kalashnikov, Dmitri V., Zhaoqi Chen, SharadMehrotra, and RabiaNuray-Turan. "Web people search via connection analysis." Knowledge and Data Engineering, IEEE Transactions on 20, no. 11 (2008): 1550-1565. 5. Matsuo, Yutaka, Junichiro Mori, Masahiro Hamasaki, Takuichi Nishimura, Hideaki Takeda, KoitiHasida, and Mitsuru Ishizuka. "POLYPHONET: an advanced social network extraction system Method Purity Index Inverse purity index F-Score Term-entity based model 0.73 0.81 0.76 Two stage clustering 0.59 0.71 0.64 Disambiguation with rich features 0.68 0.79 0.73 Unsupervised name disambiguation 0.71 0.73 0.71 using Lexical, linguistic and personal features 0.77 0.83 0.79
  • 12. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 50 from the web." Web Semantics: Science, Services and Agents on the World Wide Web 5, no. 4 (2007): 262-278. 6. Al-Mubaid, Hisham, and Ping Chen. "Biomedical term disambiguation: An application to gene- protein name disambiguation." In Information Technology: New Generations, 2006. ITNG 2006. Third International Conference on, pp. 606-612. IEEE, 2006. 7. Bollegala, Danushka, Yutaka Matsuo, and Mitsuru Ishizuka. "Automatic annotation of ambiguous personal names on the web." Computational Intelligence 28, no. 3 (2012): 398-425. 8. Ikeda, Masaki, Shingo Ono, Issei Sato, Minoru Yoshida, and Hiroshi Nakagawa. "Person name disambiguation on the web by two-stage clustering." In 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference. 2009. 9. Nguyen, Hien T., and Tru H. Cao. "A knowledge-based approach to named entity disambiguation in news articles." In AI 2007: Advances in Artificial Intelligence, pp. 619-624. Springer Berlin Heidelberg, 2007. 10. Elmacioglu, Ergin, Yee Fan Tan, Su Yan, Min-Yen Kan, and Dongwon Lee. "Psnus: Web people name disambiguation by simple clustering with rich features." In Proceedings of the 4th International Workshop on Semantic Evaluations, pp. 268-271. Association for Computational Linguistics, 2007. 11. Chen, Ying, and James Martin. "Towards Robust Unsupervised Personal Name Disambiguation." In EMNLP-CoNLL, pp. 190-198. 2007. 12. http://www.d.umn.edu/~tpederse/namedata.html 13. Wan, Xiaojun, JianfengGao, Mu Li, and Binggong Ding. "Person resolution in person search results: Webhawk." In Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 163-170. ACM, 2005. 14. Bekkerman, Ron, and Andrew McCallum. "Disambiguating web appearances of people in a social network." In Proceedings of the 14th international conference on World Wide Web, pp. 463-470. ACM, 2005. 15. Bunescu, Razvan C., and Marius Pasca. "Using Encyclopedic Knowledge for Named entity Disambiguation." In EACL, vol. 6, pp. 9-16. 2006. 16. Liu, Zhengzhong, Qin Lu, and JianXu. "High performance clustering for web person name disambiguation using topic capturing." Ratio (2011). 17. Lu, Yiming, ZaiqingNie, Taoyuan Cheng, Ying Gao, and Ji-Rong Wen. "Name disambiguation using Web connection." In Proceedings of AAAI 2007 Workshop on Information Integration on the Web. 2007. 18. Bunescu, Razvan C., and Marius Pasca. "Using Encyclopedic Knowledge for Named entity Disambiguation." In EACL, vol. 6, pp. 9-16. 2006. 19. Sugiyama, Kazunari, and Manabu Okumura. "Personal name disambiguation in web search results based on a semi-supervised clustering approach." In Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers, pp. 250-256. Springer Berlin Heidelberg, 2007. 20. Lu, Zhao, Zhixian Yan, and Liang He. "OnPerDis: Ontology-Based Personal Name Disambiguation on the Web." In Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2013 IEEE/WIC/ACM International Joint Conferences on, vol. 1, pp. 185-192. IEEE, 2013. 21. Han, Xianpei, and Jun Zhao. "Web personal name disambiguation based on reference entity tables mined from the web." In Proceedings of the eleventh international workshop on Web information and data management, pp. 75-82. ACM, 2009. 22. Guha and Garg. Disambiguating people in search.WWW’04. 23. Mann, Gideon S., and David Yarowsky. "Unsupervised personal name disambiguation." In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pp. 33-40. Association for Computational Linguistics, 2003. 24. Jain, Anil K., M. NarasimhaMurty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31, no. 3 (1999): 264-323. 25. http://www.d.umn.edu/~tpederse/namedata.html 26. Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press (2008) 27. Amigó, Enrique, Julio Gonzalo, Javier Artiles, and FelisaVerdejo. "A comparison of extrinsic clustering evaluation metrics based on formal constraints." Information retrieval 12, no. 4 (2009): 461-486.