An Enhancement Over Multi-Level Link Structure Analysis to Overcome False Positives

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 4, April (2014), pp. 157-164 © IAEME
157
AN ENHANCEMENT OVER MULTI-LEVEL LINK STRUCTURE ANALYSIS
TO OVERCOME FALSE POSITIVE
Pratikkumar B Chauhan1
, Kamlesh M Patel2
1, 2
(Computer Engineering, R.K. University, Kasturbadham, Near Tramba, Rajkot, Gujarat, India)
ABSTRACT
Search engine become the primary source to get knowledge and information on Web. When
people executed a query a list of rank-wise URLs are extracted based on keywords by Search
Engine, few of them URLs are having higher page-rank which might be boost-up due to spam links.
So, it becomes necessary to identify this spam links in response with search engine result that
prevent the users to misguide by such spam-links, but it is very difficult task to identify those spam
links. Spammer creates a spam pages to earn profit or for marketing purpose over the web. Spam
pages are also created in order to get higher page ranking score in search engine result. This research
paper has covered detail study on MLSA as it is capable to do identify Spam link based on its
threshold value, and also it gives result of those link that has more degree of levels as it is able to
perform on multi-level makes it differ from other algorithms. This research paper also introduced a
new mechanism which overcomes some issues exists with experimental result such as false positive.
This research paper will give you result of those link that are falsely detected as spam. Also
discussed new approach with an issue that may helpful to research scholar to dig out even more in
regards to improve their efficiency in linking based algorithm.
Keywords: Web Spam, Link Spam, Link Farm, Spam Rank, Spamming, Web Mining,
Web Structure Mining, Link Analysis.
1. INTRODUCTION
Today as we know, web users or internet users are increasing rapidly to gain information
from the net. Most of the people rely on search engine to get information from the Web. But due to
any reason some times this may be happened that we may not get actual information that we are
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 5, Issue 4, April (2014), pp. 157-164
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2014): 8.5328 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E

158
searching for. One of the reasons behind it is Spam Link. Many spam links will result in response
with search engine result but is very difficult to identify such spam link. Due to the bulky size of the
Web the problem of Web spam is well-known and not easy to solve, mostly that makes many
algorithms infeasible in practice [2][3]. Therefore due to web spam, it is an attempt to increase rank
of inappropriate web pages. There are many spamming technique to increase the rank of
inappropriate page among them Link spam is one of web spam technique which aims for increasing
rank by creating artificial popularity of the page by increasing in-links of the page. Some specific
Web pages are spam target pages or not that every user need to know. Our method can be applied to
such individual pages. Only the page farms of those pages need to be extracted and analyzed. We
have search neighbors of those pages, which are highly feasible [4]. Spamming degrades the quality
of search and also loses trust of user over search engine result. Spam sites serve different content
malware, adult content spreading and phishing attack.[3][4]. Spamming is the technique any action
whose purpose is to boost a web page’s position in search engine results, without providing
additional value. Owner of a web site and any business man always want to grow their business and
for that they often promote their web pages and boosts ranking by attracting links from other web
sites. The only difference between normal page and targeted pages is that only the links are
justifiable [3][8]. All misleading actions that try to increase the ranking of a page in search engines
are generally referred to as Web spam or spamdexing (spamming + indexing) [2]. For example,
ranked 100 million webpage’s using page rank algorithm. And found that 11 out of top 20 results
were pornographic websites that achieves high ranking by content and web link manipulation [3]
[4][5]. Spamming is technique to earn maximum profit to the attacker. A densely related set of pages
is known as Link farm, created explicitly with the purpose of misleading a link-based ranking
algorithm [2]. Pages in Link farms are called boosting pages, created by a sole purpose of boosting
rank score of some pages, called target pages [1][3]. A spam page is a page that is used for
spamming or receives a substantial amount of its score from other spam pages. Another definition of
spam, given in is “any attempt to mislead a search engine’s relevancy algorithm” [3] [6].
1.1 WEB SPAM
With the advent of search engines web spamming appeared early and not easily solved. The
neighborhood of a spam page will look different from an honest one. The neighborhood of a link
spam will consist of a large number of artificially generated links.
These links likely come from similar objects [9]. Web spam refers to attempts to increase the ranking
of a web page by manipulating the content of a page and the link structure around a page. There are
numerous approaches for detecting web spam, which may be based on web page content, link
structure, or a combination of these and among them I focused on linking structure of a web site [10].
2. LINKING BASED ALGORITHM
This research paper focused on Linking based algorithm namely Multi-level Link Structure
Analysis (MLSA) which purely linking based algorithm to detect spam link which is modification of
seed and parental penalty algorithm. As we always using internet or web to get information over the
net through web sites web spammer who creates spam link takes the advantage of vulnerabilities of
linking based algorithms. Search engine also uses linking based technique to rank website in
response with search engine result. They create many artificial references or links in order to acquire
higher-than-deserved ranking in search engines’ results which will generate higher traffic to their
websites [3][7][11].
Chakrabarti introduced a fine-grained approach that integrates document structure using
Document Object Model (DOM) into earlier hyperlink-based topic.

159
2.1 LIMITATION IN SEED AND PARENTAL PENALTY ALGORITHM
Seed and parental penalty algorithm works as follow. First of all, all the when the query is
executed, set of URLs result from the search engine response. Then using same search engine we
have to find another set of incoming link for each of these URLs. After having two sets we need to
find out an intersection, means two links, incoming and outgoing links are pointing two each other.
And if this value or amount of intersection is greater than some predefined threshold than it is mark
as bad else it is good. Seed and parental penalty algorithm are used to detect spam pages that directly
swap the link from its neighboring site.
Fig 1 illustrates link exchanges in 3domains A, B and C. Domain is referred to as a website
with a unique domain name. Three Domain A, B and C have web pages A1, A2, B1, B2, C1, and C2.
Domain A links to domain B through pageA2 to page B2, but the reciprocal link points from pageB1
to page A1.Similar method is used for link exchange with domain C to A, where the reciprocal link
points from A2to C2 [7].
Figure 1: Link Exchange in seed parental penalty algorithm [7]
If there does not exist any link that pointing to each other, means if there is no intersection
exists then some spam link may bypass the structure of algorithm to calculate importance of page
and may lead to wrong direction. In our example A1 bypasses the structure of algorithm. Thus, there
is a possibly false negative exists in the Seed and Parental Penalty algorithms, where the possible
spam page might be able to avoid the detection [7]. To overcome this false negative MLSA is
introduced.
2.2 MLSA ALGORITHM
MLSA algorithm is used to detect spam pages. In most link exchanges and linkfarm spam
pages, participant have at least an outgoing link from one of the web pages within the same domain
to its neighboring domain.

160
Figure 2: Link Exchange in MLSA[7]
In following Fig 2, illustrates the link parsing sequence of MLSA. A1, A2, A3 and A4 denote
4 web pages within a domain.A1 represent a candidate web page that is being analyzed. MLSA
would first parse through the outgoing links in page A1 pointing externally to other domains.
Outgoing links from page A2 andA3 would be collected in level 1. Then, the algorithm would
continue to the next level by outgoing links from internal pages that page A1links to. This process
will repeat to the number of predefined levels. Again from the all the out going link set of incoming
link is collected. And using these both set this algorithm will calculate number of intersection
between outgoing link and incoming link. If number of intersection is greater than the predetermine
threshold than link is treated as spam link as genuine link.
Let p to denote the URL of a candidate web page and d[p]the domain name of p. IN(d[p])
denotes the sets of incoming links to the domain or root of p. OUT(ntmp) represents the outgoing
links from a temporary node, ntmp. We have to assign some initial value means Parsing depth,
Dmlsa that represents the number of levels the algorithm parses and value of Tmlsa is the threshold.
MLSA algorithm:
(1) For each URL i in IN(d[p]), if d[i] != d[p] and d[i] is not in InDomainList(p), then add d[i]
to the set of InDomainList(p).
(2) Set p as ntmp and set current level, L to 0.
(3) If level, L <= Dmlsa, For each URL k in OUT(ntmp), Execute 3.1 and 3.2.
(3.1) If d[k] != d[p] and d[k] is not in OutDomainList(p), then add d[k] to the set of
OutDomainList(p).
(3.2)Else If d[k] == d[p], then set L ++, set k as ntmp and repeat step 3.
(4) Calculate the intersection of InDomainList(p) and OutDomainList(p). If number of
intersection is more than the threshold value tmlsa, then page will be marked as bad page.
(5) Repeat all above steps for every search result URL, p.
Using this algorithm it is being noted that false positive exist in this algorithm. That means
sites those are genuine are also detected as a spam so using this algorithm dmoz.org,
navjivannaturecure.com which are genuine site that are detected as a spam page. Analysis of some
links is shown in following table with level 2 and threshold value 5 which shows that there are some
links and false positive exist in system.

161
Table 1: Experimental Result analysis of MLSA algorithm
URL Intersection Spam?
http://internethomeloans.com.au/ 14 Yes
http://www.quantumfinancesolutions.com.au/ 59 Yes
http://www.lifeinsurance.net.au/ 103 Yes
http://a1carlease.com.au/ 95 Yes
http://www.domz.org/docs/en/about.html 6 Yes
http://pratikchauhan.co.in/single.html 12 Yes
http://www.navjivannaturecure.com/index.php 9 Yes
3. PROPOSED WORK
As we know false positive exist in the existing system. This research paper focuses on this
issue of existing system. In this method I use the concept of duplication the link that is detected once
will be stored in an array. And if the same link is detected then it will not be scanned more time. This
research paper also compare extracted link with master URL and if it is same as the master URL then
crosslink counter will be incremented. In this research paper I calculate main domain of master URL
and currently extracted URL as follow.
Example for the URL http://pratikchauhan.co.in/single.html has following main domain
http://www.pratikchauhan.co.in.
And if this both main domain are same then crosslink counter will not be incremented and if
they are from other domain then counter will be incremented.
Let M denote master_URL, S[] is Array of scanned_URL, cross_link, TH is threshold value,
Li is limit, L is level which is set to 1 and used in this algorithm. K represents temporary URL.
3.1 IMPROVED MLSA
Input: M, TH, Li, T
Out put: URL is spam or not.
Fetch_recursive(M,Li,L)
{
if(L>Li)
Return;
else
extract all outlink of M using DOM parser and save it in HT.
Find main d1 domain of M.
foreach K in HT
initialize S[]
if(K∈S[])

162
continue
if (K = M)
{
find main d2 domain of HT
if(d1=d2)
{
Display crosslink occur at same domain cross link not incremented
}
else
{
Display crosslink occur from other domain cross link incremented
cross_link++
continue
}
}
S[]=K
Fetch_recursive(K,Li,L++)
}
if (cross_link>= TH)
M is detected as spam
else
M is not spam.
Using this algorithm this research paper has solved this issue namely false positive which exist in
existing algorithm and following table shows result of some link and analysis of links for level 2 and
threshold value is 5.
Table 2: Experimental result analysis of Improved MLSA algorithm
URL Intersection Spam?
http://internethomeloans.com.au/ 6 Yes
http://www.quantumfinancesolutions.com.au/ 3 No
http://www.lifeinsurance.net.au/ 21 yes
http://a1carlease.com.au/ 5 Yes
http://www.domz.org/docs/en/about.html 0 No
http://pratikchauhan.co.in/single.html 0 No
http://www.navjivannaturecure.com/index.php 0 No

163
As we can see from the graph both algorithm computes number of intersection or cross link
that exist for any particular URL. The graph is created from the above both table for level 2 and
threshold value 5. As we know in MLSA algorithm false positive exists means sites which are
genuine they are detected as a spam link, and in this research paper has focused on this issue and
using Improved MLSA algorithm this issue is being solved. The number of intersection which is
greater than threshold is detected as spam link and others are not spam link. So by reducing number
of intersection we can solve false positive by 90-100% using newly proposed work.
4. CONCLUSION
At the end, after studying, learning and comparing a lot I conclude that Linking based
algorithm is used to detect link spam. This research paper focused on MLSA which stands for Multi-
level Link Structure Analysis (MLSA), which is purely relies on the linking structure of the web
pages which is the enhancement of seed and parental penalty algorithm. In this algorithm back link
from the page to its home node or master URL is calculated, and termed as cross link. Also solve the
issue of false positive which exist in MLSA algorithm using same comparison threshold and
crosslink. This research paper overcomes the issue of false positive with accuracy of 90-100%. So
websites which are genuine but detected as spam using MLSA algorithm they are detected as a
genuine site using Improved MLSA algorithm. Each node in a website has at least one out going link
to other domain or same domain. However further improvement in this algorithm is necessary
because it takes to much time in extracting link for predefined level. So we may improve efficiency
of this proposed algorithm by integrating it with using content spam and clocking algorithms.

164
REFERENCES
[1] Athasit Surarerks, Arnon Rungsawang, Chakrit Likitkhajorn, An Approach of Two- Way
Spam Detection Based on Boosting Pages Analysis, IEEE, 978-1-4673-2025-2/12, 2012.
[2] Carlos Castillo, Debora Donato, Luca Becchetti, Using Rank Propagation and Probabilistic
Counting for Link Based Spam Detection, WEBKDD’06, August 20, 2006, Philadelphia,
Pennsylvania, USA.
[3] Chauhan Pratikkumar Bharatbhai, Kamlesh M Patel, Analysis of Spam Link Detection
Algorithm based on Hyperlinks, IFRSA International Journal of Data Warehousing &
Mining, Vol 4, issue1, Feb. 2014, 67-72.
[4] Jiawei Han, Nikita Spirin, Survey on Web Spam Detection: Principles and Algorithms,
SIGKDD Explorations Volume 13, Issue 2, 50-64.
[5] K.K. Arthi, Dr. V.Thiagarasu,A Study on Web Spam Classification and Algorithms,
International Journal of Computer Trends and Technology, volume 4 Issue 9– Sep 2013,
ISSN: 2231-2803, 3126-3131.
[6] Mr.R.BalaKumar, Mr.P.Rajendran, Mrs.R.Mynavathi, Survey on Spam Detection Techniques
in Data Mining, International Journal of Advanced Research in Data Mining and Cloud
ComputingVol.1, Issue 1, July 2013, ISSN 2321-8754, 8-17.
[7] Tan Su Tung, Nor Adnan Yahaya, S.M.F.D Syed Mustapha, Multi-level Link Structure
Analysis Technique for Detecting Link Farm SpamPages, IEEE/WIC/ACM International
Conference onWeb Intelligence and Intelligent Agent Technology, IEEE COMPUTER
SOCIETY, 0-7695-2749-3/06, 2006.
[8] Zhou. B and Pei. J,Link Spam Target Detection Using Page Farms, ACM Transactions on
Knowledge Discovery from Data, Vol. 3, No. 3, Article 13, 1556-4681/2009/07, July 2009 ,
USA.
[9] Andras A. Benczur ,Karoly Csalogany,Tamas Sarlos, Mate Uher., SpamRank – Fully
Automatic Link Spam Detection, Computer and Automation Research Institute, Hungarian
Academy of Sciences (MTA SZTAKI), 11 Lagymanyosi u., H–1111 Budapest, Hungary,
2,Eotvos University, Budapest.
[10] Reid Andersen, Christian Borgs, Jennifer Chayes, John Hopcroft, Kamal Jain, Vahab
Mirrokni, Shanghua Teng, Robust PageRank and Locally Computable Spam Detection
Features, AIRWeb ’08, April 22, 2008 Beijing, China.
[11] Shekoofeh Ghiam and Alireza Nemaney Pour, A Survey On Web Spam Detection Methods:
Taxonomy, International Journal of Network Security & Its Applications (IJNSA), Vol.4,
No.5, September 2012, IRAN.
[12] Jyoti Pruthi and Dr. Ela Kumar, “Data Set Selection in Anti-Spamming Algorithm - Large or
Small”, International Journal of Computer Engineering & Technology (IJCET), Volume 3,
Issue 2, 2012, pp. 206 - 212, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[13] Goverdhan Reddy Jidiga and Dr. P Sammulal, “Machine Learning Approach to Anomaly
Detection in Cyber Security with a Case Study of Spamming Attack”, International Journal of
Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 113 - 122,
ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[14] Prajakta Ozarkar and Dr. Manasi Patwardhan, “Efficient Spam Classification by Appropriate
Feature Selection”, International Journal of Computer Engineering & Technology (IJCET),
Volume 4, Issue 3, 2013, pp. 123 - 139, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.

An Enhancement Over Multi-Level Link Structure Analysis to Overcome False Positives

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie An Enhancement Over Multi-Level Link Structure Analysis to Overcome False Positives

Ähnlich wie An Enhancement Over Multi-Level Link Structure Analysis to Overcome False Positives (20)

Mehr von IAEME Publication

Mehr von IAEME Publication (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

An Enhancement Over Multi-Level Link Structure Analysis to Overcome False Positives