E mine by V.DINESH KUMAR KSRCT

E-MINE: A NOVEL WEB MINING
APPROACH
Submitted By,
V.DINESH KUMAR,
II-MCA.

ABSTRACT
 In recent years government agencies and
industrial enterprises are using web as a
medium of publication.
 It became increasingly difficult to identify
relevant pieces of information, since pages
are cluttered with irrelevant content like
advertisements, copyright notices…
surrounding the main content.
 Thus we propose a technique that mines the
relevant data regions from a web page.

INTRODUCTION
 Several attempts have been made to extract
the regularly structured data from the web
page.
 The main disadvantage of the existing
document is that the relevant information of
a data record is contained in HTML code
which is not always true.
 So, we propose a more effective method to
mine the data region in the web page.

RELATED WORK
 MDR (Mining Data Record) is a technique
mainly used in the area of data mining.
 It exploits the regularities in HTML tag
structure directly.
 MDR algorithm makes use of all the HTML
tag tree of the web page to extract data
records from the page.

 The algorithm is based on two observations
(a) A group of data records are always
presented in a contiguous region of the web
page and are formatted using similar HTML
tags. Such region is called a Data Region.
(b) The nested structure of the HTML tags in
a web page usually forms a tag tree and a set
of similar data records are formed by some
child sub-trees of the same parent node

PROPOSED TECHNIQUE
 This proposed technique can help the system in three
ways,
a)It enables the system to identify gaps that separate
records, which helps to segment data records
correctly.
b)The visual information also contains information
about the
hierarchical structure of the tags.
c)By observing a webpage, it can be analysed that
the relevant data region occupies the major central

SYSTEM MODEL OF AN E-MINE
TECHNIQUE
HTML source of a web page
Largest Rectangle Identifier
Container Identifier
Filter
Relevant Data Region

 System model mainly consists of three
components,
 Largest Rectangle Identifier,
 Container Identifier and
 Filter.
The output of each component is the input of next
component.

 The e-mine technique is based on three
observations:
 A group of data records, is typically presented in
the neighbouring region of a page.
 The area covered by a rectangle that bounds the
data region is more than the area covered by the
rectangles bounding other regions, e.g.
Advertisements and links.
 The height of an irrelevant data record within a
collection of data records is less than the average
height of relevant data records within that region.

ALGORITHM e-Mine
INPUT : HTML source of web-page.
STEP 1:Determine the height & width of all the bounding
Rectangles in the HTML document.
STEP 2: Calculate the areas of all the Bounding
Rectangles.
STEP 3:Identify the Maximum Rectangle from all the
bounding Rectangles.
STEP 4:Identify the container within the Maximum
Rectangle obtained from step 3.
STEP 5:Identify the Data Region in the container
obtained from step 4.
STEP 6:Filter the Data Region obtained after step 5 for
removal of some more irrelevant data.

HOW THE ALGORITHM WORKS?
 Determining the Height and Width of all
bounding rectangles.
 Identification of the largest rectangle.
 Identification of the container within the largest
rectangle.
 Identification of data region containing data
records with in the container.

DETERMINING HEIGHT AND
WIDTH OF ALL BOUNDING
RECTANGLES In the first step of the proposed technique,
we determine the dimensions of all the
bounding rectangles in the web page.
 If not specified, the MSHTML parsing and
rendering engine of Microsoft Internet
Explorer 6.0 can be used.
 The parsing and rendering engine of the web
browser gives us the co-ordinates of a
bounding rectangle.

IDENTIFICATION OF THE
LARGEST RECTANGLE
 Based on the height and width of bounding
rectangles obtained in previous step, we
determine area of bounding rectangle.
 Among these rectangles determine the
largest rectangle.
 The reason for doing is that the largest
bounding rectangle will always contain the
most relevant data in web page.

PROCEDURE FOR IDENTIFICATION
OF LARGEST RECTANGLE
Procedure getMaxRect
Input: <body> of the HTML source
for each child of <body> tag
begin
Find the coordinates of the bounding rectangles
for the child
If
the area of the bounding rectangle >
area of maximum Rectangle
then Maximum Rectangle = child
endif
end

IDENTIFICATION OF THE
CONTAINER WITH IN THE LARGEST
RECTANGLE Once we have obtained the largest
rectangle, we form a set of the entire
bounding rectangles.
 The reason is that the most important data of
webpage must occupy a significant portion of
the web page.
 Determine the bounding rectangle having the
largest area in the set because only the
largest rectangle will contain the data
records.

PROCEDURE FOR IDENTIFICATION
OF CONTAINER WITH IN THE
LARGEST RECTANGLE
Procedure getContainer
Input: The Largest Rectangle out of all Bounding Rectangles.
List_of_Children=depth first listing of all the
children of the tag associated with Maximum Rectangle.
for each tag in List_of_Children
begin
if area of bounding rectangle of a tag > half the area of
Maximum Rectangle
then container = tag
endif
end

IDENTIFICATION OF DATA REGION
CONTAINING DATA RECORDS WITH IN
THE CONTAINER
 To remove the irrelevant data from the
container we use a filter.
 The filter determines the average heights of
data with in the container.
 Those data whose heights are less than the
average height are identified as irrelevant
and discarded.

PROCEDURE FOR FILTER
Procedure Filter
Input: The container obtained from the previous step.
totalHeight=0
for each child tag within container
totalHeight+=height of the bounding rectangle of child
averageHeight = totalHeight/no of children of container
for each child within container
if height of child’s bounding rectangle < averageHeight
then Discard child from container
endif
end for
end for

MDR VS E-MINE
 Here the proposed technique is evaluated and
it is compared with MDR(Mining Data
Record).This evaluation consists of three
aspects,
 Data Region Extraction,
 Data Record Extraction,
 Overall Time Complexity.

DATA REGION EXTRACTION
 MDR is dependent on certain tags like
<table>,<tbody>,etc for identifying data
region.
 A data region can be contained in some tags
like <table>,<tbody>,<p>,<li>,<forms> etc.
 In the proposed emine system, the data
region identification is independent of
specific tags and forms.

DATA RECORD EXTRACTION
 MDR identifies data records based on
keyword search. Eg.”$”.
 MDR not only identifies the relevant data
region containing the search result records
but also extract records from all other
sections of the page.

OVERALL TIME COMPLEXITY
 The existing algorithm MDR has the
complexity of the order O(nk).
 n- total number of nodes,
 K- maximum number of tags.

CONCLUSION
 In this paper we proposed a new approach to
extract structured data from webpages.
 Although there are several techniques e-mine is
a pure visual structure oriented method that can
correctly identify the data regions.
 Most of the current algorithm fails to correctly
determine the data region, when the data region
consists of only one data record.
 Thus e-mine overcomes the drawbacks of
existing method and performs significantly
better than existing tasks.

E mine by V.DINESH KUMAR KSRCT

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (12)

Ähnlich wie E mine by V.DINESH KUMAR KSRCT

Ähnlich wie E mine by V.DINESH KUMAR KSRCT (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

E mine by V.DINESH KUMAR KSRCT