The online News publisher publishes their news in the form of articles. Most of the online news websites provide the facility for their users to comment on the news article and as a result a lot of people comment on the news article. Hence news web page contains huge data in the form of article content and comments data, etc and have a good potential to be a resource for many Information Retrieval Systems and Data Mining Applications. The extraction of the main content (Article content) from a web page has always been a challenging task because a web page contains other information like advertisements and hyperlinks etc. which is not related to Article Text. In this survey, we review various techniques which are proposed by various researchers to extract the article content from a news web site. We also learn various techniques which monitor and analyze the comments for various applications like popularity prediction of articles and identification of discussions thread in the comments data.
Survey on article extraction and comment monitoring techniques
1. Submitted by
Ankur Kumar Agrawal
M.Tech(CS)-II Year
13535009
Under the guidance of
Dr. Dhaval Patel
2. Introduction
Why Article Extraction and Comments Monitoring ?
Challenges in Article Extraction and Comments Monitoring
Article Extraction Techniques
Learning Based Techniques
Heuristic Techniques
Visual Based Approach
Comments Monitoring
News Article Popularity Prediction
Extracting Discussion Structure
Conclusion
3. What is Article on news web page?
Online news sources publish their news in the form of
articles.
Article describes about a particular event happened.
The main content on the news web page is Article Content.
Other content on web pages like hyperlinks, images, and side
banners etc. is considered as noise content.
What areComments?
Comments are the reactions by the citizens on the article
published by the news media.
5. Article Extraction can be used in
Information Retrieval Systems.
Search Engines (Indexing on Article content for giving best
search result) like Google , Yahoo.
News Aggregator Systems like Google News.
Comments monitoring can be used for
News Article Popularity Prediction.
Advertisement Agencies
News Agencies
Debate Identification
Sentimental Analysis and Opinion Mining
6. 1
1
Article
Text
2
Noise
Content
Menus
Advertisements
Side Banners
Hyperlinks
2
2
2
7.
8. Public Comments are not always available for every news
source. Some websites provides their comments data
It is difficult to apply standard NLP techniques in comments
since comments may not be syntactically correct.
10. Parsed
News Web Page
Applying
Heuristics on
parsed document
Article Text
Content
output
11. Web page is processed using DOMTree.
DOMTree represents each tag as Node Object in a tree.
Two important factors in heuristic techniques are Text Count
and Link Count.
Text Count: Text count is the number of words in the text of
a node.
Link Count: Number of links a node has in the sub tree
rooted at any node.
12. Html
(7,1)
Head
(1,0)
Body
(6,1)
DIV
(5,1)
Node Structure
P(3,0)
This is
(2,0)
Article
(1,0)
A(1,1)
More
detail(1,1)
P(1,0)
Text
(1,0)
DIV
(1,0)
P(1,0)
Noise
(1,0)
Node Name
(Text Count, Link Count)
13. For each node of DOM Tree a Basic Score is calculated using
the following formula.
Basic Score Function =
푻풆풙풕 푪풐풖풏풕−푳풊풏풌 푪풐풖풏풕
푻풆풙풕 푪풐풖풏풕
A node having Maximum Basic Score is selected as a
probable node having Article Text.
If multiple nodes are having same Maximum Score:
Select the one which is higher in level
Drawback
Favors some nodes having less text count and no link.
14. Html (6,1)
Body
(6,1)
0.8
푻풆풙풕 푪풐풖풏풕 − 푳풊풏풌 푪풐풖풏풕
DIV
(5,1)
푻풆풙풕 푪풐풖풏풕
Real Article
Node
0.83 1
P
(3,0)
1 0
This is
(2,0)
Article
(1,0)
A
(1,1)
More
detail (1,1)
1 1
P
(1,0)
Text
(1,0)
Selected as
article text
node as higher
in level
DIV
(1,0)
P
(1,0)
Noise
(1,0)
DOM Tree After applying Basic
Score function
1 0 1 1
15. Weightratio ×
푻풆풙풕 푪풐풖풏풕−푳풊풏풌 푪풐풖풏풕
Here one extra factor is added in basic scoring function.
Extra factor describes the fraction of Total text of page in a
node.
Now optimal weights are assigned to both the factors.
This extra factor removes the drawback of using only basic
scoring function.
푻풆풙풕 푪풐풖풏풕
+Weighttext ×
푻풆풙풕 푪풐풖풏풕
푷풂품풆푻풆풙풕
16. Html (6,1)
Body
(0.8,0.83)
DIV
(0.8,0.9953)
P
(1,0.7)
This is
(1,0.9333)
Article
(1,0.9333)
a
(0,0)
More detail
(0, 0)
P
(1,0.91667)
Text
(1,0.9166)
DIV
(1,0.9408)
P(1,0.83)
Noise
(1,0.91667)
Real Article Text
Node
Containing
maximum score
17. Experiment was performed on 1620 news Articles from 27
different news sources.
Using a Basic Score:
Precision is around 0.85
Recall is 0.02 (Very Poor)
Using Modified Weight Score Function:
Precision is around 0.9562 (Improved)
Recall is 0.9088 (Great Improvement)
Source: Jyotiak Prasad et. al.,”Coreex: content extraction
from online news articles”
19. This approach works in two steps.
STEP 1
First Learning is performed from a set of news web pages and a
model is build which identifies the location of article content and
noise content.
STEP 2
A new web page is given as input to the model and Article text is
obtained.
20. Model Learns some
common features of web
pages to distinguish
between Noise and main
Article Text Content
Model output
Training
dataset
Target web
page
Article Text
Learning Based
Technique
21. The technique focus on removing noise content from news
web page.
Learning is from web pages of a single news source.
The model builds a Style Tree after learning common layout
from all the web pages.
Model(Style Tree) is applied on the target web page of the
same news source to classify noise nodes and content nodes.
22. Html
Body
DIV DIV
P IMG P
Html
Body
DIV DIV
a P BR P
Html
Body
DIV DIV
2
2
2
P IMG P a P BR P
1 1
d1 d2
23. Noise node and content is identified based on the
information gain(Entropy) of each node.
So it is assumed that if more presentation style a node have
then it may be the Noise Node.
If actual content is more diverted then it may be the
probable Content Node.
24. If E is an Element Node and number of pages that contain E
is m. Then
푁표푑푒퐼푚푝 퐸 =
−
푙
푝푖 푙표푔 푝푖 , 푖푓 푚 > 1
푖=1
1, 푖푓 푚 = 1
Where l denotes number of child style nodes of E and 푝푖 that
web page uses ith style node in l.
25. root
IMG
Table Table Table
35 15
Tr Text P P P
IMG A Text
P A
A
A A A
100
100
100
100
body
100 25
26. Advantage
Algorithm is fast once the learning is over.
Disadvantages
Style Tree can take large amount of memory.
It requires some web pages of a single domain to learn.
28. The techniques learns visual features of web page and identifies
the boundary of Article Text content.
A simple visual based technique uses following two steps:
Step 1: Identifying different text segments using beak node
identification of CSS.
Step 2: Global optimization method MSS(Maximum Scoring
Subsequent) is used to identify article text body .
29. <Br> and <Hr> tags are always break nodes.
For other element nodes CSS display property is checked.
If CSS display property is “block” then it indicates that element
have a line break before or after it.
Now Text segments are formed using nearest line break nodes
of every text nodes.
30. t3
Body
P DIV
A I Br em U
U
t4 t5 t6
B
t7 t8
B
I
Br
t1 t2
Element
node
Break
node
Text node
group
consecutive
Text
segments
based on the
Nearest line
break node
31. Given set of text segments from step 1 we have to
group the segments which can be the part of Article
Text.
The algorithm gives score to each segments
between -1 to 1 in the following way.
{ +1 ,Psize>c1,Pcolour>c2,Plink<c3
-1 ,otherwise
F(S) =
32. Learning based Techniques are fast.
Heuristic Techniques can be applied on any web page.
Heuristic based techniques rely on threshold values which
may not be accurate always.
Heuristic techniques are slow.
Learning based techniques require sufficient web pages to
learn.
33. News Comments monitoring can be used to predict the
popularity of an article prior to its publication.
Comments also describe the mindset of the citizens about a
particular event.
Comments can also be used to identify discussions/debates
going on about a news story.
34. The Technique uses number of comments as a key factor to
predict the popularity of an article.
The method also considers the publication hour and
category of an article it belongs to.
The method is based on Linear Regression
Y=a + bX
Where X=Number of Comments an article received over a
timed
Y= Predicted volume of comments
35. Comments
Repository
Regression
Based on
publication
hours
Regression
Based on
category
How the Proposed
Technique works?
Regression
Based on
Per Year
Published
Articles
Regression
Y=a + bX
Apply output
Predicted
volume of
comments
Different
Regression
models
Article for
popularity
Prediction
Select best
regression
aghaghgch
acbjacjjahc
jahcajhcac
ajajcnjacj
36. The experiment was performed on the articles data of four
years(from February 2006 till June 2010).
Based on Per Year Data: It was concluded that the Articles
published during 2008-10 are good for prediction.
Based on publication time of an article: The articles published
between 6 to 11 AM suits best for prediction.
37. When people comments on the comments of other people
then a Discussion Structure is created.
So the proposed method is used to identify that discussion
structure in Dutch news media.
The technique solves following two questions:
1. How to Extract the comments ?
2. How to identify the Discussion Thread?
38. Article
Scrapper
Comments
Scrapper
Dutch
News
Sources
like
Torus,
AD
RSS Feed
Articles
Comments
and Articles
Repository
Comments
Comment
URL
HTML
Page
39. Technique identifies commenter name in the comment text.
“Yes Tom you are right”
Posted by: Bob
It also assumes that @ character can also be used to refer to
someone.
“@Bob this is not a good political view.”
Posted by: Jimmy
Issue: The issue is that the Author name may be the part
of comment text as example is Boy may exist in “good
boy”.
40. Following Machine learning based methods are proposed:
Word Boundary Based: Tokenize comments and commenter and
check for commenter name in comments.
POS Tagging and Loose Match: Only those words are matched
which are noun and use following method to match.
푠푖푚푖푙푎푟푖푡푦(푚1, 푚2) =
2. 푚푎푡푐ℎ(푚1,
푚2)
푙푒푛푔푡ℎ 푚1 + 푙푒푛푔푡ℎ(푚2)
Optimal threshold value 0.85 is obtained after experiment.
@ Trigger and Loose Match: The @ character is used to trigger
previous comments. Getting all reference of a comment text loose
match is used.
41. We have learned the importance of article text and comments.
Article can be extracted using heuristic technique, learning based
technique and visual based techniques.
Comments can be monitored for popularity prediction and
identifying discussion structure or debate.