5. Crawl
• Most of the Macedonian news sites don’t have RSS feeds.
• One level crawl from a set of hubs:
(Macedonia) http://www.a1.com.mk/vesti/kategorija.aspx?KatID=1
(Economy) http://www.a1.com.mk/vesti/kategorija.aspx?KatID=2
(Sport) http://www.forum.com.mk/index.php?option=com_content&task=blogsection&id=9&Itemid=100
(Culture) http://novamakedonija.com.mk/DesktopDefault.aspx?tabindex=1&tabid=2&CategID=26
(None) http://www.kirilica.com.mk/
• Many hubs per source.
Some of the sources have fixed address of the hubs (A1, Makfax, etc.), for some
we need to determine the hub addresses in runtime (Dnevnik, Utrinski, etc.)
• Hubs annotated with section name (topic):
Macedonia, Balkan, World, Economy, Culture, Fun, Sport, Chronic, None
• At the moment, hubs of the sources are provided manually.
TIME.mk Proprietary
6. Article Extraction
Segment into title / body / image
Heuristics:
• Title matches link text and/or HTML
title, and is above the body
• Body is a big run of unformatted
Cyrillic text, below title
• Image is extracted from the hub page
and has attached link with exactly the
same address as the article
The same procedure is used for extracting all articles from all sources !!!
TIME.mk Proprietary
8. Clustering
Partition news articles into disjoint subsets of clusters,
such that:
News within a cluster are very similar
News in different clusters are very different
.. . .. .
. .
. . .
. . .
.
TIME.mk Proprietary
9. Word weights
Weight is function of word frequency within a document and across all documents
TF(w) = frequency of word w in a news article
• Intuition: a word appearing more frequently in a text is more likely to be related to
its “meaning”
IDF(w) = log [N/nw] + 1
• where N = #news articles, nw is #news articles containing w
• Intuition: words appearing in many news articles are generally not very informative
(e.g., “и”, “е”, “на”, “не”, “со”, “во”, “нели”, “затоа”, etc.)
TFIDF: weight of a word in a news article is product of these quantities:
• TFIDF(w) = TF(w) x IDF(w)
A1, 17:15h, MK
Кривична пријава против Андреј Петров
петров(28.699) пивара(25.589) андреј(23.382) комиси(23.15) мвр(20.603) кривич(16.449)
дистри(16.036) сдсм(15.714) незако(14.921) жалела(14.482)…
TIME.mk Proprietary
13. Cluster Scoring Logic
Cluster Score = quality-of-sources * freshness-of-news
Quality of a source: How useful is this
source?
- Non-dup fraction
- Participation in large stories
- First publisher of a top story
TIME.mk Proprietary
14. Article Scoring Logic
Article Score
– Used for ranking within a cluster
– Function of:
• Age
• Quality of source
• Title overlap with cluster
centroid
• Article size
• …
TIME.mk Proprietary
16. УТ
Р
НО ИН ДН
Е
0
20
40
60
80
100
120
140
160
180
200
ВА СКИ ВН
М В И
АК Е С К
ЕД Н
О ИК
Н
И
ВР ЈА
ЕМ
ВЕ Е
ЧЕ
Р
КА А1
М Н
АК АЛ5
Ф
НЕ АК
ТП С
РЕ
С
КИ ВЕ
News sources activity
РИ С
ЛИ Т
Ц
СИ А
А Л ТЕ
С Л
АТ
Ф -М
О
Р
АЛ УМ
Ф
А
Т
КУ В
ИД РИ
И Р
ВИ
ДИ
К А МТ
ЈГ В
А
БР НА
М ОК
АК Е
ДЕ Р
НЕ
С
SE BB
TI C
M
O ES
ЗА N .N
ЗА ET
* in period of 2 working days
БА
В
ТЕ А
ЛМ
А
TIME.mk Proprietary
16
17. Visitors
365.com.mk started to
present TIME.mk news
7000 6500
6000
ON.net started to
Lunch of present TIME.mk news
5000
TIME.mk
4500
1.July.2008
4000
discussions on
3000 MK forums
Article about
2000 TIME.mk in 1500
Нова Македонија
1000
700
100 180
0
Jul Aug Sept Oct Nov Dec
#visitors
Source: 8pt, medium gray TIME.mk Proprietary 17
19. Next to come …
• Search of the archive
• RSS feeds
• Click metrics & personalization
adjustable cluster ranking to the user preferences
• News alerts
emails with link to news that contain provided keywords
• Weekly and Monthly news threads
• New topics: Technology, Health, etc.
• Inclusion of other news sources (currently only 26)
• Automatic Hub discovery
• Improvements in the clustering algorithms (more sophisticated NLP)
СДСМ = СДС, премиер = груевски, нафта = бензин, АМС = агенција за
млади и спорт, струја = електрична енергија, etc.
Go beyond duplicate detection by measuring new fact introduction
TIME.mk Proprietary 19
20. Acknowledgments
- Pajo & Biba for registering TIME.mk in MARNET
- Karolina for offering DNS services and HTML/CSS tricks
- Igor (Zuljo) for implementing the new design
- Nikola and Daniel for implementing text extraction for TIMES.si
- many many users for suggesting improvements by sending tons of emails with bugs
on TIME.mk pages
TIME.mk Proprietary