SlideShare a Scribd company logo
1 of 63
Web Crawling,
Analysis and Archiving
PHD DEFENSE
VANGELIS BANOS
DEPARTMENT OF INFORMATICS, ARISTOTLE UNIVERSITY OF THESSALONIKI
OCTOBER 2015
COMMITTEE MEMBERS
Yannis Manolopoulos, Apostolos Papadopoulos, Dimitrios Katsaros,
Athena Vakali, Anastasios Gounaris, Georgios Evangelidis, Sarantos Kapidakis.
of 63
Problem definition: The web is disappearing
2WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Web Archiving
3WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
• Web archiving is the process of collecting portions of
the Web to ensure the information is preserved in
an archive for researchers, historians, and the public.
• Many important organisations work on web archiving
since 1996.
of 63
Our Contributions
We focus on Web Crawling, Analysis and Archiving.
1. New metrics and systems to appreciate the possibilities of
archiving websites,
2. New algorithms and systems to improve web crawling
efficiency and performance,
3. New approaches and systems to archive weblogs,
4. New algorithms focused on weblog data extraction.
◦Publications:
• 4 scientific journals (1 still under review),
• 7 international conference proceedings,
• 1 book chapter.
4WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Presentation Structure
1. An Innovative Method to Evaluate Website
Archivability,
2. Near-duplicate and Cycle Detection in Webgraphs
towards Optimised Web Crawling,
3. The BlogForever Platform: An Integrated Approach
to Preserve Weblogs,
4. A Scalable Approach to Harvest Modern Weblogs,
5. Conclusions and Future Work.
5WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
1. An Innovative Method to Evaluate Website Archivability
Problem description
• Not all websites can be archived correctly.
• Web bots face difficulties in harvesting websites (Technical problems, low
performance, invalid code, blocking web crawlers).
• After web harvesting, archive administrators review manually the content.
• Web crawing is automated while Quality Assurance (QA) is manual.
Our contributions
1. The Credible Live Evaluation of Archive Readiness Plus (CLEAR+) Method to
evaluate Website Archivability.
2. The ArchiveReady.com system which is the reference implementation of the
method.
3. Evaluation and observation regarding 12 prominent Web Content Management
Systems’ (CMS) Archivability.
6WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
CLEAR+: A Credible Live Method to Evaluate Website Archivability
• Website Archivability (WA) captures the core aspects of a website
crucial in diagnosing whether it has the potentiality to be archived
with completeness and accuracy.
o Not to be confused with website reliability, availability, security, etc.
• CLEAR+: A method to produce a credible on-the-fly measurement
of Website Archivability by:
o Imitating web bots to crawl a website.
o Evaluating captured information such as file encoding and errors.
o Evaluating compliance with standards, formats and metadata.
o Calculating a WA Score (0 – 100%).
7WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
CLEAR+ Archivability Facets and Website Attributes
FA
Accessibility
Fc
Cohesion
FM
Metadata
FST
Standards
Compliance
8WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
CLEAR+ Method Summary
1. Perform specific evaluations on Website Attributes
2. Each evaluation has the following attributes:
1. Belongs to one or more WA Facets.
2. Has low, medium, or high Significance (different weight).
3. Has a score range from 0 – 100%.
3. The score of each Facet is the weighted average of all
evaluations’ scores.
4. The final Website Archivability is the average of all Facets’
scores.
9WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Accessibility Facet
Facet Evaluation Rating Significance Total
FA
Accessibility
No sitemap.xml 0% High
63%
21 valid and 1 invalid link 95% High
2 inline JavaScript files 0% High
HTTP Caching Headers 100% Medium
Average response time 30ms,
very fast
100% High
Not using proprietary formats
(e.g. Flash or QuickTime)
100% High
ADBIS 2015 Website Accessibility Evaluation 1st Sept 2015
10WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Cohesion Facet
• If files constituting a single website are dispersed across different web
locations, the acquisition and ingest is likely to suffer if one or more
web locations fail.
• 3rd party resources increase website volatility.
Facet Evaluation Rating Significance Total
FC
Cohesion
6 local and no external scripts 100% Medium
100%9 local and no external images 100% Medium
2 local and no external CSS 100% Medium
ADBIS 2015 Website Accessibility Evaluation 1st Sept 2015
11WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Metadata Facet
• Adequate metadata are a big concern for digital curation.
• The lack of metadata impairs the archive’s ability to manage,
organise, retrieve and interact with content effectively.
Facet Evaluation Rating Significance Total
FM
Metadata
HTTP Content type 100% Medium
100%
HTTP Caching headers 100% Medium
ADBIS 2015 Website Accessibility Evaluation 1st Sept 2015
12WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Standards Compliance Facet
Facet Evaluation Rating Significance Total
FST
Standards
Compliance
2 Invalid CSS files 0% Medium
74%
Invalid HTML file 0% Medium
No HTTP Content transfer encoding 50% Medium
HTTP Content type found 100% Medium
HTTP Caching headers found 100% Medium
9 images found and validated with JHOVE 100% Medium
Not using proprietary formats (e.g. Flash or
QuickTime)
100% High
ADBIS 2015 Website Accessibility Evaluation 1st Sept 2015
13WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
ADBIS’2015 Website Archivability Evaluation
• Web application implementing CLEAR+
• Web interface and REST API
• Developed using Python, MySQL, Redis,
PhantomJS, Nginx, Linux.
14WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Experimentation with Assorted Datasets
• D1: National libraries, D2: Top 200 universities,
• D3: Government organizations, D4: Random spam websites from Alexa.
15WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Evaluation by experts
• Experts evaluate how well a website is archived in the Internet
Archive and assign a score.
• We evaluate the WA Score using ArchiveReady.com.
• Pearson’s Correlation Coefficient for WA, WA Facets and experts’
score.
• Correlation: 0.516
16WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
WA Variance in the Same Website
17WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Web Content Management Systems Archivability
• Aim: Identify strengths and weaknesses in different web
CMS regarding their WA.
• Corpus: 5.821 random WCMS Samples from the Alexa
top 1m websites. Systems:
o Blogger, DataLife Engine, DotNetNuke, Drupal,
Joomla, Mediawiki, MovableType, Plone, PrestaShop,
Typo3, vBulletin, Wordpress.
• Evaluation using the ArchiveReady.com API
• Results saved in MySQL and analysed.
18WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
WCMS Accessibility Variations
19WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
WCMS Standards Compliance Variations
20WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
WCMS Metadata Results
21WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
WCMS Archivability Results Summary
22WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Website Archivability Impact
• Deutches Literatur Archiv, Marbach, is using the ArchiveReady API in its
web archiving workflow since early 2014.
• Stanford University Libraries Web Archiving Resources recommends using
the CLEAR method and ArchiveReady.
• The University of South Australia is using ArchiveReady in their Digital
Preservation Course (INFS 5082).
• Invited to present at the Library of Congress, National Digital Information
Infrastructure & Preservation, Web Archiving, 2015, and the Internet
Archive Web Archiving meeting (University of Innsbruck, 2013).
• Many contacts and users from: University of Newcastle, University of
Manchester, Columbia University, Stanford University, University of
Michigan Bentley Historical Library, Old Dominion University.
• 120 unique daily visitors, 80.000+ evaluations at http://archiveready.com/.
23WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Presentation Structure
1. An Innovative Method to Evaluate Website Archivability,
2. Near-duplicate and Cycle Detection in Webgraphs
towards Optimised Web Crawling,
3. The BlogForever Platform: An Integrated Approach to
Preserve Weblogs,
4. A Scalable Approach to Harvest Modern Weblogs,
5. Conclusions and Future Work.
24WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
2. Near-duplicate and Cycle Detection in Webgraphs
towards Optimised Web Crawling
Problem description
• Web bots capture a lot of duplicate and near-duplicate data.
o There are methods to detect and remove duplicate data after crawling.
o There are few methods to remove near-duplicate data in web archives.
• Web bots fall into web spider traps, webpages that cause infinite loops. No
automated solution to detect them.
Our Contributions
1. a set of methods to detect duplicate and near-duplicate webpages in real
time during web crawling.
2. a set of methods to detect web spider traps using webgraphs in real time
during web crawling.
3. The WebGraph-It.com system, a web platform which implements the
proposed methods.
25WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Key Concepts
• Unique Webpage Identifier?
• Webpage similarity metric?
• Web crawling modeled as a graph?
26WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Key Concepts: Unique Webpage Identifier
• URI is not always optimal as a unique webpage identifier.
o http://edition.cnn.com/videos - http://edition.cnn.com/videos#some-point
o http://edition.cnn.com/videos?v1=1&v2=2
o http://edition.cnn.com/videos?v2=2&v1=1
• Sort-friendly URI Reordering Transform (SURT) URI Conversion.
o URI: scheme://user@domain.tld:port/path?query#fragment
o SURT: scheme://(tld,domain,:port@user)/path?query
o URI: http://edition.cnn.com/tech -> SURT: com,cnn,edition/tech
• SURT encoding is lossy. SURT is not always reversible to URI.
27WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Key Concepts: Unique Webpage Identifier Similarity
• Dear duplicate URIs/SURTs may have duplicate content.
o http://vbanos.gr/page?show-greater=10 - http://vbanos.gr/page?show-greater=11
o http://vbanos.gr/blog/tag/cakephp/ - http://vbanos.gr/blog/tag/php/
• We use the Sorensen-Dice coefficient similarity to search for
near-duplicate webpage identifiers with a 95% similarity
threshold.
o Low sensitivity to word ordering,
o Low sensitivity to length variations,
o Runs in linear time.
28WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 6329WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
Key Concepts: Unique Webpage Identifier Similarity
of 63
Key Concepts: Webpage content similarity
• Content similarity:
• Exact duplicate webpages
• Near-duplicate webpages (ads, dates, counters may change)
• We use the simhash algorithm (Charikar) to calculate bit
signatures from each webpage.
• 96 bit webpage signature.
• Near duplicate webpages have very few different bits.
• Fast to compare the similarity of two webpages.
• Efficient storage (save only the signature, keep it in memory).
30WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 6331WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
Key Concepts: Webpage content similarity
of 63
Key concepts: Webgraph cycle detection
Step 1 Step 2 Step 3
New Node F Get Nearby Nodes (dist=3) and Cycle Detection using DFS (dist=3)
check for duplicate / near duplicate
WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 32 of 67
of 63
Web Crawling Algorithms
33WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
WebGraph-It.com System
• Web application implementing all presented algorithms. API Available.
• Built using Python, PhantomJS, Redis, MariaDB, Linux.
• Easy to expand and create new web crawling algorithms as plugins.
34WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Evaluation
1. Dataset: 100 random websites from Alexa top 1M.
2. Crawl with all 8 algorithms (C1-C8) using the WebGraph-it system.
3. Record metrics for each web crawl.
4. Analyse the results and compare with the base web crawl.
35WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Indicative results for a single website
36WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Results
37WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Evaluation conclusions
• Best method is D8: Cycle detection with content similarity
• 17.1% faster than the base crawl.
• 60% of base crawl webpages captured.
• 98.3% results completeness.
• Always use SURT instead of URL as a unique webpage
identifier.
• Use URL/SURT similarity AND content similarity together.
• Using URL/SURL similarity alone results in incomplete results.
38WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Presentation Structure
1. An Innovative Method to Evaluate Website Archivability,
2. Near-duplicate and Cycle Detection in Webgraphs
towards Optimised Web Crawling,
3. The BlogForever Platform: An Integrated Approach to
Preserve Weblogs,
4. A Scalable Approach to Harvest Modern Weblogs,
5. Conclusions and Future Work.
39WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
3. The BlogForever Platform: An Integrated Approach to
Preserve Weblogs
Problem description
Current web archiving tools have issues with weblog archiving.
• Scheduling (timely intervals vs archive when new content is available.
• Content selection (archive everything instead of archiving the updated content only),
• Ignoring weblog features (rich set of information entities, structured content, RSS, tags,
etc.)
Our contributions
1. A survey of the technical characteristics of weblogs.
2. Methods to improve weblog harvesting, archiving and management.
3. Methods to integrate weblog archives with existing archive technologies.
4. The BlogForever platform: A system to support harvesting, ingestion, management and
reuse of weblogs.
40WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Technical survey of the blogosphere
• Dataset: 259.930 blogs
• Evaluate the use of:
o Blog platforms,
o Web standards (HTTP Headers, HTML markup etc),
o XML feeds,
o Image formats,
o JavaScript frameworks,
o Semantic markup (Microformats, XFN, OpenGraph, etc)
41WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Indicative survey results: Blog platforms
42WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Indicative survey results: Image and feed types
43WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
standard_descr
content
date
Blog has Entry
is a
PostPage
has
Comment
Content
has
Authorhas
has
Categorised ContentCategorised Content
CommunityCommunity
Web FeedWeb Feed
External WidgetsExternal Widgets
Network and Linked DataNetwork and Linked DataBlog ContextBlog Context
SemanticsSemantics
BlogForever:
Conceptual Data Model
Version 0.6
Spam DetectionSpam Detection
embeds
WidgetType
crawler
Aouth
Widget
Feed
id
format
last_updated
generator
last_build_date
related_feed
Layout
theme
css
images
SnapshotView
date
format
src
hashas
Expression_
Meta
description
def_keywords
Spam
date
flag
contains
SpamCategory
Keyword Sentiment
Content_Simila
rity
score
flag
score
src
contains
contains
username
URI
UserProfile
ExternalProfile ProfileType
URI
Association
Triple
subject
predicate
object
Association
Type
Multimedia
Text
Link
Tag
src
alt
caption/descr
GEO
src
description
type
value
format
tags
copyright
embedding
thumbnail
language
Ranking, Category and SimilarityRanking, Category and Similarity
value
date
Ranking given
Similarity
Crawling InfoCrawling Info
Crawl captured
Category
similarity_score
algorithm
AffiliationTypeAffiliation
Event
date location
name URL
Topic
avatar
creator
service_uri
hasFeed_Type
value
Structured_
Meta
name
property
has
Standard and Ontology MappingStandard and Ontology Mapping
OntologyMapp
ing
OntClass
OntProperty
SpamAlgorithm
ImageAudio
VideoDocument
LinkType
isa
BlogEntity 44WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
The BlogForever platform
45WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
45
Blog crawlers
 Real-time monitoring
 Html data extraction engine
 Spam filtering
 Web services extraction engine
Unstructured
information
Web services
Blog APIs
Original data and
XML metadata
Blog digital repository
 Digital preservation and QA
 Collections curation
 Public access APIs
 Web interface to browse, search, export
 Personalised services
Harvesting
PreservingManaging and reusing
Web services
Web interface
of 6346WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
The BlogForever platform
47WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Evaluation using external testers
48WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Presentation Structure
1. An Innovative Method to Evaluate Website Archivability,
2. Near-duplicate and Cycle Detection in Webgraphs
towards Optimised Web Crawling,
3. The BlogForever Platform: An Integrated Approach to
Preserve Weblogs,
4. A Scalable Approach to Harvest Modern Weblogs,
5. Conclusions and Future Work.
49WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
4. A scalable approach to harvest modern weblogs
50WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
Problem description
• Inefficient weblog harvesting with generic solutions.
• Unpredictable publishing rate of weblogs.
Our contributions
1. A new algorithm to build extraction rules from blog web feeds with
linear time complexity,
2. Applications of the algorithm to extract authors, publication dates and
comments,
3. A new web crawler architecture and system capable of extracting blog
articles, authors, publication dates and comments.
of 63
Motivation & Method Overview
• Extracting metadata and content from HTML is hard because web
stardards usage is low. 95% of websites do not pass HTML validation.
• Focusing on blogs, we observed that:
1. Blogs provide XML feeds: standardized views of their latest ~10 posts.
2. We have to access more posts than the ones referenced in web feeds.
3. Posts of the same blog share a similar HTML structure.
• Content Extraction Method Overview
1. Use blog XML feeds and referenced HTML pages as training data to
build extraction rules.
2. For each XML element (Title, Author, Description, Publication date,
etc) create the relevant HTML extraction rule.
3. Use the defined extraction rules to process all blog pages.
51WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Locate in HTML page all RSS referenced elements
52WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Generic procedure to build extraction rules
53WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
• Rules are XPath queries.
• For each rule, we compute the score based on string similarity.
• The choice of ScoreFunction greatly influences the running time
and precision of the extraction process.
• Why we chose Sorensen–Dice coefficient similarity:
1. Low sensitivity to word ordering
and length variations
1. Runs in linear time
54
Extraction rules and string similarity
WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Example: blog post title best extraction rule
• Find RSS blog post title: “volumelaser.eim.gr” in html page:
http://vbanos.gr/blog/2014/03/09/volumelaser-eim-gr-2/
• The Best Extraction Rule for the blog post title is:
/body/div[@id=“page”]/header/h1
XPath HTML Element Value Similarity
Score
/body/div[@id=“page”]/header/h1 volumelaser.eim.gr 100%
/body/div[@id=“page”]/div[@class=“en
try-code”]/p/a
http://volumelaser.eim.gr/ 80%
/head/title volumelaser.eim.gr | Βαγγέλης
Μπάνος
66%
... ... ...
55WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Variations for authors, dates, comments
• Authors, dates and comments are special cases as they appear
many times throughout a post.
• To resolve this issue, we implement an extra component in the
Score function:
o For authors: an HTML tree distance between the evaluated node and
the post content node.
o For dates: we check the alternative formats of each date in addition
to the HTML tree distance between the evaluated node and the post
content node.
o Example: “1970-01-01” == “January 1 1970”
o For comments: we use the special comment RSS feed.
56WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
System
Pipeline of operations:
1. Render HTML and JavaScript,
2. Extract content,
3. Extract comments,
4. Download multimedia files,
5. Propagate resulting records to
the back-end.
Interesting areas:
◦ Blog post page identification,
◦ Handle blogs with a large number of pages,
◦ JavaScript rendering,
◦ Scalability.
57WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Evaluation
58WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
• Extract articles and titles from web pages and compare
extraction success rate and running time
• Comparison against three open-source projects:
o Readability (Javascript), Boilerpipe (Java), Goose (Scala).
• Dataset: 2300 blog posts from 230 blogs from Spinn3r.
of 63
5. Conclusions
• We proposed tangible ways to improve web crawling, web
archiving and blog archiving with new algorithms and
systems.
• The Credible Live Evaluation of Archive Readiness Plus
(CLEAR+) method to evaluate Website Archivability.
• Methods to improve web crawling via detecting duplicates,
near-duplicates and web spider traps on the fly.
• A new approach to harvest, manage, preserve and reuse
weblogs.
• A new scalable algorithm to harvest modern weblogs.
59WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Publications
Publications in scientific journals:
1. Banos V., Manolopoulos Y.: “Near-duplicate and Cycle Detection in Webgraphs towards
Optimised Web Crawling”, ACM Transactions on the Web Journal, submitted, 2015.
2. Banos V., Manolopoulos Y.: “A Quantitative Approach to Evaluate Website Archivability Using
the CLEAR+ Method”, International Journal on Digital Libraries, 2015.
3. Banos V., Blanvillain O., Kasioumis N., Manolopoulos Y.: “A Scalable Approach to Harvest
Modern Weblogs”, International Journal of AI Tools, Vol.24, No.2, 2015.
4. Kasioumis N., Banos V., Kalb H.: “Towards Building a Blog Preservation Platform”, World Wide
Web Journal, Special Issue on Social Media Preservation and Applications, Springer, 2013.
Publications in international conference proceedings:
1. Banos V., Manolopoulos Y.: “Web Content Management Systems Archivability”, Proceedings
19th East-European Conference on Advances in Databases & Information Systems (ADBIS),
Springer Verlag, LNCS Vol.9282, Poitiers, France, 2015.
2. Blanvillain O., Banos V., Kasioumis N.: BlogForever Crawler: “Techniques and Algorithms to
Harvest Modern Weblogs”, Proceedings 4th International Conference on Web Intelligence,
Mining & Semantics (WIMS), ACM Press, Thessaloniki, Greece, 2014.
60WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Publications
3. Banos V., Kim Y., Ross S., Manolopoulos Y.: “CLEAR: a Credible Method to Evaluate Website
Archivability”, Proceedings 10th International Conference on Preservation of Digital Objects
(iPRES), Lisbon, Portugal, 2013.
4. Kalb H., Lazaridou P., Banos V., Kasioumis N., Trier M.: “BlogForever: From Web Archiving to Blog
Archiving”, Proceedings ‘Informatik Angepast an Mensch, Organisation und Umwelt‘
(INFORMATIK), Koblenz, Germany, 2013.
5. Stepanyan K., Gkotsis G., Banos V., Cristea A., Joy M.: “A Hybrid Approach for Spotting,
Disambiguating and Annotating Places in User-Generated Text”, Proceedings 22nd International
Conference on World Wide Web (WWW), Rio de Janeiro, Brazil, 2013.
6. Banos V., Baltas N., Manolopoulos Y.: “Trends in Blog Preservation”, Proceedings 14th
International Conference on Enterprise Information Systems (ICEIS), Vol.1, pp.13-22, Wroclaw,
Poland, 2012.
7. Banos V., Stepanyan K., Manolopoulos Y., Joy M., Cristea A.: “Technological Foundations of the
Current Blogosphere”, Proceedings 2nd International Conference on Web Intelligence, Mining &
Semantics (WIMS), ACM Press, Craiova, Romania, 2012.
Book chapters:
1. Banos V., Baltas N., Manolopoulos Y.: “Blog Preservation: Current Challenges and a New
Paradigm”, chapter 3 in book Enterprise Information Systems XIII, by Cordeiro J., Maciaszek L. and
Filipe J. (eds.), Springer LNBIP Vol.141, pp.29–51, 2013.
61WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
of 63
Future Work
1. Website Archivability
1. Augment the CLEAR+ method with new metrics.
2. Disseminate to wider audiences (e.g. web developers)
3. Integrate with web archiving systems.
4. Improve http://archiveready.com/
2. Web crawling duplicate and near-duplicate detection
1. Develop new algorithm variants.
2. Integrate into open source web crawlers.
3. Provide support services to web crawling operations.
4. Improve http://webgraph-it.com/
3. BlogForever platform
1. Automate content curation processes.
2. Improve entity detection in archived content.
3. Support more types of weblogs.
4. http://webternity.eu/
62WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
Web Crawling,
Analysis and Archiving
PHD DEFENSE
VANGELIS BANOS
DEPARTMENT OF INFORMATICS, ARISTOTLE UNIVERSITY OF THESSALONIKI
OCTOBER 2015
THANK YOU!

More Related Content

Viewers also liked

Viewers also liked (7)

Web crawler
Web crawlerWeb crawler
Web crawler
 
Parsing XML Data
Parsing XML DataParsing XML Data
Parsing XML Data
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Crawleando a web feito gente grande com o scrapy
Crawleando a web feito gente grande com o scrapyCrawleando a web feito gente grande com o scrapy
Crawleando a web feito gente grande com o scrapy
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 

More from Vangelis Banos

Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03Vangelis Banos
 
Αποθηκεύεται το διαδίκτυο; Web Archiving!
Αποθηκεύεται το διαδίκτυο; Web Archiving!Αποθηκεύεται το διαδίκτυο; Web Archiving!
Αποθηκεύεται το διαδίκτυο; Web Archiving!Vangelis Banos
 
The theory and practice of Website Archivability
The theory and practice of Website ArchivabilityThe theory and practice of Website Archivability
The theory and practice of Website ArchivabilityVangelis Banos
 
ΥπερΔιαύγεια
ΥπερΔιαύγειαΥπερΔιαύγεια
ΥπερΔιαύγειαVangelis Banos
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaVangelis Banos
 
Η Ιστορία της Μετρολογίας
Η Ιστορία της ΜετρολογίαςΗ Ιστορία της Μετρολογίας
Η Ιστορία της ΜετρολογίαςVangelis Banos
 
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας ΜετρολογίαςΟ κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας ΜετρολογίαςVangelis Banos
 
Heterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challengeHeterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challengeVangelis Banos
 
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...Vangelis Banos
 
Η τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτήΗ τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτήVangelis Banos
 

More from Vangelis Banos (10)

Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
 
Αποθηκεύεται το διαδίκτυο; Web Archiving!
Αποθηκεύεται το διαδίκτυο; Web Archiving!Αποθηκεύεται το διαδίκτυο; Web Archiving!
Αποθηκεύεται το διαδίκτυο; Web Archiving!
 
The theory and practice of Website Archivability
The theory and practice of Website ArchivabilityThe theory and practice of Website Archivability
The theory and practice of Website Archivability
 
ΥπερΔιαύγεια
ΥπερΔιαύγειαΥπερΔιαύγεια
ΥπερΔιαύγεια
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
 
Η Ιστορία της Μετρολογίας
Η Ιστορία της ΜετρολογίαςΗ Ιστορία της Μετρολογίας
Η Ιστορία της Μετρολογίας
 
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας ΜετρολογίαςΟ κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
 
Heterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challengeHeterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challenge
 
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...
 
Η τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτήΗ τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτή
 

Recently uploaded

₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...Diya Sharma
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.soniya singh
 
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts servicesonalikaur4
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
 
SEO Growth Program-Digital optimization Specialist
SEO Growth Program-Digital optimization SpecialistSEO Growth Program-Digital optimization Specialist
SEO Growth Program-Digital optimization SpecialistKHM Anwar
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girlsstephieert
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...APNIC
 
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...SofiyaSharma5
 
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663Call Girls Mumbai
 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)Damian Radcliffe
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Call Girls in Nagpur High Profile
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024APNIC
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxellan12
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Sheetaleventcompany
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.soniya singh
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...Neha Pandey
 
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607dollysharma2066
 

Recently uploaded (20)

₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
 
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
SEO Growth Program-Digital optimization Specialist
SEO Growth Program-Digital optimization SpecialistSEO Growth Program-Digital optimization Specialist
SEO Growth Program-Digital optimization Specialist
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girls
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
 
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
 
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
 

Web Crawling, Analysis and Archiving. PhD Presentation

  • 1. Web Crawling, Analysis and Archiving PHD DEFENSE VANGELIS BANOS DEPARTMENT OF INFORMATICS, ARISTOTLE UNIVERSITY OF THESSALONIKI OCTOBER 2015 COMMITTEE MEMBERS Yannis Manolopoulos, Apostolos Papadopoulos, Dimitrios Katsaros, Athena Vakali, Anastasios Gounaris, Georgios Evangelidis, Sarantos Kapidakis.
  • 2. of 63 Problem definition: The web is disappearing 2WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 3. of 63 Web Archiving 3WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE • Web archiving is the process of collecting portions of the Web to ensure the information is preserved in an archive for researchers, historians, and the public. • Many important organisations work on web archiving since 1996.
  • 4. of 63 Our Contributions We focus on Web Crawling, Analysis and Archiving. 1. New metrics and systems to appreciate the possibilities of archiving websites, 2. New algorithms and systems to improve web crawling efficiency and performance, 3. New approaches and systems to archive weblogs, 4. New algorithms focused on weblog data extraction. ◦Publications: • 4 scientific journals (1 still under review), • 7 international conference proceedings, • 1 book chapter. 4WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 5. of 63 Presentation Structure 1. An Innovative Method to Evaluate Website Archivability, 2. Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling, 3. The BlogForever Platform: An Integrated Approach to Preserve Weblogs, 4. A Scalable Approach to Harvest Modern Weblogs, 5. Conclusions and Future Work. 5WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 6. of 63 1. An Innovative Method to Evaluate Website Archivability Problem description • Not all websites can be archived correctly. • Web bots face difficulties in harvesting websites (Technical problems, low performance, invalid code, blocking web crawlers). • After web harvesting, archive administrators review manually the content. • Web crawing is automated while Quality Assurance (QA) is manual. Our contributions 1. The Credible Live Evaluation of Archive Readiness Plus (CLEAR+) Method to evaluate Website Archivability. 2. The ArchiveReady.com system which is the reference implementation of the method. 3. Evaluation and observation regarding 12 prominent Web Content Management Systems’ (CMS) Archivability. 6WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 7. of 63 CLEAR+: A Credible Live Method to Evaluate Website Archivability • Website Archivability (WA) captures the core aspects of a website crucial in diagnosing whether it has the potentiality to be archived with completeness and accuracy. o Not to be confused with website reliability, availability, security, etc. • CLEAR+: A method to produce a credible on-the-fly measurement of Website Archivability by: o Imitating web bots to crawl a website. o Evaluating captured information such as file encoding and errors. o Evaluating compliance with standards, formats and metadata. o Calculating a WA Score (0 – 100%). 7WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 8. of 63 CLEAR+ Archivability Facets and Website Attributes FA Accessibility Fc Cohesion FM Metadata FST Standards Compliance 8WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 9. of 63 CLEAR+ Method Summary 1. Perform specific evaluations on Website Attributes 2. Each evaluation has the following attributes: 1. Belongs to one or more WA Facets. 2. Has low, medium, or high Significance (different weight). 3. Has a score range from 0 – 100%. 3. The score of each Facet is the weighted average of all evaluations’ scores. 4. The final Website Archivability is the average of all Facets’ scores. 9WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 10. of 63 Accessibility Facet Facet Evaluation Rating Significance Total FA Accessibility No sitemap.xml 0% High 63% 21 valid and 1 invalid link 95% High 2 inline JavaScript files 0% High HTTP Caching Headers 100% Medium Average response time 30ms, very fast 100% High Not using proprietary formats (e.g. Flash or QuickTime) 100% High ADBIS 2015 Website Accessibility Evaluation 1st Sept 2015 10WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 11. of 63 Cohesion Facet • If files constituting a single website are dispersed across different web locations, the acquisition and ingest is likely to suffer if one or more web locations fail. • 3rd party resources increase website volatility. Facet Evaluation Rating Significance Total FC Cohesion 6 local and no external scripts 100% Medium 100%9 local and no external images 100% Medium 2 local and no external CSS 100% Medium ADBIS 2015 Website Accessibility Evaluation 1st Sept 2015 11WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 12. of 63 Metadata Facet • Adequate metadata are a big concern for digital curation. • The lack of metadata impairs the archive’s ability to manage, organise, retrieve and interact with content effectively. Facet Evaluation Rating Significance Total FM Metadata HTTP Content type 100% Medium 100% HTTP Caching headers 100% Medium ADBIS 2015 Website Accessibility Evaluation 1st Sept 2015 12WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 13. of 63 Standards Compliance Facet Facet Evaluation Rating Significance Total FST Standards Compliance 2 Invalid CSS files 0% Medium 74% Invalid HTML file 0% Medium No HTTP Content transfer encoding 50% Medium HTTP Content type found 100% Medium HTTP Caching headers found 100% Medium 9 images found and validated with JHOVE 100% Medium Not using proprietary formats (e.g. Flash or QuickTime) 100% High ADBIS 2015 Website Accessibility Evaluation 1st Sept 2015 13WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 14. of 63 ADBIS’2015 Website Archivability Evaluation • Web application implementing CLEAR+ • Web interface and REST API • Developed using Python, MySQL, Redis, PhantomJS, Nginx, Linux. 14WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 15. of 63 Experimentation with Assorted Datasets • D1: National libraries, D2: Top 200 universities, • D3: Government organizations, D4: Random spam websites from Alexa. 15WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 16. of 63 Evaluation by experts • Experts evaluate how well a website is archived in the Internet Archive and assign a score. • We evaluate the WA Score using ArchiveReady.com. • Pearson’s Correlation Coefficient for WA, WA Facets and experts’ score. • Correlation: 0.516 16WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 17. of 63 WA Variance in the Same Website 17WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 18. of 63 Web Content Management Systems Archivability • Aim: Identify strengths and weaknesses in different web CMS regarding their WA. • Corpus: 5.821 random WCMS Samples from the Alexa top 1m websites. Systems: o Blogger, DataLife Engine, DotNetNuke, Drupal, Joomla, Mediawiki, MovableType, Plone, PrestaShop, Typo3, vBulletin, Wordpress. • Evaluation using the ArchiveReady.com API • Results saved in MySQL and analysed. 18WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 19. of 63 WCMS Accessibility Variations 19WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 20. of 63 WCMS Standards Compliance Variations 20WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 21. of 63 WCMS Metadata Results 21WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 22. of 63 WCMS Archivability Results Summary 22WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 23. of 63 Website Archivability Impact • Deutches Literatur Archiv, Marbach, is using the ArchiveReady API in its web archiving workflow since early 2014. • Stanford University Libraries Web Archiving Resources recommends using the CLEAR method and ArchiveReady. • The University of South Australia is using ArchiveReady in their Digital Preservation Course (INFS 5082). • Invited to present at the Library of Congress, National Digital Information Infrastructure & Preservation, Web Archiving, 2015, and the Internet Archive Web Archiving meeting (University of Innsbruck, 2013). • Many contacts and users from: University of Newcastle, University of Manchester, Columbia University, Stanford University, University of Michigan Bentley Historical Library, Old Dominion University. • 120 unique daily visitors, 80.000+ evaluations at http://archiveready.com/. 23WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 24. of 63 Presentation Structure 1. An Innovative Method to Evaluate Website Archivability, 2. Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling, 3. The BlogForever Platform: An Integrated Approach to Preserve Weblogs, 4. A Scalable Approach to Harvest Modern Weblogs, 5. Conclusions and Future Work. 24WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 25. of 63 2. Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling Problem description • Web bots capture a lot of duplicate and near-duplicate data. o There are methods to detect and remove duplicate data after crawling. o There are few methods to remove near-duplicate data in web archives. • Web bots fall into web spider traps, webpages that cause infinite loops. No automated solution to detect them. Our Contributions 1. a set of methods to detect duplicate and near-duplicate webpages in real time during web crawling. 2. a set of methods to detect web spider traps using webgraphs in real time during web crawling. 3. The WebGraph-It.com system, a web platform which implements the proposed methods. 25WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 26. of 63 Key Concepts • Unique Webpage Identifier? • Webpage similarity metric? • Web crawling modeled as a graph? 26WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 27. of 63 Key Concepts: Unique Webpage Identifier • URI is not always optimal as a unique webpage identifier. o http://edition.cnn.com/videos - http://edition.cnn.com/videos#some-point o http://edition.cnn.com/videos?v1=1&v2=2 o http://edition.cnn.com/videos?v2=2&v1=1 • Sort-friendly URI Reordering Transform (SURT) URI Conversion. o URI: scheme://user@domain.tld:port/path?query#fragment o SURT: scheme://(tld,domain,:port@user)/path?query o URI: http://edition.cnn.com/tech -> SURT: com,cnn,edition/tech • SURT encoding is lossy. SURT is not always reversible to URI. 27WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 28. of 63 Key Concepts: Unique Webpage Identifier Similarity • Dear duplicate URIs/SURTs may have duplicate content. o http://vbanos.gr/page?show-greater=10 - http://vbanos.gr/page?show-greater=11 o http://vbanos.gr/blog/tag/cakephp/ - http://vbanos.gr/blog/tag/php/ • We use the Sorensen-Dice coefficient similarity to search for near-duplicate webpage identifiers with a 95% similarity threshold. o Low sensitivity to word ordering, o Low sensitivity to length variations, o Runs in linear time. 28WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 29. of 6329WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE Key Concepts: Unique Webpage Identifier Similarity
  • 30. of 63 Key Concepts: Webpage content similarity • Content similarity: • Exact duplicate webpages • Near-duplicate webpages (ads, dates, counters may change) • We use the simhash algorithm (Charikar) to calculate bit signatures from each webpage. • 96 bit webpage signature. • Near duplicate webpages have very few different bits. • Fast to compare the similarity of two webpages. • Efficient storage (save only the signature, keep it in memory). 30WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 31. of 6331WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE Key Concepts: Webpage content similarity
  • 32. of 63 Key concepts: Webgraph cycle detection Step 1 Step 2 Step 3 New Node F Get Nearby Nodes (dist=3) and Cycle Detection using DFS (dist=3) check for duplicate / near duplicate WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 32 of 67
  • 33. of 63 Web Crawling Algorithms 33WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 34. of 63 WebGraph-It.com System • Web application implementing all presented algorithms. API Available. • Built using Python, PhantomJS, Redis, MariaDB, Linux. • Easy to expand and create new web crawling algorithms as plugins. 34WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 35. of 63 Evaluation 1. Dataset: 100 random websites from Alexa top 1M. 2. Crawl with all 8 algorithms (C1-C8) using the WebGraph-it system. 3. Record metrics for each web crawl. 4. Analyse the results and compare with the base web crawl. 35WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 36. of 63 Indicative results for a single website 36WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 37. of 63 Results 37WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 38. of 63 Evaluation conclusions • Best method is D8: Cycle detection with content similarity • 17.1% faster than the base crawl. • 60% of base crawl webpages captured. • 98.3% results completeness. • Always use SURT instead of URL as a unique webpage identifier. • Use URL/SURT similarity AND content similarity together. • Using URL/SURL similarity alone results in incomplete results. 38WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 39. of 63 Presentation Structure 1. An Innovative Method to Evaluate Website Archivability, 2. Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling, 3. The BlogForever Platform: An Integrated Approach to Preserve Weblogs, 4. A Scalable Approach to Harvest Modern Weblogs, 5. Conclusions and Future Work. 39WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 40. of 63 3. The BlogForever Platform: An Integrated Approach to Preserve Weblogs Problem description Current web archiving tools have issues with weblog archiving. • Scheduling (timely intervals vs archive when new content is available. • Content selection (archive everything instead of archiving the updated content only), • Ignoring weblog features (rich set of information entities, structured content, RSS, tags, etc.) Our contributions 1. A survey of the technical characteristics of weblogs. 2. Methods to improve weblog harvesting, archiving and management. 3. Methods to integrate weblog archives with existing archive technologies. 4. The BlogForever platform: A system to support harvesting, ingestion, management and reuse of weblogs. 40WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 41. of 63 Technical survey of the blogosphere • Dataset: 259.930 blogs • Evaluate the use of: o Blog platforms, o Web standards (HTTP Headers, HTML markup etc), o XML feeds, o Image formats, o JavaScript frameworks, o Semantic markup (Microformats, XFN, OpenGraph, etc) 41WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 42. of 63 Indicative survey results: Blog platforms 42WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 43. of 63 Indicative survey results: Image and feed types 43WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 44. of 63 standard_descr content date Blog has Entry is a PostPage has Comment Content has Authorhas has Categorised ContentCategorised Content CommunityCommunity Web FeedWeb Feed External WidgetsExternal Widgets Network and Linked DataNetwork and Linked DataBlog ContextBlog Context SemanticsSemantics BlogForever: Conceptual Data Model Version 0.6 Spam DetectionSpam Detection embeds WidgetType crawler Aouth Widget Feed id format last_updated generator last_build_date related_feed Layout theme css images SnapshotView date format src hashas Expression_ Meta description def_keywords Spam date flag contains SpamCategory Keyword Sentiment Content_Simila rity score flag score src contains contains username URI UserProfile ExternalProfile ProfileType URI Association Triple subject predicate object Association Type Multimedia Text Link Tag src alt caption/descr GEO src description type value format tags copyright embedding thumbnail language Ranking, Category and SimilarityRanking, Category and Similarity value date Ranking given Similarity Crawling InfoCrawling Info Crawl captured Category similarity_score algorithm AffiliationTypeAffiliation Event date location name URL Topic avatar creator service_uri hasFeed_Type value Structured_ Meta name property has Standard and Ontology MappingStandard and Ontology Mapping OntologyMapp ing OntClass OntProperty SpamAlgorithm ImageAudio VideoDocument LinkType isa BlogEntity 44WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 45. of 63 The BlogForever platform 45WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 45 Blog crawlers  Real-time monitoring  Html data extraction engine  Spam filtering  Web services extraction engine Unstructured information Web services Blog APIs Original data and XML metadata Blog digital repository  Digital preservation and QA  Collections curation  Public access APIs  Web interface to browse, search, export  Personalised services Harvesting PreservingManaging and reusing Web services Web interface
  • 46. of 6346WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 47. of 63 The BlogForever platform 47WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 48. of 63 Evaluation using external testers 48WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 49. of 63 Presentation Structure 1. An Innovative Method to Evaluate Website Archivability, 2. Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling, 3. The BlogForever Platform: An Integrated Approach to Preserve Weblogs, 4. A Scalable Approach to Harvest Modern Weblogs, 5. Conclusions and Future Work. 49WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 50. of 63 4. A scalable approach to harvest modern weblogs 50WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE Problem description • Inefficient weblog harvesting with generic solutions. • Unpredictable publishing rate of weblogs. Our contributions 1. A new algorithm to build extraction rules from blog web feeds with linear time complexity, 2. Applications of the algorithm to extract authors, publication dates and comments, 3. A new web crawler architecture and system capable of extracting blog articles, authors, publication dates and comments.
  • 51. of 63 Motivation & Method Overview • Extracting metadata and content from HTML is hard because web stardards usage is low. 95% of websites do not pass HTML validation. • Focusing on blogs, we observed that: 1. Blogs provide XML feeds: standardized views of their latest ~10 posts. 2. We have to access more posts than the ones referenced in web feeds. 3. Posts of the same blog share a similar HTML structure. • Content Extraction Method Overview 1. Use blog XML feeds and referenced HTML pages as training data to build extraction rules. 2. For each XML element (Title, Author, Description, Publication date, etc) create the relevant HTML extraction rule. 3. Use the defined extraction rules to process all blog pages. 51WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 52. of 63 Locate in HTML page all RSS referenced elements 52WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 53. of 63 Generic procedure to build extraction rules 53WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 54. of 63 • Rules are XPath queries. • For each rule, we compute the score based on string similarity. • The choice of ScoreFunction greatly influences the running time and precision of the extraction process. • Why we chose Sorensen–Dice coefficient similarity: 1. Low sensitivity to word ordering and length variations 1. Runs in linear time 54 Extraction rules and string similarity WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 55. of 63 Example: blog post title best extraction rule • Find RSS blog post title: “volumelaser.eim.gr” in html page: http://vbanos.gr/blog/2014/03/09/volumelaser-eim-gr-2/ • The Best Extraction Rule for the blog post title is: /body/div[@id=“page”]/header/h1 XPath HTML Element Value Similarity Score /body/div[@id=“page”]/header/h1 volumelaser.eim.gr 100% /body/div[@id=“page”]/div[@class=“en try-code”]/p/a http://volumelaser.eim.gr/ 80% /head/title volumelaser.eim.gr | Βαγγέλης Μπάνος 66% ... ... ... 55WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 56. of 63 Variations for authors, dates, comments • Authors, dates and comments are special cases as they appear many times throughout a post. • To resolve this issue, we implement an extra component in the Score function: o For authors: an HTML tree distance between the evaluated node and the post content node. o For dates: we check the alternative formats of each date in addition to the HTML tree distance between the evaluated node and the post content node. o Example: “1970-01-01” == “January 1 1970” o For comments: we use the special comment RSS feed. 56WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 57. of 63 System Pipeline of operations: 1. Render HTML and JavaScript, 2. Extract content, 3. Extract comments, 4. Download multimedia files, 5. Propagate resulting records to the back-end. Interesting areas: ◦ Blog post page identification, ◦ Handle blogs with a large number of pages, ◦ JavaScript rendering, ◦ Scalability. 57WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 58. of 63 Evaluation 58WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE • Extract articles and titles from web pages and compare extraction success rate and running time • Comparison against three open-source projects: o Readability (Javascript), Boilerpipe (Java), Goose (Scala). • Dataset: 2300 blog posts from 230 blogs from Spinn3r.
  • 59. of 63 5. Conclusions • We proposed tangible ways to improve web crawling, web archiving and blog archiving with new algorithms and systems. • The Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method to evaluate Website Archivability. • Methods to improve web crawling via detecting duplicates, near-duplicates and web spider traps on the fly. • A new approach to harvest, manage, preserve and reuse weblogs. • A new scalable algorithm to harvest modern weblogs. 59WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 60. of 63 Publications Publications in scientific journals: 1. Banos V., Manolopoulos Y.: “Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling”, ACM Transactions on the Web Journal, submitted, 2015. 2. Banos V., Manolopoulos Y.: “A Quantitative Approach to Evaluate Website Archivability Using the CLEAR+ Method”, International Journal on Digital Libraries, 2015. 3. Banos V., Blanvillain O., Kasioumis N., Manolopoulos Y.: “A Scalable Approach to Harvest Modern Weblogs”, International Journal of AI Tools, Vol.24, No.2, 2015. 4. Kasioumis N., Banos V., Kalb H.: “Towards Building a Blog Preservation Platform”, World Wide Web Journal, Special Issue on Social Media Preservation and Applications, Springer, 2013. Publications in international conference proceedings: 1. Banos V., Manolopoulos Y.: “Web Content Management Systems Archivability”, Proceedings 19th East-European Conference on Advances in Databases & Information Systems (ADBIS), Springer Verlag, LNCS Vol.9282, Poitiers, France, 2015. 2. Blanvillain O., Banos V., Kasioumis N.: BlogForever Crawler: “Techniques and Algorithms to Harvest Modern Weblogs”, Proceedings 4th International Conference on Web Intelligence, Mining & Semantics (WIMS), ACM Press, Thessaloniki, Greece, 2014. 60WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 61. of 63 Publications 3. Banos V., Kim Y., Ross S., Manolopoulos Y.: “CLEAR: a Credible Method to Evaluate Website Archivability”, Proceedings 10th International Conference on Preservation of Digital Objects (iPRES), Lisbon, Portugal, 2013. 4. Kalb H., Lazaridou P., Banos V., Kasioumis N., Trier M.: “BlogForever: From Web Archiving to Blog Archiving”, Proceedings ‘Informatik Angepast an Mensch, Organisation und Umwelt‘ (INFORMATIK), Koblenz, Germany, 2013. 5. Stepanyan K., Gkotsis G., Banos V., Cristea A., Joy M.: “A Hybrid Approach for Spotting, Disambiguating and Annotating Places in User-Generated Text”, Proceedings 22nd International Conference on World Wide Web (WWW), Rio de Janeiro, Brazil, 2013. 6. Banos V., Baltas N., Manolopoulos Y.: “Trends in Blog Preservation”, Proceedings 14th International Conference on Enterprise Information Systems (ICEIS), Vol.1, pp.13-22, Wroclaw, Poland, 2012. 7. Banos V., Stepanyan K., Manolopoulos Y., Joy M., Cristea A.: “Technological Foundations of the Current Blogosphere”, Proceedings 2nd International Conference on Web Intelligence, Mining & Semantics (WIMS), ACM Press, Craiova, Romania, 2012. Book chapters: 1. Banos V., Baltas N., Manolopoulos Y.: “Blog Preservation: Current Challenges and a New Paradigm”, chapter 3 in book Enterprise Information Systems XIII, by Cordeiro J., Maciaszek L. and Filipe J. (eds.), Springer LNBIP Vol.141, pp.29–51, 2013. 61WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 62. of 63 Future Work 1. Website Archivability 1. Augment the CLEAR+ method with new metrics. 2. Disseminate to wider audiences (e.g. web developers) 3. Integrate with web archiving systems. 4. Improve http://archiveready.com/ 2. Web crawling duplicate and near-duplicate detection 1. Develop new algorithm variants. 2. Integrate into open source web crawlers. 3. Provide support services to web crawling operations. 4. Improve http://webgraph-it.com/ 3. BlogForever platform 1. Automate content curation processes. 2. Improve entity detection in archived content. 3. Support more types of weblogs. 4. http://webternity.eu/ 62WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE
  • 63. Web Crawling, Analysis and Archiving PHD DEFENSE VANGELIS BANOS DEPARTMENT OF INFORMATICS, ARISTOTLE UNIVERSITY OF THESSALONIKI OCTOBER 2015 THANK YOU!

Editor's Notes

  1. Competitors are generic They do not use XML feeds They do not use structural similaries of webpages. Our approach spends the majority of its total running time between the initialisation and the processing of the first post.