2024: Domino Containers - The Next Step. News from the Domino Container commu...
BlogForever eChallenges 2012
1. The Need for Long Term Preservation of Weblogs: the
BlogForever Project
Ilias Trochidis
Aristotle University of Thessaloniki
Greece
Workshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
2. State of the Blogosphere
• Blogs have become fairly established as an online
communication and web publishing tool.
• Hundreds of millions of blogs are published about every
conceivable subject
ress.com
ly ba sis in WordP
d on a week
new blogs create
Number of
Workshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
3. The problem of Blog Preservation
• Despite the fast growth of blogosphere, there is still no
effective solution for ubiquitous semantic weblog
archiving, digital preservation, management and
dissemination:
– Current web preservation initiatives are geared towards aggregating
and preserving html pages and not information entities (posts,
comments, authors, metadata, dates, pingbacks, etc)
– Current web archiving efforts disregard the preservation of Social
Networks and interrelations between the archived content (meme-
effect)
– Current web archives cannot identify topics, subjects or events
(monolithic). There is no generic web archiving solution capable to
implement arbitrary subjects and topic hierarchies.
Workshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
5. Blog archiving evaluation
• Example: In the “Blogs of War: Weblogs as News” paper
there were documented 29 blogs on the Iraq war:
• of those 29 blogs,
– 13 (45%) on June 2012 no longer exist on the Internet,
– Only 9 blogs (31%) still contained information on the Iraq
war
– 12 out of the 20 (60%) blogs that don’t exist were
preserved by the Internet Archive (problems with missing
photos, comments not archived etc.)
• blogs on major events have already been lost
Workshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
8. Impact
• Output: a simple weblog archiving solution that any user,
user group or institution could use to preserve their
collections of weblogs ensuring:
– authenticity, integrity, completeness, usability, long term
accessibility
• Parties that will benefit: Bloggers, Universities, Libraries &
Information Centres, Museums, Education, Research,
Business
• Examples:
– CERN will create a repository with all physics blogs
– National Documentation Centre of Greece will create a repository
with academic blogs
– a National Library of Medicine would like to preserve a collection of
health and medicine blogs
Workshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
9. Business Model
• BlogForever as a service (single installation that can be
used as a service by users and institutions)
• BlogForever as a software (open source distribution)
• Universities, Research Institutes, Archives, Governments,
Blog Communities will be able to easily preserve their
collections of weblogs
• BlogForever will assure the preservation, the aggregation,
the management and the dissemination of these collections
• Do you need to preserve some blogs? We can setup a
BlogForever archive for you.
Workshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
10. Future Work
• Analyse blog archives in order to gain a better
understanding of the content and provide new services:
– Use Linked Open Data to link archived blog content with other web
content
– Apply Semantic Extension of Tags to understand them better and
reuse them for multiple purposes.
• In any case, use Ontologies to interpret and reason with
information.
• Data mining in order to extract information from the
archives and transform it into an understandable structure
for further use.
• Brand reputation management and market sector repute
analysis
Workshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
11. Thank you!
Any Questions?
Visit: http://blogforever.eu to learn more.
http://twitter.com/blogforever
http://facebook.com/BlogForever
The research leading to these results has received funding from the European Commission Framework Programme 7
(FP7), BlogForever project, grant agreement No.269963.
Workshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
Hinweis der Redaktion
Blogger is the largest of these sites with more than 46 million unique U.S. visitors during October 2011, making it second only to Facebook in the social networking category tumblr.com counts 38,884,272 total blogs with 53,399,798 posts on the 29 th of December while in July 2009 the number of posts per day was 650,000 facebook or microblogging sites such as Twitter have supported the growth of blogs by delivering traffic to content which originated in blogs
1. Current web preservation initiatives are geared towards aggregating and preserving files and not information entities. For instance, the Internet Archive aggregates web pages and stores them into WARC files (ISO 28500:2009), compressed files similar to zip which are assigned a unique identification number and stored in a distributed file system. Additionally, WARC supports some metadata such as provenance and HTTP protocol metadata. Implicit page elements, such as: · Page title, headers, content, author information, · Metadata such as Dublin Core elements, · RSS feeds and other Semantic Web technologies such as Microformats (Khare R.) and Microdata (Ronallo J.) are completely ignored. This impacts greatly the way stored information is managed, reducing the utility of the archive and also hindering the creation of added-value services. 2. Current web archiving efforts disregard the preservation of Social Networks and of interrelations between the archived content. However, weblog interdependencies demonstrated by the identification of central actors and peripheral weblogs, as well as by the meme-effect that applies to them, need to be preserved, to provide meaningful features to the weblog repository. 3. Current web archive scope is limited to monolithic regions, subjects or events. There is no generic web archiving solution capable to implement arbitrary subjects and topic hierarchies. For instance, the National Library of Catalonia has initiated a web crawling and access project aiming to collect, process and provide permanent access to the entire cultural, scientific and general output of Catalonia in digital format (PADICAT). Alternatively, the Library of Congress has developed online collections for isolated historical events such as September 11, 2001 (Library of Congress). There is an ongoing debate, about benefits or disadvantages of one or another long-term preservation methodology. Many papers have been written and many conferences dedicated to this issue have appeared. It is surprising however, how little has been done at practical level.