Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Â
Common Crawl: An Open Repository of Web Data
1. London HUG
Common Crawl :
WhatRepository
An Open
Does
Theof Web Data
Data World
Mean to Society?
Lisa Green
Lisa Green
1 October 2012
10 October 2012
2. Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
5. Still Nascent
⢠Even cheaper storage
⢠Even cheaper compute
⢠Education
⢠Open Data
Image license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
10. Common Crawl Data
⢠~8 Billion web pages
⢠~120 TB
⢠2008-2012
⢠ARC files, JSON metadata, text files
⢠Available to anyone
11. ARC Files - Raw Content
Metadata
⢠Status information
⢠HTTP response code
⢠File names & offsets of ARC files
⢠HTML title
⢠HTML meta tags
⢠RSS/Atom information
⢠All anchors/hyperlinks
Text Files - Text Only
http://commoncrawl.org/get-started
12.
13. Change between 2010 and 2012
⢠URLs with embedded data +6%
⢠Microdata +14%
⢠RDFa +26%
http://webdatacommons.org
14. ⢠22% of Web pages contain Facebook URLs
⢠8% of Web pages implement Open Graph tags
15. http://wikientities.appspot.com
A corpus of anchortext-WikipediaConcept-Count
from the CommonCrawl dataset, to benefit
research on WSD, NLP and IR.
Given a sentence, it can
Explicit Topic Modeling: help identify entities
(person, location, organization) in wikipedia
Given a concept (represented as a the sentence
and map them onto Wikipedia concepts.
page), it can tell what are the most common
terms people use to describe the concept.
17. Other Use Examples
⢠Apache Giraph Testing
⢠Maplight
⢠Tineye
⢠Factual
⢠Sentiment Analysis Projects
18. In Development
⢠N-gram and Link Graph Extracts
⢠Pig Reader
⢠More Frequent Full Crawls
⢠Focused Subset Crawls at High Frequency
⢠Open Educational Resources
19. Thank You
London HUG
What Does
The Data World
Lisa Green
Mean to Society?
lisa@commoncrawl.org
www.commoncrawl.org
@commoncrawl
Lisa Green
@boudicca
1 October 2012