SlideShare ist ein Scribd-Unternehmen logo
1 von 19
London HUG
               Common Crawl :
               WhatRepository
              An Open
                      Does
             Theof Web Data
                  Data World
             Mean to Society?
                     Lisa Green
                   Lisa Green
                 1 October 2012
                 10 October 2012
Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg
Image license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
Still Nascent
                                                                    •      Even cheaper storage
                                                                    •      Even cheaper compute
                                                                    •      Education
                                                                    •      Open Data

Image license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
Gratis




Proprietary                Libre




              Commercial
Progress


Insight


Analysis


 Data
Gil Elbaz
Common Crawl Data
• ~8 Billion web pages
• ~120 TB
• 2008-2012
• ARC files, JSON metadata, text files
• Available to anyone
ARC Files - Raw Content
Metadata
•   Status information
•   HTTP response code
•   File names & offsets of ARC files
•   HTML title
•   HTML meta tags
•   RSS/Atom information
•   All anchors/hyperlinks

Text Files - Text Only

           http://commoncrawl.org/get-started
Change between 2010 and 2012
• URLs with embedded data +6%
• Microdata +14%
• RDFa +26%

      http://webdatacommons.org
• 22% of Web pages contain Facebook URLs
• 8% of Web pages implement Open Graph tags
http://wikientities.appspot.com

A corpus of anchortext-WikipediaConcept-Count
   from the CommonCrawl dataset, to benefit
         research on WSD, NLP and IR.

Given a sentence, it can
Explicit Topic Modeling: help identify entities
(person, location, organization) in wikipedia
Given a concept (represented as a the sentence
and map them onto Wikipedia concepts.
page), it can tell what are the most common
terms people use to describe the concept.
Mapping French websites related to Open Data
Other Use Examples
•   Apache Giraph Testing
•   Maplight
•   Tineye
•   Factual
•   Sentiment Analysis Projects
In Development
•   N-gram and Link Graph Extracts
•   Pig Reader
•   More Frequent Full Crawls
•   Focused Subset Crawls at High Frequency
•   Open Educational Resources
Thank You
London HUG

               What Does
             The Data World
                       Lisa Green

             Mean to Society?
                  lisa@commoncrawl.org
                www.commoncrawl.org
                     @commoncrawl
                      Lisa Green
                       @boudicca
                   1 October 2012

Weitere ähnliche Inhalte

Was ist angesagt?

Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Amazon Web Services
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftAmazon Web Services
 
WHAT ARE METADATA STANDARDS? EXPLAIN DUBLIN CORE IN DETAIL.
WHAT ARE METADATA STANDARDS? EXPLAIN DUBLIN CORE IN DETAIL.WHAT ARE METADATA STANDARDS? EXPLAIN DUBLIN CORE IN DETAIL.
WHAT ARE METADATA STANDARDS? EXPLAIN DUBLIN CORE IN DETAIL.`Shweta Bhavsar
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into ElasticsearchKnoldus Inc.
 
stackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with WeaviateNETWAYS
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaObjectRocket
 
Chapter 1 semantic web
Chapter 1 semantic webChapter 1 semantic web
Chapter 1 semantic webR A Akerkar
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...HostedbyConfluent
 
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...SindhuVasireddy1
 
How to Build a Recommendation Engine with Neo4j
How to Build a Recommendation Engine with Neo4jHow to Build a Recommendation Engine with Neo4j
How to Build a Recommendation Engine with Neo4jNeo4j
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentDatabricks
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overviewABC Talks
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsAndrzej Michałowski
 

Was ist angesagt? (20)

Digital Libray
Digital LibrayDigital Libray
Digital Libray
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 
Introduction to DSpace
Introduction to DSpaceIntroduction to DSpace
Introduction to DSpace
 
WHAT ARE METADATA STANDARDS? EXPLAIN DUBLIN CORE IN DETAIL.
WHAT ARE METADATA STANDARDS? EXPLAIN DUBLIN CORE IN DETAIL.WHAT ARE METADATA STANDARDS? EXPLAIN DUBLIN CORE IN DETAIL.
WHAT ARE METADATA STANDARDS? EXPLAIN DUBLIN CORE IN DETAIL.
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
Dspace software
Dspace softwareDspace software
Dspace software
 
stackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviate
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
Chapter 1 semantic web
Chapter 1 semantic webChapter 1 semantic web
Chapter 1 semantic web
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
 
Dspace
DspaceDspace
Dspace
 
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
 
How to Build a Recommendation Engine with Neo4j
How to Build a Recommendation Engine with Neo4jHow to Build a Recommendation Engine with Neo4j
How to Build a Recommendation Engine with Neo4j
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
Newgenlib
NewgenlibNewgenlib
Newgenlib
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
 
Extreme Analytics @ eBay
Extreme Analytics @ eBayExtreme Analytics @ eBay
Extreme Analytics @ eBay
 

Andere mochten auch

Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Using the whole web as your dataset
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your datasetTuri, Inc.
 
Is Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal PoliciesIs Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal PoliciesPromptCloud
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering projectHoa Nguyen
 
The Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterThe Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterCommonCrawl
 
Gephi Consortium Presentation
Gephi Consortium PresentationGephi Consortium Presentation
Gephi Consortium PresentationGephi Consortium
 
Enterprise Data World 2016 and CDO Vision Mural Summary
Enterprise Data World 2016 and CDO Vision Mural SummaryEnterprise Data World 2016 and CDO Vision Mural Summary
Enterprise Data World 2016 and CDO Vision Mural SummaryDATAVERSITY
 
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012Amazon Web Services
 

Andere mochten auch (9)

Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Using the whole web as your dataset
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your dataset
 
Is Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal PoliciesIs Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal Policies
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering project
 
The Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterThe Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecter
 
Gephi Consortium Presentation
Gephi Consortium PresentationGephi Consortium Presentation
Gephi Consortium Presentation
 
Enterprise Data World 2016 and CDO Vision Mural Summary
Enterprise Data World 2016 and CDO Vision Mural SummaryEnterprise Data World 2016 and CDO Vision Mural Summary
Enterprise Data World 2016 and CDO Vision Mural Summary
 
Gephi Quick Start
Gephi Quick StartGephi Quick Start
Gephi Quick Start
 
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
 

Ähnlich wie Common Crawl: An Open Repository of Web Data

OpenGLAM in museums: Linked Open Data and Wikipedia
OpenGLAM in museums: Linked Open Data and WikipediaOpenGLAM in museums: Linked Open Data and Wikipedia
OpenGLAM in museums: Linked Open Data and WikipediaGeorgina Goodlander
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsJon Voss
 
Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.Jon Voss
 
Global lodlam_communities and open cultural data
Global lodlam_communities and open cultural dataGlobal lodlam_communities and open cultural data
Global lodlam_communities and open cultural dataMinerva Lin
 
Intro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & MuseumsIntro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & MuseumsJon Voss
 
Linked Data Now & Next
Linked Data Now & NextLinked Data Now & Next
Linked Data Now & NextRichard Wallis
 
Linked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & MuseumsLinked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & MuseumsJon Voss
 
What is New in W3C land?
What is New in W3C land?What is New in W3C land?
What is New in W3C land?Ivan Herman
 
The Cultural Linked Data Backbone
The Cultural Linked Data BackboneThe Cultural Linked Data Backbone
The Cultural Linked Data BackboneRichard Wallis
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentationekansa
 
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...dri_ireland
 
OCLC Linked Data Progress
OCLC Linked Data ProgressOCLC Linked Data Progress
OCLC Linked Data ProgressRichard Wallis
 
BHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersBHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersPhil Cryer
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Anja Jentzsch
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataMarin Dimitrov
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Open Data Masterclass - Europeana and LOD
Open Data Masterclass - Europeana and LODOpen Data Masterclass - Europeana and LOD
Open Data Masterclass - Europeana and LODAntoine Isaac
 
From Open Access to Open Standards, (Linked) Data and Collaborations
From Open Access to Open Standards, (Linked) Data and CollaborationsFrom Open Access to Open Standards, (Linked) Data and Collaborations
From Open Access to Open Standards, (Linked) Data and CollaborationsSimeon Warner
 

Ähnlich wie Common Crawl: An Open Repository of Web Data (20)

OpenGLAM in museums: Linked Open Data and Wikipedia
OpenGLAM in museums: Linked Open Data and WikipediaOpenGLAM in museums: Linked Open Data and Wikipedia
OpenGLAM in museums: Linked Open Data and Wikipedia
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
 
Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.
 
Linked Data and OCLC
Linked Data and OCLCLinked Data and OCLC
Linked Data and OCLC
 
Global lodlam_communities and open cultural data
Global lodlam_communities and open cultural dataGlobal lodlam_communities and open cultural data
Global lodlam_communities and open cultural data
 
Intro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & MuseumsIntro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & Museums
 
Linked Data Now & Next
Linked Data Now & NextLinked Data Now & Next
Linked Data Now & Next
 
Linked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & MuseumsLinked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & Museums
 
What is New in W3C land?
What is New in W3C land?What is New in W3C land?
What is New in W3C land?
 
Linked Data
Linked DataLinked Data
Linked Data
 
The Cultural Linked Data Backbone
The Cultural Linked Data BackboneThe Cultural Linked Data Backbone
The Cultural Linked Data Backbone
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentation
 
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...
 
OCLC Linked Data Progress
OCLC Linked Data ProgressOCLC Linked Data Progress
OCLC Linked Data Progress
 
BHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersBHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clusters
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Open Data Masterclass - Europeana and LOD
Open Data Masterclass - Europeana and LODOpen Data Masterclass - Europeana and LOD
Open Data Masterclass - Europeana and LOD
 
From Open Access to Open Standards, (Linked) Data and Collaborations
From Open Access to Open Standards, (Linked) Data and CollaborationsFrom Open Access to Open Standards, (Linked) Data and Collaborations
From Open Access to Open Standards, (Linked) Data and Collaborations
 

Mehr von huguk

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introhuguk
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink huguk
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitchinghuguk
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoringhuguk
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startuphuguk
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapulthuguk
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysishuguk
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analyticshuguk
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Socialhuguk
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligencehuguk
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthyhuguk
 

Mehr von huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Kürzlich hochgeladen (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Common Crawl: An Open Repository of Web Data

  • 1. London HUG Common Crawl : WhatRepository An Open Does Theof Web Data Data World Mean to Society? Lisa Green Lisa Green 1 October 2012 10 October 2012
  • 2. Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
  • 3. Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg
  • 4. Image license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
  • 5. Still Nascent • Even cheaper storage • Even cheaper compute • Education • Open Data Image license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
  • 6. Gratis Proprietary Libre Commercial
  • 9.
  • 10. Common Crawl Data • ~8 Billion web pages • ~120 TB • 2008-2012 • ARC files, JSON metadata, text files • Available to anyone
  • 11. ARC Files - Raw Content Metadata • Status information • HTTP response code • File names & offsets of ARC files • HTML title • HTML meta tags • RSS/Atom information • All anchors/hyperlinks Text Files - Text Only http://commoncrawl.org/get-started
  • 12.
  • 13. Change between 2010 and 2012 • URLs with embedded data +6% • Microdata +14% • RDFa +26% http://webdatacommons.org
  • 14. • 22% of Web pages contain Facebook URLs • 8% of Web pages implement Open Graph tags
  • 15. http://wikientities.appspot.com A corpus of anchortext-WikipediaConcept-Count from the CommonCrawl dataset, to benefit research on WSD, NLP and IR. Given a sentence, it can Explicit Topic Modeling: help identify entities (person, location, organization) in wikipedia Given a concept (represented as a the sentence and map them onto Wikipedia concepts. page), it can tell what are the most common terms people use to describe the concept.
  • 16. Mapping French websites related to Open Data
  • 17. Other Use Examples • Apache Giraph Testing • Maplight • Tineye • Factual • Sentiment Analysis Projects
  • 18. In Development • N-gram and Link Graph Extracts • Pig Reader • More Frequent Full Crawls • Focused Subset Crawls at High Frequency • Open Educational Resources
  • 19. Thank You London HUG What Does The Data World Lisa Green Mean to Society? lisa@commoncrawl.org www.commoncrawl.org @commoncrawl Lisa Green @boudicca 1 October 2012