SlideShare ist ein Scribd-Unternehmen logo
1 von 76
DATA LIBERATION
Opening Up Data by Hook
or by Crook - Data
Scraping, Linkage and the
Value of a Good Identifier
                               Tony Hirst
                      Department of Communication
                             and Systems
                          The Open University
data NOT
information
              by Vick
[Disruptive
Innovation?]
“First” generation:
 data catalogues
Breathing life
 into data…
=importData(“CSV_URL”)
the spreadsheet becomes

A DATABASE
“Second” generation:
 data management
      systems
There’s lots more
data that’s locked
up in web pages…
Scraping…
“grabbing web content
in a machine readable
   format and then
 processing it for your
    own purposes”
Original      Extract
                          Accessible
HTML web    Information
                          web page
  page         -> data
Recreating the
database that was used
     to populate a
   (templated) page
…quick’n’dirty
Scrapers
                  SQLite
    Scraper      database




Views
   SQLitedatab
       ase
                 Scraper
Sometimes the
 data is spread
across different
     files…
Row based
aggregation
Sometimes the
 data is spread
across different
  websites…
…   Normalisation…
Data
Enrichment
Column
Additions/An
 notations
Sometimes the
  data is split
across different
     files…
Column
based merge
-> Data
cleansing
Clustering…
http://mashe.hawksey.info/2012/11/mining-and-openrefineing-jiscmail-a-look-at-oer-discuss/

/via Martin Hawksey/@mhawksey
“Finessing” a
  common
  identifer
Common identifiers
 (common KEYS) make
it MUCH easier to JOIN
   datasets by column
Book Title
-> ISBN
I am “psychemedia”
on Twitter, delicious,
slideshare, flickr, etc
         etc
Reconciliation…
Linked
Data™
So who speaks SPARQL?




     Diners - Journal Canteen
     by avlxyz
You DON’T have to….
Just think about how one piece of
 data might be related to another
   through a common means of
        addressing them…
http://ouseful.info

 @psychemedia

Weitere ähnliche Inhalte

Was ist angesagt?

Soton2013 opendata
Soton2013 opendataSoton2013 opendata
Soton2013 opendataTony Hirst
 
non-slides-Thatcamp
non-slides-Thatcampnon-slides-Thatcamp
non-slides-ThatcampTrevor Owens
 
I say NoSQL you say what
I say NoSQL you say whatI say NoSQL you say what
I say NoSQL you say whatPratik Khasnabis
 
Data(base) taxonomy
Data(base) taxonomyData(base) taxonomy
Data(base) taxonomyDejan Radic
 
Promises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library CatalogsPromises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library CatalogsEmily Nimsakont
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentationekansa
 
Relevance of clasification and indexing
Relevance of clasification and indexingRelevance of clasification and indexing
Relevance of clasification and indexingVaralakshmiRSR
 
A distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamA distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamEnno Meijers
 
What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?Emily Nimsakont
 
LODLAM Landscape NOTES
LODLAM Landscape NOTESLODLAM Landscape NOTES
LODLAM Landscape NOTESShana McDanold
 
Lodlam.slideshare
Lodlam.slideshareLodlam.slideshare
Lodlam.slideshareHafabe
 
LODLAM Landscape
LODLAM LandscapeLODLAM Landscape
LODLAM LandscapeShana McDanold
 
The network reconfigures the catalog
The network reconfigures the catalogThe network reconfigures the catalog
The network reconfigures the cataloglisld
 
Towards collaboration at scale: Libraries, the social and the technical
Towards collaboration at scale:  Libraries, the social and the technicalTowards collaboration at scale:  Libraries, the social and the technical
Towards collaboration at scale: Libraries, the social and the technicallisld
 
Linked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An IntroductionLinked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An IntroductionEmily Nimsakont
 
ECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern WorldECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern WorldMarc D Anderson
 

Was ist angesagt? (18)

Soton2013 opendata
Soton2013 opendataSoton2013 opendata
Soton2013 opendata
 
non-slides-Thatcamp
non-slides-Thatcampnon-slides-Thatcamp
non-slides-Thatcamp
 
I say NoSQL you say what
I say NoSQL you say whatI say NoSQL you say what
I say NoSQL you say what
 
Data(base) taxonomy
Data(base) taxonomyData(base) taxonomy
Data(base) taxonomy
 
Promises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library CatalogsPromises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library Catalogs
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentation
 
Relevance of clasification and indexing
Relevance of clasification and indexingRelevance of clasification and indexing
Relevance of clasification and indexing
 
A distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamA distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics Amsterdam
 
What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?
 
Databases and types of databases
Databases and types of databasesDatabases and types of databases
Databases and types of databases
 
LODLAM Landscape NOTES
LODLAM Landscape NOTESLODLAM Landscape NOTES
LODLAM Landscape NOTES
 
Lodlam.slideshare
Lodlam.slideshareLodlam.slideshare
Lodlam.slideshare
 
LODLAM Landscape
LODLAM LandscapeLODLAM Landscape
LODLAM Landscape
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
The network reconfigures the catalog
The network reconfigures the catalogThe network reconfigures the catalog
The network reconfigures the catalog
 
Towards collaboration at scale: Libraries, the social and the technical
Towards collaboration at scale:  Libraries, the social and the technicalTowards collaboration at scale:  Libraries, the social and the technical
Towards collaboration at scale: Libraries, the social and the technical
 
Linked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An IntroductionLinked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An Introduction
 
ECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern WorldECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern World
 

Andere mochten auch

Mining the web, no experience required
Mining the web, no experience requiredMining the web, no experience required
Mining the web, no experience requiredScrapinghub
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scrapingScrapinghub
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkScrapinghub
 
chapter22.ppt
chapter22.pptchapter22.ppt
chapter22.pptTareq Hasan
 
Python 101: Python for Absolute Beginners (PyTexas 2014)
Python 101: Python for Absolute Beginners (PyTexas 2014)Python 101: Python for Absolute Beginners (PyTexas 2014)
Python 101: Python for Absolute Beginners (PyTexas 2014)Paige Bailey
 

Andere mochten auch (6)

Search34
Search34Search34
Search34
 
Mining the web, no experience required
Mining the web, no experience requiredMining the web, no experience required
Mining the web, no experience required
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
chapter22.ppt
chapter22.pptchapter22.ppt
chapter22.ppt
 
Python 101: Python for Absolute Beginners (PyTexas 2014)
Python 101: Python for Absolute Beginners (PyTexas 2014)Python 101: Python for Absolute Beginners (PyTexas 2014)
Python 101: Python for Absolute Beginners (PyTexas 2014)
 

Ă„hnlich wie Onlineinfo2012 - Scraping

What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...Emily Nimsakont
 
What is the Semantic Web
What is the Semantic WebWhat is the Semantic Web
What is the Semantic WebJuan Sequeda
 
Linked Data: so what?
Linked Data: so what?Linked Data: so what?
Linked Data: so what?MIUR
 
Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?Emily Nimsakont
 
Linked data for Libraries, Archives, Museums
Linked data for Libraries, Archives, MuseumsLinked data for Libraries, Archives, Museums
Linked data for Libraries, Archives, Museumsljsmart
 
Library discovery: past, present and some futures
Library discovery: past, present and some futuresLibrary discovery: past, present and some futures
Library discovery: past, present and some futureslisld
 
Lodlam saa 2011_jenelfarrell_2
Lodlam saa 2011_jenelfarrell_2Lodlam saa 2011_jenelfarrell_2
Lodlam saa 2011_jenelfarrell_2Jenel Farrell
 
Sailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0sSailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0sWoodruff Solutions LLC
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesAlessandro Adamou
 
Metadata in the age of data curation and linked data
Metadata in the age of data curation and linked dataMetadata in the age of data curation and linked data
Metadata in the age of data curation and linked dataRyan Johnson
 
Madrid Building blocks of Linked Data
Madrid Building blocks of Linked DataMadrid Building blocks of Linked Data
Madrid Building blocks of Linked DataVictor de Boer
 
LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersPrattSILS
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked dataLaura Po
 
Libraries in a data-centered environment
Libraries in a data-centered environmentLibraries in a data-centered environment
Libraries in a data-centered environmentJakob .
 
What flavor of linked data is best for your collection?
What flavor of linked data is best for your collection? What flavor of linked data is best for your collection?
What flavor of linked data is best for your collection? Debra Shapiro
 
Semantic Mapping and LOD prez
Semantic Mapping and LOD prezSemantic Mapping and LOD prez
Semantic Mapping and LOD prezCarol Chiodo
 
Semantic web Santhosh N Basavarajappa
Semantic web   Santhosh N BasavarajappaSemantic web   Santhosh N Basavarajappa
Semantic web Santhosh N BasavarajappaSanthosh Basavarajappa
 
Management of bibliographic metadata - Metadata management at the Leibniz Inf...
Management of bibliographic metadata - Metadata management at the Leibniz Inf...Management of bibliographic metadata - Metadata management at the Leibniz Inf...
Management of bibliographic metadata - Metadata management at the Leibniz Inf...suvanni
 

Ă„hnlich wie Onlineinfo2012 - Scraping (20)

What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
 
What is the Semantic Web
What is the Semantic WebWhat is the Semantic Web
What is the Semantic Web
 
Linked Data: so what?
Linked Data: so what?Linked Data: so what?
Linked Data: so what?
 
Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?
 
Linked data for Libraries, Archives, Museums
Linked data for Libraries, Archives, MuseumsLinked data for Libraries, Archives, Museums
Linked data for Libraries, Archives, Museums
 
Library discovery: past, present and some futures
Library discovery: past, present and some futuresLibrary discovery: past, present and some futures
Library discovery: past, present and some futures
 
Lodlam saa 2011_jenelfarrell_2
Lodlam saa 2011_jenelfarrell_2Lodlam saa 2011_jenelfarrell_2
Lodlam saa 2011_jenelfarrell_2
 
Sailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0sSailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0s
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiences
 
Metadata in the age of data curation and linked data
Metadata in the age of data curation and linked dataMetadata in the age of data curation and linked data
Metadata in the age of data curation and linked data
 
Madrid Building blocks of Linked Data
Madrid Building blocks of Linked DataMadrid Building blocks of Linked Data
Madrid Building blocks of Linked Data
 
LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project posters
 
Engineering a Semantic Web (Spring 2018)
Engineering a Semantic Web (Spring 2018)Engineering a Semantic Web (Spring 2018)
Engineering a Semantic Web (Spring 2018)
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
Libraries in a data-centered environment
Libraries in a data-centered environmentLibraries in a data-centered environment
Libraries in a data-centered environment
 
What flavor of linked data is best for your collection?
What flavor of linked data is best for your collection? What flavor of linked data is best for your collection?
What flavor of linked data is best for your collection?
 
Linked library data
Linked library dataLinked library data
Linked library data
 
Semantic Mapping and LOD prez
Semantic Mapping and LOD prezSemantic Mapping and LOD prez
Semantic Mapping and LOD prez
 
Semantic web Santhosh N Basavarajappa
Semantic web   Santhosh N BasavarajappaSemantic web   Santhosh N Basavarajappa
Semantic web Santhosh N Basavarajappa
 
Management of bibliographic metadata - Metadata management at the Leibniz Inf...
Management of bibliographic metadata - Metadata management at the Leibniz Inf...Management of bibliographic metadata - Metadata management at the Leibniz Inf...
Management of bibliographic metadata - Metadata management at the Leibniz Inf...
 

Mehr von Tony Hirst

15 in 20 research fiesta
15 in 20 research fiesta15 in 20 research fiesta
15 in 20 research fiestaTony Hirst
 
Dev8d jupyter
Dev8d jupyterDev8d jupyter
Dev8d jupyterTony Hirst
 
Ili 16 robot
Ili 16 robotIli 16 robot
Ili 16 robotTony Hirst
 
Jupyternotebooks ou.pptx
Jupyternotebooks ou.pptxJupyternotebooks ou.pptx
Jupyternotebooks ou.pptxTony Hirst
 
Virtual computing.pptx
Virtual computing.pptxVirtual computing.pptx
Virtual computing.pptxTony Hirst
 
ouseful-parlihacks
ouseful-parlihacksouseful-parlihacks
ouseful-parlihacksTony Hirst
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriateTony Hirst
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriateTony Hirst
 
Robotlab jupyter
Robotlab   jupyterRobotlab   jupyter
Robotlab jupyterTony Hirst
 
Fco open data in half day th-v2
Fco open data in half day  th-v2Fco open data in half day  th-v2
Fco open data in half day th-v2Tony Hirst
 
Notes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopNotes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopTony Hirst
 
Community Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireCommunity Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireTony Hirst
 
Residential school 2015_robotics_interest
Residential school 2015_robotics_interestResidential school 2015_robotics_interest
Residential school 2015_robotics_interestTony Hirst
 
Data Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXData Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXTony Hirst
 
A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefineTony Hirst
 
Conversations with data
Conversations with dataConversations with data
Conversations with dataTony Hirst
 
Data reuse OU workshop bingo
Data reuse OU workshop bingoData reuse OU workshop bingo
Data reuse OU workshop bingoTony Hirst
 
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Tony Hirst
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalismTony Hirst
 

Mehr von Tony Hirst (20)

15 in 20 research fiesta
15 in 20 research fiesta15 in 20 research fiesta
15 in 20 research fiesta
 
Dev8d jupyter
Dev8d jupyterDev8d jupyter
Dev8d jupyter
 
Ili 16 robot
Ili 16 robotIli 16 robot
Ili 16 robot
 
Jupyternotebooks ou.pptx
Jupyternotebooks ou.pptxJupyternotebooks ou.pptx
Jupyternotebooks ou.pptx
 
Virtual computing.pptx
Virtual computing.pptxVirtual computing.pptx
Virtual computing.pptx
 
ouseful-parlihacks
ouseful-parlihacksouseful-parlihacks
ouseful-parlihacks
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Robotlab jupyter
Robotlab   jupyterRobotlab   jupyter
Robotlab jupyter
 
Fco open data in half day th-v2
Fco open data in half day  th-v2Fco open data in half day  th-v2
Fco open data in half day th-v2
 
Notes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopNotes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 Workshop
 
Community Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireCommunity Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wire
 
Residential school 2015_robotics_interest
Residential school 2015_robotics_interestResidential school 2015_robotics_interest
Residential school 2015_robotics_interest
 
Data Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXData Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKX
 
Week4
Week4Week4
Week4
 
A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefine
 
Conversations with data
Conversations with dataConversations with data
Conversations with data
 
Data reuse OU workshop bingo
Data reuse OU workshop bingoData reuse OU workshop bingo
Data reuse OU workshop bingo
 
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalism
 

KĂĽrzlich hochgeladen

8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCRashishs7044
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaoncallgirls2057
 
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort ServiceCall US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Servicecallgirls2057
 
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...ictsugar
 
8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCR8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCRashishs7044
 
Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Kirill Klimov
 
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu MenzaYouth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menzaictsugar
 
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deckPitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deckHajeJanKamps
 
Digital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfDigital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfJos Voskuil
 
Annual General Meeting Presentation Slides
Annual General Meeting Presentation SlidesAnnual General Meeting Presentation Slides
Annual General Meeting Presentation SlidesKeppelCorporation
 
Organizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessOrganizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessSeta Wicaksana
 
APRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdfAPRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdfRbc Rbcua
 
8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCRashishs7044
 
2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis UsageNeil Kimberley
 
India Consumer 2024 Redacted Sample Report
India Consumer 2024 Redacted Sample ReportIndia Consumer 2024 Redacted Sample Report
India Consumer 2024 Redacted Sample ReportMintel Group
 
Kenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby AfricaKenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby Africaictsugar
 
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...ShrutiBose4
 
IoT Insurance Observatory: summary 2024
IoT Insurance Observatory:  summary 2024IoT Insurance Observatory:  summary 2024
IoT Insurance Observatory: summary 2024Matteo Carbone
 
8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCRashishs7044
 

KĂĽrzlich hochgeladen (20)

8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
 
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort ServiceCall US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
 
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
 
8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCR8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCR
 
Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024
 
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu MenzaYouth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
 
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deckPitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
 
Digital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfDigital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdf
 
Annual General Meeting Presentation Slides
Annual General Meeting Presentation SlidesAnnual General Meeting Presentation Slides
Annual General Meeting Presentation Slides
 
Organizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessOrganizational Structure Running A Successful Business
Organizational Structure Running A Successful Business
 
APRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdfAPRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdf
 
8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR
 
2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage
 
India Consumer 2024 Redacted Sample Report
India Consumer 2024 Redacted Sample ReportIndia Consumer 2024 Redacted Sample Report
India Consumer 2024 Redacted Sample Report
 
Kenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby AfricaKenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby Africa
 
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
 
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...
 
IoT Insurance Observatory: summary 2024
IoT Insurance Observatory:  summary 2024IoT Insurance Observatory:  summary 2024
IoT Insurance Observatory: summary 2024
 
8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR
 

Onlineinfo2012 - Scraping

Hinweis der Redaktion

  1. Tony HirstTwitter:@psychemediaBlog: http://blog.ouseful.infoPresentation prepared for: Online Info 12/11/2012DATA LIBERATION: OPENING UP DATA BY HOOK OR BY CROOK - DATA SCRAPING, LINKAGE AND THE VALUE OF A GOOD IDENTIFIERThe 1/9/90 rule is often used to characterise the way in which a small number of creators generate content that a larger number (but still small percentage in the greater scheme of things) comment on or amplify, whilst the majority just passively consume. In this presentation, I will explore the extent to which a similar view applies to the world of "data liberation". After reviewing the idea of data scraping, and some of the techniques surrounding it, I will describe how online tools such as Scraperwiki provide a platform for concentrating data scraping activity and expertise, as well as supporting the publication of data /as data/ in a variety of formats, in addition to 'end user' views in the form of graphical charts and interactive visualisations.One of the major motivations for data scraping is the aggregation of data from a variety of data sources into a larger, integrated whole. For example, the aggregation of research council funding data from separate research councils allows us to view a large proportion of the publicly funded research grants received by a single institution; or the collection of local council spending data across all UK councils allows us to see how councils spend money with each other across a range of transaction areas. But how do we actually create such aggregations when the data is sourced from different areas? In order to do this, we need to know when different datasets are actually talking about the same thing, which is where common identifiers come in. For it is surely the case that when we have common identifiers, we can have linkage, and as a result start to realise some of the benefits of Linked Data (as well as developing a wider appreciation of what those benefits might actually be...) (As an aside, I'll describe how we might go about deriving such identifiers when they are missing from a data set that might otherwise, or more conveniently, be expected to publish them.)Throughout the presentation, I will draw on practical examples of how aggregated "liberated" data has been used as the basis of wider interest, and even status quo disrupting, services, as well as reflecting on what other sources of data we might see the data liberators turning their attention to next...Key learning points:1 - What is "data scraping", how can I do it and is my website at risk of it?2 - Why the secret to understanding "Linked Data" is the very idea of it, not just (or not even) the technology.3 - How has data scraping been used to "open up" data in actual practice?
  2. The focus on this presentation is not the release of “information”, but the release of data in raw form so that it can be interpreted and presented in informative ways by other parties.
  3. The London Datastore is an early example of a council-centric open data website. Early signs suggest it is natural to locate data websites at addresses of the form data.COUNCILNAME.gov.uk or www.COUNCILNAME.gov.uk/data
  4. Another example that demonstrates how CSV can be used to help data flow is demonstrated by Google Spreadsheets. The =importData formula allows a user to specify a source data URL, and pull the CSV data found at that location in to the spreadsheet. Unlike Many Eyes Wikified, if the source data at the URL is updated, the updated will (eventually) be pulled into the spreadsheet automatically.
  5. One of the really good reasons for getting data into a data processing environment such as a spreadsheet is that you can start to work it. In the case of Google Spreadsheets, the spreadsheet environment can also be used as a database environment. That is, we can treat one or more data containing sheets in a spreadsheet as a database, and generate new views over the data, as well as running queries over that data.
  6. Another way of using a Google Spreadsheet as a database is via the Google Spreadsheets API. The GoogleVisualisation API (?) provides a way of passing queries written using the Google ???viz query language from an arbitrary web page or web application, and receiving the resulting data in a standard JSON based format, which also happens to play nicely with the Google Visualisation API???The Guardian Datastore explorer is a crude demonstration for 2009(??) demonstrating how data from the Guardian datastore, data that is stored across a range of Google spreadsheets, can be explored , queried and visualised via these APIs. Users can select a dataset from a drop down menu, fed from a delicious account to which various datastore spreadsheets have been bookmarked using a particular set of tags, or by pasting in the URL of an arbitrary (public) Google spreadsheet. The first row/headings of the data can then be previewed (a simple spreadsheet is assumed, in which column headings appear In the first row of the spreadsheet).
  7. A series of list boxes are then populated with the column labels and there names, and provide a certain amount of help for the creation of a query over the spreadsheet data. A range of output formats can also be selected, from simple HTML data tables, to a range of charts. URLs are also generated for HTML and CSV representations of the data returned from the query.
  8. One of the nice things about the data table widget (a standard GoogleVisualisation API component in this case, though similar examples exist for YUI, the Yahoo User Interface Libraries, or frameworks such as JQuery), is that is supports things like row sorting by column, (for free – no programming required!), allowing even further manipulation of the data, albeit at a simplistic level.(It’s probably worth pointing out here that it may be worth providing a preview of the column headings and first few rows (or a sample of random rows) of data when datasets are published, just so that users can see what sort of data is on offer without having to download the whole data set?)
  9. If you’re in the business of selling information as data, you are under threat where that information is published in an openly licensed way.
  10. Linked Data – the TM is something of a joke and refers to the particular style of publishing data according to set of principles first outlined by the inventor of the World Wide Web, Sir Tim Berners Lee – is one of the data formats that the Government’s data task force favour for the publication of data.
  11. There is a problem though – at the moment, there are barriers to entry to Linked Data world from both the query side (not many people speak SPARQL, or know how to construct a SPARQL query to an endpoint) and the results side (data is returned as RDF).
  12. So – do you speak SPARQL?