Mining the web, no experience required

•

1 gefällt mir•789 views

How many times have you wanted to find some information on a website only to be disappointed with the filtering and discovery options available. Learn how to get data from a site and look for the data that you really care about.

Software

Mining the web, no experience required.
Ruairí Fahy, 25th
October 2015

Scrapinghub - Who are we?
● Provider of cloud based web-crawling
solutions
● Builder of spiders and crawling
solutions
● Creator of open source projects like
Scrapy, Portia and Splash
● Find out more at scrapinghub.com
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
Splash
Portia
Scrapy

The Project
Obtain and compare house types and
prices across the country
● Build a spider for daft.ie using Portia
● Crawl daft.ie to obtain housing data
● Process the data using Pandas
● Visualise the data using CartoDB
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015

The Basics
Web Scraping - The process of extracting
data from the web
Spider - A piece of software designed to
extract links and items from webpages
Crawl - Visit all pages of interest on a site
using your spider
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015

Build a spider using Portia
● Portia is a tool for building spiders
without having to write any code.
● It has a simple UI for loading pages
that you want to extract data from.
● Create Samples by highlighting data
that you want on a page.
● Use these samples to train the
extraction algorithm.
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
https://github.com/scrapinghub/portia

Run our spider
● Scrapy Cloud - Hosted crawling at scrapinghub.com
● Scrapyd - Run your own server for crawling
● Portiacrawl - Run the spider locally using scrapy
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015

Process our data with Pandas
● The spider has extracted the house type,
price, BER, number of bedrooms and
address for all houses for sale on daft.ie.
● Clean and normalise data
● Add a geopoint column so the houses can
be placed on a map.
● Process fields to prepare them for plotting
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
Notebook: https://gist.github.com/ruairif/80102746320d0229a0ce

Visualise the data using CartoDB
● Create a dataset from our csv file
● Plot our data on a map
● Compare prices across the country
● Compare property type
● Compare BER
● http://cdb.io/1POBIU8
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015

Thank you!
Ruairi Fahy, 25th
October 2015
ruairi@scrapinghub.com

Empfohlen

Ansible Helsinki meetup (April 2019) - Community updateCarol Chen

Gibela Trencon Presentation - 02june2016Sibongile Nsibande

GRUG 2.0 7 August Opening presentation 20180809Glasgow Revit User Group

XPath for web scrapingScrapinghub

Using Web Data for FinanceScrapinghub

Scrapinghub Deck for StartupsScrapinghub

Big data at scrapinghubDana Brophy

Frontera-Open Source Large Scale Web Crawling Frameworksixtyone

Empfohlen

Ansible Helsinki meetup (April 2019) - Community updateCarol Chen

Gibela Trencon Presentation - 02june2016Sibongile Nsibande

GRUG 2.0 7 August Opening presentation 20180809Glasgow Revit User Group

XPath for web scrapingScrapinghub

Using Web Data for FinanceScrapinghub

Scrapinghub Deck for StartupsScrapinghub

Big data at scrapinghubDana Brophy

Frontera-Open Source Large Scale Web Crawling Frameworksixtyone

ATLRUG Announcements/Upgrade News - August 2016jasnow

ATLRUG Community Announcements for December 2016jasnow

ATLRUG December 2015jasnow

Hong kong drupal user group dec13th responsive web design for dummyAnn Lam

OSGi IoT Demo & Contest 2015mfrancis

ATLRUG Community Announcements - Sept. 2015jasnow

ATLRUG Announcements - July 2016jasnow

Big Data Processing Utilizing Open-source Technologies - May 2015Amir Sedighi

The current state of SAP Integration, SAPPHIRENOW 2018Daniel Graversen

ATLRUG May 2015 Announcementsjasnow

ATLRUG Community Announcements - Oct. 2015jasnow

ATLRUG Announcements for Feb. 2016jasnow

Prototype your dreamPaul Ardeleanu

Publishing your open source projectRishi Pithadiya

Web scraping with RubyHidehiro Nagaoka

Build and Deploy a Python Web App to Amazon in 30 MinsJeff Hull

ATLRUG Community/Giveback Announcmentsjasnow

ATLRUG Announcements - October 2016jasnow

20150624 Belgian GraphDB meetup at OrdinaRik Van Bruggen

Drawbridge_MeetUp_June19_072414Nitin Panjwani

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray

Weitere ähnliche Inhalte

Ähnlich wie Mining the web, no experience required

ATLRUG Announcements/Upgrade News - August 2016jasnow

ATLRUG Community Announcements for December 2016jasnow

ATLRUG December 2015jasnow

Hong kong drupal user group dec13th responsive web design for dummyAnn Lam

OSGi IoT Demo & Contest 2015mfrancis

ATLRUG Community Announcements - Sept. 2015jasnow

ATLRUG Announcements - July 2016jasnow

Big Data Processing Utilizing Open-source Technologies - May 2015Amir Sedighi

The current state of SAP Integration, SAPPHIRENOW 2018Daniel Graversen

ATLRUG May 2015 Announcementsjasnow

ATLRUG Community Announcements - Oct. 2015jasnow

ATLRUG Announcements for Feb. 2016jasnow

Prototype your dreamPaul Ardeleanu

Publishing your open source projectRishi Pithadiya

Web scraping with RubyHidehiro Nagaoka

Build and Deploy a Python Web App to Amazon in 30 MinsJeff Hull

ATLRUG Community/Giveback Announcmentsjasnow

ATLRUG Announcements - October 2016jasnow

20150624 Belgian GraphDB meetup at OrdinaRik Van Bruggen

Drawbridge_MeetUp_June19_072414Nitin Panjwani

Ähnlich wie Mining the web, no experience required (20)

ATLRUG Announcements/Upgrade News - August 2016

ATLRUG Community Announcements for December 2016

ATLRUG December 2015

Hong kong drupal user group dec13th responsive web design for dummy

OSGi IoT Demo & Contest 2015

ATLRUG Community Announcements - Sept. 2015

ATLRUG Announcements - July 2016

Big Data Processing Utilizing Open-source Technologies - May 2015

The current state of SAP Integration, SAPPHIRENOW 2018

ATLRUG May 2015 Announcements

ATLRUG Community Announcements - Oct. 2015

ATLRUG Announcements for Feb. 2016

Prototype your dream

Publishing your open source project

Web scraping with Ruby

Build and Deploy a Python Web App to Amazon in 30 Mins

ATLRUG Community/Giveback Announcments

ATLRUG Announcements - October 2016

20150624 Belgian GraphDB meetup at Ordina

Drawbridge_MeetUp_June19_072414

Kürzlich hochgeladen

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky

Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1

What are the key points to focus on before starting to learn ETL Development....kzayra69

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater

2.pdf Ejercicios de programación competitivaDiego Iván Oliveros Acosta

EY_Graph Database Powered SustainabilityNeo4j

Introduction Computer Science - Software Design.pdfFerryKemperman

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

How to submit a standout Adobe Champion ApplicationBradBedford3

Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

Implementing Zero Trust strategy with AzureDinusha Kumarasiri

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

Kürzlich hochgeladen (20)

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...

Best Web Development Agency- Idiosys USA.pdf

What are the key points to focus on before starting to learn ETL Development....

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Software Project Health Check: Best Practices and Techniques for Your Product...

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service

Ahmed Motair CV April 2024 (Senior SW Developer)

2.pdf Ejercicios de programación competitiva

EY_Graph Database Powered Sustainability

Introduction Computer Science - Software Design.pdf

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

How to submit a standout Adobe Champion Application

Unveiling the Future: Sylius 2.0 New Features

Cloud Data Center Network Construction - IEEE

Implementing Zero Trust strategy with Azure

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...

Unveiling Design Patterns: A Visual Guide with UML Diagrams

Mining the web, no experience required

1. Mining the web, no experience required. Ruairí Fahy, 25th October 2015

2. Scrapinghub - Who are we? ● Provider of cloud based web-crawling solutions ● Builder of spiders and crawling solutions ● Creator of open source projects like Scrapy, Portia and Splash ● Find out more at scrapinghub.com Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 Splash Portia Scrapy

3. The Project Obtain and compare house types and prices across the country ● Build a spider for daft.ie using Portia ● Crawl daft.ie to obtain housing data ● Process the data using Pandas ● Visualise the data using CartoDB Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015

4. The Basics Web Scraping - The process of extracting data from the web Spider - A piece of software designed to extract links and items from webpages Crawl - Visit all pages of interest on a site using your spider Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015

5. Build a spider using Portia ● Portia is a tool for building spiders without having to write any code. ● It has a simple UI for loading pages that you want to extract data from. ● Create Samples by highlighting data that you want on a page. ● Use these samples to train the extraction algorithm. Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 https://github.com/scrapinghub/portia

6. Run our spider ● Scrapy Cloud - Hosted crawling at scrapinghub.com ● Scrapyd - Run your own server for crawling ● Portiacrawl - Run the spider locally using scrapy Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015

7. Process our data with Pandas ● The spider has extracted the house type, price, BER, number of bedrooms and address for all houses for sale on daft.ie. ● Clean and normalise data ● Add a geopoint column so the houses can be placed on a map. ● Process fields to prepare them for plotting Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 Notebook: https://gist.github.com/ruairif/80102746320d0229a0ce

8. Visualise the data using CartoDB ● Create a dataset from our csv file ● Plot our data on a map ● Compare prices across the country ● Compare property type ● Compare BER ● http://cdb.io/1POBIU8 Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015

9. We’re Hiring - scrapinghub.com/jobs

10. Thank you! Ruairi Fahy, 25th October 2015 ruairi@scrapinghub.com