Web Scraping in Python with Scrapy

Web Scraping in
Python with Scrapy
Kota Kato
@orangain
2015-09-08, 鮨会

Who am I?
• Kota Kato
• @orangain
• Software Engineer
• Interested in automation such as Jenkins,
Chef, Docker etc.

Deﬁnition: Web Scraping
• Web scraping (web harvesting or web data
extraction) is a computer software technique
of extracting information from websites.
Web scraping - Wikipedia, the free encyclopedia 
https://en.wikipedia.org/wiki/Web_scraping

eBook-1
• Cross-store search engine for ebooks.
• Retrieve ebook data from 9 ebook stores.
http://ebook-1.com/

QB Meter
• Visualize crowdedness
of QB HOUSE, 10
minutes barbershop.
• Retrieve crowdedness
from QB HOUSE's
Web site every 5
minutes.
http://qbmeter.capybala.com/

Prototype of
Glance
• Prototype of simple news
app like newspaper.
• Retrieve news from NHK
NEWS WEB 4 times per a
day.

Pokedos
• Web app to ﬁnd nearest
bus stops to see the
arrival information of
buses.
• Retrieve location of the
all bus stops in Kyoto-
city.
http://bus.capybala.com/

Why Web Scraping?
• For Web Developer:
• Develop mash-up application.
• For Data Analyst:
• Retrieve data to analyze.
• For Everybody:
• Automate operation of web sites.

Why Use Python?
• Easy to use
• Powerful libraries, especially Scrapy
• Seamlessness between data processing and
developing application

Web Scraping in Python
• Combination of lightweight libraries:
• Retrieving: Requests
• Scraping: lxml, Beautiful Soup
• Full stack framework:
• Scrapy Today's topic

Scrapy
• Fast, simple and extensible Web scraping
framework in Python
• Currently compatible only with Python 2.7
• In-progress Python 3 support
• Maintained by Scrapinghub
• BSD License
http://scrapy.org/

Why Use Scrapy?
• Annoying stuffs in crawling and scraping are
done by Scrapy.
Extracting
Links
Throttling Concurrency
robots.txt and
<meta> Tags
XML Sitemaps
Filtering
Duplicated
URLs
Retry on Error Job Control

Getting Started with Scrapy
$ pip install scrapy
$ cat > myspider.py <<EOF
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://blog.scrapinghub.com']
def parse(self, response):
for url in response.css('ul li a::attr("href")').re(r'.*/dddd/dd/$'):
yield scrapy.Request(response.urljoin(url), self.parse_titles)
def parse_titles(self, response):
for post_title in response.css('div.entries > ul > li a::text').extract():
yield {'title': post_title}
EOF
$ scrapy runspider myspider.py
http://scrapy.org/Requirements: Python 2.7, libxml2 and libxslt

Create a Scrapy Project
$ scrapy startproject sushibot
$ tree sushibot/
sushibot/
!"" scrapy.cfg
#"" sushibot
!"" __init__.py
!"" items.py
!"" pipelines.py
!"" settings.py
#"" spiders
#"" __init__.py
2 directories, 6 files

Generate a Spider
$ cd sushibot
$ scrapy genspider sushi api.flickr.com
$ cat sushibot/spiders/sushi.py
# -*- coding: utf-8 -*-
import scrapy
class SushiSpider(scrapy.Spider):
name = "sushi"
allowed_domains = ["api.flickr.com"]
start_urls = (
'http://www.api.flickr.com/',
)
pass

Flickr API to Search Photos
$ curl 'https://api.flickr.com/services/rest/?
method=flickr.photos.search&api_key=******&text=sushi&sort=relevance
' > photos.xml
$ cat photos.xml
<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="ok">
<photos page="1" pages="871" perpage="100" total="87088">
<photo id="4794344495" owner="38553162@N00" secret="d907790937"
server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0"
isfamily="0" />
<photo id="8486536177" owner="78779574@N00" secret="f77b824ebb"
server="8382" farm="9" title="Best Salmon Sushi" ispublic="1"
isfriend="0" isfamily="0" />
...
https://www.ﬂickr.com/services/api/ﬂickr.photos.search.html

Construct Photo's URL
<photo id="4794344495" owner="38553162@N00" secret="d907790937"
server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0"
isfamily="0" />
https://farm{farm-id}.staticflickr.com/{server-id}/{id}_{secret}
_[mstzb].jpg
https://farm5.staticflickr.com/4093/4794344495_d907790937_b.jpg
https://www.ﬂickr.com/services/api/misc.urls.html
Photo element:
Photo's URL template:
Result:

spider/sushi.py (Modiﬁed)
# -*- coding: utf-8 -*-
import os
import scrapy
from sushibot.items import SushibotItem
class SushiSpider(scrapy.Spider):
name = "sushi"
allowed_domains = ["api.flickr.com", "staticflickr.com"]
start_urls = (
'https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=' +
os.environ['FLICKR_KEY'] + '&text=sushi&sort=relevance',
)
for photo in response.css('photo'):
yield scrapy.Request(photo_url(photo), self.handle_image)
def handle_image(self, response):
return SushibotItem(url=response.url, body=response.body)
def photo_url(photo):
return 'https://farm{farm}.staticflickr.com/{server}/{id}_{secret}_{size}.jpg'.format(
farm=photo.xpath('@farm').extract_first(),
server=photo.xpath('@server').extract_first(),
id=photo.xpath('@id').extract_first(),
secret=photo.xpath('@secret').extract_first(),
size='b',
)

Scrapy's Architecture
http://doc.scrapy.org/en/1.0/topics/architecture.html

items.py
# -*- coding: utf-8 -*-
from pprint import pformat
import scrapy
class SushibotItem(scrapy.Item):
url = scrapy.Field()
body = scrapy.Field()
def __str__(self):
return pformat({
'url': self['url'],
'body': self['body'][:10] + '...',
})

pipelines.py
# -*- coding: utf-8 -*-
import os
class SaveImagePipeline(object):
def process_item(self, item, spider):
output_dir = 'images'
if not os.path.exists(output_dir):
os.makedirs(output_dir)
filename = item['url'].split('/')[-1]
with open(os.path.join(output_dir, filename), 'wb') as f:
f.write(item['body'])
return item

settings.py
• Appended settings:
# Crawl responsibly by identifying yourself (and your website) on the
user-agent
USER_AGENT = 'sushibot (+orangain@gmail.com)'
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/
settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'sushibot.pipelines.SaveImagePipeline': 300,
}

Run Spider
$ FLICKR_KEY=********** scrapy crawl sushi
NOTE: Provide Flickr's API key with environment variables.

Thank you!
• Web scraping has power to propose
improvement.
• Source code is available at 
https://github.com/orangain/sushibot
@orangain

Web Scraping in Python with Scrapy

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Web Scraping in Python with Scrapy

Ähnlich wie Web Scraping in Python with Scrapy (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Web Scraping in Python with Scrapy