The document discusses using the Scrapy framework in Python for web scraping. It begins with an introduction to web scraping and why Python is useful for it. It then provides an overview of Scrapy, including what problems it solves and how to get started. Specific examples are shown for scraping sushi images from Flickr using Scrapy spiders, items, pipelines, and settings. The spider constructs URLs for each image from Flickr API data and yields requests to download the images via the pipeline into an images folder.
2. Who am I?
• Kota Kato
• @orangain
• Software Engineer
• Interested in automation such as Jenkins,
Chef, Docker etc.
3. Definition: Web Scraping
• Web scraping (web harvesting or web data
extraction) is a computer software technique
of extracting information from websites.
Web scraping - Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Web_scraping
5. QB Meter
• Visualize crowdedness
of QB HOUSE, 10
minutes barbershop.
• Retrieve crowdedness
from QB HOUSE's
Web site every 5
minutes.
http://qbmeter.capybala.com/
7. Pokedos
• Web app to find nearest
bus stops to see the
arrival information of
buses.
• Retrieve location of the
all bus stops in Kyoto-
city.
http://bus.capybala.com/
8. Why Web Scraping?
• For Web Developer:
• Develop mash-up application.
• For Data Analyst:
• Retrieve data to analyze.
• For Everybody:
• Automate operation of web sites.
9. Why Use Python?
• Easy to use
• Powerful libraries, especially Scrapy
• Seamlessness between data processing and
developing application
10. Web Scraping in Python
• Combination of lightweight libraries:
• Retrieving: Requests
• Scraping: lxml, Beautiful Soup
• Full stack framework:
• Scrapy Today's topic
12. Scrapy
• Fast, simple and extensible Web scraping
framework in Python
• Currently compatible only with Python 2.7
• In-progress Python 3 support
• Maintained by Scrapinghub
• BSD License
http://scrapy.org/
13. Why Use Scrapy?
• Annoying stuffs in crawling and scraping are
done by Scrapy.
Extracting
Links
Throttling Concurrency
robots.txt and
<meta> Tags
XML Sitemaps
Filtering
Duplicated
URLs
Retry on Error Job Control
14. Getting Started with Scrapy
$ pip install scrapy
$ cat > myspider.py <<EOF
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://blog.scrapinghub.com']
def parse(self, response):
for url in response.css('ul li a::attr("href")').re(r'.*/dddd/dd/$'):
yield scrapy.Request(response.urljoin(url), self.parse_titles)
def parse_titles(self, response):
for post_title in response.css('div.entries > ul > li a::text').extract():
yield {'title': post_title}
EOF
$ scrapy runspider myspider.py
http://scrapy.org/Requirements: Python 2.7, libxml2 and libxslt
23. pipelines.py
# -*- coding: utf-8 -*-
import os
class SaveImagePipeline(object):
def process_item(self, item, spider):
output_dir = 'images'
if not os.path.exists(output_dir):
os.makedirs(output_dir)
filename = item['url'].split('/')[-1]
with open(os.path.join(output_dir, filename), 'wb') as f:
f.write(item['body'])
return item
24. settings.py
• Appended settings:
# Crawl responsibly by identifying yourself (and your website) on the
user-agent
USER_AGENT = 'sushibot (+orangain@gmail.com)'
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/
settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'sushibot.pipelines.SaveImagePipeline': 300,
}