SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Web Scraping in
Python with Scrapy
Kota Kato
@orangain
2015-09-08, 鮨会
Who am I?
• Kota Kato
• @orangain
• Software Engineer
• Interested in automation such as Jenkins,
Chef, Docker etc.
Definition: Web Scraping
• Web scraping (web harvesting or web data
extraction) is a computer software technique
of extracting information from websites.
Web scraping - Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Web_scraping
eBook-1
• Cross-store search engine for ebooks.
• Retrieve ebook data from 9 ebook stores.
http://ebook-1.com/
QB Meter
• Visualize crowdedness
of QB HOUSE, 10
minutes barbershop.
• Retrieve crowdedness
from QB HOUSE's
Web site every 5
minutes.
http://qbmeter.capybala.com/
Prototype of
Glance
• Prototype of simple news
app like newspaper.
• Retrieve news from NHK
NEWS WEB 4 times per a
day.
Pokedos
• Web app to find nearest
bus stops to see the
arrival information of
buses.
• Retrieve location of the
all bus stops in Kyoto-
city.
http://bus.capybala.com/
Why Web Scraping?
• For Web Developer:
• Develop mash-up application.
• For Data Analyst:
• Retrieve data to analyze.
• For Everybody:
• Automate operation of web sites.
Why Use Python?
• Easy to use
• Powerful libraries, especially Scrapy
• Seamlessness between data processing and
developing application
Web Scraping in Python
• Combination of lightweight libraries:
• Retrieving: Requests
• Scraping: lxml, Beautiful Soup
• Full stack framework:
• Scrapy Today's topic
Scrapy
Scrapy
• Fast, simple and extensible Web scraping
framework in Python
• Currently compatible only with Python 2.7
• In-progress Python 3 support
• Maintained by Scrapinghub
• BSD License
http://scrapy.org/
Why Use Scrapy?
• Annoying stuffs in crawling and scraping are
done by Scrapy.
Extracting
Links
Throttling Concurrency
robots.txt and
<meta> Tags
XML Sitemaps
Filtering
Duplicated
URLs
Retry on Error Job Control
Getting Started with Scrapy
$ pip install scrapy
$ cat > myspider.py <<EOF
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://blog.scrapinghub.com']
def parse(self, response):
for url in response.css('ul li a::attr("href")').re(r'.*/dddd/dd/$'):
yield scrapy.Request(response.urljoin(url), self.parse_titles)
def parse_titles(self, response):
for post_title in response.css('div.entries > ul > li a::text').extract():
yield {'title': post_title}
EOF
$ scrapy runspider myspider.py
http://scrapy.org/Requirements: Python 2.7, libxml2 and libxslt
Let's Collect Sushi Images
Create a Scrapy Project
$ scrapy startproject sushibot
$ tree sushibot/
sushibot/
!"" scrapy.cfg
#"" sushibot
!"" __init__.py
!"" items.py
!"" pipelines.py
!"" settings.py
#"" spiders
#"" __init__.py
2 directories, 6 files
Generate a Spider
$ cd sushibot
$ scrapy genspider sushi api.flickr.com
$ cat sushibot/spiders/sushi.py
# -*- coding: utf-8 -*-
import scrapy
class SushiSpider(scrapy.Spider):
name = "sushi"
allowed_domains = ["api.flickr.com"]
start_urls = (
'http://www.api.flickr.com/',
)
def parse(self, response):
pass
Flickr API to Search Photos
$ curl 'https://api.flickr.com/services/rest/?
method=flickr.photos.search&api_key=******&text=sushi&sort=relevance
' > photos.xml
$ cat photos.xml
<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="ok">
<photos page="1" pages="871" perpage="100" total="87088">
<photo id="4794344495" owner="38553162@N00" secret="d907790937"
server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0"
isfamily="0" />
<photo id="8486536177" owner="78779574@N00" secret="f77b824ebb"
server="8382" farm="9" title="Best Salmon Sushi" ispublic="1"
isfriend="0" isfamily="0" />
...
https://www.flickr.com/services/api/flickr.photos.search.html
Construct Photo's URL
<photo id="4794344495" owner="38553162@N00" secret="d907790937"
server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0"
isfamily="0" />
https://farm{farm-id}.staticflickr.com/{server-id}/{id}_{secret}
_[mstzb].jpg
https://farm5.staticflickr.com/4093/4794344495_d907790937_b.jpg
https://www.flickr.com/services/api/misc.urls.html
Photo element:
Photo's URL template:
Result:
spider/sushi.py (Modified)
# -*- coding: utf-8 -*-
import os
import scrapy
from sushibot.items import SushibotItem
class SushiSpider(scrapy.Spider):
name = "sushi"
allowed_domains = ["api.flickr.com", "staticflickr.com"]
start_urls = (
'https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=' +
os.environ['FLICKR_KEY'] + '&text=sushi&sort=relevance',
)
def parse(self, response):
for photo in response.css('photo'):
yield scrapy.Request(photo_url(photo), self.handle_image)
def handle_image(self, response):
return SushibotItem(url=response.url, body=response.body)
def photo_url(photo):
return 'https://farm{farm}.staticflickr.com/{server}/{id}_{secret}_{size}.jpg'.format(
farm=photo.xpath('@farm').extract_first(),
server=photo.xpath('@server').extract_first(),
id=photo.xpath('@id').extract_first(),
secret=photo.xpath('@secret').extract_first(),
size='b',
)
Scrapy's Architecture
http://doc.scrapy.org/en/1.0/topics/architecture.html
items.py
# -*- coding: utf-8 -*-
from pprint import pformat
import scrapy
class SushibotItem(scrapy.Item):
url = scrapy.Field()
body = scrapy.Field()
def __str__(self):
return pformat({
'url': self['url'],
'body': self['body'][:10] + '...',
})
pipelines.py
# -*- coding: utf-8 -*-
import os
class SaveImagePipeline(object):
def process_item(self, item, spider):
output_dir = 'images'
if not os.path.exists(output_dir):
os.makedirs(output_dir)
filename = item['url'].split('/')[-1]
with open(os.path.join(output_dir, filename), 'wb') as f:
f.write(item['body'])
return item
settings.py
• Appended settings:
# Crawl responsibly by identifying yourself (and your website) on the
user-agent
USER_AGENT = 'sushibot (+orangain@gmail.com)'
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/
settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'sushibot.pipelines.SaveImagePipeline': 300,
}
Run Spider
$ FLICKR_KEY=********** scrapy crawl sushi
NOTE: Provide Flickr's API key with environment variables.
Thank you!
• Web scraping has power to propose
improvement.
• Source code is available at

https://github.com/orangain/sushibot
@orangain

Weitere ähnliche Inhalte

Was ist angesagt?

LogStash - Yes, logging can be awesome
LogStash - Yes, logging can be awesomeLogStash - Yes, logging can be awesome
LogStash - Yes, logging can be awesomeJames Turnbull
 
elasticsearch basics workshop
elasticsearch basics workshopelasticsearch basics workshop
elasticsearch basics workshopMathieu Elie
 
Debugging and Testing ES Systems
Debugging and Testing ES SystemsDebugging and Testing ES Systems
Debugging and Testing ES SystemsChris Birchall
 
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014Puppet
 
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)Дмитрий Бумов
 
Luigi Presentation at OSCON 2013
Luigi Presentation at OSCON 2013Luigi Presentation at OSCON 2013
Luigi Presentation at OSCON 2013Erik Bernhardsson
 
Approach to find critical vulnerabilities
Approach to find critical vulnerabilitiesApproach to find critical vulnerabilities
Approach to find critical vulnerabilitiesAshish Kunwar
 
Spatial MongoDB, Node.JS, and Express - server-side JS for your application
Spatial MongoDB, Node.JS, and Express - server-side JS for your applicationSpatial MongoDB, Node.JS, and Express - server-side JS for your application
Spatial MongoDB, Node.JS, and Express - server-side JS for your applicationSteven Pousty
 
Django and Mongoengine
Django and MongoengineDjango and Mongoengine
Django and Mongoengineaustinpublic
 
Code4 lib 20141129 python
Code4 lib 20141129 pythonCode4 lib 20141129 python
Code4 lib 20141129 pythontdsmithCapU
 
Logstash-Elasticsearch-Kibana
Logstash-Elasticsearch-KibanaLogstash-Elasticsearch-Kibana
Logstash-Elasticsearch-Kibanadknx01
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Building an API with Django and Django REST Framework
Building an API with Django and Django REST FrameworkBuilding an API with Django and Django REST Framework
Building an API with Django and Django REST FrameworkChristopher Foresman
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0IBM Cloud Data Services
 
Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go Steven Francia
 
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게Seongyun Byeon
 

Was ist angesagt? (20)

Analyse Yourself
Analyse YourselfAnalyse Yourself
Analyse Yourself
 
LogStash - Yes, logging can be awesome
LogStash - Yes, logging can be awesomeLogStash - Yes, logging can be awesome
LogStash - Yes, logging can be awesome
 
elasticsearch basics workshop
elasticsearch basics workshopelasticsearch basics workshop
elasticsearch basics workshop
 
Debugging and Testing ES Systems
Debugging and Testing ES SystemsDebugging and Testing ES Systems
Debugging and Testing ES Systems
 
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
 
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
 
Luigi Presentation at OSCON 2013
Luigi Presentation at OSCON 2013Luigi Presentation at OSCON 2013
Luigi Presentation at OSCON 2013
 
Approach to find critical vulnerabilities
Approach to find critical vulnerabilitiesApproach to find critical vulnerabilities
Approach to find critical vulnerabilities
 
Spatial MongoDB, Node.JS, and Express - server-side JS for your application
Spatial MongoDB, Node.JS, and Express - server-side JS for your applicationSpatial MongoDB, Node.JS, and Express - server-side JS for your application
Spatial MongoDB, Node.JS, and Express - server-side JS for your application
 
Web::Scraper
Web::ScraperWeb::Scraper
Web::Scraper
 
Django and Mongoengine
Django and MongoengineDjango and Mongoengine
Django and Mongoengine
 
Code4 lib 20141129 python
Code4 lib 20141129 pythonCode4 lib 20141129 python
Code4 lib 20141129 python
 
Logstash-Elasticsearch-Kibana
Logstash-Elasticsearch-KibanaLogstash-Elasticsearch-Kibana
Logstash-Elasticsearch-Kibana
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Building an API with Django and Django REST Framework
Building an API with Django and Django REST FrameworkBuilding an API with Django and Django REST Framework
Building an API with Django and Django REST Framework
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
 
Django Mongodb Engine
Django Mongodb EngineDjango Mongodb Engine
Django Mongodb Engine
 
Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go
 
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
 

Andere mochten auch

Downloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyErin Shellman
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...Anton
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profitFederico Feroldi
 
Web crawler - Scrapy
Web crawler - Scrapy Web crawler - Scrapy
Web crawler - Scrapy yafish
 
Java Script Based Client Server Webapps 2
Java Script Based Client Server Webapps 2Java Script Based Client Server Webapps 2
Java Script Based Client Server Webapps 2kriszyp
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapyrecast203
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BSJohn D
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4Eueung Mulyana
 
Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source toolsSammy Fung
 
Web Scraping Technologies
Web Scraping TechnologiesWeb Scraping Technologies
Web Scraping TechnologiesKrishna Sunuwar
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With PythonRobert Dempsey
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupJim Chang
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scrapingScrapinghub
 
Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Bruno Rocha
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with PythonPaul Schreiber
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoSammy Fung
 
電腦不只會幫你選土豆,還會幫你選新聞
電腦不只會幫你選土豆,還會幫你選新聞電腦不只會幫你選土豆,還會幫你選新聞
電腦不只會幫你選土豆,還會幫你選新聞Andy Dai
 

Andere mochten auch (20)

Downloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + Scrapy
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Scrapy
ScrapyScrapy
Scrapy
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profit
 
Web crawler - Scrapy
Web crawler - Scrapy Web crawler - Scrapy
Web crawler - Scrapy
 
Java Script Based Client Server Webapps 2
Java Script Based Client Server Webapps 2Java Script Based Client Server Webapps 2
Java Script Based Client Server Webapps 2
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapy
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BS
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source tools
 
Web Scraping Technologies
Web Scraping TechnologiesWeb Scraping Technologies
Web Scraping Technologies
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful Soup
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
 
Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014
 
Pydata-Python tools for webscraping
Pydata-Python tools for webscrapingPydata-Python tools for webscraping
Pydata-Python tools for webscraping
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
電腦不只會幫你選土豆,還會幫你選新聞
電腦不只會幫你選土豆,還會幫你選新聞電腦不只會幫你選土豆,還會幫你選新聞
電腦不只會幫你選土豆,還會幫你選新聞
 

Ähnlich wie Web Scraping in Python with Scrapy

Crab - A Python Framework for Building Recommendation Systems
Crab - A Python Framework for Building Recommendation SystemsCrab - A Python Framework for Building Recommendation Systems
Crab - A Python Framework for Building Recommendation SystemsMarcel Caraciolo
 
Great Tools Heavily Used In Japan, You Don't Know.
Great Tools Heavily Used In Japan, You Don't Know.Great Tools Heavily Used In Japan, You Don't Know.
Great Tools Heavily Used In Japan, You Don't Know.Junichi Ishida
 
Python と Docker で mypy Playground を開発した話
Python と Docker で mypy Playground を開発した話Python と Docker で mypy Playground を開発した話
Python と Docker で mypy Playground を開発した話Yusuke Miyazaki
 
夜宴24期《这段时间》
夜宴24期《这段时间》夜宴24期《这段时间》
夜宴24期《这段时间》Koubei Banquet
 
Introduction to Crab - Python Framework for Building Recommender Systems
Introduction to Crab - Python Framework for Building Recommender SystemsIntroduction to Crab - Python Framework for Building Recommender Systems
Introduction to Crab - Python Framework for Building Recommender SystemsMarcel Caraciolo
 
或るWebサービス開発のこれから - "オープンWebサービス"という妄想 -
或るWebサービス開発のこれから - "オープンWebサービス"という妄想 -或るWebサービス開発のこれから - "オープンWebサービス"という妄想 -
或るWebサービス開発のこれから - "オープンWebサービス"という妄想 -Kei Shiratsuchi
 
Eating Fruit - Combining Robots & Apps
Eating Fruit - Combining Robots & AppsEating Fruit - Combining Robots & Apps
Eating Fruit - Combining Robots & AppsRobotGrrl
 
オブジェクトストレージの詳解とクラウドサービスを活かすスケーラブルなシステム開発
オブジェクトストレージの詳解とクラウドサービスを活かすスケーラブルなシステム開発オブジェクトストレージの詳解とクラウドサービスを活かすスケーラブルなシステム開発
オブジェクトストレージの詳解とクラウドサービスを活かすスケーラブルなシステム開発IIJ
 
Jenkins-Koji plugin presentation on Python & Ruby devel group @ Brno
Jenkins-Koji plugin presentation on Python & Ruby devel group @ BrnoJenkins-Koji plugin presentation on Python & Ruby devel group @ Brno
Jenkins-Koji plugin presentation on Python & Ruby devel group @ BrnoVaclav Tunka
 
Scraping Scripting Hacking
Scraping Scripting HackingScraping Scripting Hacking
Scraping Scripting HackingMike Ellis
 
Puppet Camp New York 2014: Streamlining Puppet Development Workflow
Puppet Camp New York 2014: Streamlining Puppet Development Workflow Puppet Camp New York 2014: Streamlining Puppet Development Workflow
Puppet Camp New York 2014: Streamlining Puppet Development Workflow Puppet
 
Steamlining your puppet development workflow
Steamlining your puppet development workflowSteamlining your puppet development workflow
Steamlining your puppet development workflowTomas Doran
 
20120524 english lt2_pythontoolsfortesting
20120524 english lt2_pythontoolsfortesting20120524 english lt2_pythontoolsfortesting
20120524 english lt2_pythontoolsfortestingKazuhiro Oinuma
 
Practical White Hat Hacker Training - Passive Information Gathering(OSINT)
Practical White Hat Hacker Training -  Passive Information Gathering(OSINT)Practical White Hat Hacker Training -  Passive Information Gathering(OSINT)
Practical White Hat Hacker Training - Passive Information Gathering(OSINT)PRISMA CSI
 
Kinect Workshop Part 1/2
Kinect Workshop Part 1/2Kinect Workshop Part 1/2
Kinect Workshop Part 1/2Seiya Konno
 
通过 Ktor 迅速打造以 Kotlin 为核心的后端服务应用
通过 Ktor 迅速打造以 Kotlin 为核心的后端服务应用通过 Ktor 迅速打造以 Kotlin 为核心的后端服务应用
通过 Ktor 迅速打造以 Kotlin 为核心的后端服务应用Shengyou Fan
 
CoffeeScript: The Good Parts
CoffeeScript: The Good PartsCoffeeScript: The Good Parts
CoffeeScript: The Good PartsC4Media
 
Package Management via Spack on SJTU π Supercomputer
Package Management via Spack on SJTU π SupercomputerPackage Management via Spack on SJTU π Supercomputer
Package Management via Spack on SJTU π SupercomputerJianwen Wei
 

Ähnlich wie Web Scraping in Python with Scrapy (20)

Crab - A Python Framework for Building Recommendation Systems
Crab - A Python Framework for Building Recommendation SystemsCrab - A Python Framework for Building Recommendation Systems
Crab - A Python Framework for Building Recommendation Systems
 
儲かるドキュメント
儲かるドキュメント儲かるドキュメント
儲かるドキュメント
 
Great Tools Heavily Used In Japan, You Don't Know.
Great Tools Heavily Used In Japan, You Don't Know.Great Tools Heavily Used In Japan, You Don't Know.
Great Tools Heavily Used In Japan, You Don't Know.
 
Python と Docker で mypy Playground を開発した話
Python と Docker で mypy Playground を開発した話Python と Docker で mypy Playground を開発した話
Python と Docker で mypy Playground を開発した話
 
夜宴24期《这段时间》
夜宴24期《这段时间》夜宴24期《这段时间》
夜宴24期《这段时间》
 
Introduction to Crab - Python Framework for Building Recommender Systems
Introduction to Crab - Python Framework for Building Recommender SystemsIntroduction to Crab - Python Framework for Building Recommender Systems
Introduction to Crab - Python Framework for Building Recommender Systems
 
或るWebサービス開発のこれから - "オープンWebサービス"という妄想 -
或るWebサービス開発のこれから - "オープンWebサービス"という妄想 -或るWebサービス開発のこれから - "オープンWebサービス"という妄想 -
或るWebサービス開発のこれから - "オープンWebサービス"という妄想 -
 
Eating Fruit - Combining Robots & Apps
Eating Fruit - Combining Robots & AppsEating Fruit - Combining Robots & Apps
Eating Fruit - Combining Robots & Apps
 
オブジェクトストレージの詳解とクラウドサービスを活かすスケーラブルなシステム開発
オブジェクトストレージの詳解とクラウドサービスを活かすスケーラブルなシステム開発オブジェクトストレージの詳解とクラウドサービスを活かすスケーラブルなシステム開発
オブジェクトストレージの詳解とクラウドサービスを活かすスケーラブルなシステム開発
 
Jenkins-Koji plugin presentation on Python & Ruby devel group @ Brno
Jenkins-Koji plugin presentation on Python & Ruby devel group @ BrnoJenkins-Koji plugin presentation on Python & Ruby devel group @ Brno
Jenkins-Koji plugin presentation on Python & Ruby devel group @ Brno
 
Scraping Scripting Hacking
Scraping Scripting HackingScraping Scripting Hacking
Scraping Scripting Hacking
 
Puppet Camp New York 2014: Streamlining Puppet Development Workflow
Puppet Camp New York 2014: Streamlining Puppet Development Workflow Puppet Camp New York 2014: Streamlining Puppet Development Workflow
Puppet Camp New York 2014: Streamlining Puppet Development Workflow
 
Steamlining your puppet development workflow
Steamlining your puppet development workflowSteamlining your puppet development workflow
Steamlining your puppet development workflow
 
20120524 english lt2_pythontoolsfortesting
20120524 english lt2_pythontoolsfortesting20120524 english lt2_pythontoolsfortesting
20120524 english lt2_pythontoolsfortesting
 
Practical White Hat Hacker Training - Passive Information Gathering(OSINT)
Practical White Hat Hacker Training -  Passive Information Gathering(OSINT)Practical White Hat Hacker Training -  Passive Information Gathering(OSINT)
Practical White Hat Hacker Training - Passive Information Gathering(OSINT)
 
Kinect Workshop Part 1/2
Kinect Workshop Part 1/2Kinect Workshop Part 1/2
Kinect Workshop Part 1/2
 
通过 Ktor 迅速打造以 Kotlin 为核心的后端服务应用
通过 Ktor 迅速打造以 Kotlin 为核心的后端服务应用通过 Ktor 迅速打造以 Kotlin 为核心的后端服务应用
通过 Ktor 迅速打造以 Kotlin 为核心的后端服务应用
 
Web Scrapping Using Python
Web Scrapping Using PythonWeb Scrapping Using Python
Web Scrapping Using Python
 
CoffeeScript: The Good Parts
CoffeeScript: The Good PartsCoffeeScript: The Good Parts
CoffeeScript: The Good Parts
 
Package Management via Spack on SJTU π Supercomputer
Package Management via Spack on SJTU π SupercomputerPackage Management via Spack on SJTU π Supercomputer
Package Management via Spack on SJTU π Supercomputer
 

Kürzlich hochgeladen

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Kürzlich hochgeladen (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Web Scraping in Python with Scrapy

  • 1. Web Scraping in Python with Scrapy Kota Kato @orangain 2015-09-08, 鮨会
  • 2. Who am I? • Kota Kato • @orangain • Software Engineer • Interested in automation such as Jenkins, Chef, Docker etc.
  • 3. Definition: Web Scraping • Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Web scraping - Wikipedia, the free encyclopedia
 https://en.wikipedia.org/wiki/Web_scraping
  • 4. eBook-1 • Cross-store search engine for ebooks. • Retrieve ebook data from 9 ebook stores. http://ebook-1.com/
  • 5. QB Meter • Visualize crowdedness of QB HOUSE, 10 minutes barbershop. • Retrieve crowdedness from QB HOUSE's Web site every 5 minutes. http://qbmeter.capybala.com/
  • 6. Prototype of Glance • Prototype of simple news app like newspaper. • Retrieve news from NHK NEWS WEB 4 times per a day.
  • 7. Pokedos • Web app to find nearest bus stops to see the arrival information of buses. • Retrieve location of the all bus stops in Kyoto- city. http://bus.capybala.com/
  • 8. Why Web Scraping? • For Web Developer: • Develop mash-up application. • For Data Analyst: • Retrieve data to analyze. • For Everybody: • Automate operation of web sites.
  • 9. Why Use Python? • Easy to use • Powerful libraries, especially Scrapy • Seamlessness between data processing and developing application
  • 10. Web Scraping in Python • Combination of lightweight libraries: • Retrieving: Requests • Scraping: lxml, Beautiful Soup • Full stack framework: • Scrapy Today's topic
  • 12. Scrapy • Fast, simple and extensible Web scraping framework in Python • Currently compatible only with Python 2.7 • In-progress Python 3 support • Maintained by Scrapinghub • BSD License http://scrapy.org/
  • 13. Why Use Scrapy? • Annoying stuffs in crawling and scraping are done by Scrapy. Extracting Links Throttling Concurrency robots.txt and <meta> Tags XML Sitemaps Filtering Duplicated URLs Retry on Error Job Control
  • 14. Getting Started with Scrapy $ pip install scrapy $ cat > myspider.py <<EOF import scrapy class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['http://blog.scrapinghub.com'] def parse(self, response): for url in response.css('ul li a::attr("href")').re(r'.*/dddd/dd/$'): yield scrapy.Request(response.urljoin(url), self.parse_titles) def parse_titles(self, response): for post_title in response.css('div.entries > ul > li a::text').extract(): yield {'title': post_title} EOF $ scrapy runspider myspider.py http://scrapy.org/Requirements: Python 2.7, libxml2 and libxslt
  • 16. Create a Scrapy Project $ scrapy startproject sushibot $ tree sushibot/ sushibot/ !"" scrapy.cfg #"" sushibot !"" __init__.py !"" items.py !"" pipelines.py !"" settings.py #"" spiders #"" __init__.py 2 directories, 6 files
  • 17. Generate a Spider $ cd sushibot $ scrapy genspider sushi api.flickr.com $ cat sushibot/spiders/sushi.py # -*- coding: utf-8 -*- import scrapy class SushiSpider(scrapy.Spider): name = "sushi" allowed_domains = ["api.flickr.com"] start_urls = ( 'http://www.api.flickr.com/', ) def parse(self, response): pass
  • 18. Flickr API to Search Photos $ curl 'https://api.flickr.com/services/rest/? method=flickr.photos.search&api_key=******&text=sushi&sort=relevance ' > photos.xml $ cat photos.xml <?xml version="1.0" encoding="utf-8" ?> <rsp stat="ok"> <photos page="1" pages="871" perpage="100" total="87088"> <photo id="4794344495" owner="38553162@N00" secret="d907790937" server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0" isfamily="0" /> <photo id="8486536177" owner="78779574@N00" secret="f77b824ebb" server="8382" farm="9" title="Best Salmon Sushi" ispublic="1" isfriend="0" isfamily="0" /> ... https://www.flickr.com/services/api/flickr.photos.search.html
  • 19. Construct Photo's URL <photo id="4794344495" owner="38553162@N00" secret="d907790937" server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0" isfamily="0" /> https://farm{farm-id}.staticflickr.com/{server-id}/{id}_{secret} _[mstzb].jpg https://farm5.staticflickr.com/4093/4794344495_d907790937_b.jpg https://www.flickr.com/services/api/misc.urls.html Photo element: Photo's URL template: Result:
  • 20. spider/sushi.py (Modified) # -*- coding: utf-8 -*- import os import scrapy from sushibot.items import SushibotItem class SushiSpider(scrapy.Spider): name = "sushi" allowed_domains = ["api.flickr.com", "staticflickr.com"] start_urls = ( 'https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=' + os.environ['FLICKR_KEY'] + '&text=sushi&sort=relevance', ) def parse(self, response): for photo in response.css('photo'): yield scrapy.Request(photo_url(photo), self.handle_image) def handle_image(self, response): return SushibotItem(url=response.url, body=response.body) def photo_url(photo): return 'https://farm{farm}.staticflickr.com/{server}/{id}_{secret}_{size}.jpg'.format( farm=photo.xpath('@farm').extract_first(), server=photo.xpath('@server').extract_first(), id=photo.xpath('@id').extract_first(), secret=photo.xpath('@secret').extract_first(), size='b', )
  • 22. items.py # -*- coding: utf-8 -*- from pprint import pformat import scrapy class SushibotItem(scrapy.Item): url = scrapy.Field() body = scrapy.Field() def __str__(self): return pformat({ 'url': self['url'], 'body': self['body'][:10] + '...', })
  • 23. pipelines.py # -*- coding: utf-8 -*- import os class SaveImagePipeline(object): def process_item(self, item, spider): output_dir = 'images' if not os.path.exists(output_dir): os.makedirs(output_dir) filename = item['url'].split('/')[-1] with open(os.path.join(output_dir, filename), 'wb') as f: f.write(item['body']) return item
  • 24. settings.py • Appended settings: # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'sushibot (+orangain@gmail.com)' # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/ settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 1 # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'sushibot.pipelines.SaveImagePipeline': 300, }
  • 25. Run Spider $ FLICKR_KEY=********** scrapy crawl sushi NOTE: Provide Flickr's API key with environment variables.
  • 26. Thank you! • Web scraping has power to propose improvement. • Source code is available at
 https://github.com/orangain/sushibot @orangain