Description
If you want to get data from the web, and there are no APIs available, then you need to use web scraping! Scrapy is the most effective and popular choice for web scraping and is used in many areas such as data science, journalism, business intelligence, web development, etc.
Abstract
If you want to get data from the web, and there are no APIs available, then you need to use web scraping! Scrapy is the most effective and popular choice for web scraping and is used in many areas such as data science, journalism, business intelligence, web development, etc.
This workshop will provide an overview of Scrapy, starting from the fundamentals and working through each new topic with hands-on examples.
Participants will come away with a good understanding of Scrapy, the principles behind its design, and how to apply the best practices encouraged by Scrapy to any scraping task.
Goals:
Set up a python environment.
Learn basic concepts of the Scrapy framework.
2. Karthik Ananth
Who am I?
! Leading professional services
@ Scrapinghub
! I have vision to synergise data
generation and analytics
! Open source promoter
10. lxml pythonic binding for the C libraries libxml2
and libxslt
beautifulsoup html.parser, lxml, html5lib
HTMLParsers
11. import requests
import lxml.html
req = requests.get(‘http://nyc2015.pydata.org/schedule/')
tree = lxml.html.fromstring(req.text)
for tr in tree.xpath('//span[@class="speaker"]'):
name = tr.xpath('text()')
url = tr.xpath('@href')
print name
print url
Show me the code!
12. “Those who don't understand xpath
are cursed to reinvent it, poorly.”
20. import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/',
]
def parse(self, response):
msg = 'A response from %s just arrived!' % response.url
self.logger.info(msg)
What is a Spider?
21. import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
‘http://www.example.com/'
]
def parse(self, response):
for h3 in response.xpath(‘//h3/text()’).extract():
yield {‘title’: h3}
for url in response.xpath('//a/@href').extract():
yield scrapy.Request(url, callback=self.parse)
What is a Spider? 1.0