getting started guide for web scraping using scrapy framework.
GitHub Link : https://github.com/zekelabs/Python---ML---DL---PySpark-Training/tree/master/Scrapy%20Projects
2. Web Scraping using Scrapy
support@zekeLabs.com | www.zekeLabs.com | +91 8095465880
3. Introduction to
Web Scraping Technique to extract large amounts of data
from websites
The data is extracted and saved in file
systems or in a database.
Python Libraries :
1. BeautifulSoup
2. Scrapy
4. Ethics for Scraping ● Respect robot.txt file
● Check if public API is available
● Identify yourself by providing User Agent
● Scrape the data to create the value not to
duplicate it
5. What is robot.txt ?
Robots.txt is a text file to instruct web robots (typically search engine robots) how to crawl
pages on their website.
Basic format:
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]
● Blocking all web crawlers from all content
User-agent: *
Disallow: /
● Allowing all web crawlers access to all content
User-agent: *
Disallow:
● Blocking a specific web crawler from a specific folder
User-agent: Googlebot
Disallow: /example-subfolder/
6. Getting Started with BeautifulSoup
● Beautiful Soup is a library for pulling data out of HTML and XML files.
● Installing Beautiful Soup4: pip install beautifulsoup4
● Useful Functions:
■ find()
■ find_all()
■ find_parent()
■ find_parents()
■ find_next_sibling()
■ find_next_siblings()
7. Introduction to
Scrapy
● Open source framework
● Extract, Process & Store unstructured data.
● In a fast, simple, yet extensible way.
9. Scrapy Components - Items
To define common output data format Scrapy provides the Item class. Item objects are simple
containers used to collect the scraped data.
Syntax :
● Define Items:
import scrapy
class MobileItem(scrapy.Item):
model_name = scrapy.Field()
model_details = scrapy.Field()
model_price = scrapy.Field()
● Using Items to store the data
model = MobilesItem()
model['model_name'] = name
model['model_details'] = details
model['model_price'] = price
yield model
10. Scrapy Components - Spider
Spiders are classes which define how how to perform the crawl and how to extract structured
data from their pages (i.e. scraping items)
Scraping Cycle :
● Generating the initial Requests to crawl the first URLs, and specify a callback function to be called with
the response downloaded from those requests.
● In the callback function, parse the response using Selectors(web page) and return either dicts with
extracted data, Item objects, Request objects, or an iterable of these objects.
● The items returned from the spider will be stored into a database or in files
Scrapy Default Spiders:
● Scrapy Spider
● Crawl Spider
● XMLFeed Spider
● CSVFeed Spider
● Sitemap Spider
11. scrapy.Spider
Attributes and Methods Description
name Name for this spider. Required Attribute
start_urls A list of URLs where the spider will begin to crawl from
parse This is the default callback used by Scrapy to process
downloaded responses.
import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
]
def parse(self, response):
print(response.url)
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
Yield
scrapy.Request("http://www.example.com",
callback=self.parse_link)
def parse_link(self, response):
pass
12. Crawl Spider
Attributes and Methods Description
name Name for this spider. Required Attribute
start_urls A list of URLs where the spider will begin to crawl from
rules To define the rules for crawling
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class mySpider(CrawlSpider):
name = "myCrawSpider"
allowed_domains = ['pexels.com']
start_urls = ['https://www.pexels.com/collections/feeling-happy-hzn4cx4/']
rules = ( Rule(LinkExtractor(allow=(r'/feeling-happy.*', )),
callback='parse_item'),
)
13. More about Link Extractors
allow a single regular expression (or list of regular expressions) that
the (absolute) urls must match in order to be extracted. If not
given (or empty), it will match all links.
deny a single regular expression (or list of regular expressions) that
the (absolute) urls must match in order to be excluded (ie. not
extracted). It has precedence over the allow parameter.
allow_domains a single value or a list of string containing domains which will
be considered for extracting the links
deny _domains a single value or a list of strings containing domains which
won’t be considered for extracting the links
restrict_xpaths XPath (or list of XPath’s) which defines regions inside the
response where links should be extracted from. If given, only
the text selected by those XPath will be scanned for links. See
examples below.
restrict_css a CSS selector (or list of selectors) which defines regions inside
the response where links should be extracted from.
14. Scrapy Selectors
Scrapy Selector to extract Data :
● Xpath (response.xpath('//div[@id="images"]/a/text()').extract_first())
● CSS ( response.css('title::text').extract())
1. Xpath :
XPath data model : XPath’s data model is a tree of nodes representing a document. Nodes can be
either:
● element nodes (<p>This is a paragraph</p>): /html/head/title
● attribute nodes (href="page.html" inside an <a> tag),
● text nodes ("I have something to say") : /html/body/div/div[1]/text()
● comment nodes (<!-- a comment -->): //meta/@*
16. CSS Selectors 1. .class
.intro Selects all elements with class="intro"
1. #id
#firstname Selects the element with id="firstname"
1. [attribute=value] [target=_blank]
Selects all elements with target="_blank"
18. Item Exporter
• Scrapy provides a collection of Item Exporters for different output formats, such as XML, CSV or JSON
1. start_exporting() in order to signal the beginning of the exporting process
2. export_item() method for each item you want to export
3. finish_exporting() to signal the end of the exporting process
• CSV Item Exporter : (scrapy.exporters.CsvItemExporter)
class CsvPipeline(object):
def __init__(self):
self.file = open("booksdata.csv", 'wb')
self.exporter = CsvItemExporter(self.file, unicode)
self.exporter.start_exporting()
def close_spider(self, spider):
self.exporter.finish_exporting()
self.file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
19. Image Pipeline and File Pipeline
Enabling Image and File Pipeline
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
Set the FILES_STORE and Image setting:
FILES_STORE = '/path/to/valid/dir'
IMAGES_STORE = '/path/to/valid/dir'
Using File Pipeline
1. In a Spider, you scrape an item and put the URLs of the desired into a file_urls field.
2. The item is returned from the spider and goes to the item pipeline.
3. When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the
standard Scrapy scheduler and downloader
4. When the files are downloaded, another field (files) will be populated with the results. This field will contain a list of
dicts with information about the downloaded files
image_urls for the image URLs of an item and it will populate an images field for the information about the
downloaded images.
21. Visit : www.zekeLabs.com for more details
THANK YOU
Let us know how can we help your organization to Upskill the
employees to stay updated in the ever-evolving IT Industry.
Get in touch:
www.zekeLabs.com | +91-8095465880 | info@zekeLabs.com