2. Agenda
● What is web scraping and why it's fun
● My experiments with web scraping
● Getting started with Scrapy
● How Scrapy works and a quick Demo
● Why Scrapy
● Questions
3. What is Web Scraping?
● Extracting information from websites
● Problem:
○ Static websites
○ No access to APIs to extract the data you
need
○ Need to extract data periodically
● Manual solution - go to the website and copy
the required data
● Smarter solution: Web Scraping
5. Web Scraping in Python
● Download webpage with urllib2, requests
● Parse the page with BeautifulSoup/lxml
● Select with XPath or css selectors
6. Scrapy - fast high Level Screen
Scraping and web crawling
Framework
● Pick a website
● Define the data you want to scrape
● Write the spider to extract the data
● Run the spider
● Store the Data
9. Why Scrapy
● Simplicity
● Fast
● Productive/ Extensible
● Portable
● Well docs & Healthy community
● Commercial Support
10. Advanced Features (built in)
● Interactive shell for trying XPaths (useful for
debugging)
● selecting and extracting data from html
sources
● cleaning and sanitizing the scraped data
● generating feed exports (JSON, CSV)
● media pipeline for downloading stuff
● Middlewares for (cookies, HTTP
compression, cache, user-agent spoofing,
etc)