Scrapy workshop

•

2 likes•1,699 views

Description If you want to get data from the web, and there are no APIs available, then you need to use web scraping! Scrapy is the most effective and popular choice for web scraping and is used in many areas such as data science, journalism, business intelligence, web development, etc. Abstract If you want to get data from the web, and there are no APIs available, then you need to use web scraping! Scrapy is the most effective and popular choice for web scraping and is used in many areas such as data science, journalism, business intelligence, web development, etc. This workshop will provide an overview of Scrapy, starting from the fundamentals and working through each new topic with hands-on examples. Participants will come away with a good understanding of Scrapy, the principles behind its design, and how to apply the best practices encouraged by Scrapy to any scraping task. Goals: Set up a python environment. Learn basic concepts of the Scrapy framework.

Data & Analytics

SCRAPY WORKSHOP
Karthik Ananth
karthik@scrapinghub.com

Karthik Ananth
Who am I?
! Leading professional services
@ Scrapinghub
! I have vision to synergise data
generation and analytics
! Open source promoter

What is Web Scraping
The main goal in scraping is to
extract structured data from
unstructured sources, typically,
web pages.

What for
! Monitor prices
! Leads generation
! Aggregate information
! Your imagination is the limit

Do you speak HTTP?
Headers, Query String
Status Codes
Methods
Persistence
GET, POST, PUT, HEAD…
2XX, 3XX, 4XX, 418 , 5XX, 999
Accept-language, UA*…
Cookies

Standard Library
HTTP for humans
Let’s perform a request
urllib2
python-requests

import requests
req = requests.get('http://scrapinghub.com/about/')
Show me the code!
What now?

lxml pythonic binding for the C libraries libxml2
and libxslt
beautifulsoup html.parser, lxml, html5lib
HTMLParsers

import requests 
import lxml.html 
req = requests.get(‘http://nyc2015.pydata.org/schedule/') 
tree = lxml.html.fromstring(req.text) 
for tr in tree.xpath('//span[@class="speaker"]'): 
name = tr.xpath('text()') 
url = tr.xpath('@href') 
print name 
print url
Show me the code!

“Those who don't understand xpath
are cursed to reinvent it, poorly.”

“An open source and collaborative framework for
extracting the data you need from websites. In a
fast, simple, yet extensible way.”

$ scrapy shell <url>
An interactive shell console
Invaluable tool for developing and debugging your spiders

An interactive shell console
>>> response.url
'http://example.com'
>>> response.xpath('//h1/text()')
[<Selector xpath='//h1/text()' data=u'Example Domain'>]
>>> view(response) # open in browser
>>> fetch('http://www.google.com') # fetch other URL

$ scrapy startproject <name>
pydata
├── pydata
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│   └── __init__.py
└── scrapy.cfg
Starting a project

$import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ ‘http://www.example.com/' ] def parse(self, response): for h3 in response.xpath(‘//h3/text()’).extract(): yield {‘title’: h3} for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse) What is a Spider? 1.0$

Batteries included
! Logging
! Stats collection
! Testing: contracts
! Telnet console: inspect a Scrapy process

Avoid getting banned
! Rotate your User Agent
! Disable cookies
! Randomized download delays
! Use a pool of rotating IPs
! Crawlera

A service daemon to run Scrapy spiders
$ scrapyd-deploy
Deployment 1.0
scrapyd

TONSofOpenSource
Fullyremotedistributedteam
About us

Mandatory Sales Slide
try.scrapinghub.com/pydatanyc
Crawl the web, at scale
• cloud-based platform
• smart proxy rotator
Get data, hassle-free
• off-the-shelf datasets
• turn-key web scraping

What's hot

Selenium&scrapyArcangelo Saracino

Webscraping with asyncioJose Manuel Ortega Candel

Fun with PythonNarong Intiruk

N hidden gems you didn't know hippo delivery tier and hippo (forge) could giveWoonsan Ko

CouchDB Day NYC 2017: Introduction to CouchDB 2.0IBM Cloud Data Services

Analyse YourselfNorberto Leite

CouchDB Day NYC 2017: MapReduce ViewsIBM Cloud Data Services

CouchDB Day NYC 2017: Full Text SearchIBM Cloud Data Services

CouchDB Day NYC 2017: ReplicationIBM Cloud Data Services

CouchDB Day NYC 2017: MangoIBM Cloud Data Services

Cross Domain Web Mashups with JQuery and Google App EngineAndy McKay

Using Logstash, elasticsearch & kibanaAlejandro E Brito Monedero

Building an API with Django and Django REST FrameworkChristopher Foresman

N hidden gems in forge (as of may '17)Woonsan Ko

Quicli - From zero to a full CLI application in a few lines of RustDamien Castelltort

CouchDB Day NYC 2017: JSON DocumentsIBM Cloud Data Services

Django REST FrameworkLoad Impact

DjangoKangjin Jun

Visualizing ORACLE performance data with R @ #C16LVMaxym Kharchenko

Approach to find critical vulnerabilitiesAshish Kunwar

What's hot (20)

Selenium&scrapy

Webscraping with asyncio

Fun with Python

N hidden gems you didn't know hippo delivery tier and hippo (forge) could give

CouchDB Day NYC 2017: Introduction to CouchDB 2.0

Analyse Yourself

CouchDB Day NYC 2017: MapReduce Views

CouchDB Day NYC 2017: Full Text Search

CouchDB Day NYC 2017: Replication

CouchDB Day NYC 2017: Mango

Cross Domain Web Mashups with JQuery and Google App Engine

Using Logstash, elasticsearch & kibana

Building an API with Django and Django REST Framework

N hidden gems in forge (as of may '17)

Quicli - From zero to a full CLI application in a few lines of Rust

CouchDB Day NYC 2017: JSON Documents

Django REST Framework

Django

Visualizing ORACLE performance data with R @ #C16LV

Approach to find critical vulnerabilities

Viewers also liked

Web Crawling Modeling with Scrapy Models #TDC2014Bruno Rocha

Downloading the internet with Python + ScrapyErin Shellman

Scraping the web with pythonJose Manuel Ortega Candel

Developing an Expression Language for Quantitative Financial ModelingScott Sanderson

Scrapinghub PyCon Philippines 2015Richard Dowinton

Scrapy-101Snehil Verma

ScrapyFrancisco Sousa

Quokka CMS - Content Management with Flask and Mongo #tdc2014Bruno Rocha

Spider进化论cjhacker

Scrapy.for.dummiesChandler Huang

XPath for web scrapingScrapinghub

Viewers also liked (11)

Web Crawling Modeling with Scrapy Models #TDC2014

Downloading the internet with Python + Scrapy

Scraping the web with python

Developing an Expression Language for Quantitative Financial Modeling

Scrapinghub PyCon Philippines 2015

Scrapy-101

Scrapy

Quokka CMS - Content Management with Flask and Mongo #tdc2014

Spider进化论

Scrapy.for.dummies

XPath for web scraping

Similar to Scrapy workshop

RoR Workshop - Web applications hacking - Ruby on Rails exampleRailwaymen

Workshop KrakYourNet2016 - Web applications hacking Ruby on Rails example Anna Klepacka

Sanjeev ghai 12Praveen kumar

Web Scraping In Ruby Utosc 2009.Keyjtzemp

Site Performance - From Pinto to FerrariJoseph Scott

Web Scrapping Using PythonComputerScienceJunct

DVWA BruCON Workshoptestuser1223

It is not HTML5. but ... / HTML5ではないサイトからHTML5を考えるSadaaki HIRAI

Vue.js + Django - configuración para desarrollo con webpack y HMRJavier Abadía

Hacking with hhvmElizabeth Smith

How to make Ajax Libraries work for youSimon Willison

Living With Legacy CodeRowan Merewood

Javascript EverywherePascal Rettig

Big data analysis in python @ PyCon.tw 2013Jimmy Lai

How to automate all your SEO projectsVincent Terrasi

Building Client-Side Attacks with HTML5 FeaturesConviso Application Security

Apirandyhoyt

HTML5 (and friends) - History, overview and current status - jsDay Verona 11....Patrick Lauke

Behave manners for ui testing pycon2019Panos Christeas

OWASP ZAP Workshop for QA TestersJavan Rasokat

Similar to Scrapy workshop (20)

RoR Workshop - Web applications hacking - Ruby on Rails example

Workshop KrakYourNet2016 - Web applications hacking Ruby on Rails example

Sanjeev ghai 12

Web Scraping In Ruby Utosc 2009.Key

Site Performance - From Pinto to Ferrari

Web Scrapping Using Python

DVWA BruCON Workshop

It is not HTML5. but ... / HTML5ではないサイトからHTML5を考える

Vue.js + Django - configuración para desarrollo con webpack y HMR

Hacking with hhvm

How to make Ajax Libraries work for you

Living With Legacy Code

Javascript Everywhere

Big data analysis in python @ PyCon.tw 2013

How to automate all your SEO projects

Building Client-Side Attacks with HTML5 Features

Api

HTML5 (and friends) - History, overview and current status - jsDay Verona 11....

Behave manners for ui testing pycon2019

OWASP ZAP Workshop for QA Testers

Recently uploaded

Discover Why Less is More in B2B Researchmichael115558

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

怎样办理圣路易斯大学毕业证（SLU毕业证书）成绩单学校原版复制vexqp

Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg

Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

怎样办理纽约州立大学宾汉姆顿分校毕业证（SUNY-Bin毕业证书）成绩单学校原版复制vexqp

7. Epi of Chronic respiratory diseases.pptibrahimabdi22

如何办理英国诺森比亚大学毕业证（NU毕业证书）成绩单原件一模一样wsppdmt

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg

一比一原版(UCD毕业证书）加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt

PLE-statistics document for primary schscnajjemba

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann

Sequential and reinforcement learning for demand side management by Margaux B...Paris Women in Machine Learning and Data Science

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher

Digital Transformation Playbook by Graham WareGraham Ware

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

Recently uploaded (20)

Discover Why Less is More in B2B Research

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...

怎样办理圣路易斯大学毕业证（SLU毕业证书）成绩单学校原版复制

Harnessing the Power of GenAI for BI and Reporting.pptx

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...

Abortion pills in Jeddah | +966572737505 | Get Cytotec

怎样办理纽约州立大学宾汉姆顿分校毕业证（SUNY-Bin毕业证书）成绩单学校原版复制

7. Epi of Chronic respiratory diseases.ppt

如何办理英国诺森比亚大学毕业证（NU毕业证书）成绩单原件一模一样

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...

一比一原版(UCD毕业证书）加州大学戴维斯分校毕业证成绩单原件一模一样

PLE-statistics document for primary schs

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK

Sequential and reinforcement learning for demand side management by Margaux B...

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...

Digital Transformation Playbook by Graham Ware

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

Scrapy workshop

1. SCRAPY WORKSHOP Karthik Ananth karthik@scrapinghub.com

2. Karthik Ananth Who am I? ! Leading professional services @ Scrapinghub ! I have vision to synergise data generation and analytics ! Open source promoter

3. APIs Why Web Scraping Semantic web

4. What is Web Scraping The main goal in scraping is to extract structured data from unstructured sources, typically, web pages.

5. What for ! Monitor prices ! Leads generation ! Aggregate information ! Your imagination is the limit

6. Do you speak HTTP? Headers, Query String Status Codes Methods Persistence GET, POST, PUT, HEAD… 2XX, 3XX, 4XX, 418 , 5XX, 999 Accept-language, UA*… Cookies

7. Standard Library HTTP for humans Let’s perform a request urllib2 python-requests

8. import requests req = requests.get('http://scrapinghub.com/about/') Show me the code! What now?

9. HTMLis not a regular language

10. lxml pythonic binding for the C libraries libxml2 and libxslt beautifulsoup html.parser, lxml, html5lib HTMLParsers

11. import requests  import lxml.html  req = requests.get(‘http://nyc2015.pydata.org/schedule/')  tree = lxml.html.fromstring(req.text)  for tr in tree.xpath('//span[@class="speaker"]'):  name = tr.xpath('text()')  url = tr.xpath('@href')  print name  print url Show me the code!

12. “Those who don't understand xpath are cursed to reinvent it, poorly.”

13. Scrapy-ify early on

14. “An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.”

15. $ conda install -c scrapinghub scrapy

16. $ scrapy shell <url> An interactive shell console Invaluable tool for developing and debugging your spiders

17. An interactive shell console >>> response.url 'http://example.com' >>> response.xpath('//h1/text()') [<Selector xpath='//h1/text()' data=u'Example Domain'>] >>> view(response) # open in browser >>> fetch('http://www.google.com') # fetch other URL

18. $ scrapy startproject <name> pydata ├── pydata │ ├── __init__.py │ ├── items.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ └── __init__.py └── scrapy.cfg Starting a project

19. What is a spider

20. import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/', ] def parse(self, response): msg = 'A response from %s just arrived!' % response.url self.logger.info(msg) What is a Spider?

21. import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ ‘http://www.example.com/' ] def parse(self, response): for h3 in response.xpath(‘//h3/text()’).extract(): yield {‘title’: h3} for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse) What is a Spider? 1.0

22. Batteries included ! Logging ! Stats collection ! Testing: contracts ! Telnet console: inspect a Scrapy process

23. Avoid getting banned ! Rotate your User Agent ! Disable cookies ! Randomized download delays ! Use a pool of rotating IPs ! Crawlera

24. A service daemon to run Scrapy spiders $ scrapyd-deploy Deployment 1.0 scrapyd

25. Scrapy Cloud $ shub deploy

26. TONSofOpenSource Fullyremotedistributedteam About us

27. Mandatory Sales Slide try.scrapinghub.com/pydatanyc Crawl the web, at scale • cloud-based platform • smart proxy rotator Get data, hassle-free • off-the-shelf datasets • turn-key web scraping

28. We’re hiring!

29. Thanks

Scrapy workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Scrapy workshop

Similar to Scrapy workshop (20)

Recently uploaded

Recently uploaded (20)

Scrapy workshop