SlideShare ist ein Scribd-Unternehmen logo
1 von 21
zekeLabs
Learning made Simpler !
www.zekeLabs.com
Web Scraping using Scrapy
support@zekeLabs.com | www.zekeLabs.com | +91 8095465880
Introduction to
Web Scraping Technique to extract large amounts of data
from websites
The data is extracted and saved in file
systems or in a database.
Python Libraries :
1. BeautifulSoup
2. Scrapy
Ethics for Scraping ● Respect robot.txt file
● Check if public API is available
● Identify yourself by providing User Agent
● Scrape the data to create the value not to
duplicate it
What is robot.txt ?
Robots.txt is a text file to instruct web robots (typically search engine robots) how to crawl
pages on their website.
Basic format:
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]
● Blocking all web crawlers from all content
User-agent: *
Disallow: /
● Allowing all web crawlers access to all content
User-agent: *
Disallow:
● Blocking a specific web crawler from a specific folder
User-agent: Googlebot
Disallow: /example-subfolder/
Getting Started with BeautifulSoup
● Beautiful Soup is a library for pulling data out of HTML and XML files.
● Installing Beautiful Soup4: pip install beautifulsoup4
● Useful Functions:
■ find()
■ find_all()
■ find_parent()
■ find_parents()
■ find_next_sibling()
■ find_next_siblings()
Introduction to
Scrapy
● Open source framework
● Extract, Process & Store unstructured data.
● In a fast, simple, yet extensible way.
Installing Scrapy
● conda install -c conda-forge scrapy
● pip install scrapy
Scrapy Components - Items
To define common output data format Scrapy provides the Item class. Item objects are simple
containers used to collect the scraped data.
Syntax :
● Define Items:
import scrapy
class MobileItem(scrapy.Item):
model_name = scrapy.Field()
model_details = scrapy.Field()
model_price = scrapy.Field()
● Using Items to store the data
model = MobilesItem()
model['model_name'] = name
model['model_details'] = details
model['model_price'] = price
yield model
Scrapy Components - Spider
Spiders are classes which define how how to perform the crawl and how to extract structured
data from their pages (i.e. scraping items)
Scraping Cycle :
● Generating the initial Requests to crawl the first URLs, and specify a callback function to be called with
the response downloaded from those requests.
● In the callback function, parse the response using Selectors(web page) and return either dicts with
extracted data, Item objects, Request objects, or an iterable of these objects.
● The items returned from the spider will be stored into a database or in files
Scrapy Default Spiders:
● Scrapy Spider
● Crawl Spider
● XMLFeed Spider
● CSVFeed Spider
● Sitemap Spider
scrapy.Spider
Attributes and Methods Description
name Name for this spider. Required Attribute
start_urls A list of URLs where the spider will begin to crawl from
parse This is the default callback used by Scrapy to process
downloaded responses.
import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
]
def parse(self, response):
print(response.url)
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
Yield
scrapy.Request("http://www.example.com",
callback=self.parse_link)
def parse_link(self, response):
pass
Crawl Spider
Attributes and Methods Description
name Name for this spider. Required Attribute
start_urls A list of URLs where the spider will begin to crawl from
rules To define the rules for crawling
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class mySpider(CrawlSpider):
name = "myCrawSpider"
allowed_domains = ['pexels.com']
start_urls = ['https://www.pexels.com/collections/feeling-happy-hzn4cx4/']
rules = ( Rule(LinkExtractor(allow=(r'/feeling-happy.*', )),
callback='parse_item'),
)
More about Link Extractors
allow a single regular expression (or list of regular expressions) that
the (absolute) urls must match in order to be extracted. If not
given (or empty), it will match all links.
deny a single regular expression (or list of regular expressions) that
the (absolute) urls must match in order to be excluded (ie. not
extracted). It has precedence over the allow parameter.
allow_domains a single value or a list of string containing domains which will
be considered for extracting the links
deny _domains a single value or a list of strings containing domains which
won’t be considered for extracting the links
restrict_xpaths XPath (or list of XPath’s) which defines regions inside the
response where links should be extracted from. If given, only
the text selected by those XPath will be scanned for links. See
examples below.
restrict_css a CSS selector (or list of selectors) which defines regions inside
the response where links should be extracted from.
Scrapy Selectors
Scrapy Selector to extract Data :
● Xpath (response.xpath('//div[@id="images"]/a/text()').extract_first())
● CSS ( response.css('title::text').extract())
1. Xpath :
XPath data model : XPath’s data model is a tree of nodes representing a document. Nodes can be
either:
● element nodes (<p>This is a paragraph</p>): /html/head/title
● attribute nodes (href="page.html" inside an <a> tag),
● text nodes ("I have something to say") : /html/body/div/div[1]/text()
● comment nodes (<!-- a comment -->): //meta/@*
More on XPath Selectors:
1. Basic Syntax: Xpath=//tagname[@attribute='value']
Xpath=//input[@id='id1']
Xpath=//input[@class='class1']
2. Xpath=//*[contains(@type,'sub')]
Xpath=//*[contains(text(),'Click here')]
3. Xpath=//label[starts-with(@id,'message')]
4. Xpath=//td[text()='UserID']
5. Xpath=//*[@type='text']//following::input
6. Xpath=//*[@id='id1']/child::li
CSS Selectors 1. .class
.intro Selects all elements with class="intro"
1. #id
#firstname Selects the element with id="firstname"
1. [attribute=value] [target=_blank]
Selects all elements with target="_blank"
Item Pipeline
Enabling Item
ITEM_PIPELINES = {
'myproject.pipelines.JsonWriterPipeline': 800,
}
import json
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('items.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "n"
self.file.write(line)
return item
Item Exporter
• Scrapy provides a collection of Item Exporters for different output formats, such as XML, CSV or JSON
1. start_exporting() in order to signal the beginning of the exporting process
2. export_item() method for each item you want to export
3. finish_exporting() to signal the end of the exporting process
• CSV Item Exporter : (scrapy.exporters.CsvItemExporter)
class CsvPipeline(object):
def __init__(self):
self.file = open("booksdata.csv", 'wb')
self.exporter = CsvItemExporter(self.file, unicode)
self.exporter.start_exporting()
def close_spider(self, spider):
self.exporter.finish_exporting()
self.file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Image Pipeline and File Pipeline
Enabling Image and File Pipeline
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
Set the FILES_STORE and Image setting:
FILES_STORE = '/path/to/valid/dir'
IMAGES_STORE = '/path/to/valid/dir'
Using File Pipeline
1. In a Spider, you scrape an item and put the URLs of the desired into a file_urls field.
2. The item is returned from the spider and goes to the item pipeline.
3. When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the
standard Scrapy scheduler and downloader
4. When the files are downloaded, another field (files) will be populated with the results. This field will contain a list of
dicts with information about the downloaded files
image_urls for the image URLs of an item and it will populate an images field for the information about the
downloaded images.
Settings
• ROBOTSTXT_OBEY = True
• USER_AGENT = 'MyCompany-MyCrawler (bot@mycompany.com)‘
• DOWNLOAD_DELAY = 5.0
• CONCURRENT_REQUESTS_PER_DOMAIN = 16
• CONCURRENT_REQUESTS_PER_IP = 16
• HTTPCACHE_ENABLED = True
• DOWNLOAD_TIMEOUT = 15
• REDIRECT_ENABLED = False
• DEPTH_LIMIT = 3
• CLOSESPIDER_ITEMCOUNT=10
• CLOSESPIDER_PAGECOUNT=10
Visit : www.zekeLabs.com for more details
THANK YOU
Let us know how can we help your organization to Upskill the
employees to stay updated in the ever-evolving IT Industry.
Get in touch:
www.zekeLabs.com | +91-8095465880 | info@zekeLabs.com

Weitere ähnliche Inhalte

Was ist angesagt?

Android datastorage
Android datastorageAndroid datastorage
Android datastorageKrazy Koder
 
Asp.net create delete directory folder in c# vb.net
Asp.net   create delete directory folder in c# vb.netAsp.net   create delete directory folder in c# vb.net
Asp.net create delete directory folder in c# vb.netrelekarsushant
 
An introduction into Spring Data
An introduction into Spring DataAn introduction into Spring Data
An introduction into Spring DataOliver Gierke
 
File Input & Output
File Input & OutputFile Input & Output
File Input & OutputPRN USM
 
JDBC - JPA - Spring Data
JDBC - JPA - Spring DataJDBC - JPA - Spring Data
JDBC - JPA - Spring DataArturs Drozdovs
 
Simplifying Persistence for Java and MongoDB with Morphia
Simplifying Persistence for Java and MongoDB with MorphiaSimplifying Persistence for Java and MongoDB with Morphia
Simplifying Persistence for Java and MongoDB with MorphiaMongoDB
 
09.Local Database Files and Storage on WP
09.Local Database Files and Storage on WP09.Local Database Files and Storage on WP
09.Local Database Files and Storage on WPNguyen Tuan
 
Building node.js applications with Database Jones
Building node.js applications with Database JonesBuilding node.js applications with Database Jones
Building node.js applications with Database JonesJohn David Duncan
 
Developing for Node.JS with MySQL and NoSQL
Developing for Node.JS with MySQL and NoSQLDeveloping for Node.JS with MySQL and NoSQL
Developing for Node.JS with MySQL and NoSQLJohn David Duncan
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEcommerce Solution Provider SysIQ
 
Selectors and normalizing state shape
Selectors and normalizing state shapeSelectors and normalizing state shape
Selectors and normalizing state shapeMuntasir Chowdhury
 
Softshake - Offline applications
Softshake - Offline applicationsSoftshake - Offline applications
Softshake - Offline applicationsjeromevdl
 
Taming Core Data by Arek Holko, Macoscope
Taming Core Data by Arek Holko, MacoscopeTaming Core Data by Arek Holko, Macoscope
Taming Core Data by Arek Holko, MacoscopeMacoscope
 
Hidden Treasures of the Python Standard Library
Hidden Treasures of the Python Standard LibraryHidden Treasures of the Python Standard Library
Hidden Treasures of the Python Standard Librarydoughellmann
 
yagdao-0.3.1 hibernate guide
yagdao-0.3.1 hibernate guideyagdao-0.3.1 hibernate guide
yagdao-0.3.1 hibernate guideMert Can Akkan
 

Was ist angesagt? (20)

Android datastorage
Android datastorageAndroid datastorage
Android datastorage
 
Asp.net create delete directory folder in c# vb.net
Asp.net   create delete directory folder in c# vb.netAsp.net   create delete directory folder in c# vb.net
Asp.net create delete directory folder in c# vb.net
 
An introduction into Spring Data
An introduction into Spring DataAn introduction into Spring Data
An introduction into Spring Data
 
File Input & Output
File Input & OutputFile Input & Output
File Input & Output
 
JDBC - JPA - Spring Data
JDBC - JPA - Spring DataJDBC - JPA - Spring Data
JDBC - JPA - Spring Data
 
Oak Lucene Indexes
Oak Lucene IndexesOak Lucene Indexes
Oak Lucene Indexes
 
Local Storage
Local StorageLocal Storage
Local Storage
 
Simplifying Persistence for Java and MongoDB with Morphia
Simplifying Persistence for Java and MongoDB with MorphiaSimplifying Persistence for Java and MongoDB with Morphia
Simplifying Persistence for Java and MongoDB with Morphia
 
09.Local Database Files and Storage on WP
09.Local Database Files and Storage on WP09.Local Database Files and Storage on WP
09.Local Database Files and Storage on WP
 
JDBC
JDBCJDBC
JDBC
 
Building node.js applications with Database Jones
Building node.js applications with Database JonesBuilding node.js applications with Database Jones
Building node.js applications with Database Jones
 
Developing for Node.JS with MySQL and NoSQL
Developing for Node.JS with MySQL and NoSQLDeveloping for Node.JS with MySQL and NoSQL
Developing for Node.JS with MySQL and NoSQL
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
Selectors and normalizing state shape
Selectors and normalizing state shapeSelectors and normalizing state shape
Selectors and normalizing state shape
 
Softshake - Offline applications
Softshake - Offline applicationsSoftshake - Offline applications
Softshake - Offline applications
 
Taming Core Data by Arek Holko, Macoscope
Taming Core Data by Arek Holko, MacoscopeTaming Core Data by Arek Holko, Macoscope
Taming Core Data by Arek Holko, Macoscope
 
Hidden Treasures of the Python Standard Library
Hidden Treasures of the Python Standard LibraryHidden Treasures of the Python Standard Library
Hidden Treasures of the Python Standard Library
 
Sequelize
SequelizeSequelize
Sequelize
 
Testing the unpredictable
Testing the unpredictableTesting the unpredictable
Testing the unpredictable
 
yagdao-0.3.1 hibernate guide
yagdao-0.3.1 hibernate guideyagdao-0.3.1 hibernate guide
yagdao-0.3.1 hibernate guide
 

Ähnlich wie Web scraping using scrapy - zekeLabs

Scrapy talk at DataPhilly
Scrapy talk at DataPhillyScrapy talk at DataPhilly
Scrapy talk at DataPhillyobdit
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
WRStmlDSQUmUrZpQ0tFJ4Q_a36bc57fe1a24dd8bc5ba549736e406f_C2-Week2.pptx
WRStmlDSQUmUrZpQ0tFJ4Q_a36bc57fe1a24dd8bc5ba549736e406f_C2-Week2.pptxWRStmlDSQUmUrZpQ0tFJ4Q_a36bc57fe1a24dd8bc5ba549736e406f_C2-Week2.pptx
WRStmlDSQUmUrZpQ0tFJ4Q_a36bc57fe1a24dd8bc5ba549736e406f_C2-Week2.pptxsalemsg
 
Data visualization in python/Django
Data visualization in python/DjangoData visualization in python/Django
Data visualization in python/Djangokenluck2001
 
Raybiztech Guide To Backbone Javascript Library
Raybiztech Guide To Backbone Javascript LibraryRaybiztech Guide To Backbone Javascript Library
Raybiztech Guide To Backbone Javascript Libraryray biztech
 
Introducing Struts 2
Introducing Struts 2Introducing Struts 2
Introducing Struts 2wiradikusuma
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talkdtdannen
 
Standard Template Library
Standard Template LibraryStandard Template Library
Standard Template LibraryGauravPatil318
 
Core Java Programming Language (JSE) : Chapter X - I/O Fundamentals
Core Java Programming Language (JSE) : Chapter X - I/O Fundamentals Core Java Programming Language (JSE) : Chapter X - I/O Fundamentals
Core Java Programming Language (JSE) : Chapter X - I/O Fundamentals WebStackAcademy
 

Ähnlich wie Web scraping using scrapy - zekeLabs (20)

Struts notes
Struts notesStruts notes
Struts notes
 
Angular JS
Angular JSAngular JS
Angular JS
 
Struts
StrutsStruts
Struts
 
backend
backendbackend
backend
 
backend
backendbackend
backend
 
Java scriptforjavadev part2a
Java scriptforjavadev part2aJava scriptforjavadev part2a
Java scriptforjavadev part2a
 
Scrapy talk at DataPhilly
Scrapy talk at DataPhillyScrapy talk at DataPhilly
Scrapy talk at DataPhilly
 
Struts2 - 101
Struts2 - 101Struts2 - 101
Struts2 - 101
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
WRStmlDSQUmUrZpQ0tFJ4Q_a36bc57fe1a24dd8bc5ba549736e406f_C2-Week2.pptx
WRStmlDSQUmUrZpQ0tFJ4Q_a36bc57fe1a24dd8bc5ba549736e406f_C2-Week2.pptxWRStmlDSQUmUrZpQ0tFJ4Q_a36bc57fe1a24dd8bc5ba549736e406f_C2-Week2.pptx
WRStmlDSQUmUrZpQ0tFJ4Q_a36bc57fe1a24dd8bc5ba549736e406f_C2-Week2.pptx
 
Intro to Rails 4
Intro to Rails 4Intro to Rails 4
Intro to Rails 4
 
Actionview
ActionviewActionview
Actionview
 
Data visualization in python/Django
Data visualization in python/DjangoData visualization in python/Django
Data visualization in python/Django
 
24sax
24sax24sax
24sax
 
Raybiztech Guide To Backbone Javascript Library
Raybiztech Guide To Backbone Javascript LibraryRaybiztech Guide To Backbone Javascript Library
Raybiztech Guide To Backbone Javascript Library
 
Introducing Struts 2
Introducing Struts 2Introducing Struts 2
Introducing Struts 2
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talk
 
Struts 2
Struts 2Struts 2
Struts 2
 
Standard Template Library
Standard Template LibraryStandard Template Library
Standard Template Library
 
Core Java Programming Language (JSE) : Chapter X - I/O Fundamentals
Core Java Programming Language (JSE) : Chapter X - I/O Fundamentals Core Java Programming Language (JSE) : Chapter X - I/O Fundamentals
Core Java Programming Language (JSE) : Chapter X - I/O Fundamentals
 

Mehr von zekeLabs Technologies

Webinar - Build Cloud-native platform using Docker, Kubernetes, Prometheus, I...
Webinar - Build Cloud-native platform using Docker, Kubernetes, Prometheus, I...Webinar - Build Cloud-native platform using Docker, Kubernetes, Prometheus, I...
Webinar - Build Cloud-native platform using Docker, Kubernetes, Prometheus, I...zekeLabs Technologies
 
Design Patterns for Pods and Containers in Kubernetes - Webinar by zekeLabs
Design Patterns for Pods and Containers in Kubernetes - Webinar by zekeLabsDesign Patterns for Pods and Containers in Kubernetes - Webinar by zekeLabs
Design Patterns for Pods and Containers in Kubernetes - Webinar by zekeLabszekeLabs Technologies
 
[Webinar] Following the Agile Footprint - zekeLabs
[Webinar] Following the Agile Footprint - zekeLabs[Webinar] Following the Agile Footprint - zekeLabs
[Webinar] Following the Agile Footprint - zekeLabszekeLabs Technologies
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
A curtain-raiser to the container world Docker & Kubernetes
A curtain-raiser to the container world Docker & KubernetesA curtain-raiser to the container world Docker & Kubernetes
A curtain-raiser to the container world Docker & KuberneteszekeLabs Technologies
 
Docker - A curtain raiser to the Container world
Docker - A curtain raiser to the Container worldDocker - A curtain raiser to the Container world
Docker - A curtain raiser to the Container worldzekeLabs Technologies
 
Master guide to become a data scientist
Master guide to become a data scientist Master guide to become a data scientist
Master guide to become a data scientist zekeLabs Technologies
 

Mehr von zekeLabs Technologies (20)

Webinar - Build Cloud-native platform using Docker, Kubernetes, Prometheus, I...
Webinar - Build Cloud-native platform using Docker, Kubernetes, Prometheus, I...Webinar - Build Cloud-native platform using Docker, Kubernetes, Prometheus, I...
Webinar - Build Cloud-native platform using Docker, Kubernetes, Prometheus, I...
 
Design Patterns for Pods and Containers in Kubernetes - Webinar by zekeLabs
Design Patterns for Pods and Containers in Kubernetes - Webinar by zekeLabsDesign Patterns for Pods and Containers in Kubernetes - Webinar by zekeLabs
Design Patterns for Pods and Containers in Kubernetes - Webinar by zekeLabs
 
[Webinar] Following the Agile Footprint - zekeLabs
[Webinar] Following the Agile Footprint - zekeLabs[Webinar] Following the Agile Footprint - zekeLabs
[Webinar] Following the Agile Footprint - zekeLabs
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
A curtain-raiser to the container world Docker & Kubernetes
A curtain-raiser to the container world Docker & KubernetesA curtain-raiser to the container world Docker & Kubernetes
A curtain-raiser to the container world Docker & Kubernetes
 
Docker - A curtain raiser to the Container world
Docker - A curtain raiser to the Container worldDocker - A curtain raiser to the Container world
Docker - A curtain raiser to the Container world
 
Serverless and cloud computing
Serverless and cloud computingServerless and cloud computing
Serverless and cloud computing
 
SQL
SQLSQL
SQL
 
02 terraform core concepts
02 terraform core concepts02 terraform core concepts
02 terraform core concepts
 
08 Terraform: Provisioners
08 Terraform: Provisioners08 Terraform: Provisioners
08 Terraform: Provisioners
 
Outlier detection handling
Outlier detection handlingOutlier detection handling
Outlier detection handling
 
Nearest neighbors
Nearest neighborsNearest neighbors
Nearest neighbors
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Master guide to become a data scientist
Master guide to become a data scientist Master guide to become a data scientist
Master guide to become a data scientist
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Linear models of classification
Linear models of classificationLinear models of classification
Linear models of classification
 
Grid search, pipeline, featureunion
Grid search, pipeline, featureunionGrid search, pipeline, featureunion
Grid search, pipeline, featureunion
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Essential NumPy
Essential NumPyEssential NumPy
Essential NumPy
 
Ensemble methods
Ensemble methods Ensemble methods
Ensemble methods
 

Kürzlich hochgeladen

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 

Kürzlich hochgeladen (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Web scraping using scrapy - zekeLabs

  • 1. zekeLabs Learning made Simpler ! www.zekeLabs.com
  • 2. Web Scraping using Scrapy support@zekeLabs.com | www.zekeLabs.com | +91 8095465880
  • 3. Introduction to Web Scraping Technique to extract large amounts of data from websites The data is extracted and saved in file systems or in a database. Python Libraries : 1. BeautifulSoup 2. Scrapy
  • 4. Ethics for Scraping ● Respect robot.txt file ● Check if public API is available ● Identify yourself by providing User Agent ● Scrape the data to create the value not to duplicate it
  • 5. What is robot.txt ? Robots.txt is a text file to instruct web robots (typically search engine robots) how to crawl pages on their website. Basic format: User-agent: [user-agent name] Disallow: [URL string not to be crawled] ● Blocking all web crawlers from all content User-agent: * Disallow: / ● Allowing all web crawlers access to all content User-agent: * Disallow: ● Blocking a specific web crawler from a specific folder User-agent: Googlebot Disallow: /example-subfolder/
  • 6. Getting Started with BeautifulSoup ● Beautiful Soup is a library for pulling data out of HTML and XML files. ● Installing Beautiful Soup4: pip install beautifulsoup4 ● Useful Functions: ■ find() ■ find_all() ■ find_parent() ■ find_parents() ■ find_next_sibling() ■ find_next_siblings()
  • 7. Introduction to Scrapy ● Open source framework ● Extract, Process & Store unstructured data. ● In a fast, simple, yet extensible way.
  • 8. Installing Scrapy ● conda install -c conda-forge scrapy ● pip install scrapy
  • 9. Scrapy Components - Items To define common output data format Scrapy provides the Item class. Item objects are simple containers used to collect the scraped data. Syntax : ● Define Items: import scrapy class MobileItem(scrapy.Item): model_name = scrapy.Field() model_details = scrapy.Field() model_price = scrapy.Field() ● Using Items to store the data model = MobilesItem() model['model_name'] = name model['model_details'] = details model['model_price'] = price yield model
  • 10. Scrapy Components - Spider Spiders are classes which define how how to perform the crawl and how to extract structured data from their pages (i.e. scraping items) Scraping Cycle : ● Generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests. ● In the callback function, parse the response using Selectors(web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects. ● The items returned from the spider will be stored into a database or in files Scrapy Default Spiders: ● Scrapy Spider ● Crawl Spider ● XMLFeed Spider ● CSVFeed Spider ● Sitemap Spider
  • 11. scrapy.Spider Attributes and Methods Description name Name for this spider. Required Attribute start_urls A list of URLs where the spider will begin to crawl from parse This is the default callback used by Scrapy to process downloaded responses. import scrapy class MySpider(scrapy.Spider): name = 'example.com' start_urls = [ 'http://www.example.com/1.html', 'http://www.example.com/2.html', ] def parse(self, response): print(response.url) class MySpider(scrapy.Spider): name = 'myspider' def start_requests(self): Yield scrapy.Request("http://www.example.com", callback=self.parse_link) def parse_link(self, response): pass
  • 12. Crawl Spider Attributes and Methods Description name Name for this spider. Required Attribute start_urls A list of URLs where the spider will begin to crawl from rules To define the rules for crawling import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class mySpider(CrawlSpider): name = "myCrawSpider" allowed_domains = ['pexels.com'] start_urls = ['https://www.pexels.com/collections/feeling-happy-hzn4cx4/'] rules = ( Rule(LinkExtractor(allow=(r'/feeling-happy.*', )), callback='parse_item'), )
  • 13. More about Link Extractors allow a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links. deny a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (ie. not extracted). It has precedence over the allow parameter. allow_domains a single value or a list of string containing domains which will be considered for extracting the links deny _domains a single value or a list of strings containing domains which won’t be considered for extracting the links restrict_xpaths XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below. restrict_css a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from.
  • 14. Scrapy Selectors Scrapy Selector to extract Data : ● Xpath (response.xpath('//div[@id="images"]/a/text()').extract_first()) ● CSS ( response.css('title::text').extract()) 1. Xpath : XPath data model : XPath’s data model is a tree of nodes representing a document. Nodes can be either: ● element nodes (<p>This is a paragraph</p>): /html/head/title ● attribute nodes (href="page.html" inside an <a> tag), ● text nodes ("I have something to say") : /html/body/div/div[1]/text() ● comment nodes (<!-- a comment -->): //meta/@*
  • 15. More on XPath Selectors: 1. Basic Syntax: Xpath=//tagname[@attribute='value'] Xpath=//input[@id='id1'] Xpath=//input[@class='class1'] 2. Xpath=//*[contains(@type,'sub')] Xpath=//*[contains(text(),'Click here')] 3. Xpath=//label[starts-with(@id,'message')] 4. Xpath=//td[text()='UserID'] 5. Xpath=//*[@type='text']//following::input 6. Xpath=//*[@id='id1']/child::li
  • 16. CSS Selectors 1. .class .intro Selects all elements with class="intro" 1. #id #firstname Selects the element with id="firstname" 1. [attribute=value] [target=_blank] Selects all elements with target="_blank"
  • 17. Item Pipeline Enabling Item ITEM_PIPELINES = { 'myproject.pipelines.JsonWriterPipeline': 800, } import json class JsonWriterPipeline(object): def open_spider(self, spider): self.file = open('items.jl', 'w') def close_spider(self, spider): self.file.close() def process_item(self, item, spider): line = json.dumps(dict(item)) + "n" self.file.write(line) return item
  • 18. Item Exporter • Scrapy provides a collection of Item Exporters for different output formats, such as XML, CSV or JSON 1. start_exporting() in order to signal the beginning of the exporting process 2. export_item() method for each item you want to export 3. finish_exporting() to signal the end of the exporting process • CSV Item Exporter : (scrapy.exporters.CsvItemExporter) class CsvPipeline(object): def __init__(self): self.file = open("booksdata.csv", 'wb') self.exporter = CsvItemExporter(self.file, unicode) self.exporter.start_exporting() def close_spider(self, spider): self.exporter.finish_exporting() self.file.close() def process_item(self, item, spider): self.exporter.export_item(item) return item
  • 19. Image Pipeline and File Pipeline Enabling Image and File Pipeline ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1} ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1} Set the FILES_STORE and Image setting: FILES_STORE = '/path/to/valid/dir' IMAGES_STORE = '/path/to/valid/dir' Using File Pipeline 1. In a Spider, you scrape an item and put the URLs of the desired into a file_urls field. 2. The item is returned from the spider and goes to the item pipeline. 3. When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader 4. When the files are downloaded, another field (files) will be populated with the results. This field will contain a list of dicts with information about the downloaded files image_urls for the image URLs of an item and it will populate an images field for the information about the downloaded images.
  • 20. Settings • ROBOTSTXT_OBEY = True • USER_AGENT = 'MyCompany-MyCrawler (bot@mycompany.com)‘ • DOWNLOAD_DELAY = 5.0 • CONCURRENT_REQUESTS_PER_DOMAIN = 16 • CONCURRENT_REQUESTS_PER_IP = 16 • HTTPCACHE_ENABLED = True • DOWNLOAD_TIMEOUT = 15 • REDIRECT_ENABLED = False • DEPTH_LIMIT = 3 • CLOSESPIDER_ITEMCOUNT=10 • CLOSESPIDER_PAGECOUNT=10
  • 21. Visit : www.zekeLabs.com for more details THANK YOU Let us know how can we help your organization to Upskill the employees to stay updated in the ever-evolving IT Industry. Get in touch: www.zekeLabs.com | +91-8095465880 | info@zekeLabs.com