SlideShare ist ein Scribd-Unternehmen logo
1 von 41
DRAFT VERSION v0.1 
First steps with Scrapy 
@Francisco Sousa
WHAT IS SCRAPY?
Scrapy is an open source and collaborative 
framework for extracting the data you 
need from websites. 
It’s made in Python!
Who is it for?
Scrapy is for everyone that want to collect 
data from one or many websites.
“The advantage of scraping is that you can 
do it with virtually any web site - from 
weather forecasts to government 
spending, even if that site does not have 
an API for raw data access” 
Friedrich Lindenberg
Alternatives?
There are many alternatives as: 
• Lxml 
• Beatiful Soup 
• Mechanize 
• Newspaper
Advantages of Scrapy?
• It’s free 
• It’s cross platform (Windows, 
Linux, Mac OS and BSD) 
• Fast and powerfull
Disadvantages of 
Scrapy?
• It’s only for python 2.7.+ 
• It’s has a bigger learnig curve that 
some other alternatives 
• Installation it’s different according 
the operating system
Let’s start!
First of all you will have to install it so do: 
pip install scrapy 
or 
sudo pip install scrapy 
Note: with this command will be installed scrapy 
and their dependencies. 
On Windows you will have to install pywin32
Create our first project
Before we starting scraping information, 
we will create an scrapy project, so go to 
directory where you want to create the 
project and write the follow command: 
scrapy startproject demo
The command before will create the 
skeleton for your project, as you can see 
on the figure bellow:
The files created are the core of our 
project, so it’s important that you 
understand the basics: 
• scrapy.cfg: the project configuration file 
• demo/: the project’s python module, you’ll later import 
your code from here. 
• demo/items.py: the project’s items file. 
• demo/pipelines.py: the project’s pipelines file. 
• demo/settings.py: the project’s settings file. 
• demo/spiders/: a directory where you’ll later put your 
spiders.
Choose an Website to 
scrape
After we have the skeleton of the project, 
the next logical step is choose among the 
number of websites in the world, what is 
website that we want get information
I choose for this example scrape 
information from the website: 
That is an important website of technology 
news
Because the verge is a giant website, I 
decide that I will only try to get 
information from the last reviews of The 
Verge. 
So we have to follow the next steps: 
1 See what is the url for reviews 
2 Define how many pages we want to get of reviews 
3 Define what information to scrape 
4 Create a spider
See what is the url for reviews 
http://www.theverge.com/reviews
Define how many pages we want to get of 
reviews. For simplicity we will choose 
scrape only the first 5 pages of The Verge 
• http://www.theverge.com/reviews/1 
• http://www.theverge.com/reviews/2 
• http://www.theverge.com/reviews/3 
• http://www.theverge.com/reviews/4 
• http://www.theverge.com/reviews/5
Define what information 
you want to scrape:
3 
1 
2 
1 Title of the article 
2 Number of comments 
3 Author of the article
Create the fields for the information that 
you want to scrape on Python
Create a spider
name: identifies the Spider. It must be 
unique! 
start_urls: is a list of URLs where the 
Spider will begin to crawl from. 
parse: is a method of the spider, which will 
be called with the 
downloaded Response object of each start 
URL..
How to run my spider?
This is the easy part, to run our spider we 
have to simple to the following command: 
scrapy runspider <spider_file.py> 
E.g: scrapy runspider the_verge.py
How to store 
information of my spider 
on a file?
To store the information of our spider we 
have to execute the following command: 
scrapy runspider the_verge.py -o 
items.json
You have other formats like CSV and XML: 
CSV: 
scrapy runspider the_verge.py -o items.csv 
XML: 
scrapy runspider the_verge.py -o 
items.xml
Conclusion
In this presentation you learn the concepts 
key of scrapy and how to create a simple 
spider. Now is time to put hands to work 
and experiment other things :D
Thanks!
Appendix
Bibliography 
http://datajournalismhandbook.org/1.0/e 
n/getting_data_3.html 
https://pypi.python.org/pypi/Scrapy 
http://scrapy.org/ 
http://doc.scrapy.org/
Code available in: 
https://github.com/FranciscoSousaDeveloper/demo 
Contact: 
pt.linkedin.com/pub/francisco-sousa/4a/921/6a3/ 
@Francisco Sousa

Weitere ähnliche Inhalte

Was ist angesagt?

Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
Sujit Pal
 

Was ist angesagt? (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in python
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
Intro to beautiful soup
Intro to beautiful soupIntro to beautiful soup
Intro to beautiful soup
 
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and ParquetFile Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen Scraping
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
Probabilistic Retrieval
Probabilistic RetrievalProbabilistic Retrieval
Probabilistic Retrieval
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
Sentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesSentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use cases
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
Python | What is Python | History of Python | Python Tutorial
Python | What is Python | History of Python | Python TutorialPython | What is Python | History of Python | Python Tutorial
Python | What is Python | History of Python | Python Tutorial
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 

Andere mochten auch

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Sammy Fung
 
Java Script Based Client Server Webapps 2
Java Script Based Client Server Webapps 2Java Script Based Client Server Webapps 2
Java Script Based Client Server Webapps 2
kriszyp
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapy
recast203
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Sammy Fung
 

Andere mochten auch (20)

Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Downloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + Scrapy
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapy
 
Scrapinghub PyCon Philippines 2015
Scrapinghub PyCon Philippines 2015Scrapinghub PyCon Philippines 2015
Scrapinghub PyCon Philippines 2015
 
Scrapy workshop
Scrapy workshopScrapy workshop
Scrapy workshop
 
Web crawler - Scrapy
Web crawler - Scrapy Web crawler - Scrapy
Web crawler - Scrapy
 
Java Script Based Client Server Webapps 2
Java Script Based Client Server Webapps 2Java Script Based Client Server Webapps 2
Java Script Based Client Server Webapps 2
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapy
 
RESTful API Design Fundamentals
RESTful API Design FundamentalsRESTful API Design Fundamentals
RESTful API Design Fundamentals
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
 
Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014
 
Pydata-Python tools for webscraping
Pydata-Python tools for webscrapingPydata-Python tools for webscraping
Pydata-Python tools for webscraping
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
Website vs web app
Website vs web appWebsite vs web app
Website vs web app
 
電腦不只會幫你選土豆,還會幫你選新聞
電腦不只會幫你選土豆,還會幫你選新聞電腦不只會幫你選土豆,還會幫你選新聞
電腦不只會幫你選土豆,還會幫你選新聞
 
均一Gae甘苦談
均一Gae甘苦談均一Gae甘苦談
均一Gae甘苦談
 
Mobile Website vs Mobile App
Mobile Website vs Mobile AppMobile Website vs Mobile App
Mobile Website vs Mobile App
 
Web Engineering - Web Applications versus Conventional Software
Web Engineering - Web Applications versus Conventional SoftwareWeb Engineering - Web Applications versus Conventional Software
Web Engineering - Web Applications versus Conventional Software
 

Ähnlich wie Scrapy

Workshop For pycon13
Workshop For pycon13Workshop For pycon13
Workshop For pycon13
Steven Pousty
 
TriplePlay-WebAppPenTestingTools
TriplePlay-WebAppPenTestingToolsTriplePlay-WebAppPenTestingTools
TriplePlay-WebAppPenTestingTools
Yury Chemerkin
 
MEAN Stack WeNode Barcelona Workshop
MEAN Stack WeNode Barcelona WorkshopMEAN Stack WeNode Barcelona Workshop
MEAN Stack WeNode Barcelona Workshop
Valeri Karpov
 

Ähnlich wie Scrapy (20)

Free Mongo on OpenShift
Free Mongo on OpenShiftFree Mongo on OpenShift
Free Mongo on OpenShift
 
Workshop For pycon13
Workshop For pycon13Workshop For pycon13
Workshop For pycon13
 
Manual JavaScript Analysis Is A Bug
Manual JavaScript Analysis Is A BugManual JavaScript Analysis Is A Bug
Manual JavaScript Analysis Is A Bug
 
TriplePlay-WebAppPenTestingTools
TriplePlay-WebAppPenTestingToolsTriplePlay-WebAppPenTestingTools
TriplePlay-WebAppPenTestingTools
 
Hunting for the secrets in a cloud forest
Hunting for the secrets in a cloud forestHunting for the secrets in a cloud forest
Hunting for the secrets in a cloud forest
 
Hunting for the secrets in a cloud forest
Hunting for the secrets in a cloud forestHunting for the secrets in a cloud forest
Hunting for the secrets in a cloud forest
 
CONFidence 2018: Hunting for the secrets in a cloud forest (Paweł Rzepa)
CONFidence 2018: Hunting for the secrets in a cloud forest (Paweł Rzepa)CONFidence 2018: Hunting for the secrets in a cloud forest (Paweł Rzepa)
CONFidence 2018: Hunting for the secrets in a cloud forest (Paweł Rzepa)
 
MEAN Stack WeNode Barcelona Workshop
MEAN Stack WeNode Barcelona WorkshopMEAN Stack WeNode Barcelona Workshop
MEAN Stack WeNode Barcelona Workshop
 
Corporate Secret Challenge - CyberDefenders.org by Azad
Corporate Secret Challenge - CyberDefenders.org by AzadCorporate Secret Challenge - CyberDefenders.org by Azad
Corporate Secret Challenge - CyberDefenders.org by Azad
 
Practical White Hat Hacker Training - Passive Information Gathering(OSINT)
Practical White Hat Hacker Training -  Passive Information Gathering(OSINT)Practical White Hat Hacker Training -  Passive Information Gathering(OSINT)
Practical White Hat Hacker Training - Passive Information Gathering(OSINT)
 
Datasets, APIs, and Web Scraping
Datasets, APIs, and Web ScrapingDatasets, APIs, and Web Scraping
Datasets, APIs, and Web Scraping
 
Vulnerability, exploit to metasploit
Vulnerability, exploit to metasploitVulnerability, exploit to metasploit
Vulnerability, exploit to metasploit
 
[Hackersuli][HUN]MacOS - Going Down the Rabbit Hole
[Hackersuli][HUN]MacOS - Going Down the Rabbit Hole[Hackersuli][HUN]MacOS - Going Down the Rabbit Hole
[Hackersuli][HUN]MacOS - Going Down the Rabbit Hole
 
Getting root with benign app store apps vsecurityfest
Getting root with benign app store apps vsecurityfestGetting root with benign app store apps vsecurityfest
Getting root with benign app store apps vsecurityfest
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Scraping Scripting Hacking
Scraping Scripting HackingScraping Scripting Hacking
Scraping Scripting Hacking
 
OSX/Pirrit: The blue balls of OS X adware
OSX/Pirrit: The blue balls of OS X adwareOSX/Pirrit: The blue balls of OS X adware
OSX/Pirrit: The blue balls of OS X adware
 
Modern Reconnaissance Phase on APT - protection layer
Modern Reconnaissance Phase on APT - protection layerModern Reconnaissance Phase on APT - protection layer
Modern Reconnaissance Phase on APT - protection layer
 
YAPC::EU 2015 - Perl Conferences
YAPC::EU 2015 - Perl ConferencesYAPC::EU 2015 - Perl Conferences
YAPC::EU 2015 - Perl Conferences
 
Beginners guide on how to start exploring IoT 2nd session
Beginners  guide on how to start exploring IoT 2nd sessionBeginners  guide on how to start exploring IoT 2nd session
Beginners guide on how to start exploring IoT 2nd session
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Scrapy

  • 1. DRAFT VERSION v0.1 First steps with Scrapy @Francisco Sousa
  • 3. Scrapy is an open source and collaborative framework for extracting the data you need from websites. It’s made in Python!
  • 4. Who is it for?
  • 5. Scrapy is for everyone that want to collect data from one or many websites.
  • 6. “The advantage of scraping is that you can do it with virtually any web site - from weather forecasts to government spending, even if that site does not have an API for raw data access” Friedrich Lindenberg
  • 8. There are many alternatives as: • Lxml • Beatiful Soup • Mechanize • Newspaper
  • 10. • It’s free • It’s cross platform (Windows, Linux, Mac OS and BSD) • Fast and powerfull
  • 12. • It’s only for python 2.7.+ • It’s has a bigger learnig curve that some other alternatives • Installation it’s different according the operating system
  • 14. First of all you will have to install it so do: pip install scrapy or sudo pip install scrapy Note: with this command will be installed scrapy and their dependencies. On Windows you will have to install pywin32
  • 15. Create our first project
  • 16. Before we starting scraping information, we will create an scrapy project, so go to directory where you want to create the project and write the follow command: scrapy startproject demo
  • 17. The command before will create the skeleton for your project, as you can see on the figure bellow:
  • 18. The files created are the core of our project, so it’s important that you understand the basics: • scrapy.cfg: the project configuration file • demo/: the project’s python module, you’ll later import your code from here. • demo/items.py: the project’s items file. • demo/pipelines.py: the project’s pipelines file. • demo/settings.py: the project’s settings file. • demo/spiders/: a directory where you’ll later put your spiders.
  • 19. Choose an Website to scrape
  • 20. After we have the skeleton of the project, the next logical step is choose among the number of websites in the world, what is website that we want get information
  • 21. I choose for this example scrape information from the website: That is an important website of technology news
  • 22. Because the verge is a giant website, I decide that I will only try to get information from the last reviews of The Verge. So we have to follow the next steps: 1 See what is the url for reviews 2 Define how many pages we want to get of reviews 3 Define what information to scrape 4 Create a spider
  • 23. See what is the url for reviews http://www.theverge.com/reviews
  • 24. Define how many pages we want to get of reviews. For simplicity we will choose scrape only the first 5 pages of The Verge • http://www.theverge.com/reviews/1 • http://www.theverge.com/reviews/2 • http://www.theverge.com/reviews/3 • http://www.theverge.com/reviews/4 • http://www.theverge.com/reviews/5
  • 25. Define what information you want to scrape:
  • 26. 3 1 2 1 Title of the article 2 Number of comments 3 Author of the article
  • 27. Create the fields for the information that you want to scrape on Python
  • 29.
  • 30. name: identifies the Spider. It must be unique! start_urls: is a list of URLs where the Spider will begin to crawl from. parse: is a method of the spider, which will be called with the downloaded Response object of each start URL..
  • 31. How to run my spider?
  • 32. This is the easy part, to run our spider we have to simple to the following command: scrapy runspider <spider_file.py> E.g: scrapy runspider the_verge.py
  • 33. How to store information of my spider on a file?
  • 34. To store the information of our spider we have to execute the following command: scrapy runspider the_verge.py -o items.json
  • 35. You have other formats like CSV and XML: CSV: scrapy runspider the_verge.py -o items.csv XML: scrapy runspider the_verge.py -o items.xml
  • 37. In this presentation you learn the concepts key of scrapy and how to create a simple spider. Now is time to put hands to work and experiment other things :D
  • 40. Bibliography http://datajournalismhandbook.org/1.0/e n/getting_data_3.html https://pypi.python.org/pypi/Scrapy http://scrapy.org/ http://doc.scrapy.org/
  • 41. Code available in: https://github.com/FranciscoSousaDeveloper/demo Contact: pt.linkedin.com/pub/francisco-sousa/4a/921/6a3/ @Francisco Sousa

Hinweis der Redaktion

  1. Colocar em dois slides
  2. Colocar em dois slides