SlideShare a Scribd company logo
1 of 21
Scrapingtotherescue
(Webscrapingusingpython)
By : Satwik Kansal and Pradhvan Bisht
Whatiswebscraping ?
Web scraping is a technique to extract large amounts of
data from websites whereby the data is extracted and
saved to a local file in your computer.
The data can be used for several purposes like displaying on
your own website and application, performing data analysis
or for any other reason.
whyshouldyouscrape
- API may not provide what you need
- No rate limit
- Take what you really want!
- Reduces manual effort
- Swag!
Thingsthatmightcomehandy
-HTML
-CSS
-XPATH
-Regular Expressions
Howit’sdone?
Broadly a Three Step Process
1. Getting the content (in most cases HTML)
2. Parsing the response.
3. Optimizing/Improving the performance and preserving the data
GETTINGTHECONTENT
● Using modules like urllib, urllib2, requests, mechanize and selenium.
● Involves GET/POST request to the server.
● The response contains the information to be extracted.
● Sometimes not as easy as it may seem.
ExtractingTheData
1. Using Regular Expression and Basic python
Tricky, complex and kind of fragile.
2. Using Parsing Libraries
❏ Two different approaches possible -- Simple Parsing and Search Tree
parsing.
❏ Some popular libraries are BeautifulSoup, Lxml, and html5lib.
❏ Each modules has its own techniques and thus its own pros and trade-
offs
ComparingParsers
BEAUTIFUL SOUP
LXML
SCRAPY
HTML5LIB
PreservingTheData
1. Writing to a file.
2. Exporting as csv or excel file.
3. Storing in a database.
Examples
Example 1 : Scraping Tweets from Twitter using BeautifulSoup
and python’s Requests module
Code
Example 2 : Scraping top Stackoverflow posts using Scrapy
Code
Example 3 : Using Selenium to Log in and fetch library
details from a university library site which uses Dynamic
HTML.
WHATTOUSEWHERE
1. Handling dynamically generated html
Solutions: Selenium or Spidermonkey
2. Cookie based Authentication
Solution : Requests module.
3. Simple scraping
Solutions: BeautifulSoup+Requests, Scrapy, Selenium
Scrapinghacks
1. Overcoming captchas
Lookup tables, One time manual entry , Death By Captchas (paid service)
2. Per IP address query limit
Using tsocks, ssh_D and socks monkey.
3. Improving performance
Multiprocessing , gevent and requests.async() method.
Example3
Automating My College Library
Problems :
1. Authentication
2. Dynamically Generated <iframe> tag
Solution
Selenium with headless Browser like PhantomJS
Alternative: Mechanize
Code
EthicsOfScraping
Exceeding authorized use of the site
Means doing anything that is prohibited in the Terms of Use
(See CFAA, breach of contract, unjust enrichment, trespass
to chattels, and various state laws similar to CFAA)
Copyright Issues
If the material you are scraping is not factual, but
something that required some amount of creativity to create,
you have copyright to worry about.
QuickTip -- Conform to the the robots.txt file.
● The brute-force way to get the information required.
● Absolutely Legal
● Not always that easy.

More Related Content

What's hot

Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction ServicePromptCloud
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?Yu-Chang Ho
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAnkur Biswas
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
Web Development with HTML5, CSS3 & JavaScript
Web Development with HTML5, CSS3 & JavaScriptWeb Development with HTML5, CSS3 & JavaScript
Web Development with HTML5, CSS3 & JavaScriptEdureka!
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data ScienceArc & Codementor
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlerishmecse13
 
Web Development Ppt
Web Development PptWeb Development Ppt
Web Development PptBruce Tucker
 
Web development | Derin Dolen
Web development | Derin Dolen Web development | Derin Dolen
Web development | Derin Dolen Derin Dolen
 
Semantic web
Semantic webSemantic web
Semantic webRehithaP
 

What's hot (20)

Web scraping
Web scrapingWeb scraping
Web scraping
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Web Development with HTML5, CSS3 & JavaScript
Web Development with HTML5, CSS3 & JavaScriptWeb Development with HTML5, CSS3 & JavaScript
Web Development with HTML5, CSS3 & JavaScript
 
Portfolio website
Portfolio websitePortfolio website
Portfolio website
 
SEO Robots txt FILE
SEO Robots txt FILESEO Robots txt FILE
SEO Robots txt FILE
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
Semantic web
Semantic webSemantic web
Semantic web
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Web Development Ppt
Web Development PptWeb Development Ppt
Web Development Ppt
 
Web development | Derin Dolen
Web development | Derin Dolen Web development | Derin Dolen
Web development | Derin Dolen
 
Semantic web
Semantic webSemantic web
Semantic web
 
Ajax and PHP
Ajax and PHPAjax and PHP
Ajax and PHP
 

Similar to Getting started with Web Scraping in Python

REST Api Tips and Tricks
REST Api Tips and TricksREST Api Tips and Tricks
REST Api Tips and TricksMaksym Bruner
 
Making it fast: Zotonic & Performance
Making it fast: Zotonic & PerformanceMaking it fast: Zotonic & Performance
Making it fast: Zotonic & PerformanceArjan
 
Frontend performance
Frontend performanceFrontend performance
Frontend performancesacred 8
 
BackboneJS Training - Giving Backbone to your applications
BackboneJS Training - Giving Backbone to your applicationsBackboneJS Training - Giving Backbone to your applications
BackboneJS Training - Giving Backbone to your applicationsJoseph Khan
 
Web Component Development Using Servlet & JSP Technologies (EE6) - Chapter 4...
 Web Component Development Using Servlet & JSP Technologies (EE6) - Chapter 4... Web Component Development Using Servlet & JSP Technologies (EE6) - Chapter 4...
Web Component Development Using Servlet & JSP Technologies (EE6) - Chapter 4...WebStackAcademy
 
Building Faster Websites
Building Faster WebsitesBuilding Faster Websites
Building Faster WebsitesCraig Walker
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyWeb scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyLITTINRAJAN
 
Scaling asp.net websites to millions of users
Scaling asp.net websites to millions of usersScaling asp.net websites to millions of users
Scaling asp.net websites to millions of usersoazabir
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Esteve Castells
 
Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Maarten Balliauw
 
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Yury Leonychev
 

Similar to Getting started with Web Scraping in Python (20)

REST Api Tips and Tricks
REST Api Tips and TricksREST Api Tips and Tricks
REST Api Tips and Tricks
 
Making it fast: Zotonic & Performance
Making it fast: Zotonic & PerformanceMaking it fast: Zotonic & Performance
Making it fast: Zotonic & Performance
 
contentDM
contentDMcontentDM
contentDM
 
Large-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate GuideLarge-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate Guide
 
Frontend performance
Frontend performanceFrontend performance
Frontend performance
 
BackboneJS Training - Giving Backbone to your applications
BackboneJS Training - Giving Backbone to your applicationsBackboneJS Training - Giving Backbone to your applications
BackboneJS Training - Giving Backbone to your applications
 
Web Component Development Using Servlet & JSP Technologies (EE6) - Chapter 4...
 Web Component Development Using Servlet & JSP Technologies (EE6) - Chapter 4... Web Component Development Using Servlet & JSP Technologies (EE6) - Chapter 4...
Web Component Development Using Servlet & JSP Technologies (EE6) - Chapter 4...
 
Cache is King
Cache is KingCache is King
Cache is King
 
Building Faster Websites
Building Faster WebsitesBuilding Faster Websites
Building Faster Websites
 
Ajax workshop
Ajax workshopAjax workshop
Ajax workshop
 
526_topic08.ppt
526_topic08.ppt526_topic08.ppt
526_topic08.ppt
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
Scrapy workshop
Scrapy workshopScrapy workshop
Scrapy workshop
 
L017418893
L017418893L017418893
L017418893
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyWeb scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
 
Net core performance
Net core performanceNet core performance
Net core performance
 
Scaling asp.net websites to millions of users
Scaling asp.net websites to millions of usersScaling asp.net websites to millions of users
Scaling asp.net websites to millions of users
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
 
Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...
 
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
 

Recently uploaded

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Getting started with Web Scraping in Python

  • 2. Whatiswebscraping ? Web scraping is a technique to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer. The data can be used for several purposes like displaying on your own website and application, performing data analysis or for any other reason.
  • 3.
  • 4. whyshouldyouscrape - API may not provide what you need - No rate limit - Take what you really want! - Reduces manual effort - Swag!
  • 6. Howit’sdone? Broadly a Three Step Process 1. Getting the content (in most cases HTML) 2. Parsing the response. 3. Optimizing/Improving the performance and preserving the data
  • 7. GETTINGTHECONTENT ● Using modules like urllib, urllib2, requests, mechanize and selenium. ● Involves GET/POST request to the server. ● The response contains the information to be extracted. ● Sometimes not as easy as it may seem.
  • 8. ExtractingTheData 1. Using Regular Expression and Basic python Tricky, complex and kind of fragile. 2. Using Parsing Libraries ❏ Two different approaches possible -- Simple Parsing and Search Tree parsing. ❏ Some popular libraries are BeautifulSoup, Lxml, and html5lib. ❏ Each modules has its own techniques and thus its own pros and trade- offs
  • 9.
  • 11. PreservingTheData 1. Writing to a file. 2. Exporting as csv or excel file. 3. Storing in a database.
  • 12. Examples Example 1 : Scraping Tweets from Twitter using BeautifulSoup and python’s Requests module Code Example 2 : Scraping top Stackoverflow posts using Scrapy Code Example 3 : Using Selenium to Log in and fetch library details from a university library site which uses Dynamic HTML.
  • 13.
  • 14. WHATTOUSEWHERE 1. Handling dynamically generated html Solutions: Selenium or Spidermonkey 2. Cookie based Authentication Solution : Requests module. 3. Simple scraping Solutions: BeautifulSoup+Requests, Scrapy, Selenium
  • 15.
  • 16. Scrapinghacks 1. Overcoming captchas Lookup tables, One time manual entry , Death By Captchas (paid service) 2. Per IP address query limit Using tsocks, ssh_D and socks monkey. 3. Improving performance Multiprocessing , gevent and requests.async() method.
  • 17. Example3 Automating My College Library Problems : 1. Authentication 2. Dynamically Generated <iframe> tag Solution Selenium with headless Browser like PhantomJS Alternative: Mechanize Code
  • 18.
  • 19. EthicsOfScraping Exceeding authorized use of the site Means doing anything that is prohibited in the Terms of Use (See CFAA, breach of contract, unjust enrichment, trespass to chattels, and various state laws similar to CFAA) Copyright Issues If the material you are scraping is not factual, but something that required some amount of creativity to create, you have copyright to worry about. QuickTip -- Conform to the the robots.txt file.
  • 20.
  • 21. ● The brute-force way to get the information required. ● Absolutely Legal ● Not always that easy.