SlideShare ist ein Scribd-Unternehmen logo
1 von 10
Downloaden Sie, um offline zu lesen
Starting Soon…
Web Scraping Workshop
Daniyal Bokhari
Using Beautiful Soup Package
for Python
What is Web Scraping?
The textbook definition
• Web scraping is the act of extracting data from
website
• Typically performed using automated tools to reduce
time spent on collecting data from websites
• Once data is extracted from websites, it can be parsed
to access specific pieces of information relevant to
your search
Some common uses
What is it used for?
• Industry statistics/insights
• By consumers to compare prices across different
stores
• Analyze stock or cryptocurrency prices
• Academic research
The legality of it
When should it not be used?
• Web scraping is generally considered to be legal when
it’s used to find publicly available information
• Should not be used to collect personal information,
data that the poster did not intend to be publicly
available
• Should not be used to access data that requires you
to create an account or login
A Python library for web scraping
• Beautiful Soup is a Python library that helps in web
scraping by providing useful methods to parse
through the HTML code of websites
• Creates python objects out of elements of the HTML
code allowing you to easily search and traverse
through the contents of websites
Which is better?
• Selenium refers to multiple different open-source projects, the Selenium
API can be used to control browsers
• Much more complex than Beautiful Soup, can interact with websites by
pressing buttons and filling out fields
• Used for automation and testing
• Can also be used to perform web scraping
• Beautiful Soup is useful for simple projects, can extract information faster
since it doesn’t load up all of the web page content and JavaScript
Beautiful Soup vs Selenium
• In order to use Beautiful Soup, you must specify a parser
• Parsers take the HTML code we receive from a website and construct an
object which we can use in Python to navigate and search through this
code
• Python has a built-in HTML parser called html.parser
• We will be using lxml, an external parser, as it is faster at parsing HTML
than html.parser and works for more versions of Python
Choice of Parser
Which one should you use?
Live Demo
• You will need to install Beautiful Soup, lxml, and the requests library
The best way to learn is to practice
Run these in a terminal:
pip install beautifulsoup4
pip install lxml
pip install requests
Alternate commands:
pip3 install <module>
Python3 -m pip install <module>
Python3 -m pip3 install <module>
• You can use your favourite IDE/text editor (Pycharm, VS Code, Vim, etc) to
follow along
Where can I learn more about this?
• The documentation is a great place to look at all available functions
• Online tutorials, YouTube videos are also a great place to learn more about
web scraping
Beautiful Soup 4 documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Many resources

Weitere ähnliche Inhalte

Ähnlich wie Web Scraping Workshop

DotNetNuke Urls - Best practice for administrators, editors and developers
DotNetNuke Urls - Best practice for administrators, editors and developersDotNetNuke Urls - Best practice for administrators, editors and developers
DotNetNuke Urls - Best practice for administrators, editors and developers
brchapman
 

Ähnlich wie Web Scraping Workshop (20)

Web Technology Part 1
Web Technology Part 1Web Technology Part 1
Web Technology Part 1
 
Best practices with development of enterprise-scale SharePoint solutions - Pa...
Best practices with development of enterprise-scale SharePoint solutions - Pa...Best practices with development of enterprise-scale SharePoint solutions - Pa...
Best practices with development of enterprise-scale SharePoint solutions - Pa...
 
Find maximum bugs in limited time
Find maximum bugs in limited timeFind maximum bugs in limited time
Find maximum bugs in limited time
 
Getting started developing for share point
Getting started developing for share pointGetting started developing for share point
Getting started developing for share point
 
Automated Acceptance Testing from Scratch
Automated Acceptance Testing from ScratchAutomated Acceptance Testing from Scratch
Automated Acceptance Testing from Scratch
 
BSides Lisbon 2013 - All your sites belong to Burp
BSides Lisbon 2013 - All your sites belong to BurpBSides Lisbon 2013 - All your sites belong to Burp
BSides Lisbon 2013 - All your sites belong to Burp
 
DNN Summit: Robots.txt & Multi-Site DNN Instances
DNN Summit: Robots.txt & Multi-Site DNN InstancesDNN Summit: Robots.txt & Multi-Site DNN Instances
DNN Summit: Robots.txt & Multi-Site DNN Instances
 
WTF: Where To Focus when you take over a Drupal project
WTF: Where To Focus when you take over a Drupal projectWTF: Where To Focus when you take over a Drupal project
WTF: Where To Focus when you take over a Drupal project
 
SharePoint 2013 Search Operations
SharePoint 2013 Search OperationsSharePoint 2013 Search Operations
SharePoint 2013 Search Operations
 
Practical automation for beginners
Practical automation for beginnersPractical automation for beginners
Practical automation for beginners
 
DCI - Free Seo Tools
DCI - Free Seo ToolsDCI - Free Seo Tools
DCI - Free Seo Tools
 
Exploring Content API Options - March 23rd 2016
Exploring Content API Options - March 23rd 2016Exploring Content API Options - March 23rd 2016
Exploring Content API Options - March 23rd 2016
 
If you want to automate, you learn to code
If you want to automate, you learn to codeIf you want to automate, you learn to code
If you want to automate, you learn to code
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
From marketplace to WordPress - WordCamp Belfast
From marketplace to WordPress - WordCamp BelfastFrom marketplace to WordPress - WordCamp Belfast
From marketplace to WordPress - WordCamp Belfast
 
Wp 3hr-course
Wp 3hr-courseWp 3hr-course
Wp 3hr-course
 
Untangling spring week11
Untangling spring week11Untangling spring week11
Untangling spring week11
 
Web Scrapping Using Python
Web Scrapping Using PythonWeb Scrapping Using Python
Web Scrapping Using Python
 
Software design with Domain-driven design
Software design with Domain-driven design Software design with Domain-driven design
Software design with Domain-driven design
 
DotNetNuke Urls - Best practice for administrators, editors and developers
DotNetNuke Urls - Best practice for administrators, editors and developersDotNetNuke Urls - Best practice for administrators, editors and developers
DotNetNuke Urls - Best practice for administrators, editors and developers
 

Mehr von GDSC UofT Mississauga

Mehr von GDSC UofT Mississauga (20)

CSSC ML Workshop
CSSC ML WorkshopCSSC ML Workshop
CSSC ML Workshop
 
ICCIT Council × GDSC: UX / UI and Figma
ICCIT Council × GDSC: UX / UI and FigmaICCIT Council × GDSC: UX / UI and Figma
ICCIT Council × GDSC: UX / UI and Figma
 
Community Projects Info Session Fall 2023
Community Projects Info Session Fall 2023Community Projects Info Session Fall 2023
Community Projects Info Session Fall 2023
 
GDSC x Deerhacks - Origami Workshop
GDSC x Deerhacks - Origami WorkshopGDSC x Deerhacks - Origami Workshop
GDSC x Deerhacks - Origami Workshop
 
Reverse Engineering 101
Reverse Engineering 101Reverse Engineering 101
Reverse Engineering 101
 
Michael's OWASP Juice Shop Workshop
Michael's OWASP Juice Shop WorkshopMichael's OWASP Juice Shop Workshop
Michael's OWASP Juice Shop Workshop
 
MCSS × GDSC: Intro to Cybersecurity Workshop
MCSS × GDSC: Intro to Cybersecurity WorkshopMCSS × GDSC: Intro to Cybersecurity Workshop
MCSS × GDSC: Intro to Cybersecurity Workshop
 
Basics of C
Basics of CBasics of C
Basics of C
 
Discord Bot Workshop Slides
Discord Bot Workshop SlidesDiscord Bot Workshop Slides
Discord Bot Workshop Slides
 
Devops Workshop
Devops WorkshopDevops Workshop
Devops Workshop
 
Express
ExpressExpress
Express
 
HTML_CSS_JS Workshop
HTML_CSS_JS WorkshopHTML_CSS_JS Workshop
HTML_CSS_JS Workshop
 
DevOps Workshop Part 1
DevOps Workshop Part 1DevOps Workshop Part 1
DevOps Workshop Part 1
 
Docker workshop GDSC_CSSC
Docker workshop GDSC_CSSCDocker workshop GDSC_CSSC
Docker workshop GDSC_CSSC
 
Back-end (Flask_AWS)
Back-end (Flask_AWS)Back-end (Flask_AWS)
Back-end (Flask_AWS)
 
Full Stack React Workshop [CSSC x GDSC]
Full Stack React Workshop [CSSC x GDSC]Full Stack React Workshop [CSSC x GDSC]
Full Stack React Workshop [CSSC x GDSC]
 
Git Init (Introduction to Git)
Git Init (Introduction to Git)Git Init (Introduction to Git)
Git Init (Introduction to Git)
 
Database Workshop Slides
Database Workshop SlidesDatabase Workshop Slides
Database Workshop Slides
 
ChatGPT General Meeting
ChatGPT General MeetingChatGPT General Meeting
ChatGPT General Meeting
 
Elon & Twitter General Meeting
Elon & Twitter General MeetingElon & Twitter General Meeting
Elon & Twitter General Meeting
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Web Scraping Workshop

  • 2. Web Scraping Workshop Daniyal Bokhari Using Beautiful Soup Package for Python
  • 3. What is Web Scraping? The textbook definition • Web scraping is the act of extracting data from website • Typically performed using automated tools to reduce time spent on collecting data from websites • Once data is extracted from websites, it can be parsed to access specific pieces of information relevant to your search
  • 4. Some common uses What is it used for? • Industry statistics/insights • By consumers to compare prices across different stores • Analyze stock or cryptocurrency prices • Academic research
  • 5. The legality of it When should it not be used? • Web scraping is generally considered to be legal when it’s used to find publicly available information • Should not be used to collect personal information, data that the poster did not intend to be publicly available • Should not be used to access data that requires you to create an account or login
  • 6. A Python library for web scraping • Beautiful Soup is a Python library that helps in web scraping by providing useful methods to parse through the HTML code of websites • Creates python objects out of elements of the HTML code allowing you to easily search and traverse through the contents of websites
  • 7. Which is better? • Selenium refers to multiple different open-source projects, the Selenium API can be used to control browsers • Much more complex than Beautiful Soup, can interact with websites by pressing buttons and filling out fields • Used for automation and testing • Can also be used to perform web scraping • Beautiful Soup is useful for simple projects, can extract information faster since it doesn’t load up all of the web page content and JavaScript Beautiful Soup vs Selenium
  • 8. • In order to use Beautiful Soup, you must specify a parser • Parsers take the HTML code we receive from a website and construct an object which we can use in Python to navigate and search through this code • Python has a built-in HTML parser called html.parser • We will be using lxml, an external parser, as it is faster at parsing HTML than html.parser and works for more versions of Python Choice of Parser Which one should you use?
  • 9. Live Demo • You will need to install Beautiful Soup, lxml, and the requests library The best way to learn is to practice Run these in a terminal: pip install beautifulsoup4 pip install lxml pip install requests Alternate commands: pip3 install <module> Python3 -m pip install <module> Python3 -m pip3 install <module> • You can use your favourite IDE/text editor (Pycharm, VS Code, Vim, etc) to follow along
  • 10. Where can I learn more about this? • The documentation is a great place to look at all available functions • Online tutorials, YouTube videos are also a great place to learn more about web scraping Beautiful Soup 4 documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Many resources