SlideShare ist ein Scribd-Unternehmen logo
1 von 81
Scraping the Web with
Laravel Dusk, Docker, and PHP
By: Paul Redmond
@paulredmond paulredmond
What You’ll Learn?
● Different types of scraping and when to use them
● Use Laravel Dusk for rapid browser automation
● Different Ways to Run Browser Automation
● Run Browser Automation in a Server Environment
What is Web Scraping?
It’s a dirty job
Gathering data from HTML and
other media for the purposes
of testing, data enrichment,
and collection.
https://flic.kr/p/8EZMNk
Hundreds of Billions
Google “Scrapes” Hundreds of Billions (Or More)
of Pages and other media on the web.
https://www.google.com/search/howsearchworks/crawling-indexing/
Why Do We Need Scraping?
● Market analysis
● Gain a competitive advantage
● Increase learning and understanding
● Monitor trends
● Combine multiple offers into one portal (ie. Shopping
comparisons)
● Analytics
Other Types of Data Scraping
● Competitor Scanning
● Military Intelligence
● Surveillance
● Metering
Other Types of Data Scraping
Other Types of Data Scraping
Is Web Scraping Legitimate?
● Yes, it can be.
● Scraping can have a negative/bad connotation, so...
○ Don’t do bad / illegal stuff
○ Be nice
○ Be careful
○ Be respectful
Keeping Web Scraping Legitimate
● Speed
● Caution
● Intent
● Empathy
● Honesty
Keeping Web Scraping Legitimate
● Speed. Go slow (watch requests/second)
● Caution. Code mistakes could create unintended load!
● Intent. Even if your intention is pure, always question.
● Empathy. Put yourself in the shoes of website owners
● Honesty. Don’t steal stuff (PII, copyrights, etc.)
Keep Robots.txt in Mind...Be a Good Bot
● https://www.google.com/robots.txt
● https://www.yahoo.com/robots.txt
● https://github.com/robots.txt (see the top comment)
* PHP Robots Parser: https://github.com/webignition/robots-txt-file
When Do We Scrape?
● What is the purpose?
● Can we live without the data?
● Do they have an API?
● If yes, does the API have everything we need?
● Do they allow scraping?
Downsides of Scraping
● Changes in the HTML/DOM breaks scrapers
● Changes in the HTML/DOM breaks scrapers
● Changes in the HTML/DOM breaks scrapers
● Changes in the HTML/DOM breaks scrapers
● Rich JavaScript apps can cause headaches
● Scraping can be process/memory and time intensive
● More manual processing/formatting of collected data
than an API
● Changes in the HTML/DOM breaks scrapers
How Do we Overcome the Downsides?
● Match DOM/Selectors defensively
● It's a bit of an art that takes practice and experience
● Make sure that you handle failure
● Good alerting, notifications, and reporting
○ https://www.bugsnag.com/
○ https://sentry.io/
● Learn to accept that scraping will break sometimes
Scraping Tools
3 Categories of Web Scraping
● Anonymous HTTP Requests (HTML, Images, XML, etc.)
● Testing elements, asserting expected behavior
● Full Browser Automation Tasks
Anonymous Scraping - HTML, Images, etc.
● Fastest
● Easy to run and reproduce
● Just speaking HTTP
● PHP has a Good DOM Parsing Tools (Goutte)
Testing elements / asserting expected behavior
● May use HTTP to make basic response assertions
● May use a full browser (think testing Rich JavaScript Apps)
● Useful for user acceptance testing and browser testing
Full Browser Automation
● Like testing, but used for scraping
● Real browser or headless browser
● The closest thing to a real user
● Requires more tooling (ie. Selenium, WebDriver, Phantom)
● Runs slow in general
● cURL
● Goutte (goot)
● Guzzle
● HTTPFul
● PHP-Webdriver
● file_get_contents()
(Some) PHP Tools You Can Use for Scraping
What Other Tools Have You Used?
HTTP Scraping
Goutte is the Best Option (in my opinion)
Pronounced “goot”
HTTP Scraping
Goutte Overview
● Uses Symfony/BrowserKit to Simulate the Browser
● Uses Symfony/DomCrawler for DOM Traversal/Filtering
● Uses Guzzle for HTTP Requests
● Get and Set Cookies
● History (allows you to go back, forward, clear)
Reference: https://github.com/FriendsOfPHP/Goutte
HTTP Scraping
Goutte Capabilities
● Click on Links and navigate the web
● Extract data / filter data
● Submit forms
● Follows redirects (by default)
● Requests return an instance of
SymfonyComponentDomCrawlerCrawler
HTTP Scraping
Let’s Look at Some Examples of HTTP Scraping
Goutte Examples on Github
HTTP Scraping
Testing and Web Scrapers
Ways you might use web scraping for testing
● Test bulk site redirects before a migration
○ Request the old URLs
○ Assert a 3xx response
○ Assert the redirect location returns a 200
● Functional test suites (ie. Symfony/Laravel)
● Healthcheck Probes / HTTP validation (ie. 200 response)
Testing and Web Scrapers
Example Functional Test Asserting HTML
Testing and Web Scrapers
http://symfony.com/doc/current/testing.html#your-first-functional-test
Example Functional Test Asserting Status
Testing and Web Scrapers
https://laravel.com/docs/5.4/http-tests#introduction
Example Functional Browser Test
Testing and Web Scrapers
https://laravel.com/docs/5.4/dusk#getting-started
Full Browser Automation
Why do we need full browser automation tools?
Full Browser Automation
Why do we need full browser automation tools?
● Simulate real browsers
● Test/Work with Async JavaScript applications
● Automate testing that applications work as expected
● Replace repetitive manual QA with automation
● Run tests in multiple browsers
● Advanced Web Scraping (ie. filtered reports)
Full Browser Automation
Noteable Tools in Browser Automation
● Selenium
● W3 WebDriver (https://www.w3.org/TR/webdriver/)
● Headless Browsers
○ PhantomJS
○ Chrome --headless*
○ ZombieJS
* Chromedriver isn’t quite working with --headless yet, at least for me ¯_(ツ)_/¯
Full Browser Automation
Noteable PHP Tools in Browser Automation
● Behat / Mink
● PHP-Webdriver
○ Codeception
○ Laravel Dusk (recently)
● Steward
● Any others you consider noteable?
Full Browser Automation
Notables in Other Languages...
● Python
○ Selenium WebDriver Bindings
○ BeautifulSoup
○ Requests: HTTP for Humans
○ Scrapy
● Ruby
○ Capybara
○ Nokogiri (DOM Parsing)
○ Mechanize Gem
Full Browser Automation
Notables in Other Languages...
● JavaScript
○ Nightwatch.js
○ Zombie
○ PhantomJS
○ Webdriver.io
○ CasperJS
○ SlimerJS
Full Browser Automation
Why Use PHP for Web Browser Automation?
● Developers don’t have to learn a new language (good/bad)
● More participation in teams already writing PHP
● Reduce cross-language mental overhead
● Browser Automation can be closer to your domain logic
● PHP-Webdriver is Good Enough™ (and backed by Facebook)
Full Browser Automation
How Do I Run PHP
Browser Automation?
How Do I Run PHP Browser Automation?
● `chrome --headless` - as of Chrome 59
● Standalone Selenium
● WebDriver
● PhantomJS
● Any other ways?
How Do I Run This Stuff?
Run Chrome Headless (Chrome 59 Stable)
$ alias chrome="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
$ chrome --headless --disable-gpu --print-to-pdf https://www.github.com/
$ open output.pdf
$ chrome --headless --disable-gpu --dump-dom
$ chrome --headless --disable-gpu --repl https://www.chromestatus.com/
Reference: https://developers.google.com/web/updates/2017/04/headless-chrome
How Do I Run This Stuff?
Getting to Know PHP-WebDriver
WebDriver Examples on Github
How Do I Run This Stuff?
Running the Chromedriver/Phantom Process
How Do I Run This Stuff?
Techniques for Triggering Browser Automation
● Eager tasks - run on a schedule
● On-demand - one-off console commands
● Event trigger - event queue
● What are some other ways?
How Do I Run This Stuff?
Intro to Laravel Dusk
Intro to Laravel Dusk
● Browser testing for Laravel projects (primary use case)
● Browser abstraction on top of PHP-Webdriver <3
● Doesn’t require JDK or Selenium (you can still use them)
● Uses standalone ChromeDriver
Do I HAVE to use Laravel to Use Dusk!?
Do I HAVE to use Laravel to Use Dusk!?
But I am going to show you why
its great for web automation stuff...
Dusk Basics: Elements
Dusk Basics: Links/Events
Dusk Basics: Form Inputs
Dusk Basics: Waiting for Elements
Quick Comparison to Our Earlier Vanilla PHP-
Webdriver Example
Webdriver Dusk Examples on Github
Running Browser Automation
Key Laravel Features for Browser Automation
● Scheduler to run Commands on a schedule (eager)
● Create Custom Console Commands (one-off)
● Built-in Queues (triggered)
● Database Migrations for quick modeling of data storage
● Service Container for browse automation classes
Scheduler (app/Console/Kernel.php)
Custom Console Commands
● Easily run one-off commands
● Scheduler uses commands, giving you both
● Laravel uses the Symfony Console and adds conveniences
● Commands run my browser scraping
Queues
● Easily trigger web scraping jobs
● Queue jobs can trigger console commands
● Laravel has a built-in queue worker
● Redis is my preferred queue driver
Queues
Queues
Running Browser Automation in Docker
How Do I Run PHP Browser Automation on a Server!?
How Do I Run This Stuff?
How Do I Run PHP Browser Automation on a Server!?
How Do I Run This Stuff?
XVFB
XVFB. What the What!?
“Xvfb (short for X virtual framebuffer) is an in-memory display
server for UNIX-like operating system (e.g., Linux). It enables you
to run graphical applications without a display (e.g., browser
tests on a CI server) while also having the ability to take
screenshots.”
Reference: http://elementalselenium.com/tips/38-headless
How Do I Run This Stuff?
Example Xvfb Usage
$ Xvfb :99 -screen 0 1920x1200x16 &
How Do I Run This Stuff?
Example Xvfb Usage How Do I Run This Stuff?
Our Requirements for a Docker Scheduler
● Google Chrome Stable
● Chromedriver
● Xvfb
● PHP
● Entrypoint to run the scheduler
Running in Docker
Our Docker Setup
● Docker Official php:7.1.6-cli (Scheduler)
● Docker Official php:7.1.6-fpm (Web Container)
● Docker Compose
● Redis
● MySQL
Running in Docker
Why Not the Official Selenium Image?
● If you need File Downloads through Chrome
● Downloads through volumes aren’t ideal
● If you want the same PHP installation on app and scheduler
(I do)
Running in Docker
Scheduler Dockerfile
● Extends php:7.1.6-cli
● Installs Chrome Stable + a script to take chrome out of
sandbox mode
● Installs Chromedriver
● Installs Required PHP Modules
● Copies Application Files
● Runs a custom entrypoint script
Running in Docker
Scheduler Dockerfile
Review the Scheduler Docker Files
Running in Docker
How Do I Download Files through Chrome?
Running in Docker
Extending Dusk Browser - Hooking it Together
● Provide our Own Browser class
● A DownloadsManager class for chrome downloads
● A DownloadedFile Class to Work with Downloaded Files
● Service Container Bindings in AppServiceProvider
● Example Command
● Lets see it in action...
Running in Docker
Full Docker Setup in Action
(Demo)
Running in Docker
My Projects
Lumen Programming Guide
http://www.apress.com/la/book/9781484221860
You will learn to write test-driven (TDD)
microservices, REST APIs, and web service
APIs with PHP using the Lumen micro-
framework.
* Zero bugs in the book source code ;)
My Projects
Docker for PHP Developers
https://leanpub.com/docker-for-php-developers
A hands-on guide to learning how to use
Docker as your primary development
environment. It covers a diverse range of
topics and scenarios you will face as a
PHP developer picking up docker.
Final Questions?
Thank You!

Weitere ähnliche Inhalte

Was ist angesagt?

JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
JSMVCOMFG - To sternly look at JavaScript MVC and Templating FrameworksJSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
Mario Heiderich
 

Was ist angesagt? (20)

Why rust?
Why rust?Why rust?
Why rust?
 
Introduction to Rust language programming
Introduction to Rust language programmingIntroduction to Rust language programming
Introduction to Rust language programming
 
OWASP SF - Reviewing Modern JavaScript Applications
OWASP SF - Reviewing Modern JavaScript ApplicationsOWASP SF - Reviewing Modern JavaScript Applications
OWASP SF - Reviewing Modern JavaScript Applications
 
Concurrency in Golang
Concurrency in GolangConcurrency in Golang
Concurrency in Golang
 
Bootstrap seminar presentation
Bootstrap seminar presentationBootstrap seminar presentation
Bootstrap seminar presentation
 
Hacking the browser with puppeteer sharp .NET conf AR 2018
Hacking the browser with puppeteer sharp .NET conf AR 2018Hacking the browser with puppeteer sharp .NET conf AR 2018
Hacking the browser with puppeteer sharp .NET conf AR 2018
 
JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
JSMVCOMFG - To sternly look at JavaScript MVC and Templating FrameworksJSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
 
ZeroNights 2018 | I <"3 XSS
ZeroNights 2018 | I <"3 XSSZeroNights 2018 | I <"3 XSS
ZeroNights 2018 | I <"3 XSS
 
SQL Injection
SQL Injection SQL Injection
SQL Injection
 
How to contribute to open source
How to contribute to open sourceHow to contribute to open source
How to contribute to open source
 
Introduction to Rust
Introduction to RustIntroduction to Rust
Introduction to Rust
 
The innerHTML Apocalypse
The innerHTML ApocalypseThe innerHTML Apocalypse
The innerHTML Apocalypse
 
JAVASCRIPT PPT [Autosaved].pptx
JAVASCRIPT PPT [Autosaved].pptxJAVASCRIPT PPT [Autosaved].pptx
JAVASCRIPT PPT [Autosaved].pptx
 
Intellij idea tutorial
Intellij idea tutorialIntellij idea tutorial
Intellij idea tutorial
 
Laravel tutorial
Laravel tutorialLaravel tutorial
Laravel tutorial
 
JavaScript Basics and Best Practices - CC FE & UX
JavaScript Basics and Best Practices - CC FE & UXJavaScript Basics and Best Practices - CC FE & UX
JavaScript Basics and Best Practices - CC FE & UX
 
Ruby on Rails Penetration Testing
Ruby on Rails Penetration TestingRuby on Rails Penetration Testing
Ruby on Rails Penetration Testing
 
Owasp Top 10 A1: Injection
Owasp Top 10 A1: InjectionOwasp Top 10 A1: Injection
Owasp Top 10 A1: Injection
 
CSRF-уязвимости все еще актуальны: как атакующие обходят CSRF-защиту в вашем ...
CSRF-уязвимости все еще актуальны: как атакующие обходят CSRF-защиту в вашем ...CSRF-уязвимости все еще актуальны: как атакующие обходят CSRF-защиту в вашем ...
CSRF-уязвимости все еще актуальны: как атакующие обходят CSRF-защиту в вашем ...
 
Flask – Python
Flask – PythonFlask – Python
Flask – Python
 

Ähnlich wie Scraping the web with Laravel, Dusk, Docker, and PHP

Web Fundamentals Crash Course
Web Fundamentals Crash CourseWeb Fundamentals Crash Course
Web Fundamentals Crash Course
MrAbbas
 
Web Fundamentals Crash Course
Web Fundamentals Crash CourseWeb Fundamentals Crash Course
Web Fundamentals Crash Course
MrAbas
 
Behaviour Testing and Continuous Integration with Drupal
Behaviour Testing and Continuous Integration with DrupalBehaviour Testing and Continuous Integration with Drupal
Behaviour Testing and Continuous Integration with Drupal
smithmilner
 

Ähnlich wie Scraping the web with Laravel, Dusk, Docker, and PHP (20)

Use Xdebug to profile PHP
Use Xdebug to profile PHPUse Xdebug to profile PHP
Use Xdebug to profile PHP
 
Intro to DooPHP
Intro to DooPHPIntro to DooPHP
Intro to DooPHP
 
Web Fundamentals Crash Course
Web Fundamentals Crash CourseWeb Fundamentals Crash Course
Web Fundamentals Crash Course
 
Web Fundamentals Crash Course
Web Fundamentals Crash CourseWeb Fundamentals Crash Course
Web Fundamentals Crash Course
 
Speed = $$$
Speed = $$$Speed = $$$
Speed = $$$
 
The Characteristics of a Successful SPA
The Characteristics of a Successful SPAThe Characteristics of a Successful SPA
The Characteristics of a Successful SPA
 
Client-Side Performance Testing
Client-Side Performance TestingClient-Side Performance Testing
Client-Side Performance Testing
 
Web Development in Django
Web Development in DjangoWeb Development in Django
Web Development in Django
 
The Recording HTTP Proxy: Not Yet Another Messiah - Bulgaria PHP 2019
The Recording HTTP Proxy: Not Yet Another Messiah - Bulgaria PHP 2019The Recording HTTP Proxy: Not Yet Another Messiah - Bulgaria PHP 2019
The Recording HTTP Proxy: Not Yet Another Messiah - Bulgaria PHP 2019
 
You Can Work on the Web Patform! (GOSIM 2023)
You Can Work on the Web Patform! (GOSIM 2023)You Can Work on the Web Patform! (GOSIM 2023)
You Can Work on the Web Patform! (GOSIM 2023)
 
Web II - 01 - Introduction to server-side development
Web II - 01 - Introduction to server-side developmentWeb II - 01 - Introduction to server-side development
Web II - 01 - Introduction to server-side development
 
Make Drupal Run Fast - increase page load speed
Make Drupal Run Fast - increase page load speedMake Drupal Run Fast - increase page load speed
Make Drupal Run Fast - increase page load speed
 
Google Chronicles: Analytics And Chrome
Google Chronicles: Analytics And ChromeGoogle Chronicles: Analytics And Chrome
Google Chronicles: Analytics And Chrome
 
Shining a light on performance (js meetup)
Shining a light on performance (js meetup)Shining a light on performance (js meetup)
Shining a light on performance (js meetup)
 
20 tips for website performance
20 tips for website performance20 tips for website performance
20 tips for website performance
 
Improving frontend performance
Improving frontend performanceImproving frontend performance
Improving frontend performance
 
Behaviour Testing and Continuous Integration with Drupal
Behaviour Testing and Continuous Integration with DrupalBehaviour Testing and Continuous Integration with Drupal
Behaviour Testing and Continuous Integration with Drupal
 
How QCLean Works? Introduction to Browser Extensions
How QCLean Works? Introduction to Browser ExtensionsHow QCLean Works? Introduction to Browser Extensions
How QCLean Works? Introduction to Browser Extensions
 
Searching for the framework of my dreams in node.js ecosystem by Mykyta Semen...
Searching for the framework of my dreams in node.js ecosystem by Mykyta Semen...Searching for the framework of my dreams in node.js ecosystem by Mykyta Semen...
Searching for the framework of my dreams in node.js ecosystem by Mykyta Semen...
 
Extreme Web Performance for Mobile Devices - Velocity Barcelona 2014
Extreme Web Performance for Mobile Devices - Velocity Barcelona 2014Extreme Web Performance for Mobile Devices - Velocity Barcelona 2014
Extreme Web Performance for Mobile Devices - Velocity Barcelona 2014
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Scraping the web with Laravel, Dusk, Docker, and PHP

  • 1. Scraping the Web with Laravel Dusk, Docker, and PHP By: Paul Redmond @paulredmond paulredmond
  • 2. What You’ll Learn? ● Different types of scraping and when to use them ● Use Laravel Dusk for rapid browser automation ● Different Ways to Run Browser Automation ● Run Browser Automation in a Server Environment
  • 3. What is Web Scraping? It’s a dirty job Gathering data from HTML and other media for the purposes of testing, data enrichment, and collection. https://flic.kr/p/8EZMNk
  • 4. Hundreds of Billions Google “Scrapes” Hundreds of Billions (Or More) of Pages and other media on the web. https://www.google.com/search/howsearchworks/crawling-indexing/
  • 5. Why Do We Need Scraping? ● Market analysis ● Gain a competitive advantage ● Increase learning and understanding ● Monitor trends ● Combine multiple offers into one portal (ie. Shopping comparisons) ● Analytics
  • 6. Other Types of Data Scraping ● Competitor Scanning ● Military Intelligence ● Surveillance ● Metering
  • 7. Other Types of Data Scraping
  • 8. Other Types of Data Scraping
  • 9. Is Web Scraping Legitimate? ● Yes, it can be. ● Scraping can have a negative/bad connotation, so... ○ Don’t do bad / illegal stuff ○ Be nice ○ Be careful ○ Be respectful
  • 10. Keeping Web Scraping Legitimate ● Speed ● Caution ● Intent ● Empathy ● Honesty
  • 11. Keeping Web Scraping Legitimate ● Speed. Go slow (watch requests/second) ● Caution. Code mistakes could create unintended load! ● Intent. Even if your intention is pure, always question. ● Empathy. Put yourself in the shoes of website owners ● Honesty. Don’t steal stuff (PII, copyrights, etc.)
  • 12. Keep Robots.txt in Mind...Be a Good Bot ● https://www.google.com/robots.txt ● https://www.yahoo.com/robots.txt ● https://github.com/robots.txt (see the top comment) * PHP Robots Parser: https://github.com/webignition/robots-txt-file
  • 13. When Do We Scrape? ● What is the purpose? ● Can we live without the data? ● Do they have an API? ● If yes, does the API have everything we need? ● Do they allow scraping?
  • 14. Downsides of Scraping ● Changes in the HTML/DOM breaks scrapers ● Changes in the HTML/DOM breaks scrapers ● Changes in the HTML/DOM breaks scrapers ● Changes in the HTML/DOM breaks scrapers ● Rich JavaScript apps can cause headaches ● Scraping can be process/memory and time intensive ● More manual processing/formatting of collected data than an API ● Changes in the HTML/DOM breaks scrapers
  • 15. How Do we Overcome the Downsides? ● Match DOM/Selectors defensively ● It's a bit of an art that takes practice and experience ● Make sure that you handle failure ● Good alerting, notifications, and reporting ○ https://www.bugsnag.com/ ○ https://sentry.io/ ● Learn to accept that scraping will break sometimes
  • 17. 3 Categories of Web Scraping ● Anonymous HTTP Requests (HTML, Images, XML, etc.) ● Testing elements, asserting expected behavior ● Full Browser Automation Tasks
  • 18. Anonymous Scraping - HTML, Images, etc. ● Fastest ● Easy to run and reproduce ● Just speaking HTTP ● PHP has a Good DOM Parsing Tools (Goutte)
  • 19. Testing elements / asserting expected behavior ● May use HTTP to make basic response assertions ● May use a full browser (think testing Rich JavaScript Apps) ● Useful for user acceptance testing and browser testing
  • 20. Full Browser Automation ● Like testing, but used for scraping ● Real browser or headless browser ● The closest thing to a real user ● Requires more tooling (ie. Selenium, WebDriver, Phantom) ● Runs slow in general
  • 21. ● cURL ● Goutte (goot) ● Guzzle ● HTTPFul ● PHP-Webdriver ● file_get_contents() (Some) PHP Tools You Can Use for Scraping
  • 22. What Other Tools Have You Used?
  • 24. Goutte is the Best Option (in my opinion) Pronounced “goot” HTTP Scraping
  • 25. Goutte Overview ● Uses Symfony/BrowserKit to Simulate the Browser ● Uses Symfony/DomCrawler for DOM Traversal/Filtering ● Uses Guzzle for HTTP Requests ● Get and Set Cookies ● History (allows you to go back, forward, clear) Reference: https://github.com/FriendsOfPHP/Goutte HTTP Scraping
  • 26. Goutte Capabilities ● Click on Links and navigate the web ● Extract data / filter data ● Submit forms ● Follows redirects (by default) ● Requests return an instance of SymfonyComponentDomCrawlerCrawler HTTP Scraping
  • 27. Let’s Look at Some Examples of HTTP Scraping Goutte Examples on Github HTTP Scraping
  • 28. Testing and Web Scrapers
  • 29. Ways you might use web scraping for testing ● Test bulk site redirects before a migration ○ Request the old URLs ○ Assert a 3xx response ○ Assert the redirect location returns a 200 ● Functional test suites (ie. Symfony/Laravel) ● Healthcheck Probes / HTTP validation (ie. 200 response) Testing and Web Scrapers
  • 30. Example Functional Test Asserting HTML Testing and Web Scrapers http://symfony.com/doc/current/testing.html#your-first-functional-test
  • 31. Example Functional Test Asserting Status Testing and Web Scrapers https://laravel.com/docs/5.4/http-tests#introduction
  • 32. Example Functional Browser Test Testing and Web Scrapers https://laravel.com/docs/5.4/dusk#getting-started
  • 34. Why do we need full browser automation tools? Full Browser Automation
  • 35. Why do we need full browser automation tools? ● Simulate real browsers ● Test/Work with Async JavaScript applications ● Automate testing that applications work as expected ● Replace repetitive manual QA with automation ● Run tests in multiple browsers ● Advanced Web Scraping (ie. filtered reports) Full Browser Automation
  • 36. Noteable Tools in Browser Automation ● Selenium ● W3 WebDriver (https://www.w3.org/TR/webdriver/) ● Headless Browsers ○ PhantomJS ○ Chrome --headless* ○ ZombieJS * Chromedriver isn’t quite working with --headless yet, at least for me ¯_(ツ)_/¯ Full Browser Automation
  • 37. Noteable PHP Tools in Browser Automation ● Behat / Mink ● PHP-Webdriver ○ Codeception ○ Laravel Dusk (recently) ● Steward ● Any others you consider noteable? Full Browser Automation
  • 38. Notables in Other Languages... ● Python ○ Selenium WebDriver Bindings ○ BeautifulSoup ○ Requests: HTTP for Humans ○ Scrapy ● Ruby ○ Capybara ○ Nokogiri (DOM Parsing) ○ Mechanize Gem Full Browser Automation
  • 39. Notables in Other Languages... ● JavaScript ○ Nightwatch.js ○ Zombie ○ PhantomJS ○ Webdriver.io ○ CasperJS ○ SlimerJS Full Browser Automation
  • 40. Why Use PHP for Web Browser Automation? ● Developers don’t have to learn a new language (good/bad) ● More participation in teams already writing PHP ● Reduce cross-language mental overhead ● Browser Automation can be closer to your domain logic ● PHP-Webdriver is Good Enough™ (and backed by Facebook) Full Browser Automation
  • 41. How Do I Run PHP Browser Automation?
  • 42. How Do I Run PHP Browser Automation? ● `chrome --headless` - as of Chrome 59 ● Standalone Selenium ● WebDriver ● PhantomJS ● Any other ways? How Do I Run This Stuff?
  • 43. Run Chrome Headless (Chrome 59 Stable) $ alias chrome="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" $ chrome --headless --disable-gpu --print-to-pdf https://www.github.com/ $ open output.pdf $ chrome --headless --disable-gpu --dump-dom $ chrome --headless --disable-gpu --repl https://www.chromestatus.com/ Reference: https://developers.google.com/web/updates/2017/04/headless-chrome How Do I Run This Stuff?
  • 44. Getting to Know PHP-WebDriver WebDriver Examples on Github How Do I Run This Stuff?
  • 45. Running the Chromedriver/Phantom Process How Do I Run This Stuff?
  • 46. Techniques for Triggering Browser Automation ● Eager tasks - run on a schedule ● On-demand - one-off console commands ● Event trigger - event queue ● What are some other ways? How Do I Run This Stuff?
  • 48. Intro to Laravel Dusk ● Browser testing for Laravel projects (primary use case) ● Browser abstraction on top of PHP-Webdriver <3 ● Doesn’t require JDK or Selenium (you can still use them) ● Uses standalone ChromeDriver
  • 49. Do I HAVE to use Laravel to Use Dusk!?
  • 50. Do I HAVE to use Laravel to Use Dusk!?
  • 51. But I am going to show you why its great for web automation stuff...
  • 55. Dusk Basics: Waiting for Elements
  • 56. Quick Comparison to Our Earlier Vanilla PHP- Webdriver Example Webdriver Dusk Examples on Github
  • 58. Key Laravel Features for Browser Automation ● Scheduler to run Commands on a schedule (eager) ● Create Custom Console Commands (one-off) ● Built-in Queues (triggered) ● Database Migrations for quick modeling of data storage ● Service Container for browse automation classes
  • 60. Custom Console Commands ● Easily run one-off commands ● Scheduler uses commands, giving you both ● Laravel uses the Symfony Console and adds conveniences ● Commands run my browser scraping
  • 61. Queues ● Easily trigger web scraping jobs ● Queue jobs can trigger console commands ● Laravel has a built-in queue worker ● Redis is my preferred queue driver
  • 65. How Do I Run PHP Browser Automation on a Server!? How Do I Run This Stuff?
  • 66. How Do I Run PHP Browser Automation on a Server!? How Do I Run This Stuff? XVFB
  • 67. XVFB. What the What!? “Xvfb (short for X virtual framebuffer) is an in-memory display server for UNIX-like operating system (e.g., Linux). It enables you to run graphical applications without a display (e.g., browser tests on a CI server) while also having the ability to take screenshots.” Reference: http://elementalselenium.com/tips/38-headless How Do I Run This Stuff?
  • 68. Example Xvfb Usage $ Xvfb :99 -screen 0 1920x1200x16 & How Do I Run This Stuff?
  • 69. Example Xvfb Usage How Do I Run This Stuff?
  • 70. Our Requirements for a Docker Scheduler ● Google Chrome Stable ● Chromedriver ● Xvfb ● PHP ● Entrypoint to run the scheduler Running in Docker
  • 71. Our Docker Setup ● Docker Official php:7.1.6-cli (Scheduler) ● Docker Official php:7.1.6-fpm (Web Container) ● Docker Compose ● Redis ● MySQL Running in Docker
  • 72. Why Not the Official Selenium Image? ● If you need File Downloads through Chrome ● Downloads through volumes aren’t ideal ● If you want the same PHP installation on app and scheduler (I do) Running in Docker
  • 73. Scheduler Dockerfile ● Extends php:7.1.6-cli ● Installs Chrome Stable + a script to take chrome out of sandbox mode ● Installs Chromedriver ● Installs Required PHP Modules ● Copies Application Files ● Runs a custom entrypoint script Running in Docker
  • 74. Scheduler Dockerfile Review the Scheduler Docker Files Running in Docker
  • 75. How Do I Download Files through Chrome? Running in Docker
  • 76. Extending Dusk Browser - Hooking it Together ● Provide our Own Browser class ● A DownloadsManager class for chrome downloads ● A DownloadedFile Class to Work with Downloaded Files ● Service Container Bindings in AppServiceProvider ● Example Command ● Lets see it in action... Running in Docker
  • 77. Full Docker Setup in Action (Demo) Running in Docker
  • 78. My Projects Lumen Programming Guide http://www.apress.com/la/book/9781484221860 You will learn to write test-driven (TDD) microservices, REST APIs, and web service APIs with PHP using the Lumen micro- framework. * Zero bugs in the book source code ;)
  • 79. My Projects Docker for PHP Developers https://leanpub.com/docker-for-php-developers A hands-on guide to learning how to use Docker as your primary development environment. It covers a diverse range of topics and scenarios you will face as a PHP developer picking up docker.

Hinweis der Redaktion

  1. The presentation is about you. I would like an open discussion during the presentation so you can get the most out of it and share your ideas on the topic.
  2. I came up with my own definition. Its neither wrong or right. What do you think about when you think of web scraping?
  3. Google is the biggest web scraper in the world. I don’t have any hard data, but I’m right. Has anyone dealt with performance issues on a site at scale due to GoogleBot traffic?
  4. Scraping is used as a business tool to look for growth, opportunities, trends, history, and analytics. For example, in order for Best Buy to stay competitive both online and in a brick-and-mortar store, they need to understand the competition’s pricing. Google uses complex search algorithms to give you the most relevant content. This builds a relationship of trust and usefulness, and in turn you will like visit sites through which Google provides advertising. Advertising revenue accounts for 88% of Google’s (Alphabet) revenue. (Source: http://www.businessinsider.com/how-google-apple-facebook-amazon-microsoft-make-money-chart-2017-5) Google Shopping collects deals for search results from multiple Stores, which is an example of scraping data, and then combining it in new and (potentially) useful ways.
  5. I like to relate topics to otherwise non-related things. These are potentially related examples of data scraping for the purposes of gain in different contexts. They are neither wrong or right, they are just ways to trigger both sides of your brain in creative ways to think about “data scraping”.
  6. Observation balloons were used heavily in World War I for artillery observation provided more data about enemy positions/artillery, and report them back to the ground. This was a huge competitive advantage, that all sides used heavily in World War I. In many ways the same principles are applied. Take a higher position, collect enrichment data, report back, and then make decisions from new data.
  7. Maybe this is a stretch, but that’s why it's a fun exercise. Water meters run all the time, and meter reads happen on a schedule. Generally someone drives around and logs the reading. This reading is processed for millions of homes, collected, and then invoiced. This data is then later collected by other parties to document water usage on a macro level, and provides many insights into predicted water usage as a populace grows. Related programming and product concepts to unrelated things is a powerful skill. I helps the concepts stick more for me, because I can associate them with other familiar everyday things. Kind of how we refer to pointing devices as a “mouse”.
  8. Like many other things, scraping can be used for good and bad. Like any other business practice, common sense will help you steer away from shady/illegal web scraping practices. I have combed over some access logs to top 50 US trafficked properties. I’ve seen so many weird user agents and bots. I’ve banned many for overwhelming our infrastructure. Intent is unclear, even if it's good, and you respond with blocking them. Don’t use scraping to do shady or questionable things, like produce artificial traffic to your site, collect user data that you shouldn’t, etc.
  9. Use these 5 words when designing web scrapers: Speed Caution Intent Empathy Honesty
  10. Speed: Go slow. Respect that your automation can produce a lot of traffic. Caution: Automation code mistakes can unintentionally do things like delete things, create load because of recursion, etc Intent: Even when you have good intentions, always question your approach. Which leads to… Empathy: Put yourself in the shoes of the site owners. What would you do if a single IP started sending 100’s of requests per second at your server? Honesty: To reiterate, don’t do illegal/shady things. Don’t collect Personally Identifiable Information, treat copyrights with respect, etc.
  11. Be respectful to site owners. This talk doesn’t go deeply into how to parse/handle the robots.txt file, but be aware of it. In my experience, my use case has been data that I would gladly consume from an API or another source like RSS. Use those if they are available. Do use scraping when something isn’t possible (or provided) and you have deemed your use case is appropriate. Always error on the side of getting permission.
  12. This is just a high-level guideline of my thought process. Generally I look for an API to collect the data I want. I’ve even used inbound email webhooks over scraping because, while scraping can be effective, it is also error-prone sometimes. Which brings me to the downsides of scraping...
  13. Async elements can be challenging because you have to write logic to wait for something async to finish before your scraper proceeds You will have to deal with more manual processing of data than an API typically Scraping can be process/memory/time intensive. You are running a browser after all. Changes in the HTML/DOM will break your stuff. There’s nothing you can do about it, except update your code.
  14. Even a well-written defensive scraper is prone to breaking when the source changes their HTML / Dom / Input names, etc. I try to make selectors as simple and flexible as possible, without losing integrity of selecting the right element. The more specific your selector, the easier a change will break it. Sometimes you can’t do anything to stop it from breaking other than updating your code. Make sure you have good error reporting so you can know about scraping issues. I use Bugsnag, and I’ve also heard many love Sentry. You can typically get them working in less than an hour.
  15. I am going to group scraping into 3 categories for our purposes. Roughly they include: anonymous HTTP requests and parsing responses Browser acceptance testing Full browser automation jobs
  16. Anonymous scraping is faster than full browser automated scraping. Use this whenever possible. You use this when you want to collect raw HTML on the page and do something with it. For example, you could a check that makes sure your page has your Analytics code and the right ID. This type of scraping is easy because we are just speaking HTTP and getting HTTP responses back. PHP has good DOM parising tools. For example, using DOMDocument directly or symfony/dom-crawler. We will cover a good tool that uses the symfony/dom-crawler component.
  17. The second type of scraping we will briefly look at is testing. Test suites might use basic HTTP to make assertions Some test suites use full browser emulation (ie. behat) Scraping / Browser emulation is great for user acceptance testing and make sure the same code works in multiple browsers.
  18. Full Browser Automation Full browser automation is used in tests, but what I am talking about is full browser automation for the purposes of scraping websites Its the closest thing to a real user It requires more tooling to run, such as selenium standalone server, webdriver, and/or PhantomJS Full browser automation is slower than basic HTTP scraping
  19. Here are some tools you could use for the purposeses of scraping websites
  20. Let’s dig a little bit into HTTP scraping sites. This is the most basic, fastest form of scraping I mentioned earlier.
  21. My favorite tool is Goutte (pronounced goot) because it provides some nice conveniences. We will jump into a few examples that use Goutte after a quick overview. How many of you are familiar with, or have used Goutte on a project?
  22. Goutte uses Symfony BrowserKit to simulate a browser Goutte uses Symfony DomCrawler for DOM traversal, filtering, and working with elements Under the hood, Goutte uses Guzzle. You can also tweak guzzle settings if needed Goutte can get and set cookies Goutte provides a history you can use to review history, and do things like go back/forward/clear
  23. Some of Goutte’s Capabilities that provide you some nice abstraction over the DOM: Click on links to navigate Extract / filter data Submit forms Follows redirects by default Provides you a Crawler instance to traverse the DOM
  24. I don’t want to spend a ton of time on this section, but thought I’d mention a couple points about how functional testing is similar to scraping. Functional tests use the browser to make a request, interact with the page, and make assertions. Scraping does the same thing, but instead also collects data for storage, analysis, and enrichment.
  25. I’ve used Goutte to write an ad-hoc test suite to verify hundreds of thousands of redirects during a large site migration. I was able to create rewrite rules and then test them in a sandboxed production environment before going live. This was a huge benefit to making sure that our migrations went well, and that we didn’t lose pagerank because of some bad responses and redirects. Functional test suites are super helpful to automate borning/repetitive QA. I have been in some dire situations where regression with manual QA took weeks. Automated web tests save valuable time and give you confidence on a user / behavior level that your unit tests cannot do. I’ve written healthcheck probes, that I consider tests. They assert that your site or service is running as expected.
  26. An example of a functional test using the Symfony framework
  27. An example of a functional test using the Laravel framework
  28. This is the first simple example with Dusk, which uses chromedriver and makes PHPUnit assertions to perform functional testing in a real browser
  29. Sometimes browser automation is the only way you can possibly collect the data you need. I’ve recently had to write some browser automation to work with 3rd parties that are providing some reporting to us, but don’t provide an API. You might need to use a combination of web scraping and/or inbound email attachments (I use Mailgun with Webhooks) in order to get the data because the 3rd party doesn’t provide an API. This is one example where web scraping can help.
  30. These are some of the more notable tools in browser automation. Selenium’s WebDriver becomes a W3C Web Standard - https://www.linkedin.com/pulse/seleniums-web-driver-become-w3c-standard-tom-weekes
  31. I have actually never heard of steward before preparing for this presentation. It seems really easy to get some PHPUnit assertions going with browser tests. Behat (http://behat.org/en/latest/) Behat is a Behavior Driven Development (BDD) framework automating browser acceptance testing with cucumber implementations. Mink (http://mink.behat.org/en/latest/) Mink is a “browser controller/emulator for web applications Steward (https://github.com/lmc-eu/steward) “Steward: easy and robust testing with Selenium WebDriver + PHPUnit. Steward is set of libraries made to simplify writing and running robust functional system tests in PHPUnit using Selenium WebDriver.” PHP-Webdriver (https://github.com/facebook/php-webdriver) PHP-Webdriver is the most complete and advanced PHP bindings to the W3C WebDriver specification. Its used by CodeCeption, Dusk, and Steward, and probably many others that I’ve never used before. http://codeception.com/ https://laravel.com/docs/5.4/dusk
  32. I don’t have extensive experience with many of these tools. I have come to appreciate some of these libraries though. The Selenium WebDriver binding in python or solid. I am a little jealous of this package. Python: http://selenium-python.readthedocs.io/ https://www.crummy.com/software/BeautifulSoup/ http://docs.python-requests.org/en/master/ https://scrapy.org/ Ruby http://teamcapybara.github.io/capybara/ http://www.rubydoc.info/github/sparklemotion/nokogiri https://rubygems.org/gems/mechanize
  33. Some of you might be familiar with these tools. Has anyone used any of these tools to do web scraping? Maybe describe a little bit about what you were doing? What you liked about these tools? What you didn’t like, or what was difficult? JavaScript http://nightwatchjs.org/ http://zombie.js.org/ http://phantomjs.org/ http://webdriver.io/ http://casperjs.org/ https://slimerjs.org/
  34. Recently I was struggling a little to put together some browser automation in PHP and I started down the path of using JavaScript with CasperJS. It’s a good tool, but there were a couple things that bothered me about them. My domain logic was written in PHP, and wasn’t portable with my codebase I wanted to spend as much time in PHP as possible. I am personally a big fan of the most simple stack possible. I didn’t want to deal with context switching, I felt like I would be more productive writing my automation tools in the same language as web application. I wanted my scraping to have access to some of the business domain code (that is well tested) I’ve written in PHP I am not knocking these tools, I am simply saying that for me, I knew I would be more productive in PHP.
  35. The main ways that I run PHP browser automation: PhantomJS WebDriver with Chrome I threw in chrome headless because Chrome recently shipped (v 59) running chrome in headless mode.
  36. Here are a couple examples you can use to run chrome headless to experiment a bit with it. The slides are using a Mac, so you will have to adapt it to your OS.
  37. Let’s spend a little time getting familiar with PHP-Webdriver... Only show the webdriver.php example at this point… php webdriver.php
  38. I wanted to point out that if you look through the source code of Laravel Dusk you will see that dusk ships with a copy of chromedriver and runs it as a process using Symfony’s Process component.
  39. What are some ways that you can trigger browser automation? I am not suggesting these are the only way, but this is how I’ve categorized my own usage patterns: Eager - you want to eagerly go after some scraping data on a schedule On-Demand - you want to run automation when you ask for it (ie. a console) Event based - some event triggers the need to scrape data (ie. user submits a sitemap.xml in WebMaster tools) What are some other ways? Any thoughts?
  40. Now that I’ve demonstrated PHP-Webdriver, I feel like its OK to use, but that a nice abstraction layer would speed development up nicely. Enter Dusk. Dusk is a project created to help browser test Laravel applications. It has stubs for setting up acceptance testing, but at its core, the Browser class is nicely abstracted with a lot of convenience around web browser automation. You don’t have to touch PHP-Webdriver directly, although you can easily get at Facebook WebDriverElement instances. As you noticed in the barebones webdriver demo, Dusk also provides convenience around waiting for elements and satisfying assertions before moving on.
  41. Laravel Dusk’s main purpose is for browser testing laravel projects. At the heart of all this, is a really great Browser class that abstracts away many tedious things you have to do with PHP-Webdriver directly. This is not a knock on PHP-Webdriver AT ALL. But I appreciate the high-level abstractions that Dusk provides for common things I need to accomplish.
  42. I am not asking that, I love using Laravel. It’s my go-to framework, but just in case you are asking if you have to use the Laravel framework in order to use Laravel dusk…
  43. You can install Laravel Dusk as a dependency in non-laravel projects. You will be responsible for booting up chromedriver and creating a RemoteWebDriver instance (which I am going to demonstrate), but you can use the \Laravel\Dusk\Browser class as the foundation for your browsing needs with automation. This is actually how I started using Dusk - I used it to simplify my browser automation needs.
  44. There are convenient things I am using in my Laravel projects to run scheduled browser automation. Laravel makes running queues and scheduled tasks really convenient. For example, I need to download multiple files and run some data analysis on those files. Every day in the morning I run scheduled tasks to process the data via Laravel’s scheduler.
  45. Here are some basic methods you can use to work with elements value() - to get the value of an input value() - you can set the value of an input by passing a second argument text() - get the text of an element attribute() - get an attribute from an element
  46. Working with links clickLink() - Clicks on a link by finding a link with the text click() event on a selector mouseover() - hover over an element drag() - drag an element (I have personally never used it, but it’s cool, no?) with() - scope element tasks within a selector
  47. Working with form inputs type() - type some text in an input clear() - clear an input select() select an option from a <select/> select() select a random option press() - press a button
  48. Waiting for elements before proceeding, this is for async things... waitFor() waits for a selector to become available waitFor() - waits one second for an element with a selector waitForText - waits for the text before proceeding waitUntil - evaluate some Javascript and don’t proceed until it’s true
  49. All of these things can... a) help you to rapidly build applications b) quickly hook in some browser automation alongside your project So that you can… Build effective browser automation tools quickly
  50. You can defile a task schedule with a fluent API that runs your console commands on a schedule. The way the Laravel scheduler works is that one cron is triggered every minute. I am going to show you how I run my scheduler in docker in a slightly different way, but the concept is the same. You can schedule all your web automation here.
  51. You can define custom console commands in Laravel. This is a great place for triggering the browser automation. It allows you to run automation on-demand and then you can hook your commands into the scheduler to run them on an automated schedule. This is how I run my all my browser scraping
  52. Laravel provides a queue with multiple drivers. You can trigger web scraping by dispatching a queue event. We will go over an example of this soon. Laravel has everything you need to run queue workers, you don’t need to reach for anything outside of laravel to get a queue going apart from the queue driver you use, such as installing redis, beanstalkd, etc.
  53. Example queue that dispatches a url to download. This is just a simple example to illustrate the flow.
  54. This is the queue handler method that actually does something. In this case I am programmatically calling a console command.
  55. You might be wondering you you are supposed to run this automation in an environment without a browser window/screen?
  56. This is an example usage of Xvfb in a Docker entrypoint. We will examine this file a little more closely at the end.
  57. What do we need to run automation in docker? We need a browser, we will use Google Chrome Chromedriver - we are going to install and run chromedriver in the container so our code doesn’t have to worry about spinning up a process Xvfb - we need to install xvfb PHP - we will run our app in PHP We will use a custom entrypoint script to run chromedriver, xvfb, and the laravel scheduler
  58. A few specifics about the Docker setup I am demonstrating here I am extending the official PHP docker images - https://hub.docker.com/_/php/ Docker Compose - this is how you can easily run/orchestrate your containers Redis - for the queue MySQL - we are not actually using MySQL in this demo for anything, but that’s what I would use for my application typically
  59. ``` chromedriver& php artisan browse:download ```
  60. Demo the connecting files docker-compose.yml docker-compose up tail -f storage/logs/laravel.log Jump into the scheduler container Run `php artisan browser:extract` Run `php artisan browser:download` Trigger the queue locally with `http POST http://localhost:8080/api/queue-example url="http://example.com"`
  61. Demo the connecting files docker-compose.yml docker-compose up tail -f storage/logs/laravel.log Jump into the scheduler container Run `php artisan browser:extract` Run `php artisan browser:download` Trigger the queue locally with `http POST http://localhost:8080/api/queue-example url="http://example.com"`
  62. Demo the connecting files docker-compose.yml docker-compose up tail -f storage/logs/laravel.log Jump into the scheduler container Run `php artisan browser:extract` Run `php artisan browser:download` Trigger the queue locally with `http POST http://localhost:8080/api/queue-example url="http://example.com"`
  63. Demo the connecting files docker-compose.yml docker-compose up tail -f storage/logs/laravel.log Jump into the scheduler container Run `php artisan browser:extract` Run `php artisan browser:download` Trigger the queue locally with `http POST http://localhost:8080/api/queue-example url="http://example.com"`