Scraping the web with Laravel, Dusk, Docker, and PHP

Scraping the Web with
Laravel Dusk, Docker, and PHP
By: Paul Redmond
@paulredmond paulredmond

What You’ll Learn?
● Different types of scraping and when to use them
● Use Laravel Dusk for rapid browser automation
● Different Ways to Run Browser Automation
● Run Browser Automation in a Server Environment

What is Web Scraping?
It’s a dirty job
Gathering data from HTML and
other media for the purposes
of testing, data enrichment,
and collection.
https://flic.kr/p/8EZMNk

Hundreds of Billions
Google “Scrapes” Hundreds of Billions (Or More)
of Pages and other media on the web.
https://www.google.com/search/howsearchworks/crawling-indexing/

Why Do We Need Scraping?
● Market analysis
● Gain a competitive advantage
● Increase learning and understanding
● Monitor trends
● Combine multiple offers into one portal (ie. Shopping
comparisons)
● Analytics

Other Types of Data Scraping
● Competitor Scanning
● Military Intelligence
● Surveillance
● Metering

Is Web Scraping Legitimate?
● Yes, it can be.
● Scraping can have a negative/bad connotation, so...
○ Don’t do bad / illegal stuff
○ Be nice
○ Be careful
○ Be respectful

Keeping Web Scraping Legitimate
● Speed
● Caution
● Intent
● Empathy
● Honesty

Keeping Web Scraping Legitimate
● Speed. Go slow (watch requests/second)
● Caution. Code mistakes could create unintended load!
● Intent. Even if your intention is pure, always question.
● Empathy. Put yourself in the shoes of website owners
● Honesty. Don’t steal stuff (PII, copyrights, etc.)

Keep Robots.txt in Mind...Be a Good Bot
● https://www.google.com/robots.txt
● https://www.yahoo.com/robots.txt
● https://github.com/robots.txt (see the top comment)
* PHP Robots Parser: https://github.com/webignition/robots-txt-file

When Do We Scrape?
● What is the purpose?
● Can we live without the data?
● Do they have an API?
● If yes, does the API have everything we need?
● Do they allow scraping?

Downsides of Scraping
● Changes in the HTML/DOM breaks scrapers
● Rich JavaScript apps can cause headaches
● Scraping can be process/memory and time intensive
● More manual processing/formatting of collected data
than an API

How Do we Overcome the Downsides?
● Match DOM/Selectors defensively
● It's a bit of an art that takes practice and experience
● Make sure that you handle failure
● Good alerting, notifications, and reporting
○ https://www.bugsnag.com/
○ https://sentry.io/
● Learn to accept that scraping will break sometimes

3 Categories of Web Scraping
● Anonymous HTTP Requests (HTML, Images, XML, etc.)
● Testing elements, asserting expected behavior
● Full Browser Automation Tasks

Anonymous Scraping - HTML, Images, etc.
● Fastest
● Easy to run and reproduce
● Just speaking HTTP
● PHP has a Good DOM Parsing Tools (Goutte)

Testing elements / asserting expected behavior
● May use HTTP to make basic response assertions
● May use a full browser (think testing Rich JavaScript Apps)
● Useful for user acceptance testing and browser testing

Full Browser Automation
● Like testing, but used for scraping
● Real browser or headless browser
● The closest thing to a real user
● Requires more tooling (ie. Selenium, WebDriver, Phantom)
● Runs slow in general

● cURL
● Goutte (goot)
● Guzzle
● HTTPFul
● PHP-Webdriver
● file_get_contents()
(Some) PHP Tools You Can Use for Scraping

What Other Tools Have You Used?

Goutte is the Best Option (in my opinion)
Pronounced “goot”
HTTP Scraping

Goutte Overview
● Uses Symfony/BrowserKit to Simulate the Browser
● Uses Symfony/DomCrawler for DOM Traversal/Filtering
● Uses Guzzle for HTTP Requests
● Get and Set Cookies
● History (allows you to go back, forward, clear)
Reference: https://github.com/FriendsOfPHP/Goutte
HTTP Scraping

Goutte Capabilities
● Click on Links and navigate the web
● Extract data / filter data
● Submit forms
● Follows redirects (by default)
● Requests return an instance of
SymfonyComponentDomCrawlerCrawler
HTTP Scraping

Let’s Look at Some Examples of HTTP Scraping
Goutte Examples on Github
HTTP Scraping

Ways you might use web scraping for testing
● Test bulk site redirects before a migration
○ Request the old URLs
○ Assert a 3xx response
○ Assert the redirect location returns a 200
● Functional test suites (ie. Symfony/Laravel)
● Healthcheck Probes / HTTP validation (ie. 200 response)
Testing and Web Scrapers

Example Functional Test Asserting HTML
http://symfony.com/doc/current/testing.html#your-first-functional-test

Example Functional Test Asserting Status
https://laravel.com/docs/5.4/http-tests#introduction

Example Functional Browser Test
https://laravel.com/docs/5.4/dusk#getting-started

Why do we need full browser automation tools?

Why do we need full browser automation tools?
● Simulate real browsers
● Test/Work with Async JavaScript applications
● Automate testing that applications work as expected
● Replace repetitive manual QA with automation
● Run tests in multiple browsers
● Advanced Web Scraping (ie. filtered reports)

Noteable Tools in Browser Automation
● Selenium
● W3 WebDriver (https://www.w3.org/TR/webdriver/)
● Headless Browsers
○ PhantomJS
○ Chrome --headless*
○ ZombieJS
* Chromedriver isn’t quite working with --headless yet, at least for me ¯_(ツ)_/¯

Noteable PHP Tools in Browser Automation
● Behat / Mink
● PHP-Webdriver
○ Codeception
○ Laravel Dusk (recently)
● Steward
● Any others you consider noteable?

Notables in Other Languages...
● Python
○ Selenium WebDriver Bindings
○ BeautifulSoup
○ Requests: HTTP for Humans
○ Scrapy
● Ruby
○ Capybara
○ Nokogiri (DOM Parsing)
○ Mechanize Gem

Notables in Other Languages...
● JavaScript
○ Nightwatch.js
○ Zombie
○ PhantomJS
○ Webdriver.io
○ CasperJS
○ SlimerJS

Why Use PHP for Web Browser Automation?
● Developers don’t have to learn a new language (good/bad)
● More participation in teams already writing PHP
● Reduce cross-language mental overhead
● Browser Automation can be closer to your domain logic
● PHP-Webdriver is Good Enough™ (and backed by Facebook)

How Do I Run PHP
Browser Automation?

How Do I Run PHP Browser Automation?
● `chrome --headless` - as of Chrome 59
● Standalone Selenium
● WebDriver
● PhantomJS
● Any other ways?
How Do I Run This Stuff?

Run Chrome Headless (Chrome 59 Stable)
$ alias chrome="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
$ chrome --headless --disable-gpu --print-to-pdf https://www.github.com/
$ open output.pdf
$ chrome --headless --disable-gpu --dump-dom
$ chrome --headless --disable-gpu --repl https://www.chromestatus.com/
Reference: https://developers.google.com/web/updates/2017/04/headless-chrome

Getting to Know PHP-WebDriver
WebDriver Examples on Github

Running the Chromedriver/Phantom Process

Techniques for Triggering Browser Automation
● Eager tasks - run on a schedule
● On-demand - one-off console commands
● Event trigger - event queue
● What are some other ways?

Intro to Laravel Dusk
● Browser testing for Laravel projects (primary use case)
● Browser abstraction on top of PHP-Webdriver <3
● Doesn’t require JDK or Selenium (you can still use them)
● Uses standalone ChromeDriver

Do I HAVE to use Laravel to Use Dusk!?

But I am going to show you why
its great for web automation stuff...

Dusk Basics: Waiting for Elements

Quick Comparison to Our Earlier Vanilla PHP-
Webdriver Example
Webdriver Dusk Examples on Github

Key Laravel Features for Browser Automation
● Scheduler to run Commands on a schedule (eager)
● Create Custom Console Commands (one-off)
● Built-in Queues (triggered)
● Database Migrations for quick modeling of data storage
● Service Container for browse automation classes

Scheduler (app/Console/Kernel.php)

Custom Console Commands
● Easily run one-off commands
● Scheduler uses commands, giving you both
● Laravel uses the Symfony Console and adds conveniences
● Commands run my browser scraping

Queues
● Easily trigger web scraping jobs
● Queue jobs can trigger console commands
● Laravel has a built-in queue worker
● Redis is my preferred queue driver

Running Browser Automation in Docker

How Do I Run PHP Browser Automation on a Server!?

How Do I Run PHP Browser Automation on a Server!?
XVFB

XVFB. What the What!?
“Xvfb (short for X virtual framebuffer) is an in-memory display
server for UNIX-like operating system (e.g., Linux). It enables you
to run graphical applications without a display (e.g., browser
tests on a CI server) while also having the ability to take
screenshots.”
Reference: http://elementalselenium.com/tips/38-headless

Example Xvfb Usage
$ Xvfb :99 -screen 0 1920x1200x16 &

Example Xvfb Usage How Do I Run This Stuff?

Our Requirements for a Docker Scheduler
● Google Chrome Stable
● Chromedriver
● Xvfb
● PHP
● Entrypoint to run the scheduler
Running in Docker

Our Docker Setup
● Docker Official php:7.1.6-cli (Scheduler)
● Docker Official php:7.1.6-fpm (Web Container)
● Docker Compose
● Redis
● MySQL
Running in Docker

Why Not the Official Selenium Image?
● If you need File Downloads through Chrome
● Downloads through volumes aren’t ideal
● If you want the same PHP installation on app and scheduler
(I do)
Running in Docker

Scheduler Dockerfile
● Extends php:7.1.6-cli
● Installs Chrome Stable + a script to take chrome out of
sandbox mode
● Installs Chromedriver
● Installs Required PHP Modules
● Copies Application Files
● Runs a custom entrypoint script
Running in Docker

Scheduler Dockerfile
Review the Scheduler Docker Files
Running in Docker

How Do I Download Files through Chrome?
Running in Docker

Extending Dusk Browser - Hooking it Together
● Provide our Own Browser class
● A DownloadsManager class for chrome downloads
● A DownloadedFile Class to Work with Downloaded Files
● Service Container Bindings in AppServiceProvider
● Example Command
● Lets see it in action...
Running in Docker

Full Docker Setup in Action
(Demo)
Running in Docker

My Projects
Lumen Programming Guide
http://www.apress.com/la/book/9781484221860
You will learn to write test-driven (TDD)
microservices, REST APIs, and web service
APIs with PHP using the Lumen micro-
framework.
* Zero bugs in the book source code ;)

My Projects
Docker for PHP Developers
https://leanpub.com/docker-for-php-developers
A hands-on guide to learning how to use
Docker as your primary development
environment. It covers a diverse range of
topics and scenarios you will face as a
PHP developer picking up docker.

Scraping the web with Laravel, Dusk, Docker, and PHP

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Scraping the web with Laravel, Dusk, Docker, and PHP

Ähnlich wie Scraping the web with Laravel, Dusk, Docker, and PHP (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Scraping the web with Laravel, Dusk, Docker, and PHP

Hinweis der Redaktion