Jumpstart your web scraping automation in the cloud with Laravel Dusk, Docker, and friends. We will discuss the types of web scraping tools, the best tools for the job, and how to deal with running selenium in Docker.
Code examples @ https://github.com/paulredmond/scraping-with-laravel-dusk
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Scraping the web with Laravel, Dusk, Docker, and PHP
1. Scraping the Web with
Laravel Dusk, Docker, and PHP
By: Paul Redmond
@paulredmond paulredmond
2. What You’ll Learn?
● Different types of scraping and when to use them
● Use Laravel Dusk for rapid browser automation
● Different Ways to Run Browser Automation
● Run Browser Automation in a Server Environment
3. What is Web Scraping?
It’s a dirty job
Gathering data from HTML and
other media for the purposes
of testing, data enrichment,
and collection.
https://flic.kr/p/8EZMNk
4. Hundreds of Billions
Google “Scrapes” Hundreds of Billions (Or More)
of Pages and other media on the web.
https://www.google.com/search/howsearchworks/crawling-indexing/
5. Why Do We Need Scraping?
● Market analysis
● Gain a competitive advantage
● Increase learning and understanding
● Monitor trends
● Combine multiple offers into one portal (ie. Shopping
comparisons)
● Analytics
6. Other Types of Data Scraping
● Competitor Scanning
● Military Intelligence
● Surveillance
● Metering
9. Is Web Scraping Legitimate?
● Yes, it can be.
● Scraping can have a negative/bad connotation, so...
○ Don’t do bad / illegal stuff
○ Be nice
○ Be careful
○ Be respectful
11. Keeping Web Scraping Legitimate
● Speed. Go slow (watch requests/second)
● Caution. Code mistakes could create unintended load!
● Intent. Even if your intention is pure, always question.
● Empathy. Put yourself in the shoes of website owners
● Honesty. Don’t steal stuff (PII, copyrights, etc.)
12. Keep Robots.txt in Mind...Be a Good Bot
● https://www.google.com/robots.txt
● https://www.yahoo.com/robots.txt
● https://github.com/robots.txt (see the top comment)
* PHP Robots Parser: https://github.com/webignition/robots-txt-file
13. When Do We Scrape?
● What is the purpose?
● Can we live without the data?
● Do they have an API?
● If yes, does the API have everything we need?
● Do they allow scraping?
14. Downsides of Scraping
● Changes in the HTML/DOM breaks scrapers
● Changes in the HTML/DOM breaks scrapers
● Changes in the HTML/DOM breaks scrapers
● Changes in the HTML/DOM breaks scrapers
● Rich JavaScript apps can cause headaches
● Scraping can be process/memory and time intensive
● More manual processing/formatting of collected data
than an API
● Changes in the HTML/DOM breaks scrapers
15. How Do we Overcome the Downsides?
● Match DOM/Selectors defensively
● It's a bit of an art that takes practice and experience
● Make sure that you handle failure
● Good alerting, notifications, and reporting
○ https://www.bugsnag.com/
○ https://sentry.io/
● Learn to accept that scraping will break sometimes
17. 3 Categories of Web Scraping
● Anonymous HTTP Requests (HTML, Images, XML, etc.)
● Testing elements, asserting expected behavior
● Full Browser Automation Tasks
18. Anonymous Scraping - HTML, Images, etc.
● Fastest
● Easy to run and reproduce
● Just speaking HTTP
● PHP has a Good DOM Parsing Tools (Goutte)
19. Testing elements / asserting expected behavior
● May use HTTP to make basic response assertions
● May use a full browser (think testing Rich JavaScript Apps)
● Useful for user acceptance testing and browser testing
20. Full Browser Automation
● Like testing, but used for scraping
● Real browser or headless browser
● The closest thing to a real user
● Requires more tooling (ie. Selenium, WebDriver, Phantom)
● Runs slow in general
21. ● cURL
● Goutte (goot)
● Guzzle
● HTTPFul
● PHP-Webdriver
● file_get_contents()
(Some) PHP Tools You Can Use for Scraping
24. Goutte is the Best Option (in my opinion)
Pronounced “goot”
HTTP Scraping
25. Goutte Overview
● Uses Symfony/BrowserKit to Simulate the Browser
● Uses Symfony/DomCrawler for DOM Traversal/Filtering
● Uses Guzzle for HTTP Requests
● Get and Set Cookies
● History (allows you to go back, forward, clear)
Reference: https://github.com/FriendsOfPHP/Goutte
HTTP Scraping
26. Goutte Capabilities
● Click on Links and navigate the web
● Extract data / filter data
● Submit forms
● Follows redirects (by default)
● Requests return an instance of
SymfonyComponentDomCrawlerCrawler
HTTP Scraping
27. Let’s Look at Some Examples of HTTP Scraping
Goutte Examples on Github
HTTP Scraping
29. Ways you might use web scraping for testing
● Test bulk site redirects before a migration
○ Request the old URLs
○ Assert a 3xx response
○ Assert the redirect location returns a 200
● Functional test suites (ie. Symfony/Laravel)
● Healthcheck Probes / HTTP validation (ie. 200 response)
Testing and Web Scrapers
30. Example Functional Test Asserting HTML
Testing and Web Scrapers
http://symfony.com/doc/current/testing.html#your-first-functional-test
31. Example Functional Test Asserting Status
Testing and Web Scrapers
https://laravel.com/docs/5.4/http-tests#introduction
32. Example Functional Browser Test
Testing and Web Scrapers
https://laravel.com/docs/5.4/dusk#getting-started
34. Why do we need full browser automation tools?
Full Browser Automation
35. Why do we need full browser automation tools?
● Simulate real browsers
● Test/Work with Async JavaScript applications
● Automate testing that applications work as expected
● Replace repetitive manual QA with automation
● Run tests in multiple browsers
● Advanced Web Scraping (ie. filtered reports)
Full Browser Automation
36. Noteable Tools in Browser Automation
● Selenium
● W3 WebDriver (https://www.w3.org/TR/webdriver/)
● Headless Browsers
○ PhantomJS
○ Chrome --headless*
○ ZombieJS
* Chromedriver isn’t quite working with --headless yet, at least for me ¯_(ツ)_/¯
Full Browser Automation
37. Noteable PHP Tools in Browser Automation
● Behat / Mink
● PHP-Webdriver
○ Codeception
○ Laravel Dusk (recently)
● Steward
● Any others you consider noteable?
Full Browser Automation
38. Notables in Other Languages...
● Python
○ Selenium WebDriver Bindings
○ BeautifulSoup
○ Requests: HTTP for Humans
○ Scrapy
● Ruby
○ Capybara
○ Nokogiri (DOM Parsing)
○ Mechanize Gem
Full Browser Automation
39. Notables in Other Languages...
● JavaScript
○ Nightwatch.js
○ Zombie
○ PhantomJS
○ Webdriver.io
○ CasperJS
○ SlimerJS
Full Browser Automation
40. Why Use PHP for Web Browser Automation?
● Developers don’t have to learn a new language (good/bad)
● More participation in teams already writing PHP
● Reduce cross-language mental overhead
● Browser Automation can be closer to your domain logic
● PHP-Webdriver is Good Enough™ (and backed by Facebook)
Full Browser Automation
42. How Do I Run PHP Browser Automation?
● `chrome --headless` - as of Chrome 59
● Standalone Selenium
● WebDriver
● PhantomJS
● Any other ways?
How Do I Run This Stuff?
43. Run Chrome Headless (Chrome 59 Stable)
$ alias chrome="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
$ chrome --headless --disable-gpu --print-to-pdf https://www.github.com/
$ open output.pdf
$ chrome --headless --disable-gpu --dump-dom
$ chrome --headless --disable-gpu --repl https://www.chromestatus.com/
Reference: https://developers.google.com/web/updates/2017/04/headless-chrome
How Do I Run This Stuff?
44. Getting to Know PHP-WebDriver
WebDriver Examples on Github
How Do I Run This Stuff?
46. Techniques for Triggering Browser Automation
● Eager tasks - run on a schedule
● On-demand - one-off console commands
● Event trigger - event queue
● What are some other ways?
How Do I Run This Stuff?
48. Intro to Laravel Dusk
● Browser testing for Laravel projects (primary use case)
● Browser abstraction on top of PHP-Webdriver <3
● Doesn’t require JDK or Selenium (you can still use them)
● Uses standalone ChromeDriver
58. Key Laravel Features for Browser Automation
● Scheduler to run Commands on a schedule (eager)
● Create Custom Console Commands (one-off)
● Built-in Queues (triggered)
● Database Migrations for quick modeling of data storage
● Service Container for browse automation classes
60. Custom Console Commands
● Easily run one-off commands
● Scheduler uses commands, giving you both
● Laravel uses the Symfony Console and adds conveniences
● Commands run my browser scraping
61. Queues
● Easily trigger web scraping jobs
● Queue jobs can trigger console commands
● Laravel has a built-in queue worker
● Redis is my preferred queue driver
65. How Do I Run PHP Browser Automation on a Server!?
How Do I Run This Stuff?
66. How Do I Run PHP Browser Automation on a Server!?
How Do I Run This Stuff?
XVFB
67. XVFB. What the What!?
“Xvfb (short for X virtual framebuffer) is an in-memory display
server for UNIX-like operating system (e.g., Linux). It enables you
to run graphical applications without a display (e.g., browser
tests on a CI server) while also having the ability to take
screenshots.”
Reference: http://elementalselenium.com/tips/38-headless
How Do I Run This Stuff?
70. Our Requirements for a Docker Scheduler
● Google Chrome Stable
● Chromedriver
● Xvfb
● PHP
● Entrypoint to run the scheduler
Running in Docker
71. Our Docker Setup
● Docker Official php:7.1.6-cli (Scheduler)
● Docker Official php:7.1.6-fpm (Web Container)
● Docker Compose
● Redis
● MySQL
Running in Docker
72. Why Not the Official Selenium Image?
● If you need File Downloads through Chrome
● Downloads through volumes aren’t ideal
● If you want the same PHP installation on app and scheduler
(I do)
Running in Docker
73. Scheduler Dockerfile
● Extends php:7.1.6-cli
● Installs Chrome Stable + a script to take chrome out of
sandbox mode
● Installs Chromedriver
● Installs Required PHP Modules
● Copies Application Files
● Runs a custom entrypoint script
Running in Docker
75. How Do I Download Files through Chrome?
Running in Docker
76. Extending Dusk Browser - Hooking it Together
● Provide our Own Browser class
● A DownloadsManager class for chrome downloads
● A DownloadedFile Class to Work with Downloaded Files
● Service Container Bindings in AppServiceProvider
● Example Command
● Lets see it in action...
Running in Docker
78. My Projects
Lumen Programming Guide
http://www.apress.com/la/book/9781484221860
You will learn to write test-driven (TDD)
microservices, REST APIs, and web service
APIs with PHP using the Lumen micro-
framework.
* Zero bugs in the book source code ;)
79. My Projects
Docker for PHP Developers
https://leanpub.com/docker-for-php-developers
A hands-on guide to learning how to use
Docker as your primary development
environment. It covers a diverse range of
topics and scenarios you will face as a
PHP developer picking up docker.
The presentation is about you. I would like an open discussion during the presentation so you can get the most out of it and share your ideas on the topic.
I came up with my own definition. Its neither wrong or right. What do you think about when you think of web scraping?
Google is the biggest web scraper in the world. I don’t have any hard data, but I’m right.
Has anyone dealt with performance issues on a site at scale due to GoogleBot traffic?
Scraping is used as a business tool to look for growth, opportunities, trends, history, and analytics. For example, in order for Best Buy to stay competitive both online and in a brick-and-mortar store, they need to understand the competition’s pricing.
Google uses complex search algorithms to give you the most relevant content. This builds a relationship of trust and usefulness, and in turn you will like visit sites through which Google provides advertising. Advertising revenue accounts for 88% of Google’s (Alphabet) revenue. (Source: http://www.businessinsider.com/how-google-apple-facebook-amazon-microsoft-make-money-chart-2017-5)
Google Shopping collects deals for search results from multiple Stores, which is an example of scraping data, and then combining it in new and (potentially) useful ways.
I like to relate topics to otherwise non-related things.
These are potentially related examples of data scraping for the purposes of gain in different contexts. They are neither wrong or right, they are just ways to trigger both sides of your brain in creative ways to think about “data scraping”.
Observation balloons were used heavily in World War I for artillery observation provided more data about enemy positions/artillery, and report them back to the ground. This was a huge competitive advantage, that all sides used heavily in World War I.
In many ways the same principles are applied. Take a higher position, collect enrichment data, report back, and then make decisions from new data.
Maybe this is a stretch, but that’s why it's a fun exercise. Water meters run all the time, and meter reads happen on a schedule. Generally someone drives around and logs the reading. This reading is processed for millions of homes, collected, and then invoiced. This data is then later collected by other parties to document water usage on a macro level, and provides many insights into predicted water usage as a populace grows.
Related programming and product concepts to unrelated things is a powerful skill. I helps the concepts stick more for me, because I can associate them with other familiar everyday things. Kind of how we refer to pointing devices as a “mouse”.
Like many other things, scraping can be used for good and bad. Like any other business practice, common sense will help you steer away from shady/illegal web scraping practices.
I have combed over some access logs to top 50 US trafficked properties. I’ve seen so many weird user agents and bots. I’ve banned many for overwhelming our infrastructure. Intent is unclear, even if it's good, and you respond with blocking them.
Don’t use scraping to do shady or questionable things, like produce artificial traffic to your site, collect user data that you shouldn’t, etc.
Use these 5 words when designing web scrapers:
Speed
Caution
Intent
Empathy
Honesty
Speed: Go slow. Respect that your automation can produce a lot of traffic.
Caution: Automation code mistakes can unintentionally do things like delete things, create load because of recursion, etc
Intent: Even when you have good intentions, always question your approach. Which leads to…
Empathy: Put yourself in the shoes of the site owners. What would you do if a single IP started sending 100’s of requests per second at your server?
Honesty: To reiterate, don’t do illegal/shady things. Don’t collect Personally Identifiable Information, treat copyrights with respect, etc.
Be respectful to site owners. This talk doesn’t go deeply into how to parse/handle the robots.txt file, but be aware of it.
In my experience, my use case has been data that I would gladly consume from an API or another source like RSS. Use those if they are available. Do use scraping when something isn’t possible (or provided) and you have deemed your use case is appropriate. Always error on the side of getting permission.
This is just a high-level guideline of my thought process. Generally I look for an API to collect the data I want. I’ve even used inbound email webhooks over scraping because, while scraping can be effective, it is also error-prone sometimes.
Which brings me to the downsides of scraping...
Async elements can be challenging because you have to write logic to wait for something async to finish before your scraper proceeds
You will have to deal with more manual processing of data than an API typically
Scraping can be process/memory/time intensive. You are running a browser after all.
Changes in the HTML/DOM will break your stuff. There’s nothing you can do about it, except update your code.
Even a well-written defensive scraper is prone to breaking when the source changes their HTML / Dom / Input names, etc. I try to make selectors as simple and flexible as possible, without losing integrity of selecting the right element. The more specific your selector, the easier a change will break it. Sometimes you can’t do anything to stop it from breaking other than updating your code.
Make sure you have good error reporting so you can know about scraping issues. I use Bugsnag, and I’ve also heard many love Sentry. You can typically get them working in less than an hour.
I am going to group scraping into 3 categories for our purposes. Roughly they include:
anonymous HTTP requests and parsing responses
Browser acceptance testing
Full browser automation jobs
Anonymous scraping is faster than full browser automated scraping. Use this whenever possible. You use this when you want to collect raw HTML on the page and do something with it. For example, you could a check that makes sure your page has your Analytics code and the right ID.
This type of scraping is easy because we are just speaking HTTP and getting HTTP responses back.
PHP has good DOM parising tools. For example, using DOMDocument directly or symfony/dom-crawler. We will cover a good tool that uses the symfony/dom-crawler component.
The second type of scraping we will briefly look at is testing.
Test suites might use basic HTTP to make assertions
Some test suites use full browser emulation (ie. behat)
Scraping / Browser emulation is great for user acceptance testing and make sure the same code works in multiple browsers.
Full Browser Automation
Full browser automation is used in tests, but what I am talking about is full browser automation for the purposes of scraping websites
Its the closest thing to a real user
It requires more tooling to run, such as selenium standalone server, webdriver, and/or PhantomJS
Full browser automation is slower than basic HTTP scraping
Here are some tools you could use for the purposeses of scraping websites
Let’s dig a little bit into HTTP scraping sites. This is the most basic, fastest form of scraping I mentioned earlier.
My favorite tool is Goutte (pronounced goot) because it provides some nice conveniences. We will jump into a few examples that use Goutte after a quick overview.
How many of you are familiar with, or have used Goutte on a project?
Goutte uses Symfony BrowserKit to simulate a browser
Goutte uses Symfony DomCrawler for DOM traversal, filtering, and working with elements
Under the hood, Goutte uses Guzzle. You can also tweak guzzle settings if needed
Goutte can get and set cookies
Goutte provides a history you can use to review history, and do things like go back/forward/clear
Some of Goutte’s Capabilities that provide you some nice abstraction over the DOM:
Click on links to navigate
Extract / filter data
Submit forms
Follows redirects by default
Provides you a Crawler instance to traverse the DOM
I don’t want to spend a ton of time on this section, but thought I’d mention a couple points about how functional testing is similar to scraping.
Functional tests use the browser to make a request, interact with the page, and make assertions.
Scraping does the same thing, but instead also collects data for storage, analysis, and enrichment.
I’ve used Goutte to write an ad-hoc test suite to verify hundreds of thousands of redirects during a large site migration. I was able to create rewrite rules and then test them in a sandboxed production environment before going live. This was a huge benefit to making sure that our migrations went well, and that we didn’t lose pagerank because of some bad responses and redirects.
Functional test suites are super helpful to automate borning/repetitive QA. I have been in some dire situations where regression with manual QA took weeks. Automated web tests save valuable time and give you confidence on a user / behavior level that your unit tests cannot do.
I’ve written healthcheck probes, that I consider tests. They assert that your site or service is running as expected.
An example of a functional test using the Symfony framework
An example of a functional test using the Laravel framework
This is the first simple example with Dusk, which uses chromedriver and makes PHPUnit assertions to perform functional testing in a real browser
Sometimes browser automation is the only way you can possibly collect the data you need. I’ve recently had to write some browser automation to work with 3rd parties that are providing some reporting to us, but don’t provide an API.
You might need to use a combination of web scraping and/or inbound email attachments (I use Mailgun with Webhooks) in order to get the data because the 3rd party doesn’t provide an API. This is one example where web scraping can help.
These are some of the more notable tools in browser automation.
Selenium’s WebDriver becomes a W3C Web Standard - https://www.linkedin.com/pulse/seleniums-web-driver-become-w3c-standard-tom-weekes
I have actually never heard of steward before preparing for this presentation. It seems really easy to get some PHPUnit assertions going with browser tests.
Behat (http://behat.org/en/latest/)
Behat is a Behavior Driven Development (BDD) framework automating browser acceptance testing with cucumber implementations.
Mink (http://mink.behat.org/en/latest/)
Mink is a “browser controller/emulator for web applications
Steward (https://github.com/lmc-eu/steward)
“Steward: easy and robust testing with Selenium WebDriver + PHPUnit. Steward is set of libraries made to simplify writing and running robust functional system tests in PHPUnit using Selenium WebDriver.”
PHP-Webdriver (https://github.com/facebook/php-webdriver)
PHP-Webdriver is the most complete and advanced PHP bindings to the W3C WebDriver specification. Its used by CodeCeption, Dusk, and Steward, and probably many others that I’ve never used before.
http://codeception.com/
https://laravel.com/docs/5.4/dusk
I don’t have extensive experience with many of these tools. I have come to appreciate some of these libraries though. The Selenium WebDriver binding in python or solid. I am a little jealous of this package.
Python:
http://selenium-python.readthedocs.io/
https://www.crummy.com/software/BeautifulSoup/
http://docs.python-requests.org/en/master/
https://scrapy.org/
Ruby
http://teamcapybara.github.io/capybara/
http://www.rubydoc.info/github/sparklemotion/nokogiri
https://rubygems.org/gems/mechanize
Some of you might be familiar with these tools. Has anyone used any of these tools to do web scraping? Maybe describe a little bit about what you were doing? What you liked about these tools? What you didn’t like, or what was difficult?
JavaScript
http://nightwatchjs.org/
http://zombie.js.org/
http://phantomjs.org/
http://webdriver.io/
http://casperjs.org/
https://slimerjs.org/
Recently I was struggling a little to put together some browser automation in PHP and I started down the path of using JavaScript with CasperJS. It’s a good tool, but there were a couple things that bothered me about them.
My domain logic was written in PHP, and wasn’t portable with my codebase
I wanted to spend as much time in PHP as possible. I am personally a big fan of the most simple stack possible.
I didn’t want to deal with context switching, I felt like I would be more productive writing my automation tools in the same language as web application.
I wanted my scraping to have access to some of the business domain code (that is well tested) I’ve written in PHP
I am not knocking these tools, I am simply saying that for me, I knew I would be more productive in PHP.
The main ways that I run PHP browser automation:
PhantomJS
WebDriver with Chrome
I threw in chrome headless because Chrome recently shipped (v 59) running chrome in headless mode.
Here are a couple examples you can use to run chrome headless to experiment a bit with it. The slides are using a Mac, so you will have to adapt it to your OS.
Let’s spend a little time getting familiar with PHP-Webdriver...
Only show the webdriver.php example at this point…
php webdriver.php
I wanted to point out that if you look through the source code of Laravel Dusk you will see that dusk ships with a copy of chromedriver and runs it as a process using Symfony’s Process component.
What are some ways that you can trigger browser automation? I am not suggesting these are the only way, but this is how I’ve categorized my own usage patterns:
Eager - you want to eagerly go after some scraping data on a schedule
On-Demand - you want to run automation when you ask for it (ie. a console)
Event based - some event triggers the need to scrape data (ie. user submits a sitemap.xml in WebMaster tools)
What are some other ways? Any thoughts?
Now that I’ve demonstrated PHP-Webdriver, I feel like its OK to use, but that a nice abstraction layer would speed development up nicely.
Enter Dusk.
Dusk is a project created to help browser test Laravel applications. It has stubs for setting up acceptance testing, but at its core, the Browser class is nicely abstracted with a lot of convenience around web browser automation. You don’t have to touch PHP-Webdriver directly, although you can easily get at Facebook WebDriverElement instances.
As you noticed in the barebones webdriver demo, Dusk also provides convenience around waiting for elements and satisfying assertions before moving on.
Laravel Dusk’s main purpose is for browser testing laravel projects. At the heart of all this, is a really great Browser class that abstracts away many tedious things you have to do with PHP-Webdriver directly. This is not a knock on PHP-Webdriver AT ALL. But I appreciate the high-level abstractions that Dusk provides for common things I need to accomplish.
I am not asking that, I love using Laravel. It’s my go-to framework, but just in case you are asking if you have to use the Laravel framework in order to use Laravel dusk…
You can install Laravel Dusk as a dependency in non-laravel projects. You will be responsible for booting up chromedriver and creating a RemoteWebDriver instance (which I am going to demonstrate), but you can use the \Laravel\Dusk\Browser class as the foundation for your browsing needs with automation.
This is actually how I started using Dusk - I used it to simplify my browser automation needs.
There are convenient things I am using in my Laravel projects to run scheduled browser automation. Laravel makes running queues and scheduled tasks really convenient. For example, I need to download multiple files and run some data analysis on those files. Every day in the morning I run scheduled tasks to process the data via Laravel’s scheduler.
Here are some basic methods you can use to work with elements
value() - to get the value of an input
value() - you can set the value of an input by passing a second argument
text() - get the text of an element
attribute() - get an attribute from an element
Working with links
clickLink() - Clicks on a link by finding a link with the text
click() event on a selector
mouseover() - hover over an element
drag() - drag an element (I have personally never used it, but it’s cool, no?)
with() - scope element tasks within a selector
Working with form inputs
type() - type some text in an input
clear() - clear an input
select() select an option from a <select/>
select() select a random option
press() - press a button
Waiting for elements before proceeding, this is for async things...
waitFor() waits for a selector to become available
waitFor() - waits one second for an element with a selector
waitForText - waits for the text before proceeding
waitUntil - evaluate some Javascript and don’t proceed until it’s true
All of these things can...
a) help you to rapidly build applications
b) quickly hook in some browser automation alongside your project
So that you can…
Build effective browser automation tools quickly
You can defile a task schedule with a fluent API that runs your console commands on a schedule.
The way the Laravel scheduler works is that one cron is triggered every minute. I am going to show you how I run my scheduler in docker in a slightly different way, but the concept is the same.
You can schedule all your web automation here.
You can define custom console commands in Laravel. This is a great place for triggering the browser automation. It allows you to run automation on-demand and then you can hook your commands into the scheduler to run them on an automated schedule.
This is how I run my all my browser scraping
Laravel provides a queue with multiple drivers. You can trigger web scraping by dispatching a queue event. We will go over an example of this soon.
Laravel has everything you need to run queue workers, you don’t need to reach for anything outside of laravel to get a queue going apart from the queue driver you use, such as installing redis, beanstalkd, etc.
Example queue that dispatches a url to download. This is just a simple example to illustrate the flow.
This is the queue handler method that actually does something. In this case I am programmatically calling a console command.
You might be wondering you you are supposed to run this automation in an environment without a browser window/screen?
This is an example usage of Xvfb in a Docker entrypoint. We will examine this file a little more closely at the end.
What do we need to run automation in docker?
We need a browser, we will use Google Chrome
Chromedriver - we are going to install and run chromedriver in the container so our code doesn’t have to worry about spinning up a process
Xvfb - we need to install xvfb
PHP - we will run our app in PHP
We will use a custom entrypoint script to run chromedriver, xvfb, and the laravel scheduler
A few specifics about the Docker setup I am demonstrating here
I am extending the official PHP docker images - https://hub.docker.com/_/php/
Docker Compose - this is how you can easily run/orchestrate your containers
Redis - for the queue
MySQL - we are not actually using MySQL in this demo for anything, but that’s what I would use for my application typically
```
chromedriver&
php artisan browse:download
```
Demo the connecting files
docker-compose.yml
docker-compose up
tail -f storage/logs/laravel.log
Jump into the scheduler container
Run `php artisan browser:extract`
Run `php artisan browser:download`
Trigger the queue locally with `http POST http://localhost:8080/api/queue-example url="http://example.com"`
Demo the connecting files
docker-compose.yml
docker-compose up
tail -f storage/logs/laravel.log
Jump into the scheduler container
Run `php artisan browser:extract`
Run `php artisan browser:download`
Trigger the queue locally with `http POST http://localhost:8080/api/queue-example url="http://example.com"`
Demo the connecting files
docker-compose.yml
docker-compose up
tail -f storage/logs/laravel.log
Jump into the scheduler container
Run `php artisan browser:extract`
Run `php artisan browser:download`
Trigger the queue locally with `http POST http://localhost:8080/api/queue-example url="http://example.com"`
Demo the connecting files
docker-compose.yml
docker-compose up
tail -f storage/logs/laravel.log
Jump into the scheduler container
Run `php artisan browser:extract`
Run `php artisan browser:download`
Trigger the queue locally with `http POST http://localhost:8080/api/queue-example url="http://example.com"`