Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Scraping the Web with
Laravel Dusk, Docker, and PHP
By: Paul Redmond
@paulredmond paulredmond
What You’ll Learn?
● Different types of scraping and when to use them
● Use Laravel Dusk for rapid browser automation
● Di...
What is Web Scraping?
It’s a dirty job
Gathering data from HTML and
other media for the purposes
of testing, data enrichme...
Hundreds of Billions
Google “Scrapes” Hundreds of Billions (Or More)
of Pages and other media on the web.
https://www.goog...
Why Do We Need Scraping?
● Market analysis
● Gain a competitive advantage
● Increase learning and understanding
● Monitor ...
Other Types of Data Scraping
● Competitor Scanning
● Military Intelligence
● Surveillance
● Metering
Other Types of Data Scraping
Other Types of Data Scraping
Is Web Scraping Legitimate?
● Yes, it can be.
● Scraping can have a negative/bad connotation, so...
○ Don’t do bad / illeg...
Keeping Web Scraping Legitimate
● Speed
● Caution
● Intent
● Empathy
● Honesty
Keeping Web Scraping Legitimate
● Speed. Go slow (watch requests/second)
● Caution. Code mistakes could create unintended ...
Keep Robots.txt in Mind...Be a Good Bot
● https://www.google.com/robots.txt
● https://www.yahoo.com/robots.txt
● https://g...
When Do We Scrape?
● What is the purpose?
● Can we live without the data?
● Do they have an API?
● If yes, does the API ha...
Downsides of Scraping
● Changes in the HTML/DOM breaks scrapers
● Changes in the HTML/DOM breaks scrapers
● Changes in the...
How Do we Overcome the Downsides?
● Match DOM/Selectors defensively
● It's a bit of an art that takes practice and experie...
Scraping Tools
3 Categories of Web Scraping
● Anonymous HTTP Requests (HTML, Images, XML, etc.)
● Testing elements, asserting expected be...
Anonymous Scraping - HTML, Images, etc.
● Fastest
● Easy to run and reproduce
● Just speaking HTTP
● PHP has a Good DOM Pa...
Testing elements / asserting expected behavior
● May use HTTP to make basic response assertions
● May use a full browser (...
Full Browser Automation
● Like testing, but used for scraping
● Real browser or headless browser
● The closest thing to a ...
● cURL
● Goutte (goot)
● Guzzle
● HTTPFul
● PHP-Webdriver
● file_get_contents()
(Some) PHP Tools You Can Use for Scraping
What Other Tools Have You Used?
HTTP Scraping
Goutte is the Best Option (in my opinion)
Pronounced “goot”
HTTP Scraping
Goutte Overview
● Uses Symfony/BrowserKit to Simulate the Browser
● Uses Symfony/DomCrawler for DOM Traversal/Filtering
● ...
Goutte Capabilities
● Click on Links and navigate the web
● Extract data / filter data
● Submit forms
● Follows redirects ...
Let’s Look at Some Examples of HTTP Scraping
Goutte Examples on Github
HTTP Scraping
Testing and Web Scrapers
Ways you might use web scraping for testing
● Test bulk site redirects before a migration
○ Request the old URLs
○ Assert ...
Example Functional Test Asserting HTML
Testing and Web Scrapers
http://symfony.com/doc/current/testing.html#your-first-fun...
Example Functional Test Asserting Status
Testing and Web Scrapers
https://laravel.com/docs/5.4/http-tests#introduction
Example Functional Browser Test
Testing and Web Scrapers
https://laravel.com/docs/5.4/dusk#getting-started
Full Browser Automation
Why do we need full browser automation tools?
Full Browser Automation
Why do we need full browser automation tools?
● Simulate real browsers
● Test/Work with Async JavaScript applications
● Au...
Noteable Tools in Browser Automation
● Selenium
● W3 WebDriver (https://www.w3.org/TR/webdriver/)
● Headless Browsers
○ Ph...
Noteable PHP Tools in Browser Automation
● Behat / Mink
● PHP-Webdriver
○ Codeception
○ Laravel Dusk (recently)
● Steward
...
Notables in Other Languages...
● Python
○ Selenium WebDriver Bindings
○ BeautifulSoup
○ Requests: HTTP for Humans
○ Scrapy...
Notables in Other Languages...
● JavaScript
○ Nightwatch.js
○ Zombie
○ PhantomJS
○ Webdriver.io
○ CasperJS
○ SlimerJS
Full...
Why Use PHP for Web Browser Automation?
● Developers don’t have to learn a new language (good/bad)
● More participation in...
How Do I Run PHP
Browser Automation?
How Do I Run PHP Browser Automation?
● `chrome --headless` - as of Chrome 59
● Standalone Selenium
● WebDriver
● PhantomJS...
Run Chrome Headless (Chrome 59 Stable)
$ alias chrome="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
$ chr...
Getting to Know PHP-WebDriver
WebDriver Examples on Github
How Do I Run This Stuff?
Running the Chromedriver/Phantom Process
How Do I Run This Stuff?
Techniques for Triggering Browser Automation
● Eager tasks - run on a schedule
● On-demand - one-off console commands
● Ev...
Intro to Laravel Dusk
Intro to Laravel Dusk
● Browser testing for Laravel projects (primary use case)
● Browser abstraction on top of PHP-Webdri...
Do I HAVE to use Laravel to Use Dusk!?
Do I HAVE to use Laravel to Use Dusk!?
But I am going to show you why
its great for web automation stuff...
Dusk Basics: Elements
Dusk Basics: Links/Events
Dusk Basics: Form Inputs
Dusk Basics: Waiting for Elements
Quick Comparison to Our Earlier Vanilla PHP-
Webdriver Example
Webdriver Dusk Examples on Github
Running Browser Automation
Key Laravel Features for Browser Automation
● Scheduler to run Commands on a schedule (eager)
● Create Custom Console Comm...
Scheduler (app/Console/Kernel.php)
Custom Console Commands
● Easily run one-off commands
● Scheduler uses commands, giving you both
● Laravel uses the Symfon...
Queues
● Easily trigger web scraping jobs
● Queue jobs can trigger console commands
● Laravel has a built-in queue worker
...
Queues
Queues
Running Browser Automation in Docker
How Do I Run PHP Browser Automation on a Server!?
How Do I Run This Stuff?
How Do I Run PHP Browser Automation on a Server!?
How Do I Run This Stuff?
XVFB
XVFB. What the What!?
“Xvfb (short for X virtual framebuffer) is an in-memory display
server for UNIX-like operating syste...
Example Xvfb Usage
$ Xvfb :99 -screen 0 1920x1200x16 &
How Do I Run This Stuff?
Example Xvfb Usage How Do I Run This Stuff?
Our Requirements for a Docker Scheduler
● Google Chrome Stable
● Chromedriver
● Xvfb
● PHP
● Entrypoint to run the schedul...
Our Docker Setup
● Docker Official php:7.1.6-cli (Scheduler)
● Docker Official php:7.1.6-fpm (Web Container)
● Docker Comp...
Why Not the Official Selenium Image?
● If you need File Downloads through Chrome
● Downloads through volumes aren’t ideal
...
Scheduler Dockerfile
● Extends php:7.1.6-cli
● Installs Chrome Stable + a script to take chrome out of
sandbox mode
● Inst...
Scheduler Dockerfile
Review the Scheduler Docker Files
Running in Docker
How Do I Download Files through Chrome?
Running in Docker
Extending Dusk Browser - Hooking it Together
● Provide our Own Browser class
● A DownloadsManager class for chrome downloa...
Full Docker Setup in Action
(Demo)
Running in Docker
My Projects
Lumen Programming Guide
http://www.apress.com/la/book/9781484221860
You will learn to write test-driven (TDD)
...
My Projects
Docker for PHP Developers
https://leanpub.com/docker-for-php-developers
A hands-on guide to learning how to us...
Final Questions?
Thank You!
Nächste SlideShare
Wird geladen in …5
×

Scraping the web with Laravel, Dusk, Docker, and PHP

4.652 Aufrufe

Veröffentlicht am

Jumpstart your web scraping automation in the cloud with Laravel Dusk, Docker, and friends. We will discuss the types of web scraping tools, the best tools for the job, and how to deal with running selenium in Docker.

Code examples @ https://github.com/paulredmond/scraping-with-laravel-dusk

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Scraping the web with Laravel, Dusk, Docker, and PHP

  1. 1. Scraping the Web with Laravel Dusk, Docker, and PHP By: Paul Redmond @paulredmond paulredmond
  2. 2. What You’ll Learn? ● Different types of scraping and when to use them ● Use Laravel Dusk for rapid browser automation ● Different Ways to Run Browser Automation ● Run Browser Automation in a Server Environment
  3. 3. What is Web Scraping? It’s a dirty job Gathering data from HTML and other media for the purposes of testing, data enrichment, and collection. https://flic.kr/p/8EZMNk
  4. 4. Hundreds of Billions Google “Scrapes” Hundreds of Billions (Or More) of Pages and other media on the web. https://www.google.com/search/howsearchworks/crawling-indexing/
  5. 5. Why Do We Need Scraping? ● Market analysis ● Gain a competitive advantage ● Increase learning and understanding ● Monitor trends ● Combine multiple offers into one portal (ie. Shopping comparisons) ● Analytics
  6. 6. Other Types of Data Scraping ● Competitor Scanning ● Military Intelligence ● Surveillance ● Metering
  7. 7. Other Types of Data Scraping
  8. 8. Other Types of Data Scraping
  9. 9. Is Web Scraping Legitimate? ● Yes, it can be. ● Scraping can have a negative/bad connotation, so... ○ Don’t do bad / illegal stuff ○ Be nice ○ Be careful ○ Be respectful
  10. 10. Keeping Web Scraping Legitimate ● Speed ● Caution ● Intent ● Empathy ● Honesty
  11. 11. Keeping Web Scraping Legitimate ● Speed. Go slow (watch requests/second) ● Caution. Code mistakes could create unintended load! ● Intent. Even if your intention is pure, always question. ● Empathy. Put yourself in the shoes of website owners ● Honesty. Don’t steal stuff (PII, copyrights, etc.)
  12. 12. Keep Robots.txt in Mind...Be a Good Bot ● https://www.google.com/robots.txt ● https://www.yahoo.com/robots.txt ● https://github.com/robots.txt (see the top comment) * PHP Robots Parser: https://github.com/webignition/robots-txt-file
  13. 13. When Do We Scrape? ● What is the purpose? ● Can we live without the data? ● Do they have an API? ● If yes, does the API have everything we need? ● Do they allow scraping?
  14. 14. Downsides of Scraping ● Changes in the HTML/DOM breaks scrapers ● Changes in the HTML/DOM breaks scrapers ● Changes in the HTML/DOM breaks scrapers ● Changes in the HTML/DOM breaks scrapers ● Rich JavaScript apps can cause headaches ● Scraping can be process/memory and time intensive ● More manual processing/formatting of collected data than an API ● Changes in the HTML/DOM breaks scrapers
  15. 15. How Do we Overcome the Downsides? ● Match DOM/Selectors defensively ● It's a bit of an art that takes practice and experience ● Make sure that you handle failure ● Good alerting, notifications, and reporting ○ https://www.bugsnag.com/ ○ https://sentry.io/ ● Learn to accept that scraping will break sometimes
  16. 16. Scraping Tools
  17. 17. 3 Categories of Web Scraping ● Anonymous HTTP Requests (HTML, Images, XML, etc.) ● Testing elements, asserting expected behavior ● Full Browser Automation Tasks
  18. 18. Anonymous Scraping - HTML, Images, etc. ● Fastest ● Easy to run and reproduce ● Just speaking HTTP ● PHP has a Good DOM Parsing Tools (Goutte)
  19. 19. Testing elements / asserting expected behavior ● May use HTTP to make basic response assertions ● May use a full browser (think testing Rich JavaScript Apps) ● Useful for user acceptance testing and browser testing
  20. 20. Full Browser Automation ● Like testing, but used for scraping ● Real browser or headless browser ● The closest thing to a real user ● Requires more tooling (ie. Selenium, WebDriver, Phantom) ● Runs slow in general
  21. 21. ● cURL ● Goutte (goot) ● Guzzle ● HTTPFul ● PHP-Webdriver ● file_get_contents() (Some) PHP Tools You Can Use for Scraping
  22. 22. What Other Tools Have You Used?
  23. 23. HTTP Scraping
  24. 24. Goutte is the Best Option (in my opinion) Pronounced “goot” HTTP Scraping
  25. 25. Goutte Overview ● Uses Symfony/BrowserKit to Simulate the Browser ● Uses Symfony/DomCrawler for DOM Traversal/Filtering ● Uses Guzzle for HTTP Requests ● Get and Set Cookies ● History (allows you to go back, forward, clear) Reference: https://github.com/FriendsOfPHP/Goutte HTTP Scraping
  26. 26. Goutte Capabilities ● Click on Links and navigate the web ● Extract data / filter data ● Submit forms ● Follows redirects (by default) ● Requests return an instance of SymfonyComponentDomCrawlerCrawler HTTP Scraping
  27. 27. Let’s Look at Some Examples of HTTP Scraping Goutte Examples on Github HTTP Scraping
  28. 28. Testing and Web Scrapers
  29. 29. Ways you might use web scraping for testing ● Test bulk site redirects before a migration ○ Request the old URLs ○ Assert a 3xx response ○ Assert the redirect location returns a 200 ● Functional test suites (ie. Symfony/Laravel) ● Healthcheck Probes / HTTP validation (ie. 200 response) Testing and Web Scrapers
  30. 30. Example Functional Test Asserting HTML Testing and Web Scrapers http://symfony.com/doc/current/testing.html#your-first-functional-test
  31. 31. Example Functional Test Asserting Status Testing and Web Scrapers https://laravel.com/docs/5.4/http-tests#introduction
  32. 32. Example Functional Browser Test Testing and Web Scrapers https://laravel.com/docs/5.4/dusk#getting-started
  33. 33. Full Browser Automation
  34. 34. Why do we need full browser automation tools? Full Browser Automation
  35. 35. Why do we need full browser automation tools? ● Simulate real browsers ● Test/Work with Async JavaScript applications ● Automate testing that applications work as expected ● Replace repetitive manual QA with automation ● Run tests in multiple browsers ● Advanced Web Scraping (ie. filtered reports) Full Browser Automation
  36. 36. Noteable Tools in Browser Automation ● Selenium ● W3 WebDriver (https://www.w3.org/TR/webdriver/) ● Headless Browsers ○ PhantomJS ○ Chrome --headless* ○ ZombieJS * Chromedriver isn’t quite working with --headless yet, at least for me ¯_(ツ)_/¯ Full Browser Automation
  37. 37. Noteable PHP Tools in Browser Automation ● Behat / Mink ● PHP-Webdriver ○ Codeception ○ Laravel Dusk (recently) ● Steward ● Any others you consider noteable? Full Browser Automation
  38. 38. Notables in Other Languages... ● Python ○ Selenium WebDriver Bindings ○ BeautifulSoup ○ Requests: HTTP for Humans ○ Scrapy ● Ruby ○ Capybara ○ Nokogiri (DOM Parsing) ○ Mechanize Gem Full Browser Automation
  39. 39. Notables in Other Languages... ● JavaScript ○ Nightwatch.js ○ Zombie ○ PhantomJS ○ Webdriver.io ○ CasperJS ○ SlimerJS Full Browser Automation
  40. 40. Why Use PHP for Web Browser Automation? ● Developers don’t have to learn a new language (good/bad) ● More participation in teams already writing PHP ● Reduce cross-language mental overhead ● Browser Automation can be closer to your domain logic ● PHP-Webdriver is Good Enough™ (and backed by Facebook) Full Browser Automation
  41. 41. How Do I Run PHP Browser Automation?
  42. 42. How Do I Run PHP Browser Automation? ● `chrome --headless` - as of Chrome 59 ● Standalone Selenium ● WebDriver ● PhantomJS ● Any other ways? How Do I Run This Stuff?
  43. 43. Run Chrome Headless (Chrome 59 Stable) $ alias chrome="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" $ chrome --headless --disable-gpu --print-to-pdf https://www.github.com/ $ open output.pdf $ chrome --headless --disable-gpu --dump-dom $ chrome --headless --disable-gpu --repl https://www.chromestatus.com/ Reference: https://developers.google.com/web/updates/2017/04/headless-chrome How Do I Run This Stuff?
  44. 44. Getting to Know PHP-WebDriver WebDriver Examples on Github How Do I Run This Stuff?
  45. 45. Running the Chromedriver/Phantom Process How Do I Run This Stuff?
  46. 46. Techniques for Triggering Browser Automation ● Eager tasks - run on a schedule ● On-demand - one-off console commands ● Event trigger - event queue ● What are some other ways? How Do I Run This Stuff?
  47. 47. Intro to Laravel Dusk
  48. 48. Intro to Laravel Dusk ● Browser testing for Laravel projects (primary use case) ● Browser abstraction on top of PHP-Webdriver <3 ● Doesn’t require JDK or Selenium (you can still use them) ● Uses standalone ChromeDriver
  49. 49. Do I HAVE to use Laravel to Use Dusk!?
  50. 50. Do I HAVE to use Laravel to Use Dusk!?
  51. 51. But I am going to show you why its great for web automation stuff...
  52. 52. Dusk Basics: Elements
  53. 53. Dusk Basics: Links/Events
  54. 54. Dusk Basics: Form Inputs
  55. 55. Dusk Basics: Waiting for Elements
  56. 56. Quick Comparison to Our Earlier Vanilla PHP- Webdriver Example Webdriver Dusk Examples on Github
  57. 57. Running Browser Automation
  58. 58. Key Laravel Features for Browser Automation ● Scheduler to run Commands on a schedule (eager) ● Create Custom Console Commands (one-off) ● Built-in Queues (triggered) ● Database Migrations for quick modeling of data storage ● Service Container for browse automation classes
  59. 59. Scheduler (app/Console/Kernel.php)
  60. 60. Custom Console Commands ● Easily run one-off commands ● Scheduler uses commands, giving you both ● Laravel uses the Symfony Console and adds conveniences ● Commands run my browser scraping
  61. 61. Queues ● Easily trigger web scraping jobs ● Queue jobs can trigger console commands ● Laravel has a built-in queue worker ● Redis is my preferred queue driver
  62. 62. Queues
  63. 63. Queues
  64. 64. Running Browser Automation in Docker
  65. 65. How Do I Run PHP Browser Automation on a Server!? How Do I Run This Stuff?
  66. 66. How Do I Run PHP Browser Automation on a Server!? How Do I Run This Stuff? XVFB
  67. 67. XVFB. What the What!? “Xvfb (short for X virtual framebuffer) is an in-memory display server for UNIX-like operating system (e.g., Linux). It enables you to run graphical applications without a display (e.g., browser tests on a CI server) while also having the ability to take screenshots.” Reference: http://elementalselenium.com/tips/38-headless How Do I Run This Stuff?
  68. 68. Example Xvfb Usage $ Xvfb :99 -screen 0 1920x1200x16 & How Do I Run This Stuff?
  69. 69. Example Xvfb Usage How Do I Run This Stuff?
  70. 70. Our Requirements for a Docker Scheduler ● Google Chrome Stable ● Chromedriver ● Xvfb ● PHP ● Entrypoint to run the scheduler Running in Docker
  71. 71. Our Docker Setup ● Docker Official php:7.1.6-cli (Scheduler) ● Docker Official php:7.1.6-fpm (Web Container) ● Docker Compose ● Redis ● MySQL Running in Docker
  72. 72. Why Not the Official Selenium Image? ● If you need File Downloads through Chrome ● Downloads through volumes aren’t ideal ● If you want the same PHP installation on app and scheduler (I do) Running in Docker
  73. 73. Scheduler Dockerfile ● Extends php:7.1.6-cli ● Installs Chrome Stable + a script to take chrome out of sandbox mode ● Installs Chromedriver ● Installs Required PHP Modules ● Copies Application Files ● Runs a custom entrypoint script Running in Docker
  74. 74. Scheduler Dockerfile Review the Scheduler Docker Files Running in Docker
  75. 75. How Do I Download Files through Chrome? Running in Docker
  76. 76. Extending Dusk Browser - Hooking it Together ● Provide our Own Browser class ● A DownloadsManager class for chrome downloads ● A DownloadedFile Class to Work with Downloaded Files ● Service Container Bindings in AppServiceProvider ● Example Command ● Lets see it in action... Running in Docker
  77. 77. Full Docker Setup in Action (Demo) Running in Docker
  78. 78. My Projects Lumen Programming Guide http://www.apress.com/la/book/9781484221860 You will learn to write test-driven (TDD) microservices, REST APIs, and web service APIs with PHP using the Lumen micro- framework. * Zero bugs in the book source code ;)
  79. 79. My Projects Docker for PHP Developers https://leanpub.com/docker-for-php-developers A hands-on guide to learning how to use Docker as your primary development environment. It covers a diverse range of topics and scenarios you will face as a PHP developer picking up docker.
  80. 80. Final Questions?
  81. 81. Thank You!

×