Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Attacking fire with fire
Or how to get an API from any website
I am Danielius Visockas
#givingBackToCommunity
Salut!
Web harvesting
Web harvesting
Go to a
page
Extract the
data
Download a
document
Basic diagram of web harvesting
Fundamental metrics
◉ Freshness
◉ Age
Revisiting policy
Constant Based on freshness
“
Edward Coffman et. al. proposed that
a crawler must minimize the fraction
of time pages remain outdated.
Aaah, easy
curl -i https://delfi.lt
No SSL...
curl -i http://delfi.lt
Doesn’t work....
Let’s try mobile
curl -i http://m.delfi.lt
……………….
<script
type="text/javascript">(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){(i[r].q...
This looks familiar
Let’s use regex and it should be fine
Overengineering
Basic techniques
Pick a right tool for the job
One-time
Your computer is on
Two ways to harvest
Automated
Can be done in a server
Copy and paste
Client-side scripting
Extensions and bookmarks!
Online scrapers
The fun part
Automated scraping
“
Don’t forget to watch the network tab
Fetching of websites
Extraction of data
Cheerio
But then it all changed
When fire nation attacked
I found a girl in Kaunas...
7 seconds
Traukiniobilietas.lt response time
Thats
Five
Seconds
More
Than
It
Takes
To
Say
Seven
Seconds
Screenshot
Traukiniobilietas.lt didn’t load...
So I decided to learn React
And built an app that helps you to find trips
Want big impact?
Use big image.
How do I get the Data?!
Headless browsers
Brings together the best
Of two worlds
I used Casper.js
◉ Runs on PhantomJS
◉ Resource intensive
◉ Can replicate everything
◉ Takes a bit longer
◉ DoS’ed traukin...
“
So basically
You have to pick
The right tool for the job
#noFreeLunchTheory
Legal stuff
Security
CAPTCHAS and friends...
Interesting ideas
◉ Visual scraping using Machine Learning
◉ Macros + Casper.js (github.com/dvisockas/scrape)
Please ask questions!
Thank you!
And if someone from TRAFI could help me with traveling salesman..
Vilnius.js
Vilnius.js
Vilnius.js
Vilnius.js
Nächste SlideShare
Wird geladen in …5
×

Vilnius.js

Presentation about scraping in general, it's techinques and my personal experience with it.

  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Vilnius.js

  1. 1. Attacking fire with fire Or how to get an API from any website
  2. 2. I am Danielius Visockas #givingBackToCommunity Salut!
  3. 3. Web harvesting
  4. 4. Web harvesting Go to a page Extract the data Download a document
  5. 5. Basic diagram of web harvesting
  6. 6. Fundamental metrics ◉ Freshness ◉ Age
  7. 7. Revisiting policy Constant Based on freshness
  8. 8. “ Edward Coffman et. al. proposed that a crawler must minimize the fraction of time pages remain outdated.
  9. 9. Aaah, easy
  10. 10. curl -i https://delfi.lt
  11. 11. No SSL...
  12. 12. curl -i http://delfi.lt
  13. 13. Doesn’t work....
  14. 14. Let’s try mobile curl -i http://m.delfi.lt
  15. 15. ………………. <script type="text/javascript">(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){(i[r].q=i[r].q||[]).push(argume ts)},i[r].l=1*new Date();a=s.createElement(o),m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)})(window,document,'scr pt','//www.google-analytics.com/analytics.js','ga');ga('create','UA-2428893-5','auto');var __ae=document.getElementsByClassName('delfi-author-name')[0]||document.getElementsByClassName('article-author-name')[0];if('undefi ed' !== typeof __ae){var au=__ae.textContent;au=au.replace(/[,;].*/g,'');au=au.replace(/^s+|s+$/g,'');au=au.toLowerCase();ga('set','dimension1',au);}if(m = navigator.userAgent.match(/Delfi/([0-9.]+)/)){var ua='Other';if(/ip(hone|ad|od)/i.test(navigator.userAgent))ua='iOS App';else if(/android/i.test(navigator.userAgent))ua='Android App';else if(/(windows|msie)/i.test(navigator.userAgent))ua='Windows App';ga('set','dimension2',ua);}else if(/FBAV//.test(navigator.userAgent))ga('set','dimension2','FBWV');else ga('set','dimension2','Browser');ga('set','dimension3',''+(window.__dabd && window.__dabd()));ga('send','pageview'); </script> <script type="text/javascript">var t=window.location.hostname.split('.').reverse();if(window._dct)_dct({s:'delfi/mobile',d:'t.'+t[1]+'.'+t[0]});</script> <script type="text/javascript"> var __ae=document.getElementsByClassName('delfi-author-name')[0]||document.getElementsByClassName('article-author-name')[0],au='',_sf_ sync_config = {}; if('undefined' !== typeof __ae){var au=__ae.textContent;au=au.replace(/,.*/g,'');au=au.replace(/^s+|s+$/g,'');au=au.toLowerCase();} _sf_async_config.uid=46335;_sf_async_config.domain='delfi.lt';_sf_async_config.sections='m.delfi';_sf_async_config.authors=au;_sf_ sync_config.useCanonical=true; (function(){function loadChartbeat(){window._sf_endpt=(new Date()).getTime();var e=document.createElement('script');e.setAttribute('language', 'javascript');e.setAttribute('type', 'text/javascript');e.setAttribute('src', '//static.chartbeat.com/js/chartbeat.js');document.body.appendChild(e);} var oldonload=window.onload; window.onload=(typeof window.onload != 'function') ? loadChartbeat : function() { oldonload(); loadChartbeat(); }; })(); </script> <script
  16. 16. This looks familiar Let’s use regex and it should be fine
  17. 17. Overengineering
  18. 18. Basic techniques Pick a right tool for the job
  19. 19. One-time Your computer is on Two ways to harvest Automated Can be done in a server
  20. 20. Copy and paste
  21. 21. Client-side scripting
  22. 22. Extensions and bookmarks!
  23. 23. Online scrapers
  24. 24. The fun part Automated scraping
  25. 25. “ Don’t forget to watch the network tab
  26. 26. Fetching of websites
  27. 27. Extraction of data Cheerio
  28. 28. But then it all changed When fire nation attacked
  29. 29. I found a girl in Kaunas...
  30. 30. 7 seconds Traukiniobilietas.lt response time
  31. 31. Thats
  32. 32. Five
  33. 33. Seconds
  34. 34. More
  35. 35. Than
  36. 36. It
  37. 37. Takes
  38. 38. To
  39. 39. Say
  40. 40. Seven
  41. 41. Seconds
  42. 42. Screenshot Traukiniobilietas.lt didn’t load...
  43. 43. So I decided to learn React And built an app that helps you to find trips
  44. 44. Want big impact? Use big image. How do I get the Data?!
  45. 45. Headless browsers
  46. 46. Brings together the best Of two worlds
  47. 47. I used Casper.js ◉ Runs on PhantomJS ◉ Resource intensive ◉ Can replicate everything ◉ Takes a bit longer ◉ DoS’ed traukiniobilietas… ◉ Works
  48. 48. “ So basically You have to pick The right tool for the job #noFreeLunchTheory
  49. 49. Legal stuff
  50. 50. Security CAPTCHAS and friends...
  51. 51. Interesting ideas ◉ Visual scraping using Machine Learning ◉ Macros + Casper.js (github.com/dvisockas/scrape)
  52. 52. Please ask questions! Thank you! And if someone from TRAFI could help me with traveling salesman..

×