This document discusses how to build a multi-threaded web crawler in Ruby to drastically increase efficiency. It introduces the key components of threads, queues, and mutexes. It then outlines the components of the web crawler app: a Crawler module to set up the environment and database connection, Crawler::Threads class to spawn threads and queue jobs, and models to store retrieved data. Running the crawler with 10 threads completes the same task of visiting 10 pages in 1.51 seconds compared to 10 seconds for a single thread. The document also discusses ensuring thread safety when outputting data.
2. Hi,
I’m Kamil Durski, Senior Ruby Developer at Polcode
If improving Ruby skills is what you’re after, stick around. I’ll
show you how to use multiple threads to drastically increase
the efficiency of your application.
As I focus on threads, only the relevant code will be displayed in the slideshow.
Find the full source here.
4. Ruby programmers have easy access to threads thanks to
build-in support.
Threads can be very useful, yet for some reason they don’t
receive much love.
Where can you use threads to see their prowess first-hand?
Crawling the web is a perfect example! Threads allow you to save
much time you’d spend waiting for data from the remote server.
5. I’m going to build a simple app so you can really understand
the power of threads. It will fetch info on some popular U.S.
TV shows (that one with dragons and an ex chemistry teacher
too!) from a bunch of websites.
But before we take a look at the code, let’s start with a few
slides of good old theory.
7. A multi-threaded app is capable of doing a lot of things at the
same time.
That’s because the app has the ability to switch between
threads, letting each of them use some of the process time.
But it’s still a single process
The same things goes for running many apps on a single-core
processor. It’s the operating system that does the switching.
8. Another big difference
Use threads within a single process and you can share memory
and variables between all of them, making development easier
Use multiple processes and processor cores and it’s no longer
the case – sharing data gets harder.
Check Wikipedia to find out more on threads.
9. Now we can go back to the TV shows. Aside of Ruby on Rails’
Active Record library for database access, all I’m going to use
are:
Three components from Ruby’s thread library:
1) Thread – the core class that runs multiple parts of code at
the same time,
2) Queue – this class will let me schedule jobs to be used by all
the threads,
3) Mutex – the role of the Mutex component is to synchronize
access to the resources. Thanks to that, the app
won’t switch to another thread too early.
10. The app itself is also divided into three major components:
1) Module
I’m going to supply the app with a list of modules to
run. The module creates multiple threads and tells
the crawler what to do,
2) Crawler
I’m going to create crawler classes to fetch data
from websites,
3) Model
Models will allow me to store and retrieve data
from the database.
12. The Crawler module is responsible
for setting the environment and
connecting to the database.
13. The autoload calls refer to major
components inside the lib/
directory. The setup_env method
connects to the database and
adds app/ directories to the
$LOAD_PATH variable and includes
all of the files under app/ directory.
A new instance of the mutex
method is stored inside of the
@mutex variable. We can access it
by Crawler.mutex.
15. Now I’m going to create the core
feature of the app. I’m initializing a
few variables - @size, to know how
many threads to spawn, @threads
array to keep track of the threads,
and @queue to store the jobs to do.
16. I’m calling the #add method to add
each job to the queue. It accepts
optional arguments and a block.
Please, google block in Ruby if
you’re not familiar with the concept.
17. Next,the#start methodinitializes
threads and calls #join on each of
them.It’sessentialforthewholeapp
to work – otherwise once the main
thread is done with its job, it would
instantly kill spawned threads and
exit without finishing its job..
18. To complete the core functionality,
I’m calling the #pop method on a
block from the queue and then run
the block with the arguments from
the earlier #add method. The true
argument makes sure that it runs in
a non-blocking mode. Otherwise, I
would run into a deadlock with the
thread waiting for a new job to be
addedevenafterthequeueisalready
emptied (eventually throwing
anapplicationerror „Nolivethreads
left. Deadlock?”).
19. I can use the Crawler::Threads
class to crawl multiple pages at the
same time.
21. 10 second to visit 10 pages and fetch
somebasicinformation.Alright,now
I’m going to try 10 threads.
22. All it took to do the same task is 1.51 s!
The app no longer wastes time doing nothing while waiting for
the remote server to deliver data.
Additionally, what’s interesting, the input order is different –
for the single thread option it’s the same as the config file. For
the multi-threaded it’s random, as some threads do their job
faster.
24. The code I used outputs information
using puts. It’s not a thread-safe
way of doing this as it causes two
particular things:
- outputs a given string,
- then outputs the new line (NL)
character.
This may cause random instances of
NLcharactersappearingoutofplace
as the thread switches in the middle
andanother assumes controlbefore
the NL character is printed See the
example below:
25. I fixed this with mutex by creating a
custom #log method to output the
information to the console wrapped
in it:
Now the console output is always
in order as the thread waits for the
puts to finish.
26. And that’s it.
Nowyouknowmoreabouthowthreadswork.
I wrote this code as a side project the topic of web crawling
being an important part of what I do. The previous version
included more features such as the usage of proxies and TOR
networksupport.Thelatterimprovesanonymitybutalsoslows
down the code a lot.
Thanks for your time and, again, feel free to tackle the entire
code at:
https://github.com/kdurski/crawler