SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Multi-threaded web
crawler in Ruby
Hi,
I’m Kamil Durski, Senior Ruby Developer at Polcode
If improving Ruby skills is what you’re after, stick around. I’ll
show you how to use multiple threads to drastically increase
the efficiency of your application.
As I focus on threads, only the relevant code will be displayed in the slideshow.
Find the full source here.
The (much) underestimated threads
Ruby programmers have easy access to threads thanks to
build-in support.
Threads can be very useful, yet for some reason they don’t
receive much love.
Where can you use threads to see their prowess first-hand?
Crawling the web is a perfect example! Threads allow you to save
much time you’d spend waiting for data from the remote server.
I’m going to build a simple app so you can really understand
the power of threads. It will fetch info on some popular U.S.
TV shows (that one with dragons and an ex chemistry teacher
too!) from a bunch of websites.
But before we take a look at the code, let’s start with a few
slides of good old theory.
What’s the difference between
a thread and a process?
A multi-threaded app is capable of doing a lot of things at the
same time.
That’s because the app has the ability to switch between
threads, letting each of them use some of the process time.
But it’s still a single process
The same things goes for running many apps on a single-core
processor. It’s the operating system that does the switching.
Another big difference
Use threads within a single process and you can share memory
and variables between all of them, making development easier
Use multiple processes and processor cores and it’s no longer
the case – sharing data gets harder.
Check Wikipedia to find out more on threads.
Now we can go back to the TV shows. Aside of Ruby on Rails’
Active Record library for database access, all I’m going to use
are:
Three components from Ruby’s thread library:
1) Thread – the core class that runs multiple parts of code at
			 the same time,
2) Queue – this class will let me schedule jobs to be used by all
			 the threads,
3) Mutex – the role of the Mutex component is to synchronize
			 access to the resources. Thanks to that, the app
			 won’t switch to another thread too early.
The app itself is also divided into three major components:
1) Module
			 I’m going to supply the app with a list of modules to
			 run. The module creates multiple threads and tells	
			 the crawler what to do,
2) Crawler
			 I’m going to create crawler classes to fetch data
			from websites,
3) Model
			 Models will allow me to store and retrieve data
			from the database.
Crawler module
The Crawler module is responsible
for setting the environment and
connecting to the database.
The autoload calls refer to major
components inside the lib/
directory. The setup_env method
connects to the database and
adds app/ directories to the
$LOAD_PATH variable and includes
all of the files under app/ directory.
A new instance of the mutex
method is stored inside of the
@mutex variable. We can access it
by Crawler.mutex.
Crawler::Threads class
core feature
Now I’m going to create the core
feature of the app. I’m initializing a
few variables - @size, to know how
many threads to spawn, @threads
array to keep track of the threads,
and @queue to store the jobs to do.
I’m calling the #add method to add
each job to the queue. It accepts
optional arguments and a block.
Please, google block in Ruby if
you’re not familiar with the concept.
Next,the#start methodinitializes
threads and calls #join on each of
them.It’sessentialforthewholeapp
to work – otherwise once the main
thread is done with its job, it would
instantly kill spawned threads and
exit without finishing its job..
To complete the core functionality,
I’m calling the #pop method on a
block from the queue and then run
the block with the arguments from
the earlier #add method. The true
argument makes sure that it runs in
a non-blocking mode. Otherwise, I
would run into a deadlock with the
thread waiting for a new job to be
addedevenafterthequeueisalready
emptied (eventually throwing
anapplicationerror „Nolivethreads
left. Deadlock?”).
I can use the Crawler::Threads
class to crawl multiple pages at the
same time.
NowIcanrunsomecodetoseewhat
all of it amounts to:
10 second to visit 10 pages and fetch
somebasicinformation.Alright,now
I’m going to try 10 threads.
All it took to do the same task is 1.51 s!
The app no longer wastes time doing nothing while waiting for
the remote server to deliver data.
Additionally, what’s interesting, the input order is different –
for the single thread option it’s the same as the config file. For
the multi-threaded it’s random, as some threads do their job
faster.
Thread safety
The code I used outputs information
using puts. It’s not a thread-safe
way of doing this as it causes two
particular things:
	 - outputs a given string,
	 - then outputs the new line (NL)
	 character.
This may cause random instances of
NLcharactersappearingoutofplace
as the thread switches in the middle
andanother assumes controlbefore
the NL character is printed See the
example below:
I fixed this with mutex by creating a
custom #log method to output the
information to the console wrapped
in it:
Now the console output is always
in order as the thread waits for the
puts to finish.
And that’s it.
Nowyouknowmoreabouthowthreadswork.
I wrote this code as a side project the topic of web crawling
being an important part of what I do. The previous version
included more features such as the usage of proxies and TOR
networksupport.Thelatterimprovesanonymitybutalsoslows
down the code a lot.
Thanks for your time and, again, feel free to tackle the entire
code at:
https://github.com/kdurski/crawler

Weitere ähnliche Inhalte

Was ist angesagt?

Chapter 4 a interprocess communication
Chapter 4 a interprocess communicationChapter 4 a interprocess communication
Chapter 4 a interprocess communication
AbDul ThaYyal
 
Lecture 2 review of network technologies
Lecture 2 review of network technologiesLecture 2 review of network technologies
Lecture 2 review of network technologies
Batzaya Dashdondog
 
Ipv6 the next generation protocol
Ipv6 the next generation protocolIpv6 the next generation protocol
Ipv6 the next generation protocol
PRADEEP Cheekatla
 

Was ist angesagt? (20)

RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure call
 
IGMP
IGMPIGMP
IGMP
 
Rabbitmq basics
Rabbitmq basicsRabbitmq basics
Rabbitmq basics
 
Content Delivery Networks (CDN)
Content Delivery Networks (CDN)Content Delivery Networks (CDN)
Content Delivery Networks (CDN)
 
Chapter 4 a interprocess communication
Chapter 4 a interprocess communicationChapter 4 a interprocess communication
Chapter 4 a interprocess communication
 
Firewall Design and Implementation
Firewall Design and ImplementationFirewall Design and Implementation
Firewall Design and Implementation
 
(WEB202) Best Practices for Handling a 20x Traffic Spike | AWS re:Invent 2014
(WEB202) Best Practices for Handling a 20x Traffic Spike | AWS re:Invent 2014(WEB202) Best Practices for Handling a 20x Traffic Spike | AWS re:Invent 2014
(WEB202) Best Practices for Handling a 20x Traffic Spike | AWS re:Invent 2014
 
Transport Layer In Computer Network
Transport Layer In Computer NetworkTransport Layer In Computer Network
Transport Layer In Computer Network
 
HyperText Transfer Protocol (HTTP)
HyperText Transfer Protocol (HTTP)HyperText Transfer Protocol (HTTP)
HyperText Transfer Protocol (HTTP)
 
Security services and mechanisms
Security services and mechanismsSecurity services and mechanisms
Security services and mechanisms
 
Cloud deployment models
Cloud deployment modelsCloud deployment models
Cloud deployment models
 
Bridging in Networking
Bridging in NetworkingBridging in Networking
Bridging in Networking
 
Firewall security in computer network
Firewall security in computer networkFirewall security in computer network
Firewall security in computer network
 
Lecture 2 review of network technologies
Lecture 2 review of network technologiesLecture 2 review of network technologies
Lecture 2 review of network technologies
 
Chapter 4
Chapter 4Chapter 4
Chapter 4
 
Igmp presentation
Igmp presentationIgmp presentation
Igmp presentation
 
Multicast routing
Multicast routingMulticast routing
Multicast routing
 
What Is User Datagram Protocol?
What Is User Datagram Protocol?What Is User Datagram Protocol?
What Is User Datagram Protocol?
 
Ipv6 the next generation protocol
Ipv6 the next generation protocolIpv6 the next generation protocol
Ipv6 the next generation protocol
 
Unit 4
Unit 4Unit 4
Unit 4
 

Andere mochten auch

Andere mochten auch (17)

Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
 
Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler
 
Threading and Concurrency in Ruby
Threading and Concurrency in RubyThreading and Concurrency in Ruby
Threading and Concurrency in Ruby
 
Ruby thread safety first
Ruby thread safety firstRuby thread safety first
Ruby thread safety first
 
Threads in Ruby (Basics)
Threads in Ruby (Basics)Threads in Ruby (Basics)
Threads in Ruby (Basics)
 
Ruby Concurrency and EventMachine
Ruby Concurrency and EventMachineRuby Concurrency and EventMachine
Ruby Concurrency and EventMachine
 
Concurrent Programming with Ruby and Tuple Spaces
Concurrent Programming with Ruby and Tuple SpacesConcurrent Programming with Ruby and Tuple Spaces
Concurrent Programming with Ruby and Tuple Spaces
 
Webcrawler
Webcrawler Webcrawler
Webcrawler
 
building blocks of a scalable webcrawler
building blocks of a scalable webcrawlerbuilding blocks of a scalable webcrawler
building blocks of a scalable webcrawler
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Actors and Threads
Actors and ThreadsActors and Threads
Actors and Threads
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
鐵道女孩向前衝-RubyKaigi心得分享
鐵道女孩向前衝-RubyKaigi心得分享鐵道女孩向前衝-RubyKaigi心得分享
鐵道女孩向前衝-RubyKaigi心得分享
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
SXSW 2016: The Need To Knows
SXSW 2016: The Need To KnowsSXSW 2016: The Need To Knows
SXSW 2016: The Need To Knows
 
The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017
 

Ähnlich wie Multi-threaded web crawler in Ruby

Synchronous Reads Asynchronous Writes RubyConf 2009
Synchronous Reads Asynchronous Writes RubyConf 2009Synchronous Reads Asynchronous Writes RubyConf 2009
Synchronous Reads Asynchronous Writes RubyConf 2009
pauldix
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language Analysis
Nicole Gomez
 
Introductionto Xm Lmessaging
Introductionto Xm LmessagingIntroductionto Xm Lmessaging
Introductionto Xm Lmessaging
LiquidHub
 

Ähnlich wie Multi-threaded web crawler in Ruby (20)

Concurrency and parallel in .net
Concurrency and parallel in .netConcurrency and parallel in .net
Concurrency and parallel in .net
 
Ruby openfest
Ruby openfestRuby openfest
Ruby openfest
 
Concurrency in java
Concurrency in javaConcurrency in java
Concurrency in java
 
Graphql
GraphqlGraphql
Graphql
 
Java Performance, Threading and Concurrent Data Structures
Java Performance, Threading and Concurrent Data StructuresJava Performance, Threading and Concurrent Data Structures
Java Performance, Threading and Concurrent Data Structures
 
RubyMotion Inspect Conference - 2013. (With speaker notes.)
RubyMotion Inspect Conference - 2013. (With speaker notes.)RubyMotion Inspect Conference - 2013. (With speaker notes.)
RubyMotion Inspect Conference - 2013. (With speaker notes.)
 
Synchronous Reads Asynchronous Writes RubyConf 2009
Synchronous Reads Asynchronous Writes RubyConf 2009Synchronous Reads Asynchronous Writes RubyConf 2009
Synchronous Reads Asynchronous Writes RubyConf 2009
 
J threads-pdf
J threads-pdfJ threads-pdf
J threads-pdf
 
Multithreading and concurrency.pptx
Multithreading and concurrency.pptxMultithreading and concurrency.pptx
Multithreading and concurrency.pptx
 
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails Final
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language Analysis
 
Understanding the Single Thread Event Loop
Understanding the Single Thread Event LoopUnderstanding the Single Thread Event Loop
Understanding the Single Thread Event Loop
 
Multithreading 101
Multithreading 101Multithreading 101
Multithreading 101
 
Assignment 2
Assignment 2Assignment 2
Assignment 2
 
The mean stack
The mean stackThe mean stack
The mean stack
 
Introductionto Xm Lmessaging
Introductionto Xm LmessagingIntroductionto Xm Lmessaging
Introductionto Xm Lmessaging
 
Best node js course
Best node js courseBest node js course
Best node js course
 
System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web Application
 
MultiThreading in Python
MultiThreading in PythonMultiThreading in Python
MultiThreading in Python
 
Operating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxOperating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptx
 

Mehr von Polcode

Mehr von Polcode (20)

How to keep customers engaged to turn them into fans
How to keep customers engaged to turn them into fansHow to keep customers engaged to turn them into fans
How to keep customers engaged to turn them into fans
 
Expert Advice on ERP
Expert Advice on ERPExpert Advice on ERP
Expert Advice on ERP
 
User Experience (UX): Brand-Customer Interaction
User Experience (UX): Brand-Customer InteractionUser Experience (UX): Brand-Customer Interaction
User Experience (UX): Brand-Customer Interaction
 
The Difference Between UX and UI
The Difference Between UX and UIThe Difference Between UX and UI
The Difference Between UX and UI
 
5 Benefits of Utilizing Machine Learning in eLearning
5 Benefits of Utilizing Machine Learning in eLearning5 Benefits of Utilizing Machine Learning in eLearning
5 Benefits of Utilizing Machine Learning in eLearning
 
KrakowJS Conference Highlights
KrakowJS Conference HighlightsKrakowJS Conference Highlights
KrakowJS Conference Highlights
 
Best Practices for Dropdowns
Best Practices for DropdownsBest Practices for Dropdowns
Best Practices for Dropdowns
 
What’s Next for the Web?
What’s Next for the Web?What’s Next for the Web?
What’s Next for the Web?
 
Book Recommended By Our CTO
Book Recommended By Our CTOBook Recommended By Our CTO
Book Recommended By Our CTO
 
8 Biggest Web Design Trends For 2018 eCommerce
8 Biggest Web Design Trends For 2018 eCommerce8 Biggest Web Design Trends For 2018 eCommerce
8 Biggest Web Design Trends For 2018 eCommerce
 
World Wide Web today
World Wide Web todayWorld Wide Web today
World Wide Web today
 
Wordpress in numbers
Wordpress in numbersWordpress in numbers
Wordpress in numbers
 
Cryptocurrencies in e-commerce
Cryptocurrencies in e-commerceCryptocurrencies in e-commerce
Cryptocurrencies in e-commerce
 
Why Choose WooCommerce?
Why Choose WooCommerce?Why Choose WooCommerce?
Why Choose WooCommerce?
 
A guide to vastly improving your eCommerce business by investing nothing more...
A guide to vastly improving your eCommerce business by investing nothing more...A guide to vastly improving your eCommerce business by investing nothing more...
A guide to vastly improving your eCommerce business by investing nothing more...
 
Boost your conversions by 40% and more with these 10 growth hacking tips!
Boost your conversions by 40% and more with these 10 growth hacking tips!Boost your conversions by 40% and more with these 10 growth hacking tips!
Boost your conversions by 40% and more with these 10 growth hacking tips!
 
Future web developer, you are going to be tremendously valuable
Future web developer, you are going to be tremendously valuableFuture web developer, you are going to be tremendously valuable
Future web developer, you are going to be tremendously valuable
 
10 reasons why Symfony is just the right fit for your project
10 reasons why Symfony is just the right fit for your project10 reasons why Symfony is just the right fit for your project
10 reasons why Symfony is just the right fit for your project
 
Free, SaaS or Enterprise? You’re asking the wrong question!
Free, SaaS or Enterprise? You’re asking the wrong question!Free, SaaS or Enterprise? You’re asking the wrong question!
Free, SaaS or Enterprise? You’re asking the wrong question!
 
Improve your web and app development with the Symfony3 framework.
Improve your web and app development with the Symfony3 framework.Improve your web and app development with the Symfony3 framework.
Improve your web and app development with the Symfony3 framework.
 

Kürzlich hochgeladen

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Kürzlich hochgeladen (20)

8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 

Multi-threaded web crawler in Ruby

  • 2. Hi, I’m Kamil Durski, Senior Ruby Developer at Polcode If improving Ruby skills is what you’re after, stick around. I’ll show you how to use multiple threads to drastically increase the efficiency of your application. As I focus on threads, only the relevant code will be displayed in the slideshow. Find the full source here.
  • 4. Ruby programmers have easy access to threads thanks to build-in support. Threads can be very useful, yet for some reason they don’t receive much love. Where can you use threads to see their prowess first-hand? Crawling the web is a perfect example! Threads allow you to save much time you’d spend waiting for data from the remote server.
  • 5. I’m going to build a simple app so you can really understand the power of threads. It will fetch info on some popular U.S. TV shows (that one with dragons and an ex chemistry teacher too!) from a bunch of websites. But before we take a look at the code, let’s start with a few slides of good old theory.
  • 6. What’s the difference between a thread and a process?
  • 7. A multi-threaded app is capable of doing a lot of things at the same time. That’s because the app has the ability to switch between threads, letting each of them use some of the process time. But it’s still a single process The same things goes for running many apps on a single-core processor. It’s the operating system that does the switching.
  • 8. Another big difference Use threads within a single process and you can share memory and variables between all of them, making development easier Use multiple processes and processor cores and it’s no longer the case – sharing data gets harder. Check Wikipedia to find out more on threads.
  • 9. Now we can go back to the TV shows. Aside of Ruby on Rails’ Active Record library for database access, all I’m going to use are: Three components from Ruby’s thread library: 1) Thread – the core class that runs multiple parts of code at the same time, 2) Queue – this class will let me schedule jobs to be used by all the threads, 3) Mutex – the role of the Mutex component is to synchronize access to the resources. Thanks to that, the app won’t switch to another thread too early.
  • 10. The app itself is also divided into three major components: 1) Module I’m going to supply the app with a list of modules to run. The module creates multiple threads and tells the crawler what to do, 2) Crawler I’m going to create crawler classes to fetch data from websites, 3) Model Models will allow me to store and retrieve data from the database.
  • 12. The Crawler module is responsible for setting the environment and connecting to the database.
  • 13. The autoload calls refer to major components inside the lib/ directory. The setup_env method connects to the database and adds app/ directories to the $LOAD_PATH variable and includes all of the files under app/ directory. A new instance of the mutex method is stored inside of the @mutex variable. We can access it by Crawler.mutex.
  • 15. Now I’m going to create the core feature of the app. I’m initializing a few variables - @size, to know how many threads to spawn, @threads array to keep track of the threads, and @queue to store the jobs to do.
  • 16. I’m calling the #add method to add each job to the queue. It accepts optional arguments and a block. Please, google block in Ruby if you’re not familiar with the concept.
  • 17. Next,the#start methodinitializes threads and calls #join on each of them.It’sessentialforthewholeapp to work – otherwise once the main thread is done with its job, it would instantly kill spawned threads and exit without finishing its job..
  • 18. To complete the core functionality, I’m calling the #pop method on a block from the queue and then run the block with the arguments from the earlier #add method. The true argument makes sure that it runs in a non-blocking mode. Otherwise, I would run into a deadlock with the thread waiting for a new job to be addedevenafterthequeueisalready emptied (eventually throwing anapplicationerror „Nolivethreads left. Deadlock?”).
  • 19. I can use the Crawler::Threads class to crawl multiple pages at the same time.
  • 21. 10 second to visit 10 pages and fetch somebasicinformation.Alright,now I’m going to try 10 threads.
  • 22. All it took to do the same task is 1.51 s! The app no longer wastes time doing nothing while waiting for the remote server to deliver data. Additionally, what’s interesting, the input order is different – for the single thread option it’s the same as the config file. For the multi-threaded it’s random, as some threads do their job faster.
  • 24. The code I used outputs information using puts. It’s not a thread-safe way of doing this as it causes two particular things: - outputs a given string, - then outputs the new line (NL) character. This may cause random instances of NLcharactersappearingoutofplace as the thread switches in the middle andanother assumes controlbefore the NL character is printed See the example below:
  • 25. I fixed this with mutex by creating a custom #log method to output the information to the console wrapped in it: Now the console output is always in order as the thread waits for the puts to finish.
  • 26. And that’s it. Nowyouknowmoreabouthowthreadswork. I wrote this code as a side project the topic of web crawling being an important part of what I do. The previous version included more features such as the usage of proxies and TOR networksupport.Thelatterimprovesanonymitybutalsoslows down the code a lot. Thanks for your time and, again, feel free to tackle the entire code at: https://github.com/kdurski/crawler