Understanding the qualities of Web robot traffic is essential to build mechanisms for mitigating the impact of their traffic on Web systems. This project presents an updated characterization of the navigational and session patterns of Web robot traffic across three Web servers in the United States, Europe, and the Middle East under 30 different features. The results indicate that some features may be fitted to the same heavy-tailed model across the Web servers, but the best fitting models for other features depend on the Web server. Due to some different tasks of Web robots and security policies set by website administrators, there are thus some features of Web robot traffic that cannot be universally modeled. The paper titled “Some (Non-)Universal Features of Web Robot Traffic” which presents the report of this project has been accepted at 52th Annual Conference on Information Sciences and Systems (CISS).
1. Some (Non-)Universal Features
of Web Robot Traffic
Presentation by: Mahdieh Zabihimayvan
Advisor: Dr. Derek Doran
Department of Computer Science and Engineering, Kno.e.sis Research Center, Wright State
University Dayton, OH
3. What is Web robot?
www.knoesis.org/mahdieh 3
Great numbers of modern Web-based technologies and services are required to study,
analyze, and collect information from massive web repositories.
Web robots (also called Web crawlers) are employed by such technologies and
services to collect and scrutinize the dynamic content repositories contain.
4. What is Web robot? (cont…)
www.knoesis.org/mahdieh 4
49.5% 60%
% of Web robot
requests on Web
servers
But why?
To keep the data repositories up-to-date, contemporary Web robots need more
comprehensive searches, more specialized functionality, and more frequent visits.
5. www.knoesis.org/mahdieh 5
What is Web robot? (cont…)
Benign Web robots carry out
useful tasks including:
• Web content archiving
• link and HTML validation
• search engine indexing
• website mirroring
Malicious Web robots pose a threat to
the performance, privacy of
information, and security of Web
servers. For instance:
• harvesting e-mail addresses,
• performing click fraud,
• accessing information behind ‘pay-
walls’ or login screens
6. www.knoesis.org/mahdieh 6
• enable researchers to discover and compare the strategies different
robots utilize in their navigation
• improve methods to distinguish between malicious and benign web
robots
• enable synthetic robot workloads for simulation studies to evaluate
the capacity of a Web system
Why should we characterize web robot traffic?
7. www.knoesis.org/mahdieh 7
• Dikaiakos et al. (2005): analyzing the activity of different robots belonging to Google,
AltaVista, Inktomi, and FastSearch, and CiteSeer
• D. Doran and S. S. Gokhale (2010): examining in more detail heavy-tailed trends in
Web robot traffic of a single Web server
• Calzarossa and Massari (2013): analyzing the properties of the traffic generated by
some commercial Web robots
• Calzarossa and Massari (2013): characterizing the access patterns and navigation
profiles of the clients of two Web servers
• Tan and Kumar (2002): proposing 26 features to distinguish between Web robots and
human users
Related work on robot traffic characterization
8. www.knoesis.org/mahdieh 8
• Most past studies examine traffic at a single Web server
Why this is not good
• Present understanding is based on studying a limited, selected number of Web robots
Why this is not good
• Major studies were carried out at least a half decade ago
Why this is not good
Limitations of our current understanding
9. www.knoesis.org/mahdieh 9
We seek to update our understanding of web robot traffic
Study design:
This Work
Data set name # of requests # of sessions
Avg. session length
(Sec)
Avg. # of requests
per session
WSU 5,232,765 25,680 551.15 97
Pav 115,211 7,756 397.83 15
IR 749,278 39,200 94.8 10
10. www.knoesis.org/mahdieh 10
Sample Features
Feature Name Description
Behavioral Features
%HEAD % of requests using HEAD
%GET % of requests using GET
%POST % of requests using POST
%4XX % of requests receiving 4XX in response
%SF-StatusCode % of switching factor of status code
%SF-HttpMethod % of HTTP methods used in requests
Session Features
#Requests The number of HTTP requests sent
Session time Time difference between the first and last requests
%Night % of requests sent between 12 p.m. and 7 a.m.
%Day % of requests sent between 7 a.m. and 11:59 p.m.
Data Sum of data requested
11. www.knoesis.org/mahdieh 11
Characterizing web robot traffic
1. We consider a collection of feasible distributions that may characterize different features of web robot traffic.
Distributions are chosen from those that are:
• Distributions with discrete or continuous support
• Symmetric distribution (the mean, median, and mode occur at the same point)
• Asymmetric distribution (the possibility of heavy- and long-tailed trends)
Description Distributions
Discrete support
Binomial, Geometric, Poisson, Discrete
uniform
Infinite, continuous support/Symmetric
Logistic
Normal, Continuous uniform, Gaussian q,
Bimodal
Infinite, continuous support/Asymmetric
Lognormal, Exponential Extreme value,
Gamma, Generalized extreme value,
Weibull, Tlocation-scale, Generalized
Pareto
12. www.knoesis.org/mahdieh 12
Characterizing Web Robot Traffic
2. Using maximum likelihood estimations to identify the parameters for each
candidate distributions
3. Employing Vuong’s closeness test to evaluate whether one distribution is a
superior fit of the data to another, for all pairs of distributions
14. www.knoesis.org/mahdieh 14
Universal Web robot features
Intriguingly, many features of robot traffic follow identical distributions around the world
Distributi
on name
Feature name
GP
Session time, %Night, %Day, %NullReferrer,
#Requests, %HEAD, %GET, %304, %CSR,
%Others
GEV
%Images, %BinaryDocs, %Multimedia,
HTML/Image, %SF-FileType, %SF-csbytes,
%SF-referrer
GEV: Generalized Extreme Value
GP: Generalized Pareto
15. www.knoesis.org/mahdieh 15
Non-Universal Web robot features
Yet many features follow different types of distributions depending on the web server
Feature name
Distribution name
WSU Pav IR
Data TLS GEV GEV
SD_RPD GP GP LGC
%POST GP GP TLS
%4XX GP GP GEV
%2XX GP GP GEV
%SF-StatusCode GP GP TLS
%SF-HttpMethod GP GP TLS
%Compressed GEV TLS GP
%Exe GP TLS LGC
%RD GP LGC GEV
TLS: T-location Scale
GEV: Generalized Extreme Value
GP: Generalized Pareto
LGC: Logistic
16. www.knoesis.org/mahdieh 16
Request Type Behaviors
We also note non-uniform request type patterns across the three web servers
Investigating the difference in %POST among three data sets in more details:
• Plot: Markov chains of %POST Examining the http method codes used by Web robots on each server.
WSUPavIR
17. www.knoesis.org/mahdieh 17
Request Type Behaviors
Universal features:
1. Self-loops of HEAD and GET and transitions between these states are approximately similar, as
expected by robots that simply request information.
2. A small but appreciable number of transitions from HEAD (on all data sets) and GET (except for
IR) to POST exist.
• It is surprising to find robots submitting POST requests, which are used to submit resources to
a Web server.
• Robots are more likely to make a HEAD following a POST request to get information about
other resources before requesting them.
Non-Universal features:
In IR, there are significantly lower transition probabilities from POST to POST. One reason can be
attributed to security policies enforced by this university against some known robots who intend to
submit malicious resources on the server.
18. www.knoesis.org/mahdieh 18
Summary of Key Findings
• Characterize 30 different features of Web robot traffic across three Web servers
around the world.
• Conducted the experiments on three large data sets from three different countries.
• Finding some features which show similar heavy-tailed models and may well be
universal across all Web robot traffic
• Finding some differences among the Web robots of the data sets
19. www.knoesis.org/mahdieh 19
Future Work
• Exploring the theoretical implications of the similar and dissimilar features
considered in this paper
• Investigating the intuitive arguments behind the contrast in Web robot traffic
• Extend this study to characterize two categories of benign and malicious Web
robots which can be very useful in detection of malicious Web robots and enhance
the security of Web servers
• Conducting similar characterization study on human Web traffic
20. Thank you for your attention!
www.knoesis.org/mahdieh 20
Hinweis der Redaktion
And it suggests that Web robot behaviors may have remarkably changed since characteristics were last studied
symmetric distributions, in which the values of variables occur at regular frequencies, and the mean, median, and mode occur at the same point, to investigate models where most outcomes are clustered relatively close to the distribution's center. We also scrutinize asymmetric or skewed distributions to examine the possibility of heavy- and long-tailed trends in a feature. For example, the Lognormal, Gamma, and Weibull distribution features parameters that can control the skewness of the distribution.