SlideShare ist ein Scribd-Unternehmen logo
1 von 99
Downloaden Sie, um offline zu lesen
Nikolaos Draganoudis, MSc dissertation
- i -
Internet Traffic Measurement and Analysis
Nikolaos Draganoudis
Master of Science in Mobile and Satellite Communications
from the
University of Surrey
Department of Electronic Engineering
Faculty of Engineering and Physical Sciences
University of Surrey
Guildford, Surrey, GU2 7XH, UK
August 2008
Supervised by: Pr. Zhili Sun
Nikolaos Draganoudis 2008
Nikolaos Draganoudis, MSc dissertation
- ii -
DECLARATION OF ORIGINALITY
I confirm that the project dissertation I am submitting is entirely my own work and that
any material used from other sources has been clearly identified and properly
acknowledged and referenced. In submitting this final version of my report to the JISC
anti-plagiarism software resource, I confirm that my work does not contravene the
university regulations on plagiarism as described in the Student Handbook. In so doing I
also acknowledge that I may be held to account for any particular instances of uncited
work detected by the JISC anti-plagiarism software, or as may be found by the project
examiner or project organiser. I also understand that if an allegation of plagiarism is upheld
via an Academic Misconduct Hearing, then I may forfeit any credit for this module or a
more sever penalty may be agreed.
Dissertation Title
Internet Traffic Measurements and Analysis
Author Name
Nikolaos Draganoudis
Author Signature Date: 11/08/2008
Supervisor’s name:
Pr. Zhili Sun
Nikolaos Draganoudis, MSc dissertation
- iii -
ACKNOWLEDGEMENT
The writing of this dissertation has been a big academic challenge that I had to face.
Without the support, guidance and patience of the following people, this study would not
have been completed. I owe my deepest gratitude to Prof Zhili Sun who was my
supervisor, to my friends and colleges and finally to my family that supported me and gave
me this opportunity to study abroad to the University of Surrey.
Nikolaos Draganoudis, MSc dissertation
- iv -
ABSTRACT
In the last few years, has been observed big improvement in the field of
telecommunications. This improvement had as a result the mobile terminals to become
faster, with bigger capacity and even smaller size than before. This was a great opportunity
for the progress of the services that can be provided from the mobile phones. So nowadays
mobile phones are not only used for calls and for text messages but we can also use them
to browse an internet website, to send an email or even to hear music and record videos.
For this dissertation will be used an emulator of a mobile device connected to the
Internet through a laptop. Web browsing from this emulator will be performed to the
BBC’s mobile web site as the BBC web site is a big resource of information, frequently
updated and well structured. Also the Wireshark is going to be used in order to capture the
incoming packets for the emulator and calculate the inter-arrival time among them.
Obtaining the appropriate literature background in the field of the dissertation is an
important fact to be able to understand and work on this field. Also through this
dissertation will be gained the experience to work, plan and extract useful results from big
amount of collected data. Also it will be examined if the sizes of the web pages are
following any known mathematical distribution and that will help to characterize the
produced traffic from a web page response. Furthermore the inter-arrival time of the
incoming packets of a web page response will be examined in order to examine the
possibility that these packets follow a distribution. This can help us to understand the QoS
provided from the web service provider.
The study and examination of BBC’s Web Sites will give useful information about the
traffic generated and time consumed to download the contents of them and that could used
as a guideline to provide improved internet services with higher QoS. Also it will be a
useful tool to understand the mobile internet services and the impact of them to the
network’s resources.
Nikolaos Draganoudis, MSc dissertation
- v -
TABLE OF CONTENTS
Internet Traffic Measurement and Analysis ...............................................................i
Nikolaos Draganoudis................................................................................................i
Declaration of originality..........................................................................................ii
Acknowledgement....................................................................................................iii
Abstract ....................................................................................................................iv
Table of Contents ......................................................................................................v
List of Figures .........................................................................................................vii
1 Introduction ..........................................................................................................1
1.1 Background and Context...............................................................................1
1.2 Scope and Objectives ....................................................................................2
1.3 Achievements................................................................................................4
1.4 Overview of Dissertation ..............................................................................5
2 Literature Review.................................................................................................6
2.1 Introduction ...................................................................................................6
2.2 Introduction to Internet Protocol Stack ............................................6
2.3 IP protocol .................................................................................................9
2.4 TCP Protocol ...............................................................................................11
2.5 World Wide Web .........................................................................................13
2.6 Mathematical Distributions for the Analysis...............................................14
2.6.1 Power Law and Pareto distributions.......................................................14
2.6.2 Normal Distribution................................................................................15
2.7 Summary .....................................................................................................16
3 Internet Traffic Measurements And Methodology .............................................17
3.1 Methodology of measurements ...................................................................17
3.1.1 Target of the measurements....................................................................17
3.1.2 Measurement tools..................................................................................18
3.2 Performed Measurements............................................................................22
3.3 Summary .....................................................................................................25
4 BBC’S Web Site Traffic Measurements And Analysis ......................................26
4.1 General Analysis of the BBC’s Web Site categories...................................26
4.2 Analysis of the BBC’s Web Site and Mathematical Distributions..............28
Nikolaos Draganoudis, MSc dissertation
- vi -
4.3 Conclusions.................................................................................................38
5 Measurements And Analysis of The Inter-arrival Time of Packets of a Web Page
Response......................................................................................................................40
5.1 Method of the Analysis of the BBC’s Web Site Inter-arrival time..............40
5.2 Inter-arrival Time Measurements of the Web Pages ...................................41
5.2.1 Monday Measurements ..........................................................................41
5.2.2 Tuesday Measurements ..........................................................................45
5.2.3 Wednesday Measurements .....................................................................51
5.2.4 Thursday Measurements.........................................................................55
5.2.5 Friday Measurements .............................................................................59
5.2.6 Saturday Measurements..........................................................................63
5.2.7 Sunday Measurements............................................................................66
5.3 Measurements that fit to the Pareto distribution .........................................68
5.4 Conclusions.................................................................................................71
6 Conclusion..........................................................................................................73
6.1 Summary and Evaluation ............................................................................73
6.2 Future Work.................................................................................................74
References...............................................................................................................76
Appendix 1 - Work plan..........................................................................................78
Appendix 2 – Matlab Code .....................................................................................79
Appendix 3- Content in bytes of BBC’s web sites..................................................82
Appendix 4 – Inter-Arrival Time Measurements of bbc’s web sites ......................83
Nikolaos Draganoudis, MSc dissertation
- vii -
LIST OF FIGURES
Figure 1. Protocol stack ......................................................................................................6
Figure 2. Encapsulation of data as it goes down the protocol stack [1]..............................8
Figure 3. IP header fields [1]...............................................................................................9
Figure 4. TCP header fields [1].........................................................................................11
Figure 5. S60 3rd
Edition emulator....................................................................................18
Figure 6. BBC website explore.........................................................................................19
Figure 7. Wireshark traffic presentation ...........................................................................21
Figure 8. BBC categories that measurements will be performed .....................................22
Figure 9. BBC News subcategories that measurements will be performed......................23
Figure 10. BBC Sport subcategories that measurements will be performed ..................24
Figure 11. Week traffic of BBC’s main categories .........................................................26
Figure 12. Average values of contents for BBC News and Sport subcategories ............27
Figure 13. PDF and CDF of Education Stories...............................................................29
Figure 14. Pareto CDF versus Empirical CDF of Education Stories..............................29
Figure 15. Normal PDF and CDF of Education Stories..................................................29
Figure 16. PDF and CDF of News Top Stories...............................................................30
Figure 17. Pareto CDF versus Empirical CDF of News Top Stories..............................30
Figure 18. Normal PDF and CDF of News Top Stories..................................................31
Figure 19. PDF and CDF of Politics Stories...................................................................31
Figure 20. Pareto CDF versus Empirical CDF of Politics Stories ..................................32
Figure 21. Normal PDF and CDF of Politics Stories......................................................32
Figure 22. PDF and CDF of Sport Top Stories ...............................................................33
Figure 23. Pareto CDF versus Empirical CDF of Sport Top Stories ..............................33
Figure 24. Normal PDF and CDF of Sport Top Stories ..................................................33
Figure 25. PDF and CDF of Tennis Stories.....................................................................34
Figure 26. Pareto CDF versus Empirical CDF of Tennis Stories....................................34
Figure 27. Normal PDF and CDF of Tennis Stories .......................................................35
Figure 28. PDF and CDF of Football Top Stories...........................................................35
Figure 29. Pareto CDF versus Empirical CDF of Football Top Stories..........................36
Figure 30. Normal PDF and CDF of Football Top Stories .............................................36
Figure 31. PDF and CDF of Championship Stories........................................................37
Nikolaos Draganoudis, MSc dissertation
- viii -
Figure 32. Pareto CDF versus Empirical CDF of Championship Stories.......................37
Figure 33. Normal PDF and CDF of Championship Stories...........................................38
Figure 34. PDF and CDF of Monday’s BBC Home page...............................................41
Figure 35. Pareto CDF versus Empirical CDF of Monday’s BBC Home page..............41
Figure 36. Normal PDF and CDF of Monday’s BBC Home page..................................42
Figure 37. PDF and CDF of Monday’s News Top Story 1 .............................................42
Figure 38. Pareto CDF versus Empirical CDF of Monday’s News Top Story 1 ............43
Figure 39. Normal PDF and CDF of Monday’s News Top Story 1 ................................43
Figure 40. PDF and CDF of Monday’s News Top Story 3 .............................................44
Figure 41. Pareto CDF versus Empirical CDF of Monday’s News Top Story 3 ............44
Figure 42. Normal PDF and CDF of Monday’s News Top Story 3 ................................45
Figure 43. PDF and CDF of Tuesday’s News web page.................................................45
Figure 44. Pareto CDF versus Empirical CDF of Tuesday’s News web page................46
Figure 45. Normal PDF and CDF of Tuesday’s News web page....................................46
Figure 46. PDF and CDF of Tuesday’s News Top Story 2 .............................................47
Figure 47. Pareto CDF versus Empirical CDF of Tuesday’s News Top Story 2 ............47
Figure 48. Normal PDF and CDF of Tuesday’s News Top tory 2 ..................................47
Figure 49. PDF and CDF of Tuesday’s Business Story 2 ...............................................48
Figure 50. Pareto CDF versus Empirical CDF of Tuesday’s Business Story 2 ..............48
Figure 51. Normal PDF and CDF of Tuesday’s Business Story 2 .................................49
Figure 52. PDF and CDF of Tuesday’s Football Top Story 1 .........................................49
Figure 53. Pareto CDF versus Empirical CDF of Tuesday’s Football Top Story 1 ........50
Figure 54. Normal PDF and CDF of Tuesday’s Football Top Story 1............................50
Figure 55. PDF and CDF of Wednesday’s BBC Home web page ..................................51
Figure 56. Pareto CDF versus Empirical CDF of Wednesday’s BBC Home web page .51
Figure 57. Normal PDF and CDF of Wednesday’s BBC Home web page.....................52
Figure 58. PDF and CDF of Wednesday’s Technology Story 1......................................52
Figure 59. Pareto CDF versus Empirical CDF of Wednesday’s Technology Story 1.....53
Figure 60. Normal PDF and CDF of Wednesday’s Technology Story 1.........................53
Figure 61. PDF and CDF of Wednesday’s Tennis Story 1 ..............................................54
Figure 62. Pareto CDF versus Empirical CDF of Wednesday’s Tennis Story 1 .............54
Figure 63. Normal PDF and CDF of Wednesday’s Tennis Story 1.................................54
Figure 64. PDF and CDF of Thursday’s BBC Home web page......................................55
Nikolaos Draganoudis, MSc dissertation
- ix -
Figure 65. Pareto CDF versus Empirical CDF of Thursday’s BBC Home web page.....55
Figure 66. Normal PDF and CDF of Thursday’s BBC Home web page ........................56
Figure 67. PDF and CDF of Thursday’s BBC News ......................................................56
Figure 68. Pareto CDF versus Empirical CDF of Thursday’s BBC News......................57
Figure 69. Normal PDF and CDF of Thursday’s BBC News .........................................57
Figure 70. PDF and CDF of Thursday’s Football Top Story 1 .......................................58
Figure 71. Pareto CDF versus Empirical CDF of Thursday’s Football Top Story 1.......58
Figure 72. Normal PDF and CDF of Thursday’s Football Top Story 1 ..........................58
Figure 73. PDF and CDF of Friday’s Business Story 2 ..................................................59
Figure 74. Pareto CDF versus Empirical CDF of Friday’s Business Story 2 .................59
Figure 75. Normal PDF and CDF of Friday’s Business Story 2.....................................60
Figure 76. PDF and CDF of Friday’s Formula Story 1...................................................60
Figure 77. Pareto CDF versus Empirical CDF of Friday’s Formula Story 1..................61
Figure 78. Normal PDF and CDF of Friday’s Formula Story 1......................................61
Figure 79. PDF and CDF of Friday’s BBC News web page...........................................62
Figure 80. Pareto CDF versus Empirical CDF of Friday’s BBC News web page..........62
Figure 81. Normal PDF and CDF of Friday’s BBC News web page..............................62
Figure 82. PDF and CDF of Saturday’s BBC Home web page ......................................63
Figure 83. Pareto CDF versus Empirical CDF of Saturday’s BBC Home web page .....63
Figure 84. Normal PDF and CDF of Saturday’s BBC Home web page .........................64
Figure 85. PDF and CDF of Saturday’s BBC Sport web page .......................................64
Figure 86. Pareto CDF versus Empirical CDF of Saturday’s BBC Sport web page.......65
Figure 87. Normal PDF and CDF of Saturday’s BBC Sport web page ..........................65
Figure 88. PDF and CDF of Sunday’s BBC Education Story 1......................................66
Figure 89. Pareto CDF versus Empirical CDF of Sunday’s BBC Education Story 1.....66
Figure 90. Normal PDF and CDF of Sunday’s BBC Education Story 1 ........................66
Figure 91. PDF and CDF of Sunday’s BBC Formula Story 1 ........................................67
Figure 92. Pareto CDF versus Empirical CDF of Sunday’s BBC Formula Story 1 .......67
Figure 93. Normal PDF and CDF of Sunday’s BBC Formula Story 1 ...........................68
Figure 94. PDF and CDF of Thursday’s Sport Top Story 1 ............................................68
Figure 95. Pareto CDF versus Empirical CDF of Thursday’s Sport Top Story 1 ...........69
Figure 96. Normal PDF and CDF of Thursday’s Sport Top Story 1...............................69
Figure 97. PDF and CDF of Monday’s News Top Story 1 .............................................70
Nikolaos Draganoudis, MSc dissertation
- x -
Figure 98. Pareto CDF versus Empirical CDF of Monday’s News Top Story 1 ............70
Figure 99. Normal PDF and CDF of Monday’s News Top Story 1 ................................71
Nikolaos Draganoudis, MSc dissertation
- 1 -
1 INTRODUCTION
1.1 Background and Context
The technology of information gathering, processing and distribution is the key
technology of the 20th
century. Until now we saw the development of worldwide telephone
networks, the birth and still growing computer industry and also the development of the
satellite communication [18].
According to the old concept of the computer systems all the work from different users
can be processed by one big computer but nowadays this concept is totally abandoned and
its place took the “computer network” where many autonomous computers interconnected
to each other can process the incoming work. The interconnected computers can exchange
information through copper wire, fibre optics, microwaves or satellites. The information is
exchanged through small units of data called packets. These networks of computers can
have many different forms, sizes and shapes like wireless networks and wide area networks
[1] [3].
At the first stages of the development of the Internet at the early 1980s, it was a single
network and its predecessor is the ARPANET (Advanced Research Projects Agency
Network), developed by the United States Department of Defence. Now Internet consists
of thousands of different networks that are connected to each other and every single of
them provide common services to the customers and follow common protocols. These
different networks are controlled by the ISPs (Internet Service Providers) and are
responsible to provide connectivity to the Internet to their customers. The Internet can have
interconnected ISPs of different sizes, forming a hierarchical interconnected structure. The
most common ISPs are the transport providers that deal with the provision of a wide range
of services to the customers but there are also the backbone providers that are connected to
many other ISPs and deal with the traffic that the customers produce and the web hosting
providers that provide the host of a Web page for the customers. These relationships
between the different ISPs can be translated as business relationships and are related to the
quality and the type of service provided to the customer. Nowadays Internet is not only
Nikolaos Draganoudis, MSc dissertation
- 2 -
used in order to communicate with other people all over the world but is mainly used for
gaining money by providing many different services. Organizations, small and big
businesses, consumers or even individuals now see the Internet from a different scope and
prefer to make their business through it. All these increasing expectations from the Internet
make the need to become more and more reliable [1] [3] [4].
For that reason many academic researchers, companies and other groups focused their
concentration to the internet traffic that the customers cause. They made more and more
measurements in order to exam this traffic and come up with some useful results that will
help to improve the internet network and the internet traffic performance management [18].
From the measurements we can see the network response and behaviour at any upgrade or
degrade of the performance [2].
As it was mentioned in the previous paragraph, the measurements are very important in
order to understand the demands of a service so application level measurements will be
performed at this dissertation trying to understand the use of the network by the service,
the demands of the service and the effects that cause the service to the network and its
performance.
1.2 Scope and Objectives
The scope of this dissertation is the Internet Traffic Measurement and Analysis. It was
mentioned in the previous paragraphs that in this dissertation will be made application
level measurements. The advantage of the application measurements is that they provide an
overall view of the application performance, which it wouldn’t be so clear if the
measurements had become in lower levels. More specifically web browsing measurements
will be performed by downloading web page contents.
In the last few years, has been observed big improvement in the field of
telecommunications. This improvement had as a result the mobile terminals to become
faster, with bigger capacity and even smaller size than before. This was a great opportunity
for the progress of the services provided from the mobile phones. So nowadays mobile
Nikolaos Draganoudis, MSc dissertation
- 3 -
phones are not only used for calls and for text messages but we can also use them to
browse an internet website, to send an email or even to hear music and record videos. For
that reason the way that a client can have access to the Internet is not only through an
internet service provider over a dial up telephone line. On the contrary, every user can have
access to Internet through its mobile phone, laptop or palm pc. As a result, access to
Internet became more flexible [2] [3] [16].
For this reason the measurements that will be performed, they will concern the traffic
that a mobile phone can produce and for this reason we will use an emulator installed in a
laptop that will be connected to the Internet thought the campus of the University of
Surrey. The platform will emulate a mobile phone and thought this platform we will have
the ability to browse the web pages. For the measurements we needed a service provider
that will have a well defined structure of its web pages and also rich and up to date
contents. For that reason we decided to browse the web pages of the mobile edition of the
BBC which is a provider that has these characteristics.
Some objectives of the dissertation are to understand and gain knowledge of the way
that Internet works and response to requests. That could be achieved easier throw the
procedure of the measurements as every packet that mobile phone send and receive will be
captured and analysed. The way that the results from the measurements will be stored and
organized is one other important parameter as it may affect the extract of the conclusions.
It is also important to mention that one other objective of this dissertation is to calculate the
time that is consumed from the request of a web page until the end of the responses for this
particular request and also the inter-arrival time between the packets of the same response.
Furthermore in this dissertation we will take measurements about the total size of the web
pages in bytes. The main objective is to observe the results of the measurements both for
the inter-arrival times and the size of the web pages and through the analysis of these
measurements to see if the results could fit into a mathematical distribution. During the
analysis of the total content size of the web pages we will have the chance to see how the
size of the data changes over a week period.
Nikolaos Draganoudis, MSc dissertation
- 4 -
1.3 Achievements
The first step in order to cope with the dissertation was to gain the appropriate
background in order to become familiar with the topic of the dissertation and understand
the requirements needed. Also it was very important to make a literature review of other
works in the field of Internet traffic measurement and analysis. Furthermore through the
literature review it became clear the type of the measurements that will be performed and
the final decision about the web sites of the measurements was taken. Also it was decided
that the S60 3rd
Edition SDK emulator and the Wireshark network packet analyzer will be
used for the measurements.
After the first step, the registration with the Nokia was made in order to obtain the rights
to use the emulator and then the emulator was installed in the laptop. Familiarization with
the emulator and the options that it provides was performed and also some trial
measurements were performed to the BBC’s mobile web sites. At the same time through
the Wireshark the incoming and outgoing packets were captured and then examined. After
exploring the structure of the BBC’s web site, systematic measurements were performed
while storing the results for further analysis as we will see in the following step of the
dissertation.
The following achievement is the analysis of the data collected from the measurements.
The results from the measurements are used in order to come up with a pattern that they
may use in order to be able to categorize them and try to fit them in a known mathematical
distribution. The changes in the context of the pages over a week period were also
examined and will be presented. Finally the results are presented with the mathematical
distribution that fits better to the measured data and represents the analysis for the BBC’s
mobile web page contexts.
Nikolaos Draganoudis, MSc dissertation
- 5 -
1.4 Overview of Dissertation
At the next chapter of the dissertation is presented the literature review and the
background concerning the protocol stack of the Internet and the protocols that are used to
transmit and receive data through the internet. Also the World Wide Web will be presented.
Finally the mathematical distributions that are going to be used to characterize the
collected data are explained and presented.
At the third chapter the reader is introduced to the technical part of the dissertation.
Firstly the tools that will be used for the measurements are presented and then the target
that the measurements will be performed. The structure of the target Web Site is presented
and the specified roots that measurements will be performed.
The fourth chapter contains the measurements performed to the BBC’s web site and
more specific the measurements that concern the total amount in bytes of the web pages.
Then an analysis of the collected data is performed and the most appropriate mathematical
distribution that fits to the measurements is chosen.
The fifth chapter contains the measurements performed to the BBC’s web pages
concerning the inter-arrival times between the received packets of a web page response.
Through the performed analysis of this chapter is chosen the most appropriate
mathematical distribution that fits to the measurements.
The sixth chapter contains the conclusions and the evaluation of this dissertation and
also the future work that could be done in this field of the dissertation.
There are also four appendixes at the end of the dissertation. In the first appendix is
presented the work plan that was followed during the year, in the second appendix is
presented the code in Matlab that was used to analyse the measurements, in the third
appendix is presented the table with the measurements about the total content size of the
BBC’s web pages and finally in the fourth appendix are presented the tables with the inter-
arrival time measurements of the BBC’s web pages.
Nikolaos Draganoudis, MSc dissertation
- 6 -
2 LITERATURE REVIEW
2.1 Introduction
Here will be presented and reviewed some issues related with the project. This is done
in order to help someone understand concepts of the field of the dissertation topic and gain
the appropriate knowledge. Firstly the Internet protocol stack will be presented with a brief
summary of every layer and its usage. After that the IP (Internet Protocol) will be
presented, the TCP (Transition Control Protocol)and the well known WWW (World Wide
Web). Finally we will present the mathematical method that is going to be used for this
dissertation to analyse the data.
2.2 Introduction to Internet Protocol Stack
As it was mentioned before, in order to understand the way that Internet works we have
to examine the protocol stack that is implemented in order to send or receive one packet
from the Internet. Firstly we will briefly present the internet protocol architecture and also
we will see how it is organized in layers. After that we will examine closely protocols that
are used, like the IP (Internet Protocol), TCP (Transmission Control Protocol) and also we
will see how WWW (World Wide Web) works as it is important in order to obtain the
appropriate background for this dissertation.
In the graph below we can see the protocol stack of the Internet
Figure 1. Protocol stack
Nikolaos Draganoudis, MSc dissertation
- 7 -
The lowest layer of the protocol stack is the Physical Layer where the main function of
it is the transmission of the bits. As we are talking for mobile phones the channel that the
bits are transmitted is the air but for our measurements, as there are going to be done from
the laptop that is connected to the internet, the channel will be the copper wire.
The layer above the Physical Layer is the Data Link Layer where the main purpose of it
is to maintain the reliable and efficient communication between two adjacent machines of
this layer. One of the most important things that exist in this layer is the MAC Address
(Medium Access Control) where every computer that connects to the Internet has one and
it is unique all over the world. As it isn’t important for this dissertation to examine
furthermore this layer we will only keep this in mind.
The next layer over the Data Link Layer is the Network Layer where the main operation
is to transmit the packets from the source to the destination. In contrast with the Data Link
layer where only the transmission of the packet from one end of a cable to the other is
concerned, this layer deals with end to end transmission. As it is very important the
function of this layer in the next pages we will examine furthermore the functionality of
this layer and the protocols that exists on it.
Above the Network Layer is the Transport Layer. The function of this layer is also very
important as it is responsible to provide reliable and cost effective data transport from the
source to the destination. Also it communicates with the Application Layer, receiving and
sending requests and data packets respectively. At the next pages we will examine in detail
the protocol that is used in order to send and receive the data packets.
Finally, the layer that exists in top of all the others is the Application Layer. It is
responsible for the communication of various applications with the protocols that exist
below of it. Also for this Layer we will examine later the main protocol that is used for
browsing the Internet [1] [2] [7].
Now that all the layers are presented we will try to understand the way that
communicate to each other and the data that they change between them. Starting from the
Application layer, it produces data streams that are mainly produced by the user’s requests.
The Transport layer take these data streams and fr
maximum size of each datagram is up to 64 Kbytes but in practice the length of each
datagram doesn’t exceed the 1460 Bytes, in order to fit in an Ethernet packet with the IP
and TCP headers that we will see later. Then each
the IP protocol is used and also a connectionless approach is used, so every packet can
follow a different path to the destination. After that the Data Link Layer follows and finally
the Physical Layer where the bits ar
example of the format of an IP packet with the header of each layer from the Application
down to the Data Link header
Figure 2. Encapsulation of data as it goes down the protocol stack
Nikolaos Draganoudis, MSc dissertation
- 8 -
Application layer, it produces data streams that are mainly produced by the user’s requests.
The Transport layer take these data streams and fragment them into datagrams. The
maximum size of each datagram is up to 64 Kbytes but in practice the length of each
datagram doesn’t exceed the 1460 Bytes, in order to fit in an Ethernet packet with the IP
and TCP headers that we will see later. Then each datagram goes to Network Layer where
the IP protocol is used and also a connectionless approach is used, so every packet can
follow a different path to the destination. After that the Data Link Layer follows and finally
the Physical Layer where the bits are transmitted to the channel. Below we can see an
example of the format of an IP packet with the header of each layer from the Application
down to the Data Link header [1] [7].
Encapsulation of data as it goes down the protocol stack
Nikolaos Draganoudis, MSc dissertation
Application layer, it produces data streams that are mainly produced by the user’s requests.
agment them into datagrams. The
maximum size of each datagram is up to 64 Kbytes but in practice the length of each
datagram doesn’t exceed the 1460 Bytes, in order to fit in an Ethernet packet with the IP
datagram goes to Network Layer where
the IP protocol is used and also a connectionless approach is used, so every packet can
follow a different path to the destination. After that the Data Link Layer follows and finally
e transmitted to the channel. Below we can see an
example of the format of an IP packet with the header of each layer from the Application
Encapsulation of data as it goes down the protocol stack [1]
Nikolaos Draganoudis, MSc dissertation
- 9 -
2.3 IP protocol
In this section we will examine the protocol that is mainly used in the Network Layer
and this is the IP (Internet Protocol). Currently they are two versions, the IP v4 and the IP
v6 which is the new version that provides wider range of IP addresses and less complex
header than the IP v4. We will examine the IP v4 as it is currently used more than IP v6. As
it can be observed in the picture below, the header of the IP protocol has a 20 Byte fixed
part and a variable length optional part.
Figure 3. IP header fields [1]
Now we will briefly explain the fields of the protocol and the usage of them.
Version: the field that contains the version of the IP protocol that is used and that is
necessary for the communication between two machines that use different version of the
IP.
IHL: The field that shows the final length of the header of the packet as we can have
variable length of Options.
Type of service: This field is mainly used in order to distinguish between different
classes of services, for example for voice we need fast and accurate delivery of the packet.
Nikolaos Draganoudis, MSc dissertation
- 10 -
Total length: This field includes the total length of the packet, including the header and
the data.
Identification: This field is used in order to determine the receiver in which datagram
the received fragment belongs.
DF bit (Don’t Fragment): When this bit is set to 1 the datagram cannot be fragmented
by routers of the network.
MF bit (More Fragments): This bit indicates that more fragments of the same
datagram are expected.
Fragment offset: This field indicates the position of the fragmented data into the
datagram.
Time to live: This field indicates the maximum number of hops that one datagram can
do until the final destination. This number is decremented by one every time that a router
forwards it and when it hits zero the datagram is dropped by the network.
Protocol: This field indicates the upper layer that the IP protocol should send the
packet.
Header checksum: It verifies only that the header has no errors.
Source address: Indicates the IP address of the sender of the packet.
Destination address: Indicates the IP address of the receiver.
Options field: This field can be used in order to add more functions to the protocol like
security, timestamping and source routing.[1] [2] [7]
Nikolaos Draganoudis, MSc dissertation
- 11 -
2.4 TCP Protocol
Moving to the Transport Layer, TCP (Transition Control Protocol) and UDP (Used
Datagram Protocol) are the two main protocols that are used. We will focus to the TCP
protocol as the World Wide Web runs with HTTP protocol in the Application layer and the
TCP protocol at the Transport layer. In the picture below we can see the TCP header.
Figure 4. TCP header fields [1]
Now we will briefly explain the fields of the protocol and the usage of them.
Source port and Destination port: These fields identify where the packets should be
send in the upper layer, so they identify the application that the packet belong.
Sequence number and Acknowledge number: These fields are used in order to be
transmitted all the packets safely without having any loss.
TCP header length: In this field is stored the number of the 32-bit words that the TCP
Nikolaos Draganoudis, MSc dissertation
- 12 -
header has. This field makes clear where the header end is and where the start of the data is
as we can have a variable header length.
URG (Urgent) flag: This is used when we have an urgent pointer that indicates that
we have urgent data in this packet.
ACK flag: This is used when we want to send acknowledge of a received packet. When
the ACK flag is set to 0 the Acknowledge number field is ignored.
PSH or PUSHed data: When this flag is set to 1 the data are delivered to the
Application layer at their arrival and they are not buffered in this stage.
RST flag: This is used to reset a connection or reject an invalid segment.
SYN flag: This is used to establish a connection between two entities.
FIN flag: This is used to release a connection between two entities.
Window size: In this field is stored the number of the bytes that the receiver is willing
to receive from the transmitter.
Checksum field: This field includes a checksum of the header and data for extra
reliability.
Urgent pointer: This field shows the byte offset where urgent data are.
Options: This field is used for extra options that are not provided in the regular header.
[1] [2] [7]
Nikolaos Draganoudis, MSc dissertation
- 13 -
2.5 World Wide Web
In the Application layer, which is in the top of all the other layers that we examined and
as the dissertation focuses on the browsing of Internet sites we will focus on the World
Wide Web which uses the HTTP protocol.
One of the most important services of the Internet is the World Wide Web that began in
1989 at CERN, a European centre for nuclear research. It became very popular all over the
world as it is friendly to use for beginners and its interface is well designed. At the
beginning it was designed in order the scientists of the CERN to be able to share their
research and exchange ideas as many of them were working in different countries but the
World Wide Web (WWW) grew out of these needs and used by the entire world. This
became when CERN and M.I.T signed an agreement setting up the World Wide Web
Consortium. This organization was responsible to develop the World Wide Web by
standardizing protocols and encouraging interoperability between the companies that had
developed a browser at the time, Netscape and Microsoft [1] [3].
World Wide Web consists of a huge amount of documents distributed randomly over the
world or otherwise called Web pages. Every web page may contain several links to other
pages around the world. In this way is formed a complicated connection between the pages
and every user can have access to them. As it would be very difficult to keep track of the
path that has been followed for a page, World Wide Web uses the URL (Uniform Resource
Locator) which is a unique identifier of each Web page. So now every user can simply
remember the URL of the Web site in order to have access to it. As World Wide Web was
growing faster and faster applications that helped the users were developed, the Web
browsers. With Web browsers is was easier to browse different sites and keep a record of
the URL of the page that you may want to visit again. Browsers made the World Wide Web
friendly to use and attracted more users [1] [2] [3].
Nikolaos Draganoudis, MSc dissertation
- 14 -
2.6 Mathematical Distributions for the Analysis
The term analysis of the service measurements indicates the process in which the data
already collected are examined in order to find if they could be modelled according to a
know distribution or model. This model then could be used to characterize relevant
phenomena or same types of data. We will focus on Power low, Pareto and Normal
distribution for the scope of this dissertation after an examination of several mathematical
distributions. At the end of this dissertation it will be clear the reason for this selection.
2.6.1 Power Law and Pareto distributions
In recent years, a significant amount of research focused on showing that many physical
and social phenomena follow a power-law distribution. Some examples of these
phenomena are the World Wide Web [9], metabolic networks, Internet router connections,
journal paper reference networks, and sexual contact networks [8]. There is sometimes
confusion between the Power law and the Pareto but we will make that clear in the next
paragraphs [8] [9].
We will try to explain both Pareto and Power law through an example of Lada A.
Adamic [9] and try to make clear their similarities and differences. Taking an example of
the distribution of the income, In the Pareto instead of asking what the r th largest income
is, we ask how many people have an income greater than x [14]. So we come up with this
equation P[X>x] ≈ x-k
. For this reason we can say that Pareto’s law is given in terms of the
cumulative distribution function (CDF), i.e the number of events larger than x is an inverse
power of x. Now what we call Power law distribution tells us not how many people had an
income greater than x, but the number of people whose income is exactly x. So it is the
probability distribution function (PDF) associated with the CDF given by Pareto’s law. By
that we can have the P[X=x] ≈ x-(k+1)
= x-a
, where k is the Pareto distribution shape
parameter [9] [13].
Nikolaos Draganoudis, MSc dissertation
- 15 -
Now we will try to explain the way that we are going to work with these distributions
and try to fit the data collected. In order to compare the data with the Pareto law we have to
find the CDF for both the data and the Pareto distribution with parameters that approaching
the curve of data’s CDF. We know from the theory that the Pareto CDF is given by the
following formula F(x) = 1 – (b / x)a
for x>b and 0 for x<b. The ‘a’ is the shape parameter
and the ‘b’ is the scale parameter. As we saw before a = k-1 where ‘k’ is the shape
parameter of Power law. The ‘b’ parameter after searching the related literature is
commonly taking the value of the smallest value of the data that are going to be examined.
In order to find the value of the parameter ‘a’ we will use a program of Aaron Clauset,
Cosma Rohilla Shalizi and M. E. J. Newman that is trying to give a value for the ‘k’
parameter of the power law that fits in a better way in the inserted data. The program
estimates ‘k’ for each possible minimum value of the incoming values x via the method of
maximum likelihood and calculates the Kolmogorov- Smirnov goodness-of-fit statistic
value, then it selects the x that has the minimum Kolmogorov- Smirnov goodness-of-fit
statistic value and export the ‘k’ value from it. Generally the Kolmogorov- Smirnov
method is used when the sample size for each test is small as in our case where we have
from 3 to 15 values per test. The KS test is based on the following value: K = sup x |F*(x) -
S(x)| where F*(x) is the hypothesized cumulative distribution function and S(x) is the
empirical distribution function based on the sampled data [8] [6] [13] [17]. Then after
exporting the closer value of ‘k’ for the data we can find the ‘a’ parameter as it is equal
with a = k-1. The next step is to compare the CDF of the data with the CDF of the Pareto
distribution and see if they follow the Pareto so they could fit on this distribution.
In the Appendix 2 is presented the code of the Matlab program and the functions that I
made in order to get the information from the main program and present the results and the
Pareto distribution.
2.6.2 Normal Distribution
Now we will present the Normal distribution and the important parameters that we have
to know in order to design it. It is important to mention that all normal distributions are
symmetric and have bell-shaped density curves with a single peak. The parameters that
Nikolaos Draganoudis, MSc dissertation
- 16 -
characterize this distribution are the mean value of the data ‘μ’ and the standard deviation
‘σ’ which is a measure of the dispersion of the data. The probability density function of the
normal distribution is given by the following formula f(x) = (1/σ √2π) * exp (- (x-μ) 2
/
2σ2
). From the probability function of the measured data we will see if they could fit to this
distribution [10].
2.7 Summary
In this chapter we presented the background that is necessary in order to come up with
this dissertation. We saw the Internet protocol stack and we made a small introduction to
each of it and then we presented the IP (Internet Protocol), the TCP (Transition Control
Protocol) and then we presented the WWW (World Wide Web). Finally we presented the
mathematical distribution that will be used to analyse the measurements that will be made.
These aspects are very important to familiarize with the dissertation theme and understand
what will follow. Firstly we have to know what we are going to measure and then make the
measurements and analyse them. In the next chapter will follow the methodology of the
measurements, the target of the measurements and the tools that are going to be used.
Nikolaos Draganoudis, MSc dissertation
- 17 -
3 INTERNET TRAFFIC MEASUREMENTS AND
METHODOLOGY
We saw in the previous chapter useful terms related with the theoretical part of
dissertation topic in order to familiarize and gain the appropriate background for this
dissertation. In this chapter we will examine the technical part of the dissertation that is
related with the internet traffic measurements.
3.1 Methodology of measurements
3.1.1 Target of the measurements
The first step in order to start the measurements is to determine the target that the
measurements will be done. The selection of the site is very important as we want the data
analysis to give us useful results that have some meaning. The final choice of the site that
the measurements will be held is the BBC’s website for mobile edition. BBC is the main
British Broadband Corporation with worldwide recognition and acceptance [15]. The
BBC’s mobile edition website tries to satisfy the requirements of the user to be able to be
informed about the news while he isn’t at home. People can easily browse to the mobile
website though their mobile phone or PDA and have access to BBC News, BBC Sports or
other categories. As we can understand the BBC mobile website offers useful information
to many people, that are frequently updated and that make it a very appropriate target for
our measurements.
Nikolaos Draganoudis, MSc dissertation
- 18 -
3.1.2 Measurement tools
3.1.2.1 S60 3rd
Edition emulator
It would be difficult do perform the measurements through a mobile phone as the
operator will charge for the internet browsing and also it would be difficult to process and
store the data from the measurements, so it was decided to use a mobile phone emulator.
The S60 3rd
Edition SDK for C++ platform was chosen as the emulator for browsing the
BBC’s mobile websites. After the registration with Nokia, the platform was ready to be
installed in the laptop. Then through the connection of the laptop to the Internet we could
access the BBC’s website without being charged.
Among many other services that this emulator provides, is the browser application that we
will mainly use for accessing the contents of the BBC’s website. It supports features such
as HTML 4.01, XHTML, JavaScript 1.5, Plug-in support and File Upload over HTTP [11].
Below we can see the form of the emulator with the input keys and its menu icons.
Figure 5. S60 3rd
Edition emulator
Nikolaos Draganoudis, MSc dissertation
- 19 -
The emulator is friendly in use and familiarization with its options can be obtained
quickly. In order to access the browser of the emulator the Services icon must be pressed.
After that and as we can see in the figure 8 we have to type the address that we want to
browse. Now we will access into the BBC’s webpage and will explain the usage of the
diagnostic tool of the emulator.
Figure 6. BBC website explore
Nikolaos Draganoudis, MSc dissertation
- 20 -
As we can see in the above figure the emulator supports a diagnostic tool that
provides information about the traffic that has been done, the total size of every web page
that has been visited and the type of the incoming files, like text, photographs or videos, in
form of requests and responses. This will help us to fulfil one part of the measurements
about the size of the web pages.
Nikolaos Draganoudis, MSc dissertation
- 21 -
3.1.2.2 Wireshark
As one of the main scopes of this dissertation is to capture the inter-arrival times of the
incoming packets for a web page request, it is necessary a tool that will allow as counting
this time period. This tool is the Wireshark. Wireshark is a network packet analyzer.
Generally Wireshark can be used from:
 network administrators to troubleshoot network problems
 network security engineers to examine security problems
 developers to debug protocol implementations
 people to learn network protocol internals [12]
For this dissertation Wireshark will be used in order to capture every packet that comes
from the Internet and also every request that we have and goes to the Internet. From the
picture bellow we can see that for every packet is captured the time that was sent or
received, so from that we are able to count the inter-arrival time of the received packets,
until the end of the received packets from the server’s response.
Figure 7. Wireshark traffic presentation
Nikolaos Draganoudis, MSc dissertation
- 22 -
3.2 Performed Measurements
Before starting the measurements it is important to make clear that the sample size
should be large enough and representative. Also the measurements have to be repeated
over a lot of times. After gaining experience from the preparatory phase, it was decided to
make the measurements for the two main categories of the BBC mobile web page and
these are the BBC News and BBC Sport. These categories are frequently updated and
contain a lot of subcategories. Moreover we will focus on the more important
subcategories of these two categories. In the next tree graphs we will present the web pages
that were chosen to be measured and their containing relationship.
Figure 8. BBC categories that measurements will be performed
Nikolaos Draganoudis, MSc dissertation
- 23 -
Figure 9. BBC News subcategories that measurements will be performed
As we can see from the above graph we will focus the measurements at six
subcategories of the BBC News category (Top Stories, Technology, Politics,
Entertainment, Business end Education) and then at tree stories of every subcategory.
These mentioned categories are assumed to concentrate the user preferences.
Nikolaos Draganoudis, MSc dissertation
- 24 -
Figure 10. BBC Sport subcategories that measurements will be performed
In case of BBC Sport we will focus on four subcategories (Top Stories, Motorsport,
Football and Tennis) but here 2 of them contain subcategories like Formula 1 and World
Rally are contained in Motorsport subcategory and Top Stories, Premiership and
Championship that are contained in Football subcategory. For the displayed stories
measurements will be performed.
Nikolaos Draganoudis, MSc dissertation
- 25 -
Before starting the measurements it is important to mention the frequency of taking
measurements from these web sites. For better results this frequency should be the same of
the frequency that the information of the pages is updated. After observations to the BBC’s
web site it wouldn’t be wise to take more than one measurements per web page per day
because the changes at the contents are rare. This may be logic, because it is seamless and
difficult to change the contents in so small time intervals. So we decided to collect
measurements for a week in time intervals of a day.
3.3 Summary
In this chapter we presented the methodology that we are going to follow for the
measurements, specifying the target for the measurements and the reasons for this choice.
Also we presented the tolls that are going to be used for the measurements and these are
the Nokia S60 Emulator and the WIreshark. After the presentation of the tools and the way
that they are going to be used we presented the specific web pages that measurements are
going to be performed from the entity structure of the BBC’s mobile web site. At the
following chapter we will perform the measurements for the total content size of the web
pages and the analysis of the collected data.
4 BBC’S WEB SITE TRAFF
ANALYSIS
In this chapter we will analyse and present the
that were performed in order to obtain a view of the form of the BBC’s web site profile in
data. For these measurements the emulator that was presented in Chapter 3 was used and
the diagnostic tool that is also contain
web pages.
4.1 General Analysis of the BBC’s Web Site categories
At the following graph we can see the minimum, maximum and average value in byte
of the main page of the BBC web site and the two main c
Sport.
Figure 11.
10000
10800
11600
12400
13200
14000
14800
15600
16400
17200
BBC HOME
minimum
value
BBC HOME
average
value
BBC HOME
Maximum
value
BBC Home content in bytes over
a week
100001080011600124001320014000148001560016400172001800018800
Nikolaos Draganoudis, MSc dissertation
- 26 -
BBC’S WEB SITE TRAFFIC MEASUREMENTS AND
In this chapter we will analyse and present the collected data from the measurements
that were performed in order to obtain a view of the form of the BBC’s web site profile in
For these measurements the emulator that was presented in Chapter 3 was used and
the diagnostic tool that is also contained to the emulator in order to get the total size of the
General Analysis of the BBC’s Web Site categories
At the following graph we can see the minimum, maximum and average value in byte
of the main page of the BBC web site and the two main categories
Figure 11. Week traffic of BBC’s main categories
BBC HOME
Maximum
value
BBC Home content in bytes over
Bytes
12000
12800
13600
14400
15200
16000
16800
17600
18400
BBC NEWS
minimum
value
BBC NEWS
average
value
BBC News content in bytes over a
week
100001080011600124001320014000148001560016400172001800018800
BBC
SPORT
minimum
value
BBC
SPORT
average
value
BBC
SPORT
Maximum
value
BBC Sport content in
bytes over a week
Bytes
Nikolaos Draganoudis, MSc dissertation
IC MEASUREMENTS AND
collected data from the measurements
that were performed in order to obtain a view of the form of the BBC’s web site profile in
For these measurements the emulator that was presented in Chapter 3 was used and
ed to the emulator in order to get the total size of the
At the following graph we can see the minimum, maximum and average value in byte
BBC News and BBC
Week traffic of BBC’s main categories
BBC NEWS
average
value
BBC NEWS
Maximum
value
BBC News content in bytes over a
week
Bytes
From the above graph, we can see that the News and Sports generate almost the same
amount of traffic and both of them lower traffic compared to the BBC Home web page.
This is logical as at the first page the amount of information is bigger that the subcategories
of it. In the following graph we will present the average values of bytes of the
subcategories of BBC News and Sport.
Figure 12. Average values of contents for BBC News and S
1500
1750
2000
2250
2500
2750
3000
3250
3500
3750
Top Stories
BBC News subcategories content in bytes over a
500
1300
2100
2900
3700
4500
5300
6100
6900
7700
8500
9300
10100
10900
11700
12500
13300
14100
14900
15700
Top Stories
BBC Sport subcategories content in bytes over a
Nikolaos Draganoudis, MSc dissertation
- 27 -
From the above graph, we can see that the News and Sports generate almost the same
amount of traffic and both of them lower traffic compared to the BBC Home web page.
logical as at the first page the amount of information is bigger that the subcategories
of it. In the following graph we will present the average values of bytes of the
subcategories of BBC News and Sport.
Average values of contents for BBC News and Sport subcategories
Politics Education
BBC News subcategories content in bytes over a
week
Motorsport Football Tennis
BBC Sport subcategories content in bytes over a
week
Nikolaos Draganoudis, MSc dissertation
From the above graph, we can see that the News and Sports generate almost the same
amount of traffic and both of them lower traffic compared to the BBC Home web page.
logical as at the first page the amount of information is bigger that the subcategories
of it. In the following graph we will present the average values of bytes of the
port subcategories
BBC News subcategories content in bytes over a
Bytes
BBC Sport subcategories content in bytes over a
bytes
Nikolaos Draganoudis, MSc dissertation
- 28 -
From the above figure we can observe that in News category almost all the
subcategories have average value around 3000 and 3250 bytes except the Top Stories that
have 3500 bytes and the Business with 3450 bytes. On the other hand these is a big
difference between the Football subcategory of Sport and the other subcategories that their
average value is round 2900 bytes. This can be explained as the football is more favour that
the others and BBC pays more attention and provides more information in this field.
4.2 Analysis of the BBC’s Web Site and Mathematical Distributions
In this part of the dissertation we will analyse in depth the results from the
measurements for the content in bytes of the BBC’s web page and we will examine the
case that the data could follow a mathematical distribution from those we presented at
2.6.1 and 2.6.2.
At the first steps of the analysis we tried to see if the measurements could fit to the
Pareto distribution and in order to do that we constructed the PDF (Probability Density
Function) and the CDF (Cumulative Distribution Function) of the collected data and then
throw the method we described in 2.6.1 we found the parameter of the Pareto distribution
and then we designed the Pareto CDF and compared them with the CDF of the empirical
CDF of the collected data. Also we compared the data with the Normal distribution to see
which distribution has better results with the data. At the following pages we will present
these results and we will make the estimation of the acceptance of the Pareto distribution
and Normal distribution.
From the total amount of the graphs that were designed we can say that the Pareto
distribution in not fitting well with the collected data of the content in bytes of the BBC’s
web pages. Of course there are some exceptions that we are going also to present that
Pareto fits well with the data but the majority of the graphs shows that this is false and the
appropriate distribution could be the Normal. We will start with the Education Stories of
the BBC News category. In the following figure is presented the PDF and the CDF of the
Education stories that were performed measurements for a week period. We can see the
distribution of the web pages according to their total size in bytes.
Figure 13.
Figure 14. Pareto CDF versus Empirical CDF of Education Stories
Figure 15.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0
0.00005
0.0001
0.00015
0.0002
0.00025
0 5000
density
bytes
Nikolaos Draganoudis, MSc dissertation
- 29 -
Figure 13. PDF and CDF of Education Stories
Pareto CDF versus Empirical CDF of Education Stories
Figure 15. Normal PDF and CDF of Education Stories
density
0
0.2
0.4
0.6
0.8
1
10000 15000
0
0.2
0.4
0.6
0.8
1
0 5000
density
Nikolaos Draganoudis, MSc dissertation
PDF and CDF of Education Stories
Pareto CDF versus Empirical CDF of Education Stories
Normal PDF and CDF of Education Stories
density
10000 15000
bytes
It is obvious from the figure14 that the Pareto CDF is not following the curve of the
empirical CDF of the data as the two curves have only two common points and then the
differences between them increasing. On the other hand the figure15 shows that the
measurements of the Education stories feet very well to normal distribution with average
point equal to 8487 bytes and standard deviation equal to 1752.23. We can extract this
conclusion when we look the PDF of the data and the PDF of Normal distribution and
the CDF of the data and the CDF of Normal distribution.
Now we will examine the Top Stories of the BBC News category. The results of the
measurements come from three different top stories and represent the total content size of
the web pages.
Figure 16.
Figure 17. Pareto CDF versus Empirical CDF of News Top Stories
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Nikolaos Draganoudis, MSc dissertation
- 30 -
It is obvious from the figure14 that the Pareto CDF is not following the curve of the
empirical CDF of the data as the two curves have only two common points and then the
differences between them increasing. On the other hand the figure15 shows that the
asurements of the Education stories feet very well to normal distribution with average
point equal to 8487 bytes and standard deviation equal to 1752.23. We can extract this
conclusion when we look the PDF of the data and the PDF of Normal distribution and
the CDF of the data and the CDF of Normal distribution.
Now we will examine the Top Stories of the BBC News category. The results of the
measurements come from three different top stories and represent the total content size of
Figure 16. PDF and CDF of News Top Stories
Pareto CDF versus Empirical CDF of News Top Stories
density
0
0.2
0.4
0.6
0.8
1
Nikolaos Draganoudis, MSc dissertation
It is obvious from the figure14 that the Pareto CDF is not following the curve of the
empirical CDF of the data as the two curves have only two common points and then the
differences between them increasing. On the other hand the figure15 shows that the
asurements of the Education stories feet very well to normal distribution with average
point equal to 8487 bytes and standard deviation equal to 1752.23. We can extract this
conclusion when we look the PDF of the data and the PDF of Normal distribution and also
Now we will examine the Top Stories of the BBC News category. The results of the
measurements come from three different top stories and represent the total content size of
PDF and CDF of News Top Stories
Pareto CDF versus Empirical CDF of News Top Stories
density
Figure 18.
In the figure 16 are presented the PDF and CDF of the content in bytes of the BBC
News Top Stories web pages and at the following figure are presented the curves of the
Empirical CDF of the data and the Pareto CDF with its parameters set to be closer to th
Empirical CDF. Even now the differences between these two curves are obvious and have
only two common points. On the contrary, considering the PDF of the data with the PDF of
the Normal distribution in figure18 there are many similarities in the shape of
and that can also stand to the CDF of the data and the Normal distribution. The Normal
distribution has average value 8123 bytes and standard deviation 1359 bytes.
The next example will be the Politics stories of the BBC News. The measurements
composed of three different Politics stories.
Figure 19.
0
0.00005
0.0001
0.00015
0.0002
0.00025
0.0003
0.00035
0 5000
density
bytes
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Nikolaos Draganoudis, MSc dissertation
- 31 -
Figure 18. Normal PDF and CDF of News Top Stories
In the figure 16 are presented the PDF and CDF of the content in bytes of the BBC
News Top Stories web pages and at the following figure are presented the curves of the
Empirical CDF of the data and the Pareto CDF with its parameters set to be closer to th
Empirical CDF. Even now the differences between these two curves are obvious and have
only two common points. On the contrary, considering the PDF of the data with the PDF of
the Normal distribution in figure18 there are many similarities in the shape of
and that can also stand to the CDF of the data and the Normal distribution. The Normal
distribution has average value 8123 bytes and standard deviation 1359 bytes.
The next example will be the Politics stories of the BBC News. The measurements
composed of three different Politics stories.
Figure 19. PDF and CDF of Politics Stories
10000 15000
bytes
0
0.2
0.4
0.6
0.8
1
0 5000
density
density
0
0.2
0.4
0.6
0.8
1
Nikolaos Draganoudis, MSc dissertation
Normal PDF and CDF of News Top Stories
In the figure 16 are presented the PDF and CDF of the content in bytes of the BBC
News Top Stories web pages and at the following figure are presented the curves of the
Empirical CDF of the data and the Pareto CDF with its parameters set to be closer to the
Empirical CDF. Even now the differences between these two curves are obvious and have
only two common points. On the contrary, considering the PDF of the data with the PDF of
the Normal distribution in figure18 there are many similarities in the shape of the graph
and that can also stand to the CDF of the data and the Normal distribution. The Normal
distribution has average value 8123 bytes and standard deviation 1359 bytes.
The next example will be the Politics stories of the BBC News. The measurements are
PDF and CDF of Politics Stories
5000 10000 15000
bytes
density
Figure 20.
Figure 21.
From the figure20 we can see that the two curves of the Pareto and Empi
different angles and so the data of politics stories don’t follow the Pareto distribution. On
the other hand from the figure19 and figure 21 the Normal distribution seems to be closer
to the data and fit well with the data for both the PDF a
distribution parameters are 7919 bytes for the average value and 1192 bytes for the
standard deviation.
As we can see until this point the measurements tend to feet to the Normal distribution
but as we will present at the following p
to the Pareto distribution in spite the fact that these cases are few. At the following figures
we will present one case that fits well to the Pareto distribution.
0
0.00005
0.0001
0.00015
0.0002
0.00025
0.0003
0.00035
0.0004
0 5000 10000
density
bytes
Nikolaos Draganoudis, MSc dissertation
- 32 -
Pareto CDF versus Empirical CDF of Politics Stories
Figure 21. Normal PDF and CDF of Politics Stories
From the figure20 we can see that the two curves of the Pareto and Empi
different angles and so the data of politics stories don’t follow the Pareto distribution. On
the other hand from the figure19 and figure 21 the Normal distribution seems to be closer
to the data and fit well with the data for both the PDF and the CDF. The Normal
distribution parameters are 7919 bytes for the average value and 1192 bytes for the
As we can see until this point the measurements tend to feet to the Normal distribution
but as we will present at the following pages there are some measurements that fit also well
to the Pareto distribution in spite the fact that these cases are few. At the following figures
we will present one case that fits well to the Pareto distribution.
10000 15000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5000
density
bytes
Nikolaos Draganoudis, MSc dissertation
Pareto CDF versus Empirical CDF of Politics Stories
Normal PDF and CDF of Politics Stories
From the figure20 we can see that the two curves of the Pareto and Empirical CDF have
different angles and so the data of politics stories don’t follow the Pareto distribution. On
the other hand from the figure19 and figure 21 the Normal distribution seems to be closer
nd the CDF. The Normal
distribution parameters are 7919 bytes for the average value and 1192 bytes for the
As we can see until this point the measurements tend to feet to the Normal distribution
ages there are some measurements that fit also well
to the Pareto distribution in spite the fact that these cases are few. At the following figures
10000 15000
bytes
Figure 22.
Figure 23. Pareto CDF versus Empirical CDF of Sport Top Stories
Figure 24.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0
0.00002
0.00004
0.00006
0.00008
0.0001
0 10000
density
bytes
Nikolaos Draganoudis, MSc dissertation
- 33 -
Figure 22. PDF and CDF of Sport Top Stories
Pareto CDF versus Empirical CDF of Sport Top Stories
Figure 24. Normal PDF and CDF of Sport Top Stories
density
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
20000 30000
bytes
0
0.2
0.4
0.6
0.8
1
0 10000
density
Nikolaos Draganoudis, MSc dissertation
PDF and CDF of Sport Top Stories
Pareto CDF versus Empirical CDF of Sport Top Stories
Normal PDF and CDF of Sport Top Stories
density
10000 20000 30000
bytes
In figure22 is presented the PDF and the CDF of the content size of BBC’s Sport Top
Stories and are contained measurments from 3 different stories. From th
see that the Pareto distribution fits very well to the collected data and this can also be
suspected from the PDF of the figure 22 as the data have a long tail that is a characteristic
of the pareto distribution. On the contrary the Norm
as its trying to cover all the data and few of them are far away from the magority of the
measurements and away from the main bell
rare cases that Pareto distribution fit
measurements.
Now we will examine BBC’s web pages that belong to the BBC Sport category. We will
start with the Tennis stories and at the next graph we will present the PDF and CDF of the
tennis stories according to their total content size.
Figure 25.
Figure 26.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Nikolaos Draganoudis, MSc dissertation
- 34 -
In figure22 is presented the PDF and the CDF of the content size of BBC’s Sport Top
Stories and are contained measurments from 3 different stories. From th
see that the Pareto distribution fits very well to the collected data and this can also be
suspected from the PDF of the figure 22 as the data have a long tail that is a characteristic
of the pareto distribution. On the contrary the Normal distribution dont have a good shape
as its trying to cover all the data and few of them are far away from the magority of the
measurements and away from the main bell-shaped curve. But as we already said there are
rare cases that Pareto distribution fits better that the Normal distribution to the
Now we will examine BBC’s web pages that belong to the BBC Sport category. We will
start with the Tennis stories and at the next graph we will present the PDF and CDF of the
ng to their total content size.
Figure 25. PDF and CDF of Tennis Stories
Pareto CDF versus Empirical CDF of Tennis Stories
density
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Nikolaos Draganoudis, MSc dissertation
In figure22 is presented the PDF and the CDF of the content size of BBC’s Sport Top
Stories and are contained measurments from 3 different stories. From the figure 23 we can
see that the Pareto distribution fits very well to the collected data and this can also be
suspected from the PDF of the figure 22 as the data have a long tail that is a characteristic
al distribution dont have a good shape
as its trying to cover all the data and few of them are far away from the magority of the
But as we already said there are
s better that the Normal distribution to the
Now we will examine BBC’s web pages that belong to the BBC Sport category. We will
start with the Tennis stories and at the next graph we will present the PDF and CDF of the
PDF and CDF of Tennis Stories
Pareto CDF versus Empirical CDF of Tennis Stories
density
Figure 27.
For the Tennis stories it can be observed from the figure26 that the Pareto distribution is
not following the Empirical CDF curve of the data so it isn’t the appropriate distribution to
characterize the data. Comparing the figure’s 26 Normal PDF and CDF with those of the
figure25 we find a bigger similarity and Normal distribution characterize m
appropriately the collected data.
Continuing with the BBC Sport category we will present the results for the Football Top
Stories.
Figure 28.
0
0.00005
0.0001
0.00015
0.0002
0.00025
0.0003
0 5000 10000
density
bytes
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Nikolaos Draganoudis, MSc dissertation
- 35 -
Figure 27. Normal PDF and CDF of Tennis Stories
For the Tennis stories it can be observed from the figure26 that the Pareto distribution is
not following the Empirical CDF curve of the data so it isn’t the appropriate distribution to
characterize the data. Comparing the figure’s 26 Normal PDF and CDF with those of the
figure25 we find a bigger similarity and Normal distribution characterize m
appropriately the collected data.
Continuing with the BBC Sport category we will present the results for the Football Top
Figure 28. PDF and CDF of Football Top Stories
10000 15000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5000
density
density
0
0.2
0.4
0.6
0.8
1
Nikolaos Draganoudis, MSc dissertation
Normal PDF and CDF of Tennis Stories
For the Tennis stories it can be observed from the figure26 that the Pareto distribution is
not following the Empirical CDF curve of the data so it isn’t the appropriate distribution to
characterize the data. Comparing the figure’s 26 Normal PDF and CDF with those of the
figure25 we find a bigger similarity and Normal distribution characterize more
Continuing with the BBC Sport category we will present the results for the Football Top
PDF and CDF of Football Top Stories
10000 15000
bytes
density
Figure 29. Pareto CDF versus Empirical CDF of Football Top Stories
Figure 30.
At the figure28 is presented the PDF and the CDF of the collected data of the BBC’s
Football Top Stories pages and at the figure29 is presented the comparison between the
Pareto CDF of the data and the Empirical CDF. From this comparison we can see the mai
differences of these two curves that show that there is no fit with the Pareto distribution as
the two curves have only 3 same points at the start of the curve and then the distance
between them increases. On the contrary the figure30 compared with the f
many similarities and follow the collected data with a better approximation.
0
0.00005
0.0001
0.00015
0.0002
0.00025
0.0003
0.00035
0.0004
0 2000 4000
density
bytes
Nikolaos Draganoudis, MSc dissertation
- 36 -
Pareto CDF versus Empirical CDF of Football Top Stories
Normal PDF and CDF of Football Top Stories
At the figure28 is presented the PDF and the CDF of the collected data of the BBC’s
Football Top Stories pages and at the figure29 is presented the comparison between the
Pareto CDF of the data and the Empirical CDF. From this comparison we can see the mai
differences of these two curves that show that there is no fit with the Pareto distribution as
the two curves have only 3 same points at the start of the curve and then the distance
between them increases. On the contrary the figure30 compared with the f
many similarities and follow the collected data with a better approximation.
6000 8000 10000
bytes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2000 4000
density
Nikolaos Draganoudis, MSc dissertation
Pareto CDF versus Empirical CDF of Football Top Stories
DF and CDF of Football Top Stories
At the figure28 is presented the PDF and the CDF of the collected data of the BBC’s
Football Top Stories pages and at the figure29 is presented the comparison between the
Pareto CDF of the data and the Empirical CDF. From this comparison we can see the main
differences of these two curves that show that there is no fit with the Pareto distribution as
the two curves have only 3 same points at the start of the curve and then the distance
between them increases. On the contrary the figure30 compared with the figure 28 has
many similarities and follow the collected data with a better approximation.
4000 6000 8000 10000
bytes
At the following pages we will present one more example of the measurements that
have been made to the BBC’s web pages and that will be the Championship Stories t
subcategory of the Football in Sport Category.
Figure 31.
Figure 32. Pareto CDF versus Empirical CDF of Championship Stories
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Nikolaos Draganoudis, MSc dissertation
- 37 -
At the following pages we will present one more example of the measurements that
have been made to the BBC’s web pages and that will be the Championship Stories t
subcategory of the Football in Sport Category.
Figure 31. PDF and CDF of Championship Stories
Pareto CDF versus Empirical CDF of Championship Stories
density
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Nikolaos Draganoudis, MSc dissertation
At the following pages we will present one more example of the measurements that
have been made to the BBC’s web pages and that will be the Championship Stories that are
PDF and CDF of Championship Stories
Pareto CDF versus Empirical CDF of Championship Stories
density
Figure 33.
From this last example we confirm that the Pareto distribution is not the appropriate
distribution to characterize and fit to the measurements of the total content size of the
BBC’s web pages. This can be seen from the figure32 and the differences between
curves. According to the results of the measurements that we made and the graphs that we
also presented we can see that the Normal distribution is more appropriate distribution for
our measurements and can fit to the collected data PDF and CDF.
4.3 Conclusions
From the previous analysis of the measurements that have been made in the BBC’s web
pages about their content total size we can say that the majority of them follow the Normal
distribution and not the Pareto distribution. This conclusion can
provide better service as the average size of the web page is known so there is a know
value of the bytes that the user have to download to see the web page and so the service
provider could adjust the bandwidth needed by the user to d
can calculate the total resources that have to provide as there is always an estimation of the
customers that use the service and an estimation of the size of the web page.
0
0.00005
0.0001
0.00015
0.0002
0.00025
0.0003
0.00035
0 2000 4000
density
bytes
Nikolaos Draganoudis, MSc dissertation
- 38 -
Normal PDF and CDF of Championship Stories
From this last example we confirm that the Pareto distribution is not the appropriate
distribution to characterize and fit to the measurements of the total content size of the
BBC’s web pages. This can be seen from the figure32 and the differences between
curves. According to the results of the measurements that we made and the graphs that we
also presented we can see that the Normal distribution is more appropriate distribution for
our measurements and can fit to the collected data PDF and CDF.
From the previous analysis of the measurements that have been made in the BBC’s web
pages about their content total size we can say that the majority of them follow the Normal
distribution and not the Pareto distribution. This conclusion can
provide better service as the average size of the web page is known so there is a know
value of the bytes that the user have to download to see the web page and so the service
provider could adjust the bandwidth needed by the user to download the web page and also
can calculate the total resources that have to provide as there is always an estimation of the
customers that use the service and an estimation of the size of the web page.
6000 8000 10000
bytes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2000
density
Nikolaos Draganoudis, MSc dissertation
Normal PDF and CDF of Championship Stories
From this last example we confirm that the Pareto distribution is not the appropriate
distribution to characterize and fit to the measurements of the total content size of the
BBC’s web pages. This can be seen from the figure32 and the differences between the two
curves. According to the results of the measurements that we made and the graphs that we
also presented we can see that the Normal distribution is more appropriate distribution for
From the previous analysis of the measurements that have been made in the BBC’s web
pages about their content total size we can say that the majority of them follow the Normal
distribution and not the Pareto distribution. This conclusion can help the provider to
provide better service as the average size of the web page is known so there is a know
value of the bytes that the user have to download to see the web page and so the service
ownload the web page and also
can calculate the total resources that have to provide as there is always an estimation of the
customers that use the service and an estimation of the size of the web page.
4000 6000 8000 10000
bytes
Nikolaos Draganoudis, MSc dissertation
- 39 -
We also saw that when the measurements follow the Pareto distribution then the web
site can have a variation to its content size, meaning that even if the majority of the
measurements for one web site have a small variation between them there are some others
that have big variation, like the example of Sport Top Stories in figure 22 where five out of
seven days had an average content size close to 9,500 bytes and the other two had content
size bigger that 18,000 bytes. From that case we can extract the conclusion that we can’t
have an estimation of the average content size of the web page if the Pareto distribution is
being followed so we can’t estimate the needed bandwidth to download the web page and
we may deal with bigger delays and lower QoS.
Nikolaos Draganoudis, MSc dissertation
- 40 -
5 MEASUREMENTS AND ANALYSIS OF THE INTER-
ARRIVAL TIME OF PACKETS OF A WEB PAGE
RESPONSE
In this chapter we will focus on the measurements that were performed at the web pages
in order to examine the inter-arrival time of the received packets produced by a web page
request. When the user tries to access to a web page then a packet is sent to the service
provider asking for access to the contents of the page. Then the provider after processing
the user’s request sends back to the user the contents of the page. These may not fit into a
single packet for many reasons like big amount in bytes or fragmentation of the packet by
the network. For that reason the user receives many packets for this particular web page
request and these packets we need to capture to observe their inter-arrival time. This can be
done with the Wireshark program that was presented in Chapter 3 and also the requests
were produced by the emulator presented in the 3rd
Chapter. From these measurements we
will try to extract some useful results about these received packets and the possibility of
this these packets following a mathematical distribution.
5.1 Method of the Analysis of the BBC’s Web Site Inter-arrival time
In order to obtain right results from the performed measurements it is very important to
make the analysis correctly, otherwise the results from the analysis will be useless. The
analysis cannot be done like the previous Chapter where we gathered the measurements
from the same subcategory and analysed them all together, for example we cannot gather
the Education stories all together and extract results from them but we need to analyse and
study every web page on its own. In that way we will observe the inter-arrival time of the
packets of the web page separately from other web page packets that are irrelevant with it.
At the next pages we will present the measurements that have been made and the results
that were extracted from them.
5.2 Inter-arrival Time Measurements of the Web Pages
As the graphs produced by the measurements are too many we decided to prese
representative sample of them capable to extract useful results from it. We will present
graphs from all the days of a week as the measurements have been made for a week period.
5.2.1 Monday Measurements
We will start presenting the measurements from the
the next graph are presented the PDF and CDF of the collected data for the BBC Home
web page of Monday.
Figure 34.
Figure 35. Pareto CDF versus Empirical CDF of Monday’s BBC Home page
0
0.05
0.1
0.15
0.2
0.25
0.3
60 68 76
density
msec
Nikolaos Draganoudis, MSc dissertation
- 41 -
arrival Time Measurements of the Web Pages
As the graphs produced by the measurements are too many we decided to prese
representative sample of them capable to extract useful results from it. We will present
graphs from all the days of a week as the measurements have been made for a week period.
Monday Measurements
We will start presenting the measurements from the Monday for different web sites.At
the next graph are presented the PDF and CDF of the collected data for the BBC Home
Figure 34. PDF and CDF of Monday’s BBC Home page
Pareto CDF versus Empirical CDF of Monday’s BBC Home page
85 99 100
msec
0
0.2
0.4
0.6
0.8
1
60 68 76
density
Nikolaos Draganoudis, MSc dissertation
arrival Time Measurements of the Web Pages
As the graphs produced by the measurements are too many we decided to present a
representative sample of them capable to extract useful results from it. We will present
graphs from all the days of a week as the measurements have been made for a week period.
Monday for different web sites.At
the next graph are presented the PDF and CDF of the collected data for the BBC Home
PDF and CDF of Monday’s BBC Home page
Pareto CDF versus Empirical CDF of Monday’s BBC Home page
76 85 99 100
msec
Figure 36.
From the graphs that were presented in the previous page we can observe that the
measurements of the inter
not following the Pareto distribution and can be see
empirical CDF of the data have a different curve from the Pareto CDF of the data and have
only 2 common points. But comparing the CDF of the Normal distribution with the CDF of
the figure 34 which is the CDF of the co
to follow the Normal distribution rather than the Pareto distribution. It can also be
observed that from the PDF of the data because the data tend to have a bell
just like the Normal PDF.
Now we will present the results for the News Top Story 1 of the BBC News category.
Figure 37.
0
0.005
0.01
0.015
0.02
0.025
0 50
density
msec
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
16.5 23.5
density
msec
Nikolaos Draganoudis, MSc dissertation
- 42 -
Normal PDF and CDF of Monday’s BBC Home page
From the graphs that were presented in the previous page we can observe that the
measurements of the inter-arrival time of the packets of the BBC Home page response are
not following the Pareto distribution and can be seen clearly from the figure 35 where the
empirical CDF of the data have a different curve from the Pareto CDF of the data and have
only 2 common points. But comparing the CDF of the Normal distribution with the CDF of
the figure 34 which is the CDF of the collected data we can see that the measurements tend
to follow the Normal distribution rather than the Pareto distribution. It can also be
observed that from the PDF of the data because the data tend to have a bell
just like the Normal PDF.
we will present the results for the News Top Story 1 of the BBC News category.
Figure 37. PDF and CDF of Monday’s News Top Story 1
100 150
msec
0
0.2
0.4
0.6
0.8
1
0 50
density
26.5 35.3
msec
0
0.2
0.4
0.6
0.8
1
16.5 23.5
density
msec
Nikolaos Draganoudis, MSc dissertation
and CDF of Monday’s BBC Home page
From the graphs that were presented in the previous page we can observe that the
arrival time of the packets of the BBC Home page response are
n clearly from the figure 35 where the
empirical CDF of the data have a different curve from the Pareto CDF of the data and have
only 2 common points. But comparing the CDF of the Normal distribution with the CDF of
llected data we can see that the measurements tend
to follow the Normal distribution rather than the Pareto distribution. It can also be
observed that from the PDF of the data because the data tend to have a bell-shaped curve
we will present the results for the News Top Story 1 of the BBC News category.
PDF and CDF of Monday’s News Top Story 1
100 150
msec
23.5 26.5 35.3
msec
Figure 38. Pareto CDF versus Empirical CDF of Monday’s News Top Story 1
Figure 39.
In the figure 37 are presented the PDF and CDF of the measurements performed to the
News Top Story 1 web site for Monday and then in figure 38 we can see the graph where
we compare the Pareto CDF with the Empirical CDF of the collected data. From this
comparison the Pareto CDF is close to the empirical CDF of the data but from the figure
38 we can see that the Normal distribution is closer to the collected data PDF and CDF and
fits better than the Pareto distribution. So for this measurement the Normal distributio
more appropriate than the Pareto.
0
0.01
0.02
0.03
0.04
0.05
0.06
0 10 20
density
msec
Nikolaos Draganoudis, MSc dissertation
- 43 -
Pareto CDF versus Empirical CDF of Monday’s News Top Story 1
Normal PDF and CDF of Monday’s News Top Story 1
figure 37 are presented the PDF and CDF of the measurements performed to the
News Top Story 1 web site for Monday and then in figure 38 we can see the graph where
we compare the Pareto CDF with the Empirical CDF of the collected data. From this
the Pareto CDF is close to the empirical CDF of the data but from the figure
38 we can see that the Normal distribution is closer to the collected data PDF and CDF and
fits better than the Pareto distribution. So for this measurement the Normal distributio
more appropriate than the Pareto.
30 40
msec
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10
density
Nikolaos Draganoudis, MSc dissertation
Pareto CDF versus Empirical CDF of Monday’s News Top Story 1
Normal PDF and CDF of Monday’s News Top Story 1
figure 37 are presented the PDF and CDF of the measurements performed to the
News Top Story 1 web site for Monday and then in figure 38 we can see the graph where
we compare the Pareto CDF with the Empirical CDF of the collected data. From this
the Pareto CDF is close to the empirical CDF of the data but from the figure
38 we can see that the Normal distribution is closer to the collected data PDF and CDF and
fits better than the Pareto distribution. So for this measurement the Normal distribution is
20 30 40
msec
We will continue the analysis with the BBC’s News Top Story 3 web page.
Figure 40.
Figure 41. Pareto CDF versus Empirical CDF of Monday’s News Top Story 3
0
0.05
0.1
0.15
0.2
0.25
0.3
16.1 24.1 29.2
density
msec
Nikolaos Draganoudis, MSc dissertation
- 44 -
We will continue the analysis with the BBC’s News Top Story 3 web page.
Figure 40. PDF and CDF of Monday’s News Top Story 3
Pareto CDF versus Empirical CDF of Monday’s News Top Story 3
29.2 37.2 45.2
msec
0
0.2
0.4
0.6
0.8
1
16.1 24.1
density
Nikolaos Draganoudis, MSc dissertation
We will continue the analysis with the BBC’s News Top Story 3 web page.
PDF and CDF of Monday’s News Top Story 3
Pareto CDF versus Empirical CDF of Monday’s News Top Story 3
29.2 37.2 45.2
msec
Figure 42.
Also for this case of the measurements we can see that the Pareto distribution is not the
most appropriate one to characterize the collected data and can be seen from the figure 41
where there are parts of the curves that are fol
points in these parts. The figure 42 indicates that the Normal distribution is more
appropriate distribution to characterize the data and that can be confirmed by comparing
the PDF and CDF of the collected data with
5.2.2 Tuesday Measurements
Staring the measurements for the Tuesday we will examine the BBC News web page.
Figure 43.
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0 10 20
density
msec
0
0.05
0.1
0.15
0.2
17.8 25.8 26.8 32.8
density
msec
Nikolaos Draganoudis, MSc dissertation
- 45 -
Normal PDF and CDF of Monday’s News Top Story 3
Also for this case of the measurements we can see that the Pareto distribution is not the
most appropriate one to characterize the collected data and can be seen from the figure 41
where there are parts of the curves that are following different ways and have no common
points in these parts. The figure 42 indicates that the Normal distribution is more
appropriate distribution to characterize the data and that can be confirmed by comparing
the PDF and CDF of the collected data with these of the Normal distribution.
Tuesday Measurements
Staring the measurements for the Tuesday we will examine the BBC News web page.
Figure 43. PDF and CDF of Tuesday’s News web page
30 40 50
msec
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10
density
32.8 40.5 48.9 56.8 57.8
msec
0
0.2
0.4
0.6
0.8
1
17.8 25.8 26.8
density
Nikolaos Draganoudis, MSc dissertation
Monday’s News Top Story 3
Also for this case of the measurements we can see that the Pareto distribution is not the
most appropriate one to characterize the collected data and can be seen from the figure 41
lowing different ways and have no common
points in these parts. The figure 42 indicates that the Normal distribution is more
appropriate distribution to characterize the data and that can be confirmed by comparing
these of the Normal distribution.
Staring the measurements for the Tuesday we will examine the BBC News web page.
PDF and CDF of Tuesday’s News web page
20 30 40 50
msec
32.8 40.5 48.9 56.8 57.8
msec
Figure 44. Pareto CDF versus Empirical CDF of Tuesday’s News web page
Figure 45.
From the figure 43 we can see the PDF and CDF of the collected data for the News web
page. We have to mention that sometimes the PDF isn’t the appropriate method to compare
with other distribution and that’s b
try to compute their inter
be the same as the majority of the received packets are coming with time intervals so they
have the same probability of arrival with the others except when more than one packets are
received with small time intervals. The figure 44 shows the Pareto CDF and the Empiricla
CDF of the collected data and from that graph we can see that the Pareto is not the most
appropriate distribution that fits to the data but from the CDF of the Normal distribution in
figure 45 we can see that the Normal distribution is more appropriate and fits better to the
collected data.
0
0.005
0.01
0.015
0.02
0.025
0.03
0 20 40
density
msec
Nikolaos Draganoudis, MSc dissertation
- 46 -
Pareto CDF versus Empirical CDF of Tuesday’s News web page
Normal PDF and CDF of Tuesday’s News web page
From the figure 43 we can see the PDF and CDF of the collected data for the News web
page. We have to mention that sometimes the PDF isn’t the appropriate method to compare
with other distribution and that’s because taking measurements from received packets and
try to compute their inter-arrival time will have as a result the probability of each packet to
be the same as the majority of the received packets are coming with time intervals so they
obability of arrival with the others except when more than one packets are
received with small time intervals. The figure 44 shows the Pareto CDF and the Empiricla
CDF of the collected data and from that graph we can see that the Pareto is not the most
ropriate distribution that fits to the data but from the CDF of the Normal distribution in
figure 45 we can see that the Normal distribution is more appropriate and fits better to the
60 80
msec
0
0.2
0.4
0.6
0.8
1
0 20
density
Nikolaos Draganoudis, MSc dissertation
Pareto CDF versus Empirical CDF of Tuesday’s News web page
Normal PDF and CDF of Tuesday’s News web page
From the figure 43 we can see the PDF and CDF of the collected data for the News web
page. We have to mention that sometimes the PDF isn’t the appropriate method to compare
ecause taking measurements from received packets and
arrival time will have as a result the probability of each packet to
be the same as the majority of the received packets are coming with time intervals so they
obability of arrival with the others except when more than one packets are
received with small time intervals. The figure 44 shows the Pareto CDF and the Empiricla
CDF of the collected data and from that graph we can see that the Pareto is not the most
ropriate distribution that fits to the data but from the CDF of the Normal distribution in
figure 45 we can see that the Normal distribution is more appropriate and fits better to the
40 60 80
msec
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis
Internet Traffic Measurement and Analysis

Weitere ähnliche Inhalte

Andere mochten auch (9)

Chuyên đề 3 phương trình, hệ phương trình
Chuyên đề 3 phương trình, hệ phương trìnhChuyên đề 3 phương trình, hệ phương trình
Chuyên đề 3 phương trình, hệ phương trình
 
Burma Country Brief
Burma Country BriefBurma Country Brief
Burma Country Brief
 
Proctor_DeMelo
Proctor_DeMeloProctor_DeMelo
Proctor_DeMelo
 
Dialysis & the elderly
Dialysis & the elderlyDialysis & the elderly
Dialysis & the elderly
 
BRIDGE OVERVIEW
BRIDGE OVERVIEWBRIDGE OVERVIEW
BRIDGE OVERVIEW
 
Chuyên đề 5 thống kê
Chuyên đề 5 thống kêChuyên đề 5 thống kê
Chuyên đề 5 thống kê
 
Emprendimiento
EmprendimientoEmprendimiento
Emprendimiento
 
Chuyên đề 1 vector
Chuyên đề 1 vectorChuyên đề 1 vector
Chuyên đề 1 vector
 
Lý thuyết và bài tập vậy lý 10 1
Lý thuyết và bài tập vậy lý 10   1Lý thuyết và bài tập vậy lý 10   1
Lý thuyết và bài tập vậy lý 10 1
 

Ähnlich wie Internet Traffic Measurement and Analysis

final year project
final year projectfinal year project
final year project
shiola kofi
 
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf6510.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
Med labbi
 
OBD2 Scanner-Final Year Project Report
OBD2 Scanner-Final Year Project ReportOBD2 Scanner-Final Year Project Report
OBD2 Scanner-Final Year Project Report
Kokila Surasinghe
 
Final_Project_Report_DT021_4_2011_Charlie_Weld
Final_Project_Report_DT021_4_2011_Charlie_WeldFinal_Project_Report_DT021_4_2011_Charlie_Weld
Final_Project_Report_DT021_4_2011_Charlie_Weld
Charlie Weld
 
Dissertation BE 1180 Gagandeep Singh 10038702 April 15, 2012 Project Management
Dissertation BE 1180 Gagandeep Singh 10038702 April 15, 2012 Project ManagementDissertation BE 1180 Gagandeep Singh 10038702 April 15, 2012 Project Management
Dissertation BE 1180 Gagandeep Singh 10038702 April 15, 2012 Project Management
Gagandeep Singh
 
AN ANALYSIS OF CHALLENGES ENCOUNTERED WHEN PERFORMING MOBILE FORENSICS ON EME...
AN ANALYSIS OF CHALLENGES ENCOUNTERED WHEN PERFORMING MOBILE FORENSICS ON EME...AN ANALYSIS OF CHALLENGES ENCOUNTERED WHEN PERFORMING MOBILE FORENSICS ON EME...
AN ANALYSIS OF CHALLENGES ENCOUNTERED WHEN PERFORMING MOBILE FORENSICS ON EME...
Raymond Gonzales
 
Lossy Compression Using Stationary Wavelet Transform and Vector Quantization
Lossy Compression Using Stationary Wavelet Transform and Vector QuantizationLossy Compression Using Stationary Wavelet Transform and Vector Quantization
Lossy Compression Using Stationary Wavelet Transform and Vector Quantization
Omar Ghazi
 
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
Tom Robinson
 

Ähnlich wie Internet Traffic Measurement and Analysis (20)

final year project
final year projectfinal year project
final year project
 
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf6510.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
 
Accident detection and notification system
Accident detection and notification systemAccident detection and notification system
Accident detection and notification system
 
File tracking system
File tracking systemFile tracking system
File tracking system
 
Thesis
ThesisThesis
Thesis
 
OBD2 Scanner-Final Year Project Report
OBD2 Scanner-Final Year Project ReportOBD2 Scanner-Final Year Project Report
OBD2 Scanner-Final Year Project Report
 
Optical Fiber link Design Complete guide by Aamir Saleem
Optical Fiber link Design Complete guide by Aamir SaleemOptical Fiber link Design Complete guide by Aamir Saleem
Optical Fiber link Design Complete guide by Aamir Saleem
 
Final_Project_Report_DT021_4_2011_Charlie_Weld
Final_Project_Report_DT021_4_2011_Charlie_WeldFinal_Project_Report_DT021_4_2011_Charlie_Weld
Final_Project_Report_DT021_4_2011_Charlie_Weld
 
Dissertation BE 1180 Gagandeep Singh 10038702 April 15, 2012 Project Management
Dissertation BE 1180 Gagandeep Singh 10038702 April 15, 2012 Project ManagementDissertation BE 1180 Gagandeep Singh 10038702 April 15, 2012 Project Management
Dissertation BE 1180 Gagandeep Singh 10038702 April 15, 2012 Project Management
 
Learning 2.0 for an Inclusive Knowledge Society
Learning 2.0 for an Inclusive Knowledge SocietyLearning 2.0 for an Inclusive Knowledge Society
Learning 2.0 for an Inclusive Knowledge Society
 
AN ANALYSIS OF CHALLENGES ENCOUNTERED WHEN PERFORMING MOBILE FORENSICS ON EME...
AN ANALYSIS OF CHALLENGES ENCOUNTERED WHEN PERFORMING MOBILE FORENSICS ON EME...AN ANALYSIS OF CHALLENGES ENCOUNTERED WHEN PERFORMING MOBILE FORENSICS ON EME...
AN ANALYSIS OF CHALLENGES ENCOUNTERED WHEN PERFORMING MOBILE FORENSICS ON EME...
 
computer science internship report
computer science  internship reportcomputer science  internship report
computer science internship report
 
Kaahwa armstrong intern report
Kaahwa armstrong intern reportKaahwa armstrong intern report
Kaahwa armstrong intern report
 
Lossy Compression Using Stationary Wavelet Transform and Vector Quantization
Lossy Compression Using Stationary Wavelet Transform and Vector QuantizationLossy Compression Using Stationary Wavelet Transform and Vector Quantization
Lossy Compression Using Stationary Wavelet Transform and Vector Quantization
 
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
 
SMARTbus System
SMARTbus SystemSMARTbus System
SMARTbus System
 
AbleMoJah's IT Report
AbleMoJah's IT ReportAbleMoJah's IT Report
AbleMoJah's IT Report
 
Ddos attacks on the data and prevention of attacks
Ddos attacks on the data and prevention of attacksDdos attacks on the data and prevention of attacks
Ddos attacks on the data and prevention of attacks
 
An investigation into the physical build and psychological aspects of an inte...
An investigation into the physical build and psychological aspects of an inte...An investigation into the physical build and psychological aspects of an inte...
An investigation into the physical build and psychological aspects of an inte...
 
Book
BookBook
Book
 

Internet Traffic Measurement and Analysis

  • 1. Nikolaos Draganoudis, MSc dissertation - i - Internet Traffic Measurement and Analysis Nikolaos Draganoudis Master of Science in Mobile and Satellite Communications from the University of Surrey Department of Electronic Engineering Faculty of Engineering and Physical Sciences University of Surrey Guildford, Surrey, GU2 7XH, UK August 2008 Supervised by: Pr. Zhili Sun Nikolaos Draganoudis 2008
  • 2. Nikolaos Draganoudis, MSc dissertation - ii - DECLARATION OF ORIGINALITY I confirm that the project dissertation I am submitting is entirely my own work and that any material used from other sources has been clearly identified and properly acknowledged and referenced. In submitting this final version of my report to the JISC anti-plagiarism software resource, I confirm that my work does not contravene the university regulations on plagiarism as described in the Student Handbook. In so doing I also acknowledge that I may be held to account for any particular instances of uncited work detected by the JISC anti-plagiarism software, or as may be found by the project examiner or project organiser. I also understand that if an allegation of plagiarism is upheld via an Academic Misconduct Hearing, then I may forfeit any credit for this module or a more sever penalty may be agreed. Dissertation Title Internet Traffic Measurements and Analysis Author Name Nikolaos Draganoudis Author Signature Date: 11/08/2008 Supervisor’s name: Pr. Zhili Sun
  • 3. Nikolaos Draganoudis, MSc dissertation - iii - ACKNOWLEDGEMENT The writing of this dissertation has been a big academic challenge that I had to face. Without the support, guidance and patience of the following people, this study would not have been completed. I owe my deepest gratitude to Prof Zhili Sun who was my supervisor, to my friends and colleges and finally to my family that supported me and gave me this opportunity to study abroad to the University of Surrey.
  • 4. Nikolaos Draganoudis, MSc dissertation - iv - ABSTRACT In the last few years, has been observed big improvement in the field of telecommunications. This improvement had as a result the mobile terminals to become faster, with bigger capacity and even smaller size than before. This was a great opportunity for the progress of the services that can be provided from the mobile phones. So nowadays mobile phones are not only used for calls and for text messages but we can also use them to browse an internet website, to send an email or even to hear music and record videos. For this dissertation will be used an emulator of a mobile device connected to the Internet through a laptop. Web browsing from this emulator will be performed to the BBC’s mobile web site as the BBC web site is a big resource of information, frequently updated and well structured. Also the Wireshark is going to be used in order to capture the incoming packets for the emulator and calculate the inter-arrival time among them. Obtaining the appropriate literature background in the field of the dissertation is an important fact to be able to understand and work on this field. Also through this dissertation will be gained the experience to work, plan and extract useful results from big amount of collected data. Also it will be examined if the sizes of the web pages are following any known mathematical distribution and that will help to characterize the produced traffic from a web page response. Furthermore the inter-arrival time of the incoming packets of a web page response will be examined in order to examine the possibility that these packets follow a distribution. This can help us to understand the QoS provided from the web service provider. The study and examination of BBC’s Web Sites will give useful information about the traffic generated and time consumed to download the contents of them and that could used as a guideline to provide improved internet services with higher QoS. Also it will be a useful tool to understand the mobile internet services and the impact of them to the network’s resources.
  • 5. Nikolaos Draganoudis, MSc dissertation - v - TABLE OF CONTENTS Internet Traffic Measurement and Analysis ...............................................................i Nikolaos Draganoudis................................................................................................i Declaration of originality..........................................................................................ii Acknowledgement....................................................................................................iii Abstract ....................................................................................................................iv Table of Contents ......................................................................................................v List of Figures .........................................................................................................vii 1 Introduction ..........................................................................................................1 1.1 Background and Context...............................................................................1 1.2 Scope and Objectives ....................................................................................2 1.3 Achievements................................................................................................4 1.4 Overview of Dissertation ..............................................................................5 2 Literature Review.................................................................................................6 2.1 Introduction ...................................................................................................6 2.2 Introduction to Internet Protocol Stack ............................................6 2.3 IP protocol .................................................................................................9 2.4 TCP Protocol ...............................................................................................11 2.5 World Wide Web .........................................................................................13 2.6 Mathematical Distributions for the Analysis...............................................14 2.6.1 Power Law and Pareto distributions.......................................................14 2.6.2 Normal Distribution................................................................................15 2.7 Summary .....................................................................................................16 3 Internet Traffic Measurements And Methodology .............................................17 3.1 Methodology of measurements ...................................................................17 3.1.1 Target of the measurements....................................................................17 3.1.2 Measurement tools..................................................................................18 3.2 Performed Measurements............................................................................22 3.3 Summary .....................................................................................................25 4 BBC’S Web Site Traffic Measurements And Analysis ......................................26 4.1 General Analysis of the BBC’s Web Site categories...................................26 4.2 Analysis of the BBC’s Web Site and Mathematical Distributions..............28
  • 6. Nikolaos Draganoudis, MSc dissertation - vi - 4.3 Conclusions.................................................................................................38 5 Measurements And Analysis of The Inter-arrival Time of Packets of a Web Page Response......................................................................................................................40 5.1 Method of the Analysis of the BBC’s Web Site Inter-arrival time..............40 5.2 Inter-arrival Time Measurements of the Web Pages ...................................41 5.2.1 Monday Measurements ..........................................................................41 5.2.2 Tuesday Measurements ..........................................................................45 5.2.3 Wednesday Measurements .....................................................................51 5.2.4 Thursday Measurements.........................................................................55 5.2.5 Friday Measurements .............................................................................59 5.2.6 Saturday Measurements..........................................................................63 5.2.7 Sunday Measurements............................................................................66 5.3 Measurements that fit to the Pareto distribution .........................................68 5.4 Conclusions.................................................................................................71 6 Conclusion..........................................................................................................73 6.1 Summary and Evaluation ............................................................................73 6.2 Future Work.................................................................................................74 References...............................................................................................................76 Appendix 1 - Work plan..........................................................................................78 Appendix 2 – Matlab Code .....................................................................................79 Appendix 3- Content in bytes of BBC’s web sites..................................................82 Appendix 4 – Inter-Arrival Time Measurements of bbc’s web sites ......................83
  • 7. Nikolaos Draganoudis, MSc dissertation - vii - LIST OF FIGURES Figure 1. Protocol stack ......................................................................................................6 Figure 2. Encapsulation of data as it goes down the protocol stack [1]..............................8 Figure 3. IP header fields [1]...............................................................................................9 Figure 4. TCP header fields [1].........................................................................................11 Figure 5. S60 3rd Edition emulator....................................................................................18 Figure 6. BBC website explore.........................................................................................19 Figure 7. Wireshark traffic presentation ...........................................................................21 Figure 8. BBC categories that measurements will be performed .....................................22 Figure 9. BBC News subcategories that measurements will be performed......................23 Figure 10. BBC Sport subcategories that measurements will be performed ..................24 Figure 11. Week traffic of BBC’s main categories .........................................................26 Figure 12. Average values of contents for BBC News and Sport subcategories ............27 Figure 13. PDF and CDF of Education Stories...............................................................29 Figure 14. Pareto CDF versus Empirical CDF of Education Stories..............................29 Figure 15. Normal PDF and CDF of Education Stories..................................................29 Figure 16. PDF and CDF of News Top Stories...............................................................30 Figure 17. Pareto CDF versus Empirical CDF of News Top Stories..............................30 Figure 18. Normal PDF and CDF of News Top Stories..................................................31 Figure 19. PDF and CDF of Politics Stories...................................................................31 Figure 20. Pareto CDF versus Empirical CDF of Politics Stories ..................................32 Figure 21. Normal PDF and CDF of Politics Stories......................................................32 Figure 22. PDF and CDF of Sport Top Stories ...............................................................33 Figure 23. Pareto CDF versus Empirical CDF of Sport Top Stories ..............................33 Figure 24. Normal PDF and CDF of Sport Top Stories ..................................................33 Figure 25. PDF and CDF of Tennis Stories.....................................................................34 Figure 26. Pareto CDF versus Empirical CDF of Tennis Stories....................................34 Figure 27. Normal PDF and CDF of Tennis Stories .......................................................35 Figure 28. PDF and CDF of Football Top Stories...........................................................35 Figure 29. Pareto CDF versus Empirical CDF of Football Top Stories..........................36 Figure 30. Normal PDF and CDF of Football Top Stories .............................................36 Figure 31. PDF and CDF of Championship Stories........................................................37
  • 8. Nikolaos Draganoudis, MSc dissertation - viii - Figure 32. Pareto CDF versus Empirical CDF of Championship Stories.......................37 Figure 33. Normal PDF and CDF of Championship Stories...........................................38 Figure 34. PDF and CDF of Monday’s BBC Home page...............................................41 Figure 35. Pareto CDF versus Empirical CDF of Monday’s BBC Home page..............41 Figure 36. Normal PDF and CDF of Monday’s BBC Home page..................................42 Figure 37. PDF and CDF of Monday’s News Top Story 1 .............................................42 Figure 38. Pareto CDF versus Empirical CDF of Monday’s News Top Story 1 ............43 Figure 39. Normal PDF and CDF of Monday’s News Top Story 1 ................................43 Figure 40. PDF and CDF of Monday’s News Top Story 3 .............................................44 Figure 41. Pareto CDF versus Empirical CDF of Monday’s News Top Story 3 ............44 Figure 42. Normal PDF and CDF of Monday’s News Top Story 3 ................................45 Figure 43. PDF and CDF of Tuesday’s News web page.................................................45 Figure 44. Pareto CDF versus Empirical CDF of Tuesday’s News web page................46 Figure 45. Normal PDF and CDF of Tuesday’s News web page....................................46 Figure 46. PDF and CDF of Tuesday’s News Top Story 2 .............................................47 Figure 47. Pareto CDF versus Empirical CDF of Tuesday’s News Top Story 2 ............47 Figure 48. Normal PDF and CDF of Tuesday’s News Top tory 2 ..................................47 Figure 49. PDF and CDF of Tuesday’s Business Story 2 ...............................................48 Figure 50. Pareto CDF versus Empirical CDF of Tuesday’s Business Story 2 ..............48 Figure 51. Normal PDF and CDF of Tuesday’s Business Story 2 .................................49 Figure 52. PDF and CDF of Tuesday’s Football Top Story 1 .........................................49 Figure 53. Pareto CDF versus Empirical CDF of Tuesday’s Football Top Story 1 ........50 Figure 54. Normal PDF and CDF of Tuesday’s Football Top Story 1............................50 Figure 55. PDF and CDF of Wednesday’s BBC Home web page ..................................51 Figure 56. Pareto CDF versus Empirical CDF of Wednesday’s BBC Home web page .51 Figure 57. Normal PDF and CDF of Wednesday’s BBC Home web page.....................52 Figure 58. PDF and CDF of Wednesday’s Technology Story 1......................................52 Figure 59. Pareto CDF versus Empirical CDF of Wednesday’s Technology Story 1.....53 Figure 60. Normal PDF and CDF of Wednesday’s Technology Story 1.........................53 Figure 61. PDF and CDF of Wednesday’s Tennis Story 1 ..............................................54 Figure 62. Pareto CDF versus Empirical CDF of Wednesday’s Tennis Story 1 .............54 Figure 63. Normal PDF and CDF of Wednesday’s Tennis Story 1.................................54 Figure 64. PDF and CDF of Thursday’s BBC Home web page......................................55
  • 9. Nikolaos Draganoudis, MSc dissertation - ix - Figure 65. Pareto CDF versus Empirical CDF of Thursday’s BBC Home web page.....55 Figure 66. Normal PDF and CDF of Thursday’s BBC Home web page ........................56 Figure 67. PDF and CDF of Thursday’s BBC News ......................................................56 Figure 68. Pareto CDF versus Empirical CDF of Thursday’s BBC News......................57 Figure 69. Normal PDF and CDF of Thursday’s BBC News .........................................57 Figure 70. PDF and CDF of Thursday’s Football Top Story 1 .......................................58 Figure 71. Pareto CDF versus Empirical CDF of Thursday’s Football Top Story 1.......58 Figure 72. Normal PDF and CDF of Thursday’s Football Top Story 1 ..........................58 Figure 73. PDF and CDF of Friday’s Business Story 2 ..................................................59 Figure 74. Pareto CDF versus Empirical CDF of Friday’s Business Story 2 .................59 Figure 75. Normal PDF and CDF of Friday’s Business Story 2.....................................60 Figure 76. PDF and CDF of Friday’s Formula Story 1...................................................60 Figure 77. Pareto CDF versus Empirical CDF of Friday’s Formula Story 1..................61 Figure 78. Normal PDF and CDF of Friday’s Formula Story 1......................................61 Figure 79. PDF and CDF of Friday’s BBC News web page...........................................62 Figure 80. Pareto CDF versus Empirical CDF of Friday’s BBC News web page..........62 Figure 81. Normal PDF and CDF of Friday’s BBC News web page..............................62 Figure 82. PDF and CDF of Saturday’s BBC Home web page ......................................63 Figure 83. Pareto CDF versus Empirical CDF of Saturday’s BBC Home web page .....63 Figure 84. Normal PDF and CDF of Saturday’s BBC Home web page .........................64 Figure 85. PDF and CDF of Saturday’s BBC Sport web page .......................................64 Figure 86. Pareto CDF versus Empirical CDF of Saturday’s BBC Sport web page.......65 Figure 87. Normal PDF and CDF of Saturday’s BBC Sport web page ..........................65 Figure 88. PDF and CDF of Sunday’s BBC Education Story 1......................................66 Figure 89. Pareto CDF versus Empirical CDF of Sunday’s BBC Education Story 1.....66 Figure 90. Normal PDF and CDF of Sunday’s BBC Education Story 1 ........................66 Figure 91. PDF and CDF of Sunday’s BBC Formula Story 1 ........................................67 Figure 92. Pareto CDF versus Empirical CDF of Sunday’s BBC Formula Story 1 .......67 Figure 93. Normal PDF and CDF of Sunday’s BBC Formula Story 1 ...........................68 Figure 94. PDF and CDF of Thursday’s Sport Top Story 1 ............................................68 Figure 95. Pareto CDF versus Empirical CDF of Thursday’s Sport Top Story 1 ...........69 Figure 96. Normal PDF and CDF of Thursday’s Sport Top Story 1...............................69 Figure 97. PDF and CDF of Monday’s News Top Story 1 .............................................70
  • 10. Nikolaos Draganoudis, MSc dissertation - x - Figure 98. Pareto CDF versus Empirical CDF of Monday’s News Top Story 1 ............70 Figure 99. Normal PDF and CDF of Monday’s News Top Story 1 ................................71
  • 11. Nikolaos Draganoudis, MSc dissertation - 1 - 1 INTRODUCTION 1.1 Background and Context The technology of information gathering, processing and distribution is the key technology of the 20th century. Until now we saw the development of worldwide telephone networks, the birth and still growing computer industry and also the development of the satellite communication [18]. According to the old concept of the computer systems all the work from different users can be processed by one big computer but nowadays this concept is totally abandoned and its place took the “computer network” where many autonomous computers interconnected to each other can process the incoming work. The interconnected computers can exchange information through copper wire, fibre optics, microwaves or satellites. The information is exchanged through small units of data called packets. These networks of computers can have many different forms, sizes and shapes like wireless networks and wide area networks [1] [3]. At the first stages of the development of the Internet at the early 1980s, it was a single network and its predecessor is the ARPANET (Advanced Research Projects Agency Network), developed by the United States Department of Defence. Now Internet consists of thousands of different networks that are connected to each other and every single of them provide common services to the customers and follow common protocols. These different networks are controlled by the ISPs (Internet Service Providers) and are responsible to provide connectivity to the Internet to their customers. The Internet can have interconnected ISPs of different sizes, forming a hierarchical interconnected structure. The most common ISPs are the transport providers that deal with the provision of a wide range of services to the customers but there are also the backbone providers that are connected to many other ISPs and deal with the traffic that the customers produce and the web hosting providers that provide the host of a Web page for the customers. These relationships between the different ISPs can be translated as business relationships and are related to the quality and the type of service provided to the customer. Nowadays Internet is not only
  • 12. Nikolaos Draganoudis, MSc dissertation - 2 - used in order to communicate with other people all over the world but is mainly used for gaining money by providing many different services. Organizations, small and big businesses, consumers or even individuals now see the Internet from a different scope and prefer to make their business through it. All these increasing expectations from the Internet make the need to become more and more reliable [1] [3] [4]. For that reason many academic researchers, companies and other groups focused their concentration to the internet traffic that the customers cause. They made more and more measurements in order to exam this traffic and come up with some useful results that will help to improve the internet network and the internet traffic performance management [18]. From the measurements we can see the network response and behaviour at any upgrade or degrade of the performance [2]. As it was mentioned in the previous paragraph, the measurements are very important in order to understand the demands of a service so application level measurements will be performed at this dissertation trying to understand the use of the network by the service, the demands of the service and the effects that cause the service to the network and its performance. 1.2 Scope and Objectives The scope of this dissertation is the Internet Traffic Measurement and Analysis. It was mentioned in the previous paragraphs that in this dissertation will be made application level measurements. The advantage of the application measurements is that they provide an overall view of the application performance, which it wouldn’t be so clear if the measurements had become in lower levels. More specifically web browsing measurements will be performed by downloading web page contents. In the last few years, has been observed big improvement in the field of telecommunications. This improvement had as a result the mobile terminals to become faster, with bigger capacity and even smaller size than before. This was a great opportunity for the progress of the services provided from the mobile phones. So nowadays mobile
  • 13. Nikolaos Draganoudis, MSc dissertation - 3 - phones are not only used for calls and for text messages but we can also use them to browse an internet website, to send an email or even to hear music and record videos. For that reason the way that a client can have access to the Internet is not only through an internet service provider over a dial up telephone line. On the contrary, every user can have access to Internet through its mobile phone, laptop or palm pc. As a result, access to Internet became more flexible [2] [3] [16]. For this reason the measurements that will be performed, they will concern the traffic that a mobile phone can produce and for this reason we will use an emulator installed in a laptop that will be connected to the Internet thought the campus of the University of Surrey. The platform will emulate a mobile phone and thought this platform we will have the ability to browse the web pages. For the measurements we needed a service provider that will have a well defined structure of its web pages and also rich and up to date contents. For that reason we decided to browse the web pages of the mobile edition of the BBC which is a provider that has these characteristics. Some objectives of the dissertation are to understand and gain knowledge of the way that Internet works and response to requests. That could be achieved easier throw the procedure of the measurements as every packet that mobile phone send and receive will be captured and analysed. The way that the results from the measurements will be stored and organized is one other important parameter as it may affect the extract of the conclusions. It is also important to mention that one other objective of this dissertation is to calculate the time that is consumed from the request of a web page until the end of the responses for this particular request and also the inter-arrival time between the packets of the same response. Furthermore in this dissertation we will take measurements about the total size of the web pages in bytes. The main objective is to observe the results of the measurements both for the inter-arrival times and the size of the web pages and through the analysis of these measurements to see if the results could fit into a mathematical distribution. During the analysis of the total content size of the web pages we will have the chance to see how the size of the data changes over a week period.
  • 14. Nikolaos Draganoudis, MSc dissertation - 4 - 1.3 Achievements The first step in order to cope with the dissertation was to gain the appropriate background in order to become familiar with the topic of the dissertation and understand the requirements needed. Also it was very important to make a literature review of other works in the field of Internet traffic measurement and analysis. Furthermore through the literature review it became clear the type of the measurements that will be performed and the final decision about the web sites of the measurements was taken. Also it was decided that the S60 3rd Edition SDK emulator and the Wireshark network packet analyzer will be used for the measurements. After the first step, the registration with the Nokia was made in order to obtain the rights to use the emulator and then the emulator was installed in the laptop. Familiarization with the emulator and the options that it provides was performed and also some trial measurements were performed to the BBC’s mobile web sites. At the same time through the Wireshark the incoming and outgoing packets were captured and then examined. After exploring the structure of the BBC’s web site, systematic measurements were performed while storing the results for further analysis as we will see in the following step of the dissertation. The following achievement is the analysis of the data collected from the measurements. The results from the measurements are used in order to come up with a pattern that they may use in order to be able to categorize them and try to fit them in a known mathematical distribution. The changes in the context of the pages over a week period were also examined and will be presented. Finally the results are presented with the mathematical distribution that fits better to the measured data and represents the analysis for the BBC’s mobile web page contexts.
  • 15. Nikolaos Draganoudis, MSc dissertation - 5 - 1.4 Overview of Dissertation At the next chapter of the dissertation is presented the literature review and the background concerning the protocol stack of the Internet and the protocols that are used to transmit and receive data through the internet. Also the World Wide Web will be presented. Finally the mathematical distributions that are going to be used to characterize the collected data are explained and presented. At the third chapter the reader is introduced to the technical part of the dissertation. Firstly the tools that will be used for the measurements are presented and then the target that the measurements will be performed. The structure of the target Web Site is presented and the specified roots that measurements will be performed. The fourth chapter contains the measurements performed to the BBC’s web site and more specific the measurements that concern the total amount in bytes of the web pages. Then an analysis of the collected data is performed and the most appropriate mathematical distribution that fits to the measurements is chosen. The fifth chapter contains the measurements performed to the BBC’s web pages concerning the inter-arrival times between the received packets of a web page response. Through the performed analysis of this chapter is chosen the most appropriate mathematical distribution that fits to the measurements. The sixth chapter contains the conclusions and the evaluation of this dissertation and also the future work that could be done in this field of the dissertation. There are also four appendixes at the end of the dissertation. In the first appendix is presented the work plan that was followed during the year, in the second appendix is presented the code in Matlab that was used to analyse the measurements, in the third appendix is presented the table with the measurements about the total content size of the BBC’s web pages and finally in the fourth appendix are presented the tables with the inter- arrival time measurements of the BBC’s web pages.
  • 16. Nikolaos Draganoudis, MSc dissertation - 6 - 2 LITERATURE REVIEW 2.1 Introduction Here will be presented and reviewed some issues related with the project. This is done in order to help someone understand concepts of the field of the dissertation topic and gain the appropriate knowledge. Firstly the Internet protocol stack will be presented with a brief summary of every layer and its usage. After that the IP (Internet Protocol) will be presented, the TCP (Transition Control Protocol)and the well known WWW (World Wide Web). Finally we will present the mathematical method that is going to be used for this dissertation to analyse the data. 2.2 Introduction to Internet Protocol Stack As it was mentioned before, in order to understand the way that Internet works we have to examine the protocol stack that is implemented in order to send or receive one packet from the Internet. Firstly we will briefly present the internet protocol architecture and also we will see how it is organized in layers. After that we will examine closely protocols that are used, like the IP (Internet Protocol), TCP (Transmission Control Protocol) and also we will see how WWW (World Wide Web) works as it is important in order to obtain the appropriate background for this dissertation. In the graph below we can see the protocol stack of the Internet Figure 1. Protocol stack
  • 17. Nikolaos Draganoudis, MSc dissertation - 7 - The lowest layer of the protocol stack is the Physical Layer where the main function of it is the transmission of the bits. As we are talking for mobile phones the channel that the bits are transmitted is the air but for our measurements, as there are going to be done from the laptop that is connected to the internet, the channel will be the copper wire. The layer above the Physical Layer is the Data Link Layer where the main purpose of it is to maintain the reliable and efficient communication between two adjacent machines of this layer. One of the most important things that exist in this layer is the MAC Address (Medium Access Control) where every computer that connects to the Internet has one and it is unique all over the world. As it isn’t important for this dissertation to examine furthermore this layer we will only keep this in mind. The next layer over the Data Link Layer is the Network Layer where the main operation is to transmit the packets from the source to the destination. In contrast with the Data Link layer where only the transmission of the packet from one end of a cable to the other is concerned, this layer deals with end to end transmission. As it is very important the function of this layer in the next pages we will examine furthermore the functionality of this layer and the protocols that exists on it. Above the Network Layer is the Transport Layer. The function of this layer is also very important as it is responsible to provide reliable and cost effective data transport from the source to the destination. Also it communicates with the Application Layer, receiving and sending requests and data packets respectively. At the next pages we will examine in detail the protocol that is used in order to send and receive the data packets. Finally, the layer that exists in top of all the others is the Application Layer. It is responsible for the communication of various applications with the protocols that exist below of it. Also for this Layer we will examine later the main protocol that is used for browsing the Internet [1] [2] [7]. Now that all the layers are presented we will try to understand the way that communicate to each other and the data that they change between them. Starting from the
  • 18. Application layer, it produces data streams that are mainly produced by the user’s requests. The Transport layer take these data streams and fr maximum size of each datagram is up to 64 Kbytes but in practice the length of each datagram doesn’t exceed the 1460 Bytes, in order to fit in an Ethernet packet with the IP and TCP headers that we will see later. Then each the IP protocol is used and also a connectionless approach is used, so every packet can follow a different path to the destination. After that the Data Link Layer follows and finally the Physical Layer where the bits ar example of the format of an IP packet with the header of each layer from the Application down to the Data Link header Figure 2. Encapsulation of data as it goes down the protocol stack Nikolaos Draganoudis, MSc dissertation - 8 - Application layer, it produces data streams that are mainly produced by the user’s requests. The Transport layer take these data streams and fragment them into datagrams. The maximum size of each datagram is up to 64 Kbytes but in practice the length of each datagram doesn’t exceed the 1460 Bytes, in order to fit in an Ethernet packet with the IP and TCP headers that we will see later. Then each datagram goes to Network Layer where the IP protocol is used and also a connectionless approach is used, so every packet can follow a different path to the destination. After that the Data Link Layer follows and finally the Physical Layer where the bits are transmitted to the channel. Below we can see an example of the format of an IP packet with the header of each layer from the Application down to the Data Link header [1] [7]. Encapsulation of data as it goes down the protocol stack Nikolaos Draganoudis, MSc dissertation Application layer, it produces data streams that are mainly produced by the user’s requests. agment them into datagrams. The maximum size of each datagram is up to 64 Kbytes but in practice the length of each datagram doesn’t exceed the 1460 Bytes, in order to fit in an Ethernet packet with the IP datagram goes to Network Layer where the IP protocol is used and also a connectionless approach is used, so every packet can follow a different path to the destination. After that the Data Link Layer follows and finally e transmitted to the channel. Below we can see an example of the format of an IP packet with the header of each layer from the Application Encapsulation of data as it goes down the protocol stack [1]
  • 19. Nikolaos Draganoudis, MSc dissertation - 9 - 2.3 IP protocol In this section we will examine the protocol that is mainly used in the Network Layer and this is the IP (Internet Protocol). Currently they are two versions, the IP v4 and the IP v6 which is the new version that provides wider range of IP addresses and less complex header than the IP v4. We will examine the IP v4 as it is currently used more than IP v6. As it can be observed in the picture below, the header of the IP protocol has a 20 Byte fixed part and a variable length optional part. Figure 3. IP header fields [1] Now we will briefly explain the fields of the protocol and the usage of them. Version: the field that contains the version of the IP protocol that is used and that is necessary for the communication between two machines that use different version of the IP. IHL: The field that shows the final length of the header of the packet as we can have variable length of Options. Type of service: This field is mainly used in order to distinguish between different classes of services, for example for voice we need fast and accurate delivery of the packet.
  • 20. Nikolaos Draganoudis, MSc dissertation - 10 - Total length: This field includes the total length of the packet, including the header and the data. Identification: This field is used in order to determine the receiver in which datagram the received fragment belongs. DF bit (Don’t Fragment): When this bit is set to 1 the datagram cannot be fragmented by routers of the network. MF bit (More Fragments): This bit indicates that more fragments of the same datagram are expected. Fragment offset: This field indicates the position of the fragmented data into the datagram. Time to live: This field indicates the maximum number of hops that one datagram can do until the final destination. This number is decremented by one every time that a router forwards it and when it hits zero the datagram is dropped by the network. Protocol: This field indicates the upper layer that the IP protocol should send the packet. Header checksum: It verifies only that the header has no errors. Source address: Indicates the IP address of the sender of the packet. Destination address: Indicates the IP address of the receiver. Options field: This field can be used in order to add more functions to the protocol like security, timestamping and source routing.[1] [2] [7]
  • 21. Nikolaos Draganoudis, MSc dissertation - 11 - 2.4 TCP Protocol Moving to the Transport Layer, TCP (Transition Control Protocol) and UDP (Used Datagram Protocol) are the two main protocols that are used. We will focus to the TCP protocol as the World Wide Web runs with HTTP protocol in the Application layer and the TCP protocol at the Transport layer. In the picture below we can see the TCP header. Figure 4. TCP header fields [1] Now we will briefly explain the fields of the protocol and the usage of them. Source port and Destination port: These fields identify where the packets should be send in the upper layer, so they identify the application that the packet belong. Sequence number and Acknowledge number: These fields are used in order to be transmitted all the packets safely without having any loss. TCP header length: In this field is stored the number of the 32-bit words that the TCP
  • 22. Nikolaos Draganoudis, MSc dissertation - 12 - header has. This field makes clear where the header end is and where the start of the data is as we can have a variable header length. URG (Urgent) flag: This is used when we have an urgent pointer that indicates that we have urgent data in this packet. ACK flag: This is used when we want to send acknowledge of a received packet. When the ACK flag is set to 0 the Acknowledge number field is ignored. PSH or PUSHed data: When this flag is set to 1 the data are delivered to the Application layer at their arrival and they are not buffered in this stage. RST flag: This is used to reset a connection or reject an invalid segment. SYN flag: This is used to establish a connection between two entities. FIN flag: This is used to release a connection between two entities. Window size: In this field is stored the number of the bytes that the receiver is willing to receive from the transmitter. Checksum field: This field includes a checksum of the header and data for extra reliability. Urgent pointer: This field shows the byte offset where urgent data are. Options: This field is used for extra options that are not provided in the regular header. [1] [2] [7]
  • 23. Nikolaos Draganoudis, MSc dissertation - 13 - 2.5 World Wide Web In the Application layer, which is in the top of all the other layers that we examined and as the dissertation focuses on the browsing of Internet sites we will focus on the World Wide Web which uses the HTTP protocol. One of the most important services of the Internet is the World Wide Web that began in 1989 at CERN, a European centre for nuclear research. It became very popular all over the world as it is friendly to use for beginners and its interface is well designed. At the beginning it was designed in order the scientists of the CERN to be able to share their research and exchange ideas as many of them were working in different countries but the World Wide Web (WWW) grew out of these needs and used by the entire world. This became when CERN and M.I.T signed an agreement setting up the World Wide Web Consortium. This organization was responsible to develop the World Wide Web by standardizing protocols and encouraging interoperability between the companies that had developed a browser at the time, Netscape and Microsoft [1] [3]. World Wide Web consists of a huge amount of documents distributed randomly over the world or otherwise called Web pages. Every web page may contain several links to other pages around the world. In this way is formed a complicated connection between the pages and every user can have access to them. As it would be very difficult to keep track of the path that has been followed for a page, World Wide Web uses the URL (Uniform Resource Locator) which is a unique identifier of each Web page. So now every user can simply remember the URL of the Web site in order to have access to it. As World Wide Web was growing faster and faster applications that helped the users were developed, the Web browsers. With Web browsers is was easier to browse different sites and keep a record of the URL of the page that you may want to visit again. Browsers made the World Wide Web friendly to use and attracted more users [1] [2] [3].
  • 24. Nikolaos Draganoudis, MSc dissertation - 14 - 2.6 Mathematical Distributions for the Analysis The term analysis of the service measurements indicates the process in which the data already collected are examined in order to find if they could be modelled according to a know distribution or model. This model then could be used to characterize relevant phenomena or same types of data. We will focus on Power low, Pareto and Normal distribution for the scope of this dissertation after an examination of several mathematical distributions. At the end of this dissertation it will be clear the reason for this selection. 2.6.1 Power Law and Pareto distributions In recent years, a significant amount of research focused on showing that many physical and social phenomena follow a power-law distribution. Some examples of these phenomena are the World Wide Web [9], metabolic networks, Internet router connections, journal paper reference networks, and sexual contact networks [8]. There is sometimes confusion between the Power law and the Pareto but we will make that clear in the next paragraphs [8] [9]. We will try to explain both Pareto and Power law through an example of Lada A. Adamic [9] and try to make clear their similarities and differences. Taking an example of the distribution of the income, In the Pareto instead of asking what the r th largest income is, we ask how many people have an income greater than x [14]. So we come up with this equation P[X>x] ≈ x-k . For this reason we can say that Pareto’s law is given in terms of the cumulative distribution function (CDF), i.e the number of events larger than x is an inverse power of x. Now what we call Power law distribution tells us not how many people had an income greater than x, but the number of people whose income is exactly x. So it is the probability distribution function (PDF) associated with the CDF given by Pareto’s law. By that we can have the P[X=x] ≈ x-(k+1) = x-a , where k is the Pareto distribution shape parameter [9] [13].
  • 25. Nikolaos Draganoudis, MSc dissertation - 15 - Now we will try to explain the way that we are going to work with these distributions and try to fit the data collected. In order to compare the data with the Pareto law we have to find the CDF for both the data and the Pareto distribution with parameters that approaching the curve of data’s CDF. We know from the theory that the Pareto CDF is given by the following formula F(x) = 1 – (b / x)a for x>b and 0 for x<b. The ‘a’ is the shape parameter and the ‘b’ is the scale parameter. As we saw before a = k-1 where ‘k’ is the shape parameter of Power law. The ‘b’ parameter after searching the related literature is commonly taking the value of the smallest value of the data that are going to be examined. In order to find the value of the parameter ‘a’ we will use a program of Aaron Clauset, Cosma Rohilla Shalizi and M. E. J. Newman that is trying to give a value for the ‘k’ parameter of the power law that fits in a better way in the inserted data. The program estimates ‘k’ for each possible minimum value of the incoming values x via the method of maximum likelihood and calculates the Kolmogorov- Smirnov goodness-of-fit statistic value, then it selects the x that has the minimum Kolmogorov- Smirnov goodness-of-fit statistic value and export the ‘k’ value from it. Generally the Kolmogorov- Smirnov method is used when the sample size for each test is small as in our case where we have from 3 to 15 values per test. The KS test is based on the following value: K = sup x |F*(x) - S(x)| where F*(x) is the hypothesized cumulative distribution function and S(x) is the empirical distribution function based on the sampled data [8] [6] [13] [17]. Then after exporting the closer value of ‘k’ for the data we can find the ‘a’ parameter as it is equal with a = k-1. The next step is to compare the CDF of the data with the CDF of the Pareto distribution and see if they follow the Pareto so they could fit on this distribution. In the Appendix 2 is presented the code of the Matlab program and the functions that I made in order to get the information from the main program and present the results and the Pareto distribution. 2.6.2 Normal Distribution Now we will present the Normal distribution and the important parameters that we have to know in order to design it. It is important to mention that all normal distributions are symmetric and have bell-shaped density curves with a single peak. The parameters that
  • 26. Nikolaos Draganoudis, MSc dissertation - 16 - characterize this distribution are the mean value of the data ‘μ’ and the standard deviation ‘σ’ which is a measure of the dispersion of the data. The probability density function of the normal distribution is given by the following formula f(x) = (1/σ √2π) * exp (- (x-μ) 2 / 2σ2 ). From the probability function of the measured data we will see if they could fit to this distribution [10]. 2.7 Summary In this chapter we presented the background that is necessary in order to come up with this dissertation. We saw the Internet protocol stack and we made a small introduction to each of it and then we presented the IP (Internet Protocol), the TCP (Transition Control Protocol) and then we presented the WWW (World Wide Web). Finally we presented the mathematical distribution that will be used to analyse the measurements that will be made. These aspects are very important to familiarize with the dissertation theme and understand what will follow. Firstly we have to know what we are going to measure and then make the measurements and analyse them. In the next chapter will follow the methodology of the measurements, the target of the measurements and the tools that are going to be used.
  • 27. Nikolaos Draganoudis, MSc dissertation - 17 - 3 INTERNET TRAFFIC MEASUREMENTS AND METHODOLOGY We saw in the previous chapter useful terms related with the theoretical part of dissertation topic in order to familiarize and gain the appropriate background for this dissertation. In this chapter we will examine the technical part of the dissertation that is related with the internet traffic measurements. 3.1 Methodology of measurements 3.1.1 Target of the measurements The first step in order to start the measurements is to determine the target that the measurements will be done. The selection of the site is very important as we want the data analysis to give us useful results that have some meaning. The final choice of the site that the measurements will be held is the BBC’s website for mobile edition. BBC is the main British Broadband Corporation with worldwide recognition and acceptance [15]. The BBC’s mobile edition website tries to satisfy the requirements of the user to be able to be informed about the news while he isn’t at home. People can easily browse to the mobile website though their mobile phone or PDA and have access to BBC News, BBC Sports or other categories. As we can understand the BBC mobile website offers useful information to many people, that are frequently updated and that make it a very appropriate target for our measurements.
  • 28. Nikolaos Draganoudis, MSc dissertation - 18 - 3.1.2 Measurement tools 3.1.2.1 S60 3rd Edition emulator It would be difficult do perform the measurements through a mobile phone as the operator will charge for the internet browsing and also it would be difficult to process and store the data from the measurements, so it was decided to use a mobile phone emulator. The S60 3rd Edition SDK for C++ platform was chosen as the emulator for browsing the BBC’s mobile websites. After the registration with Nokia, the platform was ready to be installed in the laptop. Then through the connection of the laptop to the Internet we could access the BBC’s website without being charged. Among many other services that this emulator provides, is the browser application that we will mainly use for accessing the contents of the BBC’s website. It supports features such as HTML 4.01, XHTML, JavaScript 1.5, Plug-in support and File Upload over HTTP [11]. Below we can see the form of the emulator with the input keys and its menu icons. Figure 5. S60 3rd Edition emulator
  • 29. Nikolaos Draganoudis, MSc dissertation - 19 - The emulator is friendly in use and familiarization with its options can be obtained quickly. In order to access the browser of the emulator the Services icon must be pressed. After that and as we can see in the figure 8 we have to type the address that we want to browse. Now we will access into the BBC’s webpage and will explain the usage of the diagnostic tool of the emulator. Figure 6. BBC website explore
  • 30. Nikolaos Draganoudis, MSc dissertation - 20 - As we can see in the above figure the emulator supports a diagnostic tool that provides information about the traffic that has been done, the total size of every web page that has been visited and the type of the incoming files, like text, photographs or videos, in form of requests and responses. This will help us to fulfil one part of the measurements about the size of the web pages.
  • 31. Nikolaos Draganoudis, MSc dissertation - 21 - 3.1.2.2 Wireshark As one of the main scopes of this dissertation is to capture the inter-arrival times of the incoming packets for a web page request, it is necessary a tool that will allow as counting this time period. This tool is the Wireshark. Wireshark is a network packet analyzer. Generally Wireshark can be used from:  network administrators to troubleshoot network problems  network security engineers to examine security problems  developers to debug protocol implementations  people to learn network protocol internals [12] For this dissertation Wireshark will be used in order to capture every packet that comes from the Internet and also every request that we have and goes to the Internet. From the picture bellow we can see that for every packet is captured the time that was sent or received, so from that we are able to count the inter-arrival time of the received packets, until the end of the received packets from the server’s response. Figure 7. Wireshark traffic presentation
  • 32. Nikolaos Draganoudis, MSc dissertation - 22 - 3.2 Performed Measurements Before starting the measurements it is important to make clear that the sample size should be large enough and representative. Also the measurements have to be repeated over a lot of times. After gaining experience from the preparatory phase, it was decided to make the measurements for the two main categories of the BBC mobile web page and these are the BBC News and BBC Sport. These categories are frequently updated and contain a lot of subcategories. Moreover we will focus on the more important subcategories of these two categories. In the next tree graphs we will present the web pages that were chosen to be measured and their containing relationship. Figure 8. BBC categories that measurements will be performed
  • 33. Nikolaos Draganoudis, MSc dissertation - 23 - Figure 9. BBC News subcategories that measurements will be performed As we can see from the above graph we will focus the measurements at six subcategories of the BBC News category (Top Stories, Technology, Politics, Entertainment, Business end Education) and then at tree stories of every subcategory. These mentioned categories are assumed to concentrate the user preferences.
  • 34. Nikolaos Draganoudis, MSc dissertation - 24 - Figure 10. BBC Sport subcategories that measurements will be performed In case of BBC Sport we will focus on four subcategories (Top Stories, Motorsport, Football and Tennis) but here 2 of them contain subcategories like Formula 1 and World Rally are contained in Motorsport subcategory and Top Stories, Premiership and Championship that are contained in Football subcategory. For the displayed stories measurements will be performed.
  • 35. Nikolaos Draganoudis, MSc dissertation - 25 - Before starting the measurements it is important to mention the frequency of taking measurements from these web sites. For better results this frequency should be the same of the frequency that the information of the pages is updated. After observations to the BBC’s web site it wouldn’t be wise to take more than one measurements per web page per day because the changes at the contents are rare. This may be logic, because it is seamless and difficult to change the contents in so small time intervals. So we decided to collect measurements for a week in time intervals of a day. 3.3 Summary In this chapter we presented the methodology that we are going to follow for the measurements, specifying the target for the measurements and the reasons for this choice. Also we presented the tolls that are going to be used for the measurements and these are the Nokia S60 Emulator and the WIreshark. After the presentation of the tools and the way that they are going to be used we presented the specific web pages that measurements are going to be performed from the entity structure of the BBC’s mobile web site. At the following chapter we will perform the measurements for the total content size of the web pages and the analysis of the collected data.
  • 36. 4 BBC’S WEB SITE TRAFF ANALYSIS In this chapter we will analyse and present the that were performed in order to obtain a view of the form of the BBC’s web site profile in data. For these measurements the emulator that was presented in Chapter 3 was used and the diagnostic tool that is also contain web pages. 4.1 General Analysis of the BBC’s Web Site categories At the following graph we can see the minimum, maximum and average value in byte of the main page of the BBC web site and the two main c Sport. Figure 11. 10000 10800 11600 12400 13200 14000 14800 15600 16400 17200 BBC HOME minimum value BBC HOME average value BBC HOME Maximum value BBC Home content in bytes over a week 100001080011600124001320014000148001560016400172001800018800 Nikolaos Draganoudis, MSc dissertation - 26 - BBC’S WEB SITE TRAFFIC MEASUREMENTS AND In this chapter we will analyse and present the collected data from the measurements that were performed in order to obtain a view of the form of the BBC’s web site profile in For these measurements the emulator that was presented in Chapter 3 was used and the diagnostic tool that is also contained to the emulator in order to get the total size of the General Analysis of the BBC’s Web Site categories At the following graph we can see the minimum, maximum and average value in byte of the main page of the BBC web site and the two main categories Figure 11. Week traffic of BBC’s main categories BBC HOME Maximum value BBC Home content in bytes over Bytes 12000 12800 13600 14400 15200 16000 16800 17600 18400 BBC NEWS minimum value BBC NEWS average value BBC News content in bytes over a week 100001080011600124001320014000148001560016400172001800018800 BBC SPORT minimum value BBC SPORT average value BBC SPORT Maximum value BBC Sport content in bytes over a week Bytes Nikolaos Draganoudis, MSc dissertation IC MEASUREMENTS AND collected data from the measurements that were performed in order to obtain a view of the form of the BBC’s web site profile in For these measurements the emulator that was presented in Chapter 3 was used and ed to the emulator in order to get the total size of the At the following graph we can see the minimum, maximum and average value in byte BBC News and BBC Week traffic of BBC’s main categories BBC NEWS average value BBC NEWS Maximum value BBC News content in bytes over a week Bytes
  • 37. From the above graph, we can see that the News and Sports generate almost the same amount of traffic and both of them lower traffic compared to the BBC Home web page. This is logical as at the first page the amount of information is bigger that the subcategories of it. In the following graph we will present the average values of bytes of the subcategories of BBC News and Sport. Figure 12. Average values of contents for BBC News and S 1500 1750 2000 2250 2500 2750 3000 3250 3500 3750 Top Stories BBC News subcategories content in bytes over a 500 1300 2100 2900 3700 4500 5300 6100 6900 7700 8500 9300 10100 10900 11700 12500 13300 14100 14900 15700 Top Stories BBC Sport subcategories content in bytes over a Nikolaos Draganoudis, MSc dissertation - 27 - From the above graph, we can see that the News and Sports generate almost the same amount of traffic and both of them lower traffic compared to the BBC Home web page. logical as at the first page the amount of information is bigger that the subcategories of it. In the following graph we will present the average values of bytes of the subcategories of BBC News and Sport. Average values of contents for BBC News and Sport subcategories Politics Education BBC News subcategories content in bytes over a week Motorsport Football Tennis BBC Sport subcategories content in bytes over a week Nikolaos Draganoudis, MSc dissertation From the above graph, we can see that the News and Sports generate almost the same amount of traffic and both of them lower traffic compared to the BBC Home web page. logical as at the first page the amount of information is bigger that the subcategories of it. In the following graph we will present the average values of bytes of the port subcategories BBC News subcategories content in bytes over a Bytes BBC Sport subcategories content in bytes over a bytes
  • 38. Nikolaos Draganoudis, MSc dissertation - 28 - From the above figure we can observe that in News category almost all the subcategories have average value around 3000 and 3250 bytes except the Top Stories that have 3500 bytes and the Business with 3450 bytes. On the other hand these is a big difference between the Football subcategory of Sport and the other subcategories that their average value is round 2900 bytes. This can be explained as the football is more favour that the others and BBC pays more attention and provides more information in this field. 4.2 Analysis of the BBC’s Web Site and Mathematical Distributions In this part of the dissertation we will analyse in depth the results from the measurements for the content in bytes of the BBC’s web page and we will examine the case that the data could follow a mathematical distribution from those we presented at 2.6.1 and 2.6.2. At the first steps of the analysis we tried to see if the measurements could fit to the Pareto distribution and in order to do that we constructed the PDF (Probability Density Function) and the CDF (Cumulative Distribution Function) of the collected data and then throw the method we described in 2.6.1 we found the parameter of the Pareto distribution and then we designed the Pareto CDF and compared them with the CDF of the empirical CDF of the collected data. Also we compared the data with the Normal distribution to see which distribution has better results with the data. At the following pages we will present these results and we will make the estimation of the acceptance of the Pareto distribution and Normal distribution. From the total amount of the graphs that were designed we can say that the Pareto distribution in not fitting well with the collected data of the content in bytes of the BBC’s web pages. Of course there are some exceptions that we are going also to present that Pareto fits well with the data but the majority of the graphs shows that this is false and the appropriate distribution could be the Normal. We will start with the Education Stories of the BBC News category. In the following figure is presented the PDF and the CDF of the Education stories that were performed measurements for a week period. We can see the distribution of the web pages according to their total size in bytes.
  • 39. Figure 13. Figure 14. Pareto CDF versus Empirical CDF of Education Stories Figure 15. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0 0.00005 0.0001 0.00015 0.0002 0.00025 0 5000 density bytes Nikolaos Draganoudis, MSc dissertation - 29 - Figure 13. PDF and CDF of Education Stories Pareto CDF versus Empirical CDF of Education Stories Figure 15. Normal PDF and CDF of Education Stories density 0 0.2 0.4 0.6 0.8 1 10000 15000 0 0.2 0.4 0.6 0.8 1 0 5000 density Nikolaos Draganoudis, MSc dissertation PDF and CDF of Education Stories Pareto CDF versus Empirical CDF of Education Stories Normal PDF and CDF of Education Stories density 10000 15000 bytes
  • 40. It is obvious from the figure14 that the Pareto CDF is not following the curve of the empirical CDF of the data as the two curves have only two common points and then the differences between them increasing. On the other hand the figure15 shows that the measurements of the Education stories feet very well to normal distribution with average point equal to 8487 bytes and standard deviation equal to 1752.23. We can extract this conclusion when we look the PDF of the data and the PDF of Normal distribution and the CDF of the data and the CDF of Normal distribution. Now we will examine the Top Stories of the BBC News category. The results of the measurements come from three different top stories and represent the total content size of the web pages. Figure 16. Figure 17. Pareto CDF versus Empirical CDF of News Top Stories 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Nikolaos Draganoudis, MSc dissertation - 30 - It is obvious from the figure14 that the Pareto CDF is not following the curve of the empirical CDF of the data as the two curves have only two common points and then the differences between them increasing. On the other hand the figure15 shows that the asurements of the Education stories feet very well to normal distribution with average point equal to 8487 bytes and standard deviation equal to 1752.23. We can extract this conclusion when we look the PDF of the data and the PDF of Normal distribution and the CDF of the data and the CDF of Normal distribution. Now we will examine the Top Stories of the BBC News category. The results of the measurements come from three different top stories and represent the total content size of Figure 16. PDF and CDF of News Top Stories Pareto CDF versus Empirical CDF of News Top Stories density 0 0.2 0.4 0.6 0.8 1 Nikolaos Draganoudis, MSc dissertation It is obvious from the figure14 that the Pareto CDF is not following the curve of the empirical CDF of the data as the two curves have only two common points and then the differences between them increasing. On the other hand the figure15 shows that the asurements of the Education stories feet very well to normal distribution with average point equal to 8487 bytes and standard deviation equal to 1752.23. We can extract this conclusion when we look the PDF of the data and the PDF of Normal distribution and also Now we will examine the Top Stories of the BBC News category. The results of the measurements come from three different top stories and represent the total content size of PDF and CDF of News Top Stories Pareto CDF versus Empirical CDF of News Top Stories density
  • 41. Figure 18. In the figure 16 are presented the PDF and CDF of the content in bytes of the BBC News Top Stories web pages and at the following figure are presented the curves of the Empirical CDF of the data and the Pareto CDF with its parameters set to be closer to th Empirical CDF. Even now the differences between these two curves are obvious and have only two common points. On the contrary, considering the PDF of the data with the PDF of the Normal distribution in figure18 there are many similarities in the shape of and that can also stand to the CDF of the data and the Normal distribution. The Normal distribution has average value 8123 bytes and standard deviation 1359 bytes. The next example will be the Politics stories of the BBC News. The measurements composed of three different Politics stories. Figure 19. 0 0.00005 0.0001 0.00015 0.0002 0.00025 0.0003 0.00035 0 5000 density bytes 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Nikolaos Draganoudis, MSc dissertation - 31 - Figure 18. Normal PDF and CDF of News Top Stories In the figure 16 are presented the PDF and CDF of the content in bytes of the BBC News Top Stories web pages and at the following figure are presented the curves of the Empirical CDF of the data and the Pareto CDF with its parameters set to be closer to th Empirical CDF. Even now the differences between these two curves are obvious and have only two common points. On the contrary, considering the PDF of the data with the PDF of the Normal distribution in figure18 there are many similarities in the shape of and that can also stand to the CDF of the data and the Normal distribution. The Normal distribution has average value 8123 bytes and standard deviation 1359 bytes. The next example will be the Politics stories of the BBC News. The measurements composed of three different Politics stories. Figure 19. PDF and CDF of Politics Stories 10000 15000 bytes 0 0.2 0.4 0.6 0.8 1 0 5000 density density 0 0.2 0.4 0.6 0.8 1 Nikolaos Draganoudis, MSc dissertation Normal PDF and CDF of News Top Stories In the figure 16 are presented the PDF and CDF of the content in bytes of the BBC News Top Stories web pages and at the following figure are presented the curves of the Empirical CDF of the data and the Pareto CDF with its parameters set to be closer to the Empirical CDF. Even now the differences between these two curves are obvious and have only two common points. On the contrary, considering the PDF of the data with the PDF of the Normal distribution in figure18 there are many similarities in the shape of the graph and that can also stand to the CDF of the data and the Normal distribution. The Normal distribution has average value 8123 bytes and standard deviation 1359 bytes. The next example will be the Politics stories of the BBC News. The measurements are PDF and CDF of Politics Stories 5000 10000 15000 bytes density
  • 42. Figure 20. Figure 21. From the figure20 we can see that the two curves of the Pareto and Empi different angles and so the data of politics stories don’t follow the Pareto distribution. On the other hand from the figure19 and figure 21 the Normal distribution seems to be closer to the data and fit well with the data for both the PDF a distribution parameters are 7919 bytes for the average value and 1192 bytes for the standard deviation. As we can see until this point the measurements tend to feet to the Normal distribution but as we will present at the following p to the Pareto distribution in spite the fact that these cases are few. At the following figures we will present one case that fits well to the Pareto distribution. 0 0.00005 0.0001 0.00015 0.0002 0.00025 0.0003 0.00035 0.0004 0 5000 10000 density bytes Nikolaos Draganoudis, MSc dissertation - 32 - Pareto CDF versus Empirical CDF of Politics Stories Figure 21. Normal PDF and CDF of Politics Stories From the figure20 we can see that the two curves of the Pareto and Empi different angles and so the data of politics stories don’t follow the Pareto distribution. On the other hand from the figure19 and figure 21 the Normal distribution seems to be closer to the data and fit well with the data for both the PDF and the CDF. The Normal distribution parameters are 7919 bytes for the average value and 1192 bytes for the As we can see until this point the measurements tend to feet to the Normal distribution but as we will present at the following pages there are some measurements that fit also well to the Pareto distribution in spite the fact that these cases are few. At the following figures we will present one case that fits well to the Pareto distribution. 10000 15000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 5000 density bytes Nikolaos Draganoudis, MSc dissertation Pareto CDF versus Empirical CDF of Politics Stories Normal PDF and CDF of Politics Stories From the figure20 we can see that the two curves of the Pareto and Empirical CDF have different angles and so the data of politics stories don’t follow the Pareto distribution. On the other hand from the figure19 and figure 21 the Normal distribution seems to be closer nd the CDF. The Normal distribution parameters are 7919 bytes for the average value and 1192 bytes for the As we can see until this point the measurements tend to feet to the Normal distribution ages there are some measurements that fit also well to the Pareto distribution in spite the fact that these cases are few. At the following figures 10000 15000 bytes
  • 43. Figure 22. Figure 23. Pareto CDF versus Empirical CDF of Sport Top Stories Figure 24. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 0.00002 0.00004 0.00006 0.00008 0.0001 0 10000 density bytes Nikolaos Draganoudis, MSc dissertation - 33 - Figure 22. PDF and CDF of Sport Top Stories Pareto CDF versus Empirical CDF of Sport Top Stories Figure 24. Normal PDF and CDF of Sport Top Stories density 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 20000 30000 bytes 0 0.2 0.4 0.6 0.8 1 0 10000 density Nikolaos Draganoudis, MSc dissertation PDF and CDF of Sport Top Stories Pareto CDF versus Empirical CDF of Sport Top Stories Normal PDF and CDF of Sport Top Stories density 10000 20000 30000 bytes
  • 44. In figure22 is presented the PDF and the CDF of the content size of BBC’s Sport Top Stories and are contained measurments from 3 different stories. From th see that the Pareto distribution fits very well to the collected data and this can also be suspected from the PDF of the figure 22 as the data have a long tail that is a characteristic of the pareto distribution. On the contrary the Norm as its trying to cover all the data and few of them are far away from the magority of the measurements and away from the main bell rare cases that Pareto distribution fit measurements. Now we will examine BBC’s web pages that belong to the BBC Sport category. We will start with the Tennis stories and at the next graph we will present the PDF and CDF of the tennis stories according to their total content size. Figure 25. Figure 26. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Nikolaos Draganoudis, MSc dissertation - 34 - In figure22 is presented the PDF and the CDF of the content size of BBC’s Sport Top Stories and are contained measurments from 3 different stories. From th see that the Pareto distribution fits very well to the collected data and this can also be suspected from the PDF of the figure 22 as the data have a long tail that is a characteristic of the pareto distribution. On the contrary the Normal distribution dont have a good shape as its trying to cover all the data and few of them are far away from the magority of the measurements and away from the main bell-shaped curve. But as we already said there are rare cases that Pareto distribution fits better that the Normal distribution to the Now we will examine BBC’s web pages that belong to the BBC Sport category. We will start with the Tennis stories and at the next graph we will present the PDF and CDF of the ng to their total content size. Figure 25. PDF and CDF of Tennis Stories Pareto CDF versus Empirical CDF of Tennis Stories density 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Nikolaos Draganoudis, MSc dissertation In figure22 is presented the PDF and the CDF of the content size of BBC’s Sport Top Stories and are contained measurments from 3 different stories. From the figure 23 we can see that the Pareto distribution fits very well to the collected data and this can also be suspected from the PDF of the figure 22 as the data have a long tail that is a characteristic al distribution dont have a good shape as its trying to cover all the data and few of them are far away from the magority of the But as we already said there are s better that the Normal distribution to the Now we will examine BBC’s web pages that belong to the BBC Sport category. We will start with the Tennis stories and at the next graph we will present the PDF and CDF of the PDF and CDF of Tennis Stories Pareto CDF versus Empirical CDF of Tennis Stories density
  • 45. Figure 27. For the Tennis stories it can be observed from the figure26 that the Pareto distribution is not following the Empirical CDF curve of the data so it isn’t the appropriate distribution to characterize the data. Comparing the figure’s 26 Normal PDF and CDF with those of the figure25 we find a bigger similarity and Normal distribution characterize m appropriately the collected data. Continuing with the BBC Sport category we will present the results for the Football Top Stories. Figure 28. 0 0.00005 0.0001 0.00015 0.0002 0.00025 0.0003 0 5000 10000 density bytes 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Nikolaos Draganoudis, MSc dissertation - 35 - Figure 27. Normal PDF and CDF of Tennis Stories For the Tennis stories it can be observed from the figure26 that the Pareto distribution is not following the Empirical CDF curve of the data so it isn’t the appropriate distribution to characterize the data. Comparing the figure’s 26 Normal PDF and CDF with those of the figure25 we find a bigger similarity and Normal distribution characterize m appropriately the collected data. Continuing with the BBC Sport category we will present the results for the Football Top Figure 28. PDF and CDF of Football Top Stories 10000 15000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 5000 density density 0 0.2 0.4 0.6 0.8 1 Nikolaos Draganoudis, MSc dissertation Normal PDF and CDF of Tennis Stories For the Tennis stories it can be observed from the figure26 that the Pareto distribution is not following the Empirical CDF curve of the data so it isn’t the appropriate distribution to characterize the data. Comparing the figure’s 26 Normal PDF and CDF with those of the figure25 we find a bigger similarity and Normal distribution characterize more Continuing with the BBC Sport category we will present the results for the Football Top PDF and CDF of Football Top Stories 10000 15000 bytes density
  • 46. Figure 29. Pareto CDF versus Empirical CDF of Football Top Stories Figure 30. At the figure28 is presented the PDF and the CDF of the collected data of the BBC’s Football Top Stories pages and at the figure29 is presented the comparison between the Pareto CDF of the data and the Empirical CDF. From this comparison we can see the mai differences of these two curves that show that there is no fit with the Pareto distribution as the two curves have only 3 same points at the start of the curve and then the distance between them increases. On the contrary the figure30 compared with the f many similarities and follow the collected data with a better approximation. 0 0.00005 0.0001 0.00015 0.0002 0.00025 0.0003 0.00035 0.0004 0 2000 4000 density bytes Nikolaos Draganoudis, MSc dissertation - 36 - Pareto CDF versus Empirical CDF of Football Top Stories Normal PDF and CDF of Football Top Stories At the figure28 is presented the PDF and the CDF of the collected data of the BBC’s Football Top Stories pages and at the figure29 is presented the comparison between the Pareto CDF of the data and the Empirical CDF. From this comparison we can see the mai differences of these two curves that show that there is no fit with the Pareto distribution as the two curves have only 3 same points at the start of the curve and then the distance between them increases. On the contrary the figure30 compared with the f many similarities and follow the collected data with a better approximation. 6000 8000 10000 bytes 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2000 4000 density Nikolaos Draganoudis, MSc dissertation Pareto CDF versus Empirical CDF of Football Top Stories DF and CDF of Football Top Stories At the figure28 is presented the PDF and the CDF of the collected data of the BBC’s Football Top Stories pages and at the figure29 is presented the comparison between the Pareto CDF of the data and the Empirical CDF. From this comparison we can see the main differences of these two curves that show that there is no fit with the Pareto distribution as the two curves have only 3 same points at the start of the curve and then the distance between them increases. On the contrary the figure30 compared with the figure 28 has many similarities and follow the collected data with a better approximation. 4000 6000 8000 10000 bytes
  • 47. At the following pages we will present one more example of the measurements that have been made to the BBC’s web pages and that will be the Championship Stories t subcategory of the Football in Sport Category. Figure 31. Figure 32. Pareto CDF versus Empirical CDF of Championship Stories 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Nikolaos Draganoudis, MSc dissertation - 37 - At the following pages we will present one more example of the measurements that have been made to the BBC’s web pages and that will be the Championship Stories t subcategory of the Football in Sport Category. Figure 31. PDF and CDF of Championship Stories Pareto CDF versus Empirical CDF of Championship Stories density 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Nikolaos Draganoudis, MSc dissertation At the following pages we will present one more example of the measurements that have been made to the BBC’s web pages and that will be the Championship Stories that are PDF and CDF of Championship Stories Pareto CDF versus Empirical CDF of Championship Stories density
  • 48. Figure 33. From this last example we confirm that the Pareto distribution is not the appropriate distribution to characterize and fit to the measurements of the total content size of the BBC’s web pages. This can be seen from the figure32 and the differences between curves. According to the results of the measurements that we made and the graphs that we also presented we can see that the Normal distribution is more appropriate distribution for our measurements and can fit to the collected data PDF and CDF. 4.3 Conclusions From the previous analysis of the measurements that have been made in the BBC’s web pages about their content total size we can say that the majority of them follow the Normal distribution and not the Pareto distribution. This conclusion can provide better service as the average size of the web page is known so there is a know value of the bytes that the user have to download to see the web page and so the service provider could adjust the bandwidth needed by the user to d can calculate the total resources that have to provide as there is always an estimation of the customers that use the service and an estimation of the size of the web page. 0 0.00005 0.0001 0.00015 0.0002 0.00025 0.0003 0.00035 0 2000 4000 density bytes Nikolaos Draganoudis, MSc dissertation - 38 - Normal PDF and CDF of Championship Stories From this last example we confirm that the Pareto distribution is not the appropriate distribution to characterize and fit to the measurements of the total content size of the BBC’s web pages. This can be seen from the figure32 and the differences between curves. According to the results of the measurements that we made and the graphs that we also presented we can see that the Normal distribution is more appropriate distribution for our measurements and can fit to the collected data PDF and CDF. From the previous analysis of the measurements that have been made in the BBC’s web pages about their content total size we can say that the majority of them follow the Normal distribution and not the Pareto distribution. This conclusion can provide better service as the average size of the web page is known so there is a know value of the bytes that the user have to download to see the web page and so the service provider could adjust the bandwidth needed by the user to download the web page and also can calculate the total resources that have to provide as there is always an estimation of the customers that use the service and an estimation of the size of the web page. 6000 8000 10000 bytes 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2000 density Nikolaos Draganoudis, MSc dissertation Normal PDF and CDF of Championship Stories From this last example we confirm that the Pareto distribution is not the appropriate distribution to characterize and fit to the measurements of the total content size of the BBC’s web pages. This can be seen from the figure32 and the differences between the two curves. According to the results of the measurements that we made and the graphs that we also presented we can see that the Normal distribution is more appropriate distribution for From the previous analysis of the measurements that have been made in the BBC’s web pages about their content total size we can say that the majority of them follow the Normal distribution and not the Pareto distribution. This conclusion can help the provider to provide better service as the average size of the web page is known so there is a know value of the bytes that the user have to download to see the web page and so the service ownload the web page and also can calculate the total resources that have to provide as there is always an estimation of the customers that use the service and an estimation of the size of the web page. 4000 6000 8000 10000 bytes
  • 49. Nikolaos Draganoudis, MSc dissertation - 39 - We also saw that when the measurements follow the Pareto distribution then the web site can have a variation to its content size, meaning that even if the majority of the measurements for one web site have a small variation between them there are some others that have big variation, like the example of Sport Top Stories in figure 22 where five out of seven days had an average content size close to 9,500 bytes and the other two had content size bigger that 18,000 bytes. From that case we can extract the conclusion that we can’t have an estimation of the average content size of the web page if the Pareto distribution is being followed so we can’t estimate the needed bandwidth to download the web page and we may deal with bigger delays and lower QoS.
  • 50. Nikolaos Draganoudis, MSc dissertation - 40 - 5 MEASUREMENTS AND ANALYSIS OF THE INTER- ARRIVAL TIME OF PACKETS OF A WEB PAGE RESPONSE In this chapter we will focus on the measurements that were performed at the web pages in order to examine the inter-arrival time of the received packets produced by a web page request. When the user tries to access to a web page then a packet is sent to the service provider asking for access to the contents of the page. Then the provider after processing the user’s request sends back to the user the contents of the page. These may not fit into a single packet for many reasons like big amount in bytes or fragmentation of the packet by the network. For that reason the user receives many packets for this particular web page request and these packets we need to capture to observe their inter-arrival time. This can be done with the Wireshark program that was presented in Chapter 3 and also the requests were produced by the emulator presented in the 3rd Chapter. From these measurements we will try to extract some useful results about these received packets and the possibility of this these packets following a mathematical distribution. 5.1 Method of the Analysis of the BBC’s Web Site Inter-arrival time In order to obtain right results from the performed measurements it is very important to make the analysis correctly, otherwise the results from the analysis will be useless. The analysis cannot be done like the previous Chapter where we gathered the measurements from the same subcategory and analysed them all together, for example we cannot gather the Education stories all together and extract results from them but we need to analyse and study every web page on its own. In that way we will observe the inter-arrival time of the packets of the web page separately from other web page packets that are irrelevant with it. At the next pages we will present the measurements that have been made and the results that were extracted from them.
  • 51. 5.2 Inter-arrival Time Measurements of the Web Pages As the graphs produced by the measurements are too many we decided to prese representative sample of them capable to extract useful results from it. We will present graphs from all the days of a week as the measurements have been made for a week period. 5.2.1 Monday Measurements We will start presenting the measurements from the the next graph are presented the PDF and CDF of the collected data for the BBC Home web page of Monday. Figure 34. Figure 35. Pareto CDF versus Empirical CDF of Monday’s BBC Home page 0 0.05 0.1 0.15 0.2 0.25 0.3 60 68 76 density msec Nikolaos Draganoudis, MSc dissertation - 41 - arrival Time Measurements of the Web Pages As the graphs produced by the measurements are too many we decided to prese representative sample of them capable to extract useful results from it. We will present graphs from all the days of a week as the measurements have been made for a week period. Monday Measurements We will start presenting the measurements from the Monday for different web sites.At the next graph are presented the PDF and CDF of the collected data for the BBC Home Figure 34. PDF and CDF of Monday’s BBC Home page Pareto CDF versus Empirical CDF of Monday’s BBC Home page 85 99 100 msec 0 0.2 0.4 0.6 0.8 1 60 68 76 density Nikolaos Draganoudis, MSc dissertation arrival Time Measurements of the Web Pages As the graphs produced by the measurements are too many we decided to present a representative sample of them capable to extract useful results from it. We will present graphs from all the days of a week as the measurements have been made for a week period. Monday for different web sites.At the next graph are presented the PDF and CDF of the collected data for the BBC Home PDF and CDF of Monday’s BBC Home page Pareto CDF versus Empirical CDF of Monday’s BBC Home page 76 85 99 100 msec
  • 52. Figure 36. From the graphs that were presented in the previous page we can observe that the measurements of the inter not following the Pareto distribution and can be see empirical CDF of the data have a different curve from the Pareto CDF of the data and have only 2 common points. But comparing the CDF of the Normal distribution with the CDF of the figure 34 which is the CDF of the co to follow the Normal distribution rather than the Pareto distribution. It can also be observed that from the PDF of the data because the data tend to have a bell just like the Normal PDF. Now we will present the results for the News Top Story 1 of the BBC News category. Figure 37. 0 0.005 0.01 0.015 0.02 0.025 0 50 density msec 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 16.5 23.5 density msec Nikolaos Draganoudis, MSc dissertation - 42 - Normal PDF and CDF of Monday’s BBC Home page From the graphs that were presented in the previous page we can observe that the measurements of the inter-arrival time of the packets of the BBC Home page response are not following the Pareto distribution and can be seen clearly from the figure 35 where the empirical CDF of the data have a different curve from the Pareto CDF of the data and have only 2 common points. But comparing the CDF of the Normal distribution with the CDF of the figure 34 which is the CDF of the collected data we can see that the measurements tend to follow the Normal distribution rather than the Pareto distribution. It can also be observed that from the PDF of the data because the data tend to have a bell just like the Normal PDF. we will present the results for the News Top Story 1 of the BBC News category. Figure 37. PDF and CDF of Monday’s News Top Story 1 100 150 msec 0 0.2 0.4 0.6 0.8 1 0 50 density 26.5 35.3 msec 0 0.2 0.4 0.6 0.8 1 16.5 23.5 density msec Nikolaos Draganoudis, MSc dissertation and CDF of Monday’s BBC Home page From the graphs that were presented in the previous page we can observe that the arrival time of the packets of the BBC Home page response are n clearly from the figure 35 where the empirical CDF of the data have a different curve from the Pareto CDF of the data and have only 2 common points. But comparing the CDF of the Normal distribution with the CDF of llected data we can see that the measurements tend to follow the Normal distribution rather than the Pareto distribution. It can also be observed that from the PDF of the data because the data tend to have a bell-shaped curve we will present the results for the News Top Story 1 of the BBC News category. PDF and CDF of Monday’s News Top Story 1 100 150 msec 23.5 26.5 35.3 msec
  • 53. Figure 38. Pareto CDF versus Empirical CDF of Monday’s News Top Story 1 Figure 39. In the figure 37 are presented the PDF and CDF of the measurements performed to the News Top Story 1 web site for Monday and then in figure 38 we can see the graph where we compare the Pareto CDF with the Empirical CDF of the collected data. From this comparison the Pareto CDF is close to the empirical CDF of the data but from the figure 38 we can see that the Normal distribution is closer to the collected data PDF and CDF and fits better than the Pareto distribution. So for this measurement the Normal distributio more appropriate than the Pareto. 0 0.01 0.02 0.03 0.04 0.05 0.06 0 10 20 density msec Nikolaos Draganoudis, MSc dissertation - 43 - Pareto CDF versus Empirical CDF of Monday’s News Top Story 1 Normal PDF and CDF of Monday’s News Top Story 1 figure 37 are presented the PDF and CDF of the measurements performed to the News Top Story 1 web site for Monday and then in figure 38 we can see the graph where we compare the Pareto CDF with the Empirical CDF of the collected data. From this the Pareto CDF is close to the empirical CDF of the data but from the figure 38 we can see that the Normal distribution is closer to the collected data PDF and CDF and fits better than the Pareto distribution. So for this measurement the Normal distributio more appropriate than the Pareto. 30 40 msec 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 10 density Nikolaos Draganoudis, MSc dissertation Pareto CDF versus Empirical CDF of Monday’s News Top Story 1 Normal PDF and CDF of Monday’s News Top Story 1 figure 37 are presented the PDF and CDF of the measurements performed to the News Top Story 1 web site for Monday and then in figure 38 we can see the graph where we compare the Pareto CDF with the Empirical CDF of the collected data. From this the Pareto CDF is close to the empirical CDF of the data but from the figure 38 we can see that the Normal distribution is closer to the collected data PDF and CDF and fits better than the Pareto distribution. So for this measurement the Normal distribution is 20 30 40 msec
  • 54. We will continue the analysis with the BBC’s News Top Story 3 web page. Figure 40. Figure 41. Pareto CDF versus Empirical CDF of Monday’s News Top Story 3 0 0.05 0.1 0.15 0.2 0.25 0.3 16.1 24.1 29.2 density msec Nikolaos Draganoudis, MSc dissertation - 44 - We will continue the analysis with the BBC’s News Top Story 3 web page. Figure 40. PDF and CDF of Monday’s News Top Story 3 Pareto CDF versus Empirical CDF of Monday’s News Top Story 3 29.2 37.2 45.2 msec 0 0.2 0.4 0.6 0.8 1 16.1 24.1 density Nikolaos Draganoudis, MSc dissertation We will continue the analysis with the BBC’s News Top Story 3 web page. PDF and CDF of Monday’s News Top Story 3 Pareto CDF versus Empirical CDF of Monday’s News Top Story 3 29.2 37.2 45.2 msec
  • 55. Figure 42. Also for this case of the measurements we can see that the Pareto distribution is not the most appropriate one to characterize the collected data and can be seen from the figure 41 where there are parts of the curves that are fol points in these parts. The figure 42 indicates that the Normal distribution is more appropriate distribution to characterize the data and that can be confirmed by comparing the PDF and CDF of the collected data with 5.2.2 Tuesday Measurements Staring the measurements for the Tuesday we will examine the BBC News web page. Figure 43. 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0 10 20 density msec 0 0.05 0.1 0.15 0.2 17.8 25.8 26.8 32.8 density msec Nikolaos Draganoudis, MSc dissertation - 45 - Normal PDF and CDF of Monday’s News Top Story 3 Also for this case of the measurements we can see that the Pareto distribution is not the most appropriate one to characterize the collected data and can be seen from the figure 41 where there are parts of the curves that are following different ways and have no common points in these parts. The figure 42 indicates that the Normal distribution is more appropriate distribution to characterize the data and that can be confirmed by comparing the PDF and CDF of the collected data with these of the Normal distribution. Tuesday Measurements Staring the measurements for the Tuesday we will examine the BBC News web page. Figure 43. PDF and CDF of Tuesday’s News web page 30 40 50 msec 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 10 density 32.8 40.5 48.9 56.8 57.8 msec 0 0.2 0.4 0.6 0.8 1 17.8 25.8 26.8 density Nikolaos Draganoudis, MSc dissertation Monday’s News Top Story 3 Also for this case of the measurements we can see that the Pareto distribution is not the most appropriate one to characterize the collected data and can be seen from the figure 41 lowing different ways and have no common points in these parts. The figure 42 indicates that the Normal distribution is more appropriate distribution to characterize the data and that can be confirmed by comparing these of the Normal distribution. Staring the measurements for the Tuesday we will examine the BBC News web page. PDF and CDF of Tuesday’s News web page 20 30 40 50 msec 32.8 40.5 48.9 56.8 57.8 msec
  • 56. Figure 44. Pareto CDF versus Empirical CDF of Tuesday’s News web page Figure 45. From the figure 43 we can see the PDF and CDF of the collected data for the News web page. We have to mention that sometimes the PDF isn’t the appropriate method to compare with other distribution and that’s b try to compute their inter be the same as the majority of the received packets are coming with time intervals so they have the same probability of arrival with the others except when more than one packets are received with small time intervals. The figure 44 shows the Pareto CDF and the Empiricla CDF of the collected data and from that graph we can see that the Pareto is not the most appropriate distribution that fits to the data but from the CDF of the Normal distribution in figure 45 we can see that the Normal distribution is more appropriate and fits better to the collected data. 0 0.005 0.01 0.015 0.02 0.025 0.03 0 20 40 density msec Nikolaos Draganoudis, MSc dissertation - 46 - Pareto CDF versus Empirical CDF of Tuesday’s News web page Normal PDF and CDF of Tuesday’s News web page From the figure 43 we can see the PDF and CDF of the collected data for the News web page. We have to mention that sometimes the PDF isn’t the appropriate method to compare with other distribution and that’s because taking measurements from received packets and try to compute their inter-arrival time will have as a result the probability of each packet to be the same as the majority of the received packets are coming with time intervals so they obability of arrival with the others except when more than one packets are received with small time intervals. The figure 44 shows the Pareto CDF and the Empiricla CDF of the collected data and from that graph we can see that the Pareto is not the most ropriate distribution that fits to the data but from the CDF of the Normal distribution in figure 45 we can see that the Normal distribution is more appropriate and fits better to the 60 80 msec 0 0.2 0.4 0.6 0.8 1 0 20 density Nikolaos Draganoudis, MSc dissertation Pareto CDF versus Empirical CDF of Tuesday’s News web page Normal PDF and CDF of Tuesday’s News web page From the figure 43 we can see the PDF and CDF of the collected data for the News web page. We have to mention that sometimes the PDF isn’t the appropriate method to compare ecause taking measurements from received packets and arrival time will have as a result the probability of each packet to be the same as the majority of the received packets are coming with time intervals so they obability of arrival with the others except when more than one packets are received with small time intervals. The figure 44 shows the Pareto CDF and the Empiricla CDF of the collected data and from that graph we can see that the Pareto is not the most ropriate distribution that fits to the data but from the CDF of the Normal distribution in figure 45 we can see that the Normal distribution is more appropriate and fits better to the 40 60 80 msec