Internet Traffic Measurement and Analysis

Nikolaos Draganoudis, MSc dissertation
- i -
Internet Traffic Measurement and Analysis
Nikolaos Draganoudis
Master of Science in Mobile and Satellite Communications
from the
University of Surrey
Department of Electronic Engineering
Faculty of Engineering and Physical Sciences
University of Surrey
Guildford, Surrey, GU2 7XH, UK
August 2008
Supervised by: Pr. Zhili Sun
Nikolaos Draganoudis 2008

- ii -
DECLARATION OF ORIGINALITY
I confirm that the project dissertation I am submitting is entirely my own work and that
any material used from other sources has been clearly identified and properly
acknowledged and referenced. In submitting this final version of my report to the JISC
anti-plagiarism software resource, I confirm that my work does not contravene the
university regulations on plagiarism as described in the Student Handbook. In so doing I
also acknowledge that I may be held to account for any particular instances of uncited
work detected by the JISC anti-plagiarism software, or as may be found by the project
examiner or project organiser. I also understand that if an allegation of plagiarism is upheld
via an Academic Misconduct Hearing, then I may forfeit any credit for this module or a
more sever penalty may be agreed.
Dissertation Title
Internet Traffic Measurements and Analysis
Author Name
Nikolaos Draganoudis
Author Signature Date: 11/08/2008
Supervisor’s name:
Pr. Zhili Sun

- iii -
ACKNOWLEDGEMENT
The writing of this dissertation has been a big academic challenge that I had to face.
Without the support, guidance and patience of the following people, this study would not
have been completed. I owe my deepest gratitude to Prof Zhili Sun who was my
supervisor, to my friends and colleges and finally to my family that supported me and gave
me this opportunity to study abroad to the University of Surrey.

- iv -
ABSTRACT
In the last few years, has been observed big improvement in the field of
telecommunications. This improvement had as a result the mobile terminals to become
faster, with bigger capacity and even smaller size than before. This was a great opportunity
for the progress of the services that can be provided from the mobile phones. So nowadays
mobile phones are not only used for calls and for text messages but we can also use them
to browse an internet website, to send an email or even to hear music and record videos.
For this dissertation will be used an emulator of a mobile device connected to the
Internet through a laptop. Web browsing from this emulator will be performed to the
BBC’s mobile web site as the BBC web site is a big resource of information, frequently
updated and well structured. Also the Wireshark is going to be used in order to capture the
incoming packets for the emulator and calculate the inter-arrival time among them.
Obtaining the appropriate literature background in the field of the dissertation is an
important fact to be able to understand and work on this field. Also through this
dissertation will be gained the experience to work, plan and extract useful results from big
amount of collected data. Also it will be examined if the sizes of the web pages are
following any known mathematical distribution and that will help to characterize the
produced traffic from a web page response. Furthermore the inter-arrival time of the
incoming packets of a web page response will be examined in order to examine the
possibility that these packets follow a distribution. This can help us to understand the QoS
provided from the web service provider.
The study and examination of BBC’s Web Sites will give useful information about the
traffic generated and time consumed to download the contents of them and that could used
as a guideline to provide improved internet services with higher QoS. Also it will be a
useful tool to understand the mobile internet services and the impact of them to the
network’s resources.

- v -
TABLE OF CONTENTS
Internet Traffic Measurement and Analysis ...............................................................i
Nikolaos Draganoudis................................................................................................i
Declaration of originality..........................................................................................ii
Acknowledgement....................................................................................................iii
Abstract ....................................................................................................................iv
Table of Contents ......................................................................................................v
List of Figures .........................................................................................................vii
1 Introduction ..........................................................................................................1
1.1 Background and Context...............................................................................1
1.2 Scope and Objectives ....................................................................................2
1.3 Achievements................................................................................................4
1.4 Overview of Dissertation ..............................................................................5
2 Literature Review.................................................................................................6
2.1 Introduction ...................................................................................................6
2.2 Introduction to Internet Protocol Stack ............................................6
2.3 IP protocol .................................................................................................9
2.4 TCP Protocol ...............................................................................................11
2.5 World Wide Web .........................................................................................13
2.6 Mathematical Distributions for the Analysis...............................................14
2.6.1 Power Law and Pareto distributions.......................................................14
2.6.2 Normal Distribution................................................................................15
2.7 Summary .....................................................................................................16
3 Internet Traffic Measurements And Methodology .............................................17
3.1 Methodology of measurements ...................................................................17
3.1.1 Target of the measurements....................................................................17
3.1.2 Measurement tools..................................................................................18
3.2 Performed Measurements............................................................................22
3.3 Summary .....................................................................................................25
4 BBC’S Web Site Traffic Measurements And Analysis ......................................26
4.1 General Analysis of the BBC’s Web Site categories...................................26
4.2 Analysis of the BBC’s Web Site and Mathematical Distributions..............28

- vi -
4.3 Conclusions.................................................................................................38
5 Measurements And Analysis of The Inter-arrival Time of Packets of a Web Page
Response......................................................................................................................40
5.1 Method of the Analysis of the BBC’s Web Site Inter-arrival time..............40
5.2 Inter-arrival Time Measurements of the Web Pages ...................................41
5.2.1 Monday Measurements ..........................................................................41
5.2.2 Tuesday Measurements ..........................................................................45
5.2.3 Wednesday Measurements .....................................................................51
5.2.4 Thursday Measurements.........................................................................55
5.2.5 Friday Measurements .............................................................................59
5.2.6 Saturday Measurements..........................................................................63
5.2.7 Sunday Measurements............................................................................66
5.3 Measurements that fit to the Pareto distribution .........................................68
5.4 Conclusions.................................................................................................71
6 Conclusion..........................................................................................................73
6.1 Summary and Evaluation ............................................................................73
6.2 Future Work.................................................................................................74
References...............................................................................................................76
Appendix 1 - Work plan..........................................................................................78
Appendix 2 – Matlab Code .....................................................................................79
Appendix 3- Content in bytes of BBC’s web sites..................................................82
Appendix 4 – Inter-Arrival Time Measurements of bbc’s web sites ......................83

- vii -
LIST OF FIGURES
Figure 1. Protocol stack ......................................................................................................6
Figure 2. Encapsulation of data as it goes down the protocol stack [1]..............................8
Figure 3. IP header fields [1]...............................................................................................9
Figure 4. TCP header fields [1].........................................................................................11
Figure 5. S60 3rd
Edition emulator....................................................................................18
Figure 6. BBC website explore.........................................................................................19
Figure 7. Wireshark traffic presentation ...........................................................................21
Figure 8. BBC categories that measurements will be performed .....................................22
Figure 9. BBC News subcategories that measurements will be performed......................23
Figure 10. BBC Sport subcategories that measurements will be performed ..................24
Figure 11. Week traffic of BBC’s main categories .........................................................26
Figure 12. Average values of contents for BBC News and Sport subcategories ............27
Figure 13. PDF and CDF of Education Stories...............................................................29
Figure 14. Pareto CDF versus Empirical CDF of Education Stories..............................29
Figure 15. Normal PDF and CDF of Education Stories..................................................29
Figure 16. PDF and CDF of News Top Stories...............................................................30
Figure 17. Pareto CDF versus Empirical CDF of News Top Stories..............................30
Figure 18. Normal PDF and CDF of News Top Stories..................................................31
Figure 19. PDF and CDF of Politics Stories...................................................................31
Figure 20. Pareto CDF versus Empirical CDF of Politics Stories ..................................32
Figure 21. Normal PDF and CDF of Politics Stories......................................................32
Figure 22. PDF and CDF of Sport Top Stories ...............................................................33
Figure 23. Pareto CDF versus Empirical CDF of Sport Top Stories ..............................33
Figure 24. Normal PDF and CDF of Sport Top Stories ..................................................33
Figure 25. PDF and CDF of Tennis Stories.....................................................................34
Figure 26. Pareto CDF versus Empirical CDF of Tennis Stories....................................34
Figure 27. Normal PDF and CDF of Tennis Stories .......................................................35
Figure 28. PDF and CDF of Football Top Stories...........................................................35
Figure 29. Pareto CDF versus Empirical CDF of Football Top Stories..........................36
Figure 30. Normal PDF and CDF of Football Top Stories .............................................36
Figure 31. PDF and CDF of Championship Stories........................................................37

- viii -
Figure 32. Pareto CDF versus Empirical CDF of Championship Stories.......................37
Figure 33. Normal PDF and CDF of Championship Stories...........................................38
Figure 34. PDF and CDF of Monday’s BBC Home page...............................................41
Figure 35. Pareto CDF versus Empirical CDF of Monday’s BBC Home page..............41
Figure 36. Normal PDF and CDF of Monday’s BBC Home page..................................42
Figure 37. PDF and CDF of Monday’s News Top Story 1 .............................................42
Figure 38. Pareto CDF versus Empirical CDF of Monday’s News Top Story 1 ............43
Figure 39. Normal PDF and CDF of Monday’s News Top Story 1 ................................43
Figure 43. PDF and CDF of Tuesday’s News web page.................................................45
Figure 44. Pareto CDF versus Empirical CDF of Tuesday’s News web page................46
Figure 45. Normal PDF and CDF of Tuesday’s News web page....................................46
Figure 46. PDF and CDF of Tuesday’s News Top Story 2 .............................................47
Figure 47. Pareto CDF versus Empirical CDF of Tuesday’s News Top Story 2 ............47
Figure 48. Normal PDF and CDF of Tuesday’s News Top tory 2 ..................................47
Figure 49. PDF and CDF of Tuesday’s Business Story 2 ...............................................48
Figure 50. Pareto CDF versus Empirical CDF of Tuesday’s Business Story 2 ..............48
Figure 51. Normal PDF and CDF of Tuesday’s Business Story 2 .................................49
Figure 52. PDF and CDF of Tuesday’s Football Top Story 1 .........................................49
Figure 53. Pareto CDF versus Empirical CDF of Tuesday’s Football Top Story 1 ........50
Figure 54. Normal PDF and CDF of Tuesday’s Football Top Story 1............................50
Figure 55. PDF and CDF of Wednesday’s BBC Home web page ..................................51
Figure 56. Pareto CDF versus Empirical CDF of Wednesday’s BBC Home web page .51
Figure 57. Normal PDF and CDF of Wednesday’s BBC Home web page.....................52
Figure 58. PDF and CDF of Wednesday’s Technology Story 1......................................52
Figure 59. Pareto CDF versus Empirical CDF of Wednesday’s Technology Story 1.....53
Figure 60. Normal PDF and CDF of Wednesday’s Technology Story 1.........................53
Figure 61. PDF and CDF of Wednesday’s Tennis Story 1 ..............................................54
Figure 62. Pareto CDF versus Empirical CDF of Wednesday’s Tennis Story 1 .............54
Figure 63. Normal PDF and CDF of Wednesday’s Tennis Story 1.................................54
Figure 64. PDF and CDF of Thursday’s BBC Home web page......................................55

- ix -
Figure 65. Pareto CDF versus Empirical CDF of Thursday’s BBC Home web page.....55
Figure 66. Normal PDF and CDF of Thursday’s BBC Home web page ........................56
Figure 67. PDF and CDF of Thursday’s BBC News ......................................................56
Figure 68. Pareto CDF versus Empirical CDF of Thursday’s BBC News......................57
Figure 69. Normal PDF and CDF of Thursday’s BBC News .........................................57
Figure 70. PDF and CDF of Thursday’s Football Top Story 1 .......................................58
Figure 71. Pareto CDF versus Empirical CDF of Thursday’s Football Top Story 1.......58
Figure 72. Normal PDF and CDF of Thursday’s Football Top Story 1 ..........................58
Figure 73. PDF and CDF of Friday’s Business Story 2 ..................................................59
Figure 74. Pareto CDF versus Empirical CDF of Friday’s Business Story 2 .................59
Figure 75. Normal PDF and CDF of Friday’s Business Story 2.....................................60
Figure 76. PDF and CDF of Friday’s Formula Story 1...................................................60
Figure 77. Pareto CDF versus Empirical CDF of Friday’s Formula Story 1..................61
Figure 78. Normal PDF and CDF of Friday’s Formula Story 1......................................61
Figure 79. PDF and CDF of Friday’s BBC News web page...........................................62
Figure 80. Pareto CDF versus Empirical CDF of Friday’s BBC News web page..........62
Figure 81. Normal PDF and CDF of Friday’s BBC News web page..............................62
Figure 82. PDF and CDF of Saturday’s BBC Home web page ......................................63
Figure 83. Pareto CDF versus Empirical CDF of Saturday’s BBC Home web page .....63
Figure 84. Normal PDF and CDF of Saturday’s BBC Home web page .........................64
Figure 85. PDF and CDF of Saturday’s BBC Sport web page .......................................64
Figure 86. Pareto CDF versus Empirical CDF of Saturday’s BBC Sport web page.......65
Figure 87. Normal PDF and CDF of Saturday’s BBC Sport web page ..........................65
Figure 88. PDF and CDF of Sunday’s BBC Education Story 1......................................66
Figure 89. Pareto CDF versus Empirical CDF of Sunday’s BBC Education Story 1.....66
Figure 90. Normal PDF and CDF of Sunday’s BBC Education Story 1 ........................66
Figure 91. PDF and CDF of Sunday’s BBC Formula Story 1 ........................................67
Figure 92. Pareto CDF versus Empirical CDF of Sunday’s BBC Formula Story 1 .......67
Figure 93. Normal PDF and CDF of Sunday’s BBC Formula Story 1 ...........................68
Figure 94. PDF and CDF of Thursday’s Sport Top Story 1 ............................................68
Figure 95. Pareto CDF versus Empirical CDF of Thursday’s Sport Top Story 1 ...........69
Figure 96. Normal PDF and CDF of Thursday’s Sport Top Story 1...............................69

- x -

- 1 -
1 INTRODUCTION
1.1 Background and Context
The technology of information gathering, processing and distribution is the key
technology of the 20th
century. Until now we saw the development of worldwide telephone
networks, the birth and still growing computer industry and also the development of the
satellite communication [18].
According to the old concept of the computer systems all the work from different users
can be processed by one big computer but nowadays this concept is totally abandoned and
its place took the “computer network” where many autonomous computers interconnected
to each other can process the incoming work. The interconnected computers can exchange
information through copper wire, fibre optics, microwaves or satellites. The information is
exchanged through small units of data called packets. These networks of computers can
have many different forms, sizes and shapes like wireless networks and wide area networks
[1] [3].
At the first stages of the development of the Internet at the early 1980s, it was a single
network and its predecessor is the ARPANET (Advanced Research Projects Agency
Network), developed by the United States Department of Defence. Now Internet consists
of thousands of different networks that are connected to each other and every single of
them provide common services to the customers and follow common protocols. These
different networks are controlled by the ISPs (Internet Service Providers) and are
responsible to provide connectivity to the Internet to their customers. The Internet can have
interconnected ISPs of different sizes, forming a hierarchical interconnected structure. The
most common ISPs are the transport providers that deal with the provision of a wide range
of services to the customers but there are also the backbone providers that are connected to
many other ISPs and deal with the traffic that the customers produce and the web hosting
providers that provide the host of a Web page for the customers. These relationships
between the different ISPs can be translated as business relationships and are related to the
quality and the type of service provided to the customer. Nowadays Internet is not only

- 2 -
used in order to communicate with other people all over the world but is mainly used for
gaining money by providing many different services. Organizations, small and big
businesses, consumers or even individuals now see the Internet from a different scope and
prefer to make their business through it. All these increasing expectations from the Internet
make the need to become more and more reliable [1] [3] [4].
For that reason many academic researchers, companies and other groups focused their
concentration to the internet traffic that the customers cause. They made more and more
measurements in order to exam this traffic and come up with some useful results that will
help to improve the internet network and the internet traffic performance management [18].
From the measurements we can see the network response and behaviour at any upgrade or
degrade of the performance [2].
As it was mentioned in the previous paragraph, the measurements are very important in
order to understand the demands of a service so application level measurements will be
performed at this dissertation trying to understand the use of the network by the service,
the demands of the service and the effects that cause the service to the network and its
performance.
1.2 Scope and Objectives
The scope of this dissertation is the Internet Traffic Measurement and Analysis. It was
mentioned in the previous paragraphs that in this dissertation will be made application
level measurements. The advantage of the application measurements is that they provide an
overall view of the application performance, which it wouldn’t be so clear if the
measurements had become in lower levels. More specifically web browsing measurements
will be performed by downloading web page contents.
In the last few years, has been observed big improvement in the field of
telecommunications. This improvement had as a result the mobile terminals to become
faster, with bigger capacity and even smaller size than before. This was a great opportunity
for the progress of the services provided from the mobile phones. So nowadays mobile

- 3 -
phones are not only used for calls and for text messages but we can also use them to
browse an internet website, to send an email or even to hear music and record videos. For
that reason the way that a client can have access to the Internet is not only through an
internet service provider over a dial up telephone line. On the contrary, every user can have
access to Internet through its mobile phone, laptop or palm pc. As a result, access to
Internet became more flexible [2] [3] [16].
For this reason the measurements that will be performed, they will concern the traffic
that a mobile phone can produce and for this reason we will use an emulator installed in a
laptop that will be connected to the Internet thought the campus of the University of
Surrey. The platform will emulate a mobile phone and thought this platform we will have
the ability to browse the web pages. For the measurements we needed a service provider
that will have a well defined structure of its web pages and also rich and up to date
contents. For that reason we decided to browse the web pages of the mobile edition of the
BBC which is a provider that has these characteristics.
Some objectives of the dissertation are to understand and gain knowledge of the way
that Internet works and response to requests. That could be achieved easier throw the
procedure of the measurements as every packet that mobile phone send and receive will be
captured and analysed. The way that the results from the measurements will be stored and
organized is one other important parameter as it may affect the extract of the conclusions.
It is also important to mention that one other objective of this dissertation is to calculate the
time that is consumed from the request of a web page until the end of the responses for this
particular request and also the inter-arrival time between the packets of the same response.
Furthermore in this dissertation we will take measurements about the total size of the web
pages in bytes. The main objective is to observe the results of the measurements both for
the inter-arrival times and the size of the web pages and through the analysis of these
measurements to see if the results could fit into a mathematical distribution. During the
analysis of the total content size of the web pages we will have the chance to see how the
size of the data changes over a week period.

- 4 -
1.3 Achievements
The first step in order to cope with the dissertation was to gain the appropriate
background in order to become familiar with the topic of the dissertation and understand
the requirements needed. Also it was very important to make a literature review of other
works in the field of Internet traffic measurement and analysis. Furthermore through the
literature review it became clear the type of the measurements that will be performed and
the final decision about the web sites of the measurements was taken. Also it was decided
that the S60 3rd
Edition SDK emulator and the Wireshark network packet analyzer will be
used for the measurements.
After the first step, the registration with the Nokia was made in order to obtain the rights
to use the emulator and then the emulator was installed in the laptop. Familiarization with
the emulator and the options that it provides was performed and also some trial
measurements were performed to the BBC’s mobile web sites. At the same time through
the Wireshark the incoming and outgoing packets were captured and then examined. After
exploring the structure of the BBC’s web site, systematic measurements were performed
while storing the results for further analysis as we will see in the following step of the
dissertation.
The following achievement is the analysis of the data collected from the measurements.
The results from the measurements are used in order to come up with a pattern that they
may use in order to be able to categorize them and try to fit them in a known mathematical
distribution. The changes in the context of the pages over a week period were also
examined and will be presented. Finally the results are presented with the mathematical
distribution that fits better to the measured data and represents the analysis for the BBC’s
mobile web page contexts.

- 5 -
1.4 Overview of Dissertation
At the next chapter of the dissertation is presented the literature review and the
background concerning the protocol stack of the Internet and the protocols that are used to
transmit and receive data through the internet. Also the World Wide Web will be presented.
Finally the mathematical distributions that are going to be used to characterize the
collected data are explained and presented.
At the third chapter the reader is introduced to the technical part of the dissertation.
Firstly the tools that will be used for the measurements are presented and then the target
that the measurements will be performed. The structure of the target Web Site is presented
and the specified roots that measurements will be performed.
The fourth chapter contains the measurements performed to the BBC’s web site and
more specific the measurements that concern the total amount in bytes of the web pages.
Then an analysis of the collected data is performed and the most appropriate mathematical
distribution that fits to the measurements is chosen.
The fifth chapter contains the measurements performed to the BBC’s web pages
concerning the inter-arrival times between the received packets of a web page response.
Through the performed analysis of this chapter is chosen the most appropriate
mathematical distribution that fits to the measurements.
The sixth chapter contains the conclusions and the evaluation of this dissertation and
also the future work that could be done in this field of the dissertation.
There are also four appendixes at the end of the dissertation. In the first appendix is
presented the work plan that was followed during the year, in the second appendix is
presented the code in Matlab that was used to analyse the measurements, in the third
appendix is presented the table with the measurements about the total content size of the
BBC’s web pages and finally in the fourth appendix are presented the tables with the inter-
arrival time measurements of the BBC’s web pages.

- 6 -
2 LITERATURE REVIEW
2.1 Introduction
Here will be presented and reviewed some issues related with the project. This is done
in order to help someone understand concepts of the field of the dissertation topic and gain
the appropriate knowledge. Firstly the Internet protocol stack will be presented with a brief
summary of every layer and its usage. After that the IP (Internet Protocol) will be
presented, the TCP (Transition Control Protocol)and the well known WWW (World Wide
Web). Finally we will present the mathematical method that is going to be used for this
dissertation to analyse the data.
2.2 Introduction to Internet Protocol Stack
As it was mentioned before, in order to understand the way that Internet works we have
to examine the protocol stack that is implemented in order to send or receive one packet
from the Internet. Firstly we will briefly present the internet protocol architecture and also
we will see how it is organized in layers. After that we will examine closely protocols that
are used, like the IP (Internet Protocol), TCP (Transmission Control Protocol) and also we
will see how WWW (World Wide Web) works as it is important in order to obtain the
appropriate background for this dissertation.
In the graph below we can see the protocol stack of the Internet
Figure 1. Protocol stack

- 7 -
The lowest layer of the protocol stack is the Physical Layer where the main function of
it is the transmission of the bits. As we are talking for mobile phones the channel that the
bits are transmitted is the air but for our measurements, as there are going to be done from
the laptop that is connected to the internet, the channel will be the copper wire.
The layer above the Physical Layer is the Data Link Layer where the main purpose of it
is to maintain the reliable and efficient communication between two adjacent machines of
this layer. One of the most important things that exist in this layer is the MAC Address
(Medium Access Control) where every computer that connects to the Internet has one and
it is unique all over the world. As it isn’t important for this dissertation to examine
furthermore this layer we will only keep this in mind.
The next layer over the Data Link Layer is the Network Layer where the main operation
is to transmit the packets from the source to the destination. In contrast with the Data Link
layer where only the transmission of the packet from one end of a cable to the other is
concerned, this layer deals with end to end transmission. As it is very important the
function of this layer in the next pages we will examine furthermore the functionality of
this layer and the protocols that exists on it.
Above the Network Layer is the Transport Layer. The function of this layer is also very
important as it is responsible to provide reliable and cost effective data transport from the
source to the destination. Also it communicates with the Application Layer, receiving and
sending requests and data packets respectively. At the next pages we will examine in detail
the protocol that is used in order to send and receive the data packets.
Finally, the layer that exists in top of all the others is the Application Layer. It is
responsible for the communication of various applications with the protocols that exist
below of it. Also for this Layer we will examine later the main protocol that is used for
browsing the Internet [1] [2] [7].
Now that all the layers are presented we will try to understand the way that
communicate to each other and the data that they change between them. Starting from the

Application layer, it produces data streams that are mainly produced by the user’s requests.
The Transport layer take these data streams and fr
maximum size of each datagram is up to 64 Kbytes but in practice the length of each
datagram doesn’t exceed the 1460 Bytes, in order to fit in an Ethernet packet with the IP
and TCP headers that we will see later. Then each
the IP protocol is used and also a connectionless approach is used, so every packet can
follow a different path to the destination. After that the Data Link Layer follows and finally
the Physical Layer where the bits ar
example of the format of an IP packet with the header of each layer from the Application
down to the Data Link header
Figure 2. Encapsulation of data as it goes down the protocol stack
- 8 -
The Transport layer take these data streams and fragment them into datagrams. The
and TCP headers that we will see later. Then each datagram goes to Network Layer where
the Physical Layer where the bits are transmitted to the channel. Below we can see an
down to the Data Link header [1] [7].
Encapsulation of data as it goes down the protocol stack
agment them into datagrams. The
datagram goes to Network Layer where
e transmitted to the channel. Below we can see an
Encapsulation of data as it goes down the protocol stack [1]

- 9 -
2.3 IP protocol
In this section we will examine the protocol that is mainly used in the Network Layer
and this is the IP (Internet Protocol). Currently they are two versions, the IP v4 and the IP
v6 which is the new version that provides wider range of IP addresses and less complex
header than the IP v4. We will examine the IP v4 as it is currently used more than IP v6. As
it can be observed in the picture below, the header of the IP protocol has a 20 Byte fixed
part and a variable length optional part.
Figure 3. IP header fields [1]
Now we will briefly explain the fields of the protocol and the usage of them.
Version: the field that contains the version of the IP protocol that is used and that is
necessary for the communication between two machines that use different version of the
IP.
IHL: The field that shows the final length of the header of the packet as we can have
variable length of Options.
Type of service: This field is mainly used in order to distinguish between different
classes of services, for example for voice we need fast and accurate delivery of the packet.

- 10 -
Total length: This field includes the total length of the packet, including the header and
the data.
Identification: This field is used in order to determine the receiver in which datagram
the received fragment belongs.
DF bit (Don’t Fragment): When this bit is set to 1 the datagram cannot be fragmented
by routers of the network.
MF bit (More Fragments): This bit indicates that more fragments of the same
datagram are expected.
Fragment offset: This field indicates the position of the fragmented data into the
datagram.
Time to live: This field indicates the maximum number of hops that one datagram can
do until the final destination. This number is decremented by one every time that a router
forwards it and when it hits zero the datagram is dropped by the network.
Protocol: This field indicates the upper layer that the IP protocol should send the
packet.
Header checksum: It verifies only that the header has no errors.
Source address: Indicates the IP address of the sender of the packet.
Destination address: Indicates the IP address of the receiver.
Options field: This field can be used in order to add more functions to the protocol like
security, timestamping and source routing.[1] [2] [7]

- 11 -
2.4 TCP Protocol
Moving to the Transport Layer, TCP (Transition Control Protocol) and UDP (Used
Datagram Protocol) are the two main protocols that are used. We will focus to the TCP
protocol as the World Wide Web runs with HTTP protocol in the Application layer and the
TCP protocol at the Transport layer. In the picture below we can see the TCP header.
Figure 4. TCP header fields [1]
Now we will briefly explain the fields of the protocol and the usage of them.
Source port and Destination port: These fields identify where the packets should be
send in the upper layer, so they identify the application that the packet belong.
Sequence number and Acknowledge number: These fields are used in order to be
transmitted all the packets safely without having any loss.
TCP header length: In this field is stored the number of the 32-bit words that the TCP

- 12 -
header has. This field makes clear where the header end is and where the start of the data is
as we can have a variable header length.
URG (Urgent) flag: This is used when we have an urgent pointer that indicates that
we have urgent data in this packet.
ACK flag: This is used when we want to send acknowledge of a received packet. When
the ACK flag is set to 0 the Acknowledge number field is ignored.
PSH or PUSHed data: When this flag is set to 1 the data are delivered to the
Application layer at their arrival and they are not buffered in this stage.
RST flag: This is used to reset a connection or reject an invalid segment.
SYN flag: This is used to establish a connection between two entities.
FIN flag: This is used to release a connection between two entities.
Window size: In this field is stored the number of the bytes that the receiver is willing
to receive from the transmitter.
Checksum field: This field includes a checksum of the header and data for extra
reliability.
Urgent pointer: This field shows the byte offset where urgent data are.
Options: This field is used for extra options that are not provided in the regular header.
[1] [2] [7]

- 13 -
2.5 World Wide Web
In the Application layer, which is in the top of all the other layers that we examined and
as the dissertation focuses on the browsing of Internet sites we will focus on the World
Wide Web which uses the HTTP protocol.
One of the most important services of the Internet is the World Wide Web that began in
1989 at CERN, a European centre for nuclear research. It became very popular all over the
world as it is friendly to use for beginners and its interface is well designed. At the
beginning it was designed in order the scientists of the CERN to be able to share their
research and exchange ideas as many of them were working in different countries but the
World Wide Web (WWW) grew out of these needs and used by the entire world. This
became when CERN and M.I.T signed an agreement setting up the World Wide Web
Consortium. This organization was responsible to develop the World Wide Web by
standardizing protocols and encouraging interoperability between the companies that had
developed a browser at the time, Netscape and Microsoft [1] [3].
World Wide Web consists of a huge amount of documents distributed randomly over the
world or otherwise called Web pages. Every web page may contain several links to other
pages around the world. In this way is formed a complicated connection between the pages
and every user can have access to them. As it would be very difficult to keep track of the
path that has been followed for a page, World Wide Web uses the URL (Uniform Resource
Locator) which is a unique identifier of each Web page. So now every user can simply
remember the URL of the Web site in order to have access to it. As World Wide Web was
growing faster and faster applications that helped the users were developed, the Web
browsers. With Web browsers is was easier to browse different sites and keep a record of
the URL of the page that you may want to visit again. Browsers made the World Wide Web
friendly to use and attracted more users [1] [2] [3].

- 14 -
2.6 Mathematical Distributions for the Analysis
The term analysis of the service measurements indicates the process in which the data
already collected are examined in order to find if they could be modelled according to a
know distribution or model. This model then could be used to characterize relevant
phenomena or same types of data. We will focus on Power low, Pareto and Normal
distribution for the scope of this dissertation after an examination of several mathematical
distributions. At the end of this dissertation it will be clear the reason for this selection.
2.6.1 Power Law and Pareto distributions
In recent years, a significant amount of research focused on showing that many physical
and social phenomena follow a power-law distribution. Some examples of these
phenomena are the World Wide Web [9], metabolic networks, Internet router connections,
journal paper reference networks, and sexual contact networks [8]. There is sometimes
confusion between the Power law and the Pareto but we will make that clear in the next
paragraphs [8] [9].
We will try to explain both Pareto and Power law through an example of Lada A.
Adamic [9] and try to make clear their similarities and differences. Taking an example of
the distribution of the income, In the Pareto instead of asking what the r th largest income
is, we ask how many people have an income greater than x [14]. So we come up with this
equation P[X>x] ≈ x-k
. For this reason we can say that Pareto’s law is given in terms of the
cumulative distribution function (CDF), i.e the number of events larger than x is an inverse
power of x. Now what we call Power law distribution tells us not how many people had an
income greater than x, but the number of people whose income is exactly x. So it is the
probability distribution function (PDF) associated with the CDF given by Pareto’s law. By
that we can have the P[X=x] ≈ x-(k+1)
= x-a
, where k is the Pareto distribution shape
parameter [9] [13].

- 15 -
Now we will try to explain the way that we are going to work with these distributions
and try to fit the data collected. In order to compare the data with the Pareto law we have to
find the CDF for both the data and the Pareto distribution with parameters that approaching
the curve of data’s CDF. We know from the theory that the Pareto CDF is given by the
following formula F(x) = 1 – (b / x)a
for x>b and 0 for x<b. The ‘a’ is the shape parameter
and the ‘b’ is the scale parameter. As we saw before a = k-1 where ‘k’ is the shape
parameter of Power law. The ‘b’ parameter after searching the related literature is
commonly taking the value of the smallest value of the data that are going to be examined.
In order to find the value of the parameter ‘a’ we will use a program of Aaron Clauset,
Cosma Rohilla Shalizi and M. E. J. Newman that is trying to give a value for the ‘k’
parameter of the power law that fits in a better way in the inserted data. The program
estimates ‘k’ for each possible minimum value of the incoming values x via the method of
maximum likelihood and calculates the Kolmogorov- Smirnov goodness-of-fit statistic
value, then it selects the x that has the minimum Kolmogorov- Smirnov goodness-of-fit
statistic value and export the ‘k’ value from it. Generally the Kolmogorov- Smirnov
method is used when the sample size for each test is small as in our case where we have
from 3 to 15 values per test. The KS test is based on the following value: K = sup x |F*(x) -
S(x)| where F*(x) is the hypothesized cumulative distribution function and S(x) is the
empirical distribution function based on the sampled data [8] [6] [13] [17]. Then after
exporting the closer value of ‘k’ for the data we can find the ‘a’ parameter as it is equal
with a = k-1. The next step is to compare the CDF of the data with the CDF of the Pareto
distribution and see if they follow the Pareto so they could fit on this distribution.
In the Appendix 2 is presented the code of the Matlab program and the functions that I
made in order to get the information from the main program and present the results and the
Pareto distribution.
2.6.2 Normal Distribution
Now we will present the Normal distribution and the important parameters that we have
to know in order to design it. It is important to mention that all normal distributions are
symmetric and have bell-shaped density curves with a single peak. The parameters that

- 16 -
characterize this distribution are the mean value of the data ‘μ’ and the standard deviation
‘σ’ which is a measure of the dispersion of the data. The probability density function of the
normal distribution is given by the following formula f(x) = (1/σ √2π) * exp (- (x-μ) 2
/
2σ2
). From the probability function of the measured data we will see if they could fit to this
distribution [10].
2.7 Summary
In this chapter we presented the background that is necessary in order to come up with
this dissertation. We saw the Internet protocol stack and we made a small introduction to
each of it and then we presented the IP (Internet Protocol), the TCP (Transition Control
Protocol) and then we presented the WWW (World Wide Web). Finally we presented the
mathematical distribution that will be used to analyse the measurements that will be made.
These aspects are very important to familiarize with the dissertation theme and understand
what will follow. Firstly we have to know what we are going to measure and then make the
measurements and analyse them. In the next chapter will follow the methodology of the
measurements, the target of the measurements and the tools that are going to be used.

- 17 -
3 INTERNET TRAFFIC MEASUREMENTS AND
METHODOLOGY
We saw in the previous chapter useful terms related with the theoretical part of
dissertation topic in order to familiarize and gain the appropriate background for this
dissertation. In this chapter we will examine the technical part of the dissertation that is
related with the internet traffic measurements.
3.1 Methodology of measurements
3.1.1 Target of the measurements
The first step in order to start the measurements is to determine the target that the
measurements will be done. The selection of the site is very important as we want the data
analysis to give us useful results that have some meaning. The final choice of the site that
the measurements will be held is the BBC’s website for mobile edition. BBC is the main
British Broadband Corporation with worldwide recognition and acceptance [15]. The
BBC’s mobile edition website tries to satisfy the requirements of the user to be able to be
informed about the news while he isn’t at home. People can easily browse to the mobile
website though their mobile phone or PDA and have access to BBC News, BBC Sports or
other categories. As we can understand the BBC mobile website offers useful information
to many people, that are frequently updated and that make it a very appropriate target for
our measurements.

- 18 -
3.1.2 Measurement tools
3.1.2.1 S60 3rd
Edition emulator
It would be difficult do perform the measurements through a mobile phone as the
operator will charge for the internet browsing and also it would be difficult to process and
store the data from the measurements, so it was decided to use a mobile phone emulator.
The S60 3rd
Edition SDK for C++ platform was chosen as the emulator for browsing the
BBC’s mobile websites. After the registration with Nokia, the platform was ready to be
installed in the laptop. Then through the connection of the laptop to the Internet we could
access the BBC’s website without being charged.
Among many other services that this emulator provides, is the browser application that we
will mainly use for accessing the contents of the BBC’s website. It supports features such
as HTML 4.01, XHTML, JavaScript 1.5, Plug-in support and File Upload over HTTP [11].
Below we can see the form of the emulator with the input keys and its menu icons.
Figure 5. S60 3rd
Edition emulator

- 19 -
The emulator is friendly in use and familiarization with its options can be obtained
quickly. In order to access the browser of the emulator the Services icon must be pressed.
After that and as we can see in the figure 8 we have to type the address that we want to
browse. Now we will access into the BBC’s webpage and will explain the usage of the
diagnostic tool of the emulator.
Figure 6. BBC website explore

- 20 -
As we can see in the above figure the emulator supports a diagnostic tool that
provides information about the traffic that has been done, the total size of every web page
that has been visited and the type of the incoming files, like text, photographs or videos, in
form of requests and responses. This will help us to fulfil one part of the measurements
about the size of the web pages.

- 21 -
3.1.2.2 Wireshark
As one of the main scopes of this dissertation is to capture the inter-arrival times of the
incoming packets for a web page request, it is necessary a tool that will allow as counting
this time period. This tool is the Wireshark. Wireshark is a network packet analyzer.
Generally Wireshark can be used from:
 network administrators to troubleshoot network problems
 network security engineers to examine security problems
 developers to debug protocol implementations
 people to learn network protocol internals [12]
For this dissertation Wireshark will be used in order to capture every packet that comes
from the Internet and also every request that we have and goes to the Internet. From the
picture bellow we can see that for every packet is captured the time that was sent or
received, so from that we are able to count the inter-arrival time of the received packets,
until the end of the received packets from the server’s response.
Figure 7. Wireshark traffic presentation

- 22 -
3.2 Performed Measurements
Before starting the measurements it is important to make clear that the sample size
should be large enough and representative. Also the measurements have to be repeated
over a lot of times. After gaining experience from the preparatory phase, it was decided to
make the measurements for the two main categories of the BBC mobile web page and
these are the BBC News and BBC Sport. These categories are frequently updated and
contain a lot of subcategories. Moreover we will focus on the more important
subcategories of these two categories. In the next tree graphs we will present the web pages
that were chosen to be measured and their containing relationship.
Figure 8. BBC categories that measurements will be performed

- 23 -
Figure 9. BBC News subcategories that measurements will be performed
As we can see from the above graph we will focus the measurements at six
subcategories of the BBC News category (Top Stories, Technology, Politics,
Entertainment, Business end Education) and then at tree stories of every subcategory.
These mentioned categories are assumed to concentrate the user preferences.

- 24 -
Figure 10. BBC Sport subcategories that measurements will be performed
In case of BBC Sport we will focus on four subcategories (Top Stories, Motorsport,
Football and Tennis) but here 2 of them contain subcategories like Formula 1 and World
Rally are contained in Motorsport subcategory and Top Stories, Premiership and
Championship that are contained in Football subcategory. For the displayed stories
measurements will be performed.

- 25 -
Before starting the measurements it is important to mention the frequency of taking
measurements from these web sites. For better results this frequency should be the same of
the frequency that the information of the pages is updated. After observations to the BBC’s
web site it wouldn’t be wise to take more than one measurements per web page per day
because the changes at the contents are rare. This may be logic, because it is seamless and
difficult to change the contents in so small time intervals. So we decided to collect
measurements for a week in time intervals of a day.
3.3 Summary
In this chapter we presented the methodology that we are going to follow for the
measurements, specifying the target for the measurements and the reasons for this choice.
Also we presented the tolls that are going to be used for the measurements and these are
the Nokia S60 Emulator and the WIreshark. After the presentation of the tools and the way
that they are going to be used we presented the specific web pages that measurements are
going to be performed from the entity structure of the BBC’s mobile web site. At the
following chapter we will perform the measurements for the total content size of the web
pages and the analysis of the collected data.

4 BBC’S WEB SITE TRAFF
ANALYSIS
In this chapter we will analyse and present the
that were performed in order to obtain a view of the form of the BBC’s web site profile in
data. For these measurements the emulator that was presented in Chapter 3 was used and
the diagnostic tool that is also contain
web pages.
4.1 General Analysis of the BBC’s Web Site categories
At the following graph we can see the minimum, maximum and average value in byte
of the main page of the BBC web site and the two main c
Sport.
Figure 11.
10000
10800
11600
12400
13200
14000
14800
15600
16400
17200
BBC HOME
minimum
value
BBC HOME
average
value
BBC HOME
Maximum
value
BBC Home content in bytes over
a week
100001080011600124001320014000148001560016400172001800018800
- 26 -
BBC’S WEB SITE TRAFFIC MEASUREMENTS AND
In this chapter we will analyse and present the collected data from the measurements
For these measurements the emulator that was presented in Chapter 3 was used and
the diagnostic tool that is also contained to the emulator in order to get the total size of the
General Analysis of the BBC’s Web Site categories
of the main page of the BBC web site and the two main categories
Figure 11. Week traffic of BBC’s main categories
BBC HOME
Maximum
value
BBC Home content in bytes over
Bytes
12000
12800
13600
14400
15200
16000
16800
17600
18400
BBC NEWS
minimum
value
BBC NEWS
average
value
BBC News content in bytes over a
week
100001080011600124001320014000148001560016400172001800018800
BBC
SPORT
minimum
value
BBC
SPORT
average
value
BBC
SPORT
Maximum
value
BBC Sport content in
bytes over a week
Bytes
IC MEASUREMENTS AND
collected data from the measurements
For these measurements the emulator that was presented in Chapter 3 was used and
ed to the emulator in order to get the total size of the
BBC News and BBC
Week traffic of BBC’s main categories
BBC NEWS
average
value
BBC NEWS
Maximum
value
BBC News content in bytes over a
week
Bytes

From the above graph, we can see that the News and Sports generate almost the same
amount of traffic and both of them lower traffic compared to the BBC Home web page.
This is logical as at the first page the amount of information is bigger that the subcategories
of it. In the following graph we will present the average values of bytes of the
subcategories of BBC News and Sport.
Figure 12. Average values of contents for BBC News and S
1500
1750
2000
2250
2500
2750
3000
3250
3500
3750
Top Stories
BBC News subcategories content in bytes over a
500
1300
2100
2900
3700
4500
5300
6100
6900
7700
8500
9300
10100
10900
11700
12500
13300
14100
14900
15700
Top Stories
BBC Sport subcategories content in bytes over a
- 27 -
logical as at the first page the amount of information is bigger that the subcategories
subcategories of BBC News and Sport.
Average values of contents for BBC News and Sport subcategories
Politics Education
week
Motorsport Football Tennis
week
logical as at the first page the amount of information is bigger that the subcategories
port subcategories
Bytes
bytes

- 28 -
From the above figure we can observe that in News category almost all the
subcategories have average value around 3000 and 3250 bytes except the Top Stories that
have 3500 bytes and the Business with 3450 bytes. On the other hand these is a big
difference between the Football subcategory of Sport and the other subcategories that their
average value is round 2900 bytes. This can be explained as the football is more favour that
the others and BBC pays more attention and provides more information in this field.
4.2 Analysis of the BBC’s Web Site and Mathematical Distributions
In this part of the dissertation we will analyse in depth the results from the
measurements for the content in bytes of the BBC’s web page and we will examine the
case that the data could follow a mathematical distribution from those we presented at
2.6.1 and 2.6.2.
At the first steps of the analysis we tried to see if the measurements could fit to the
Pareto distribution and in order to do that we constructed the PDF (Probability Density
Function) and the CDF (Cumulative Distribution Function) of the collected data and then
throw the method we described in 2.6.1 we found the parameter of the Pareto distribution
and then we designed the Pareto CDF and compared them with the CDF of the empirical
CDF of the collected data. Also we compared the data with the Normal distribution to see
which distribution has better results with the data. At the following pages we will present
these results and we will make the estimation of the acceptance of the Pareto distribution
and Normal distribution.
From the total amount of the graphs that were designed we can say that the Pareto
distribution in not fitting well with the collected data of the content in bytes of the BBC’s
web pages. Of course there are some exceptions that we are going also to present that
Pareto fits well with the data but the majority of the graphs shows that this is false and the
appropriate distribution could be the Normal. We will start with the Education Stories of
the BBC News category. In the following figure is presented the PDF and the CDF of the
Education stories that were performed measurements for a week period. We can see the
distribution of the web pages according to their total size in bytes.

Figure 13.
Figure 14. Pareto CDF versus Empirical CDF of Education Stories
Figure 15.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0
0.00005
0.0001
0.00015
0.0002
0.00025
0 5000
density
bytes
- 29 -
Figure 13. PDF and CDF of Education Stories
Pareto CDF versus Empirical CDF of Education Stories
Figure 15. Normal PDF and CDF of Education Stories
density
0
0.2
0.4
0.6
0.8
1
10000 15000
0
0.2
0.4
0.6
0.8
1
0 5000
density
PDF and CDF of Education Stories
Pareto CDF versus Empirical CDF of Education Stories
Normal PDF and CDF of Education Stories
density
10000 15000
bytes

It is obvious from the figure14 that the Pareto CDF is not following the curve of the
empirical CDF of the data as the two curves have only two common points and then the
differences between them increasing. On the other hand the figure15 shows that the
measurements of the Education stories feet very well to normal distribution with average
point equal to 8487 bytes and standard deviation equal to 1752.23. We can extract this
conclusion when we look the PDF of the data and the PDF of Normal distribution and
the CDF of the data and the CDF of Normal distribution.
Now we will examine the Top Stories of the BBC News category. The results of the
measurements come from three different top stories and represent the total content size of
the web pages.
Figure 16.
Figure 17. Pareto CDF versus Empirical CDF of News Top Stories
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
- 30 -
asurements of the Education stories feet very well to normal distribution with average
conclusion when we look the PDF of the data and the PDF of Normal distribution and
the CDF of the data and the CDF of Normal distribution.
Figure 16. PDF and CDF of News Top Stories
Pareto CDF versus Empirical CDF of News Top Stories
density
0
0.2
0.4
0.6
0.8
1
asurements of the Education stories feet very well to normal distribution with average
conclusion when we look the PDF of the data and the PDF of Normal distribution and also
PDF and CDF of News Top Stories
Pareto CDF versus Empirical CDF of News Top Stories
density

Figure 18.
In the figure 16 are presented the PDF and CDF of the content in bytes of the BBC
News Top Stories web pages and at the following figure are presented the curves of the
Empirical CDF of the data and the Pareto CDF with its parameters set to be closer to th
Empirical CDF. Even now the differences between these two curves are obvious and have
only two common points. On the contrary, considering the PDF of the data with the PDF of
the Normal distribution in figure18 there are many similarities in the shape of
and that can also stand to the CDF of the data and the Normal distribution. The Normal
distribution has average value 8123 bytes and standard deviation 1359 bytes.
The next example will be the Politics stories of the BBC News. The measurements
composed of three different Politics stories.
Figure 19.
0
0.00005
0.0001
0.00015
0.0002
0.00025
0.0003
0.00035
0 5000
density
bytes
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
- 31 -
Figure 18. Normal PDF and CDF of News Top Stories
Empirical CDF of the data and the Pareto CDF with its parameters set to be closer to th
the Normal distribution in figure18 there are many similarities in the shape of
The next example will be the Politics stories of the BBC News. The measurements
composed of three different Politics stories.
Figure 19. PDF and CDF of Politics Stories
10000 15000
bytes
0
0.2
0.4
0.6
0.8
1
0 5000
density
density
0
0.2
0.4
0.6
0.8
1
Normal PDF and CDF of News Top Stories
Empirical CDF of the data and the Pareto CDF with its parameters set to be closer to the
the Normal distribution in figure18 there are many similarities in the shape of the graph
The next example will be the Politics stories of the BBC News. The measurements are
PDF and CDF of Politics Stories
5000 10000 15000
bytes
density

Figure 20.
Figure 21.
From the figure20 we can see that the two curves of the Pareto and Empi
different angles and so the data of politics stories don’t follow the Pareto distribution. On
the other hand from the figure19 and figure 21 the Normal distribution seems to be closer
to the data and fit well with the data for both the PDF a
distribution parameters are 7919 bytes for the average value and 1192 bytes for the
standard deviation.
As we can see until this point the measurements tend to feet to the Normal distribution
but as we will present at the following p
to the Pareto distribution in spite the fact that these cases are few. At the following figures
we will present one case that fits well to the Pareto distribution.
0
0.00005
0.0001
0.00015
0.0002
0.00025
0.0003
0.00035
0.0004
0 5000 10000
density
bytes
- 32 -
Pareto CDF versus Empirical CDF of Politics Stories
Figure 21. Normal PDF and CDF of Politics Stories
From the figure20 we can see that the two curves of the Pareto and Empi
to the data and fit well with the data for both the PDF and the CDF. The Normal
but as we will present at the following pages there are some measurements that fit also well
we will present one case that fits well to the Pareto distribution.
10000 15000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5000
density
bytes
Pareto CDF versus Empirical CDF of Politics Stories
Normal PDF and CDF of Politics Stories
From the figure20 we can see that the two curves of the Pareto and Empirical CDF have
nd the CDF. The Normal
ages there are some measurements that fit also well
10000 15000
bytes

Figure 22.
Figure 23. Pareto CDF versus Empirical CDF of Sport Top Stories
Figure 24.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0
0.00002
0.00004
0.00006
0.00008
0.0001
0 10000
density
bytes
- 33 -
Figure 22. PDF and CDF of Sport Top Stories
Pareto CDF versus Empirical CDF of Sport Top Stories
Figure 24. Normal PDF and CDF of Sport Top Stories
density
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
20000 30000
bytes
0
0.2
0.4
0.6
0.8
1
0 10000
density
PDF and CDF of Sport Top Stories
Pareto CDF versus Empirical CDF of Sport Top Stories
Normal PDF and CDF of Sport Top Stories
density
10000 20000 30000
bytes

In figure22 is presented the PDF and the CDF of the content size of BBC’s Sport Top
Stories and are contained measurments from 3 different stories. From th
see that the Pareto distribution fits very well to the collected data and this can also be
suspected from the PDF of the figure 22 as the data have a long tail that is a characteristic
of the pareto distribution. On the contrary the Norm
as its trying to cover all the data and few of them are far away from the magority of the
measurements and away from the main bell
rare cases that Pareto distribution fit
measurements.
Now we will examine BBC’s web pages that belong to the BBC Sport category. We will
start with the Tennis stories and at the next graph we will present the PDF and CDF of the
tennis stories according to their total content size.
Figure 25.
Figure 26.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
- 34 -
Stories and are contained measurments from 3 different stories. From th
of the pareto distribution. On the contrary the Normal distribution dont have a good shape
measurements and away from the main bell-shaped curve. But as we already said there are
rare cases that Pareto distribution fits better that the Normal distribution to the
ng to their total content size.
Figure 25. PDF and CDF of Tennis Stories
Pareto CDF versus Empirical CDF of Tennis Stories
density
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Stories and are contained measurments from 3 different stories. From the figure 23 we can
al distribution dont have a good shape
But as we already said there are
s better that the Normal distribution to the
PDF and CDF of Tennis Stories
Pareto CDF versus Empirical CDF of Tennis Stories
density

Figure 27.
For the Tennis stories it can be observed from the figure26 that the Pareto distribution is
not following the Empirical CDF curve of the data so it isn’t the appropriate distribution to
characterize the data. Comparing the figure’s 26 Normal PDF and CDF with those of the
figure25 we find a bigger similarity and Normal distribution characterize m
appropriately the collected data.
Continuing with the BBC Sport category we will present the results for the Football Top
Stories.
Figure 28.
0
0.00005
0.0001
0.00015
0.0002
0.00025
0.0003
0 5000 10000
density
bytes
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
- 35 -
Figure 27. Normal PDF and CDF of Tennis Stories
figure25 we find a bigger similarity and Normal distribution characterize m
appropriately the collected data.
Figure 28. PDF and CDF of Football Top Stories
10000 15000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5000
density
density
0
0.2
0.4
0.6
0.8
1
Normal PDF and CDF of Tennis Stories
figure25 we find a bigger similarity and Normal distribution characterize more
PDF and CDF of Football Top Stories
10000 15000
bytes
density

Figure 29. Pareto CDF versus Empirical CDF of Football Top Stories
Figure 30.
At the figure28 is presented the PDF and the CDF of the collected data of the BBC’s
Football Top Stories pages and at the figure29 is presented the comparison between the
Pareto CDF of the data and the Empirical CDF. From this comparison we can see the mai
differences of these two curves that show that there is no fit with the Pareto distribution as
the two curves have only 3 same points at the start of the curve and then the distance
between them increases. On the contrary the figure30 compared with the f
many similarities and follow the collected data with a better approximation.
0
0.00005
0.0001
0.00015
0.0002
0.00025
0.0003
0.00035
0.0004
0 2000 4000
density
bytes
- 36 -
Pareto CDF versus Empirical CDF of Football Top Stories
Normal PDF and CDF of Football Top Stories
Pareto CDF of the data and the Empirical CDF. From this comparison we can see the mai
between them increases. On the contrary the figure30 compared with the f
6000 8000 10000
bytes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2000 4000
density
Pareto CDF versus Empirical CDF of Football Top Stories
DF and CDF of Football Top Stories
Pareto CDF of the data and the Empirical CDF. From this comparison we can see the main
between them increases. On the contrary the figure30 compared with the figure 28 has
4000 6000 8000 10000
bytes

At the following pages we will present one more example of the measurements that
have been made to the BBC’s web pages and that will be the Championship Stories t
subcategory of the Football in Sport Category.
Figure 31.
Figure 32. Pareto CDF versus Empirical CDF of Championship Stories
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
- 37 -
have been made to the BBC’s web pages and that will be the Championship Stories t
subcategory of the Football in Sport Category.
Figure 31. PDF and CDF of Championship Stories
Pareto CDF versus Empirical CDF of Championship Stories
density
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
have been made to the BBC’s web pages and that will be the Championship Stories that are
PDF and CDF of Championship Stories
Pareto CDF versus Empirical CDF of Championship Stories
density

Figure 33.
From this last example we confirm that the Pareto distribution is not the appropriate
distribution to characterize and fit to the measurements of the total content size of the
BBC’s web pages. This can be seen from the figure32 and the differences between
curves. According to the results of the measurements that we made and the graphs that we
also presented we can see that the Normal distribution is more appropriate distribution for
our measurements and can fit to the collected data PDF and CDF.
4.3 Conclusions
From the previous analysis of the measurements that have been made in the BBC’s web
pages about their content total size we can say that the majority of them follow the Normal
distribution and not the Pareto distribution. This conclusion can
provide better service as the average size of the web page is known so there is a know
value of the bytes that the user have to download to see the web page and so the service
provider could adjust the bandwidth needed by the user to d
can calculate the total resources that have to provide as there is always an estimation of the
customers that use the service and an estimation of the size of the web page.
0
0.00005
0.0001
0.00015
0.0002
0.00025
0.0003
0.00035
0 2000 4000
density
bytes
- 38 -
Normal PDF and CDF of Championship Stories
BBC’s web pages. This can be seen from the figure32 and the differences between
our measurements and can fit to the collected data PDF and CDF.
distribution and not the Pareto distribution. This conclusion can
provider could adjust the bandwidth needed by the user to download the web page and also
6000 8000 10000
bytes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2000
density
Normal PDF and CDF of Championship Stories
BBC’s web pages. This can be seen from the figure32 and the differences between the two
distribution and not the Pareto distribution. This conclusion can help the provider to
ownload the web page and also
4000 6000 8000 10000
bytes

- 39 -
We also saw that when the measurements follow the Pareto distribution then the web
site can have a variation to its content size, meaning that even if the majority of the
measurements for one web site have a small variation between them there are some others
that have big variation, like the example of Sport Top Stories in figure 22 where five out of
seven days had an average content size close to 9,500 bytes and the other two had content
size bigger that 18,000 bytes. From that case we can extract the conclusion that we can’t
have an estimation of the average content size of the web page if the Pareto distribution is
being followed so we can’t estimate the needed bandwidth to download the web page and
we may deal with bigger delays and lower QoS.

- 40 -
5 MEASUREMENTS AND ANALYSIS OF THE INTER-
ARRIVAL TIME OF PACKETS OF A WEB PAGE
RESPONSE
In this chapter we will focus on the measurements that were performed at the web pages
in order to examine the inter-arrival time of the received packets produced by a web page
request. When the user tries to access to a web page then a packet is sent to the service
provider asking for access to the contents of the page. Then the provider after processing
the user’s request sends back to the user the contents of the page. These may not fit into a
single packet for many reasons like big amount in bytes or fragmentation of the packet by
the network. For that reason the user receives many packets for this particular web page
request and these packets we need to capture to observe their inter-arrival time. This can be
done with the Wireshark program that was presented in Chapter 3 and also the requests
were produced by the emulator presented in the 3rd
Chapter. From these measurements we
will try to extract some useful results about these received packets and the possibility of
this these packets following a mathematical distribution.
5.1 Method of the Analysis of the BBC’s Web Site Inter-arrival time
In order to obtain right results from the performed measurements it is very important to
make the analysis correctly, otherwise the results from the analysis will be useless. The
analysis cannot be done like the previous Chapter where we gathered the measurements
from the same subcategory and analysed them all together, for example we cannot gather
the Education stories all together and extract results from them but we need to analyse and
study every web page on its own. In that way we will observe the inter-arrival time of the
packets of the web page separately from other web page packets that are irrelevant with it.
At the next pages we will present the measurements that have been made and the results
that were extracted from them.

5.2 Inter-arrival Time Measurements of the Web Pages
As the graphs produced by the measurements are too many we decided to prese
representative sample of them capable to extract useful results from it. We will present
graphs from all the days of a week as the measurements have been made for a week period.
5.2.1 Monday Measurements
We will start presenting the measurements from the
the next graph are presented the PDF and CDF of the collected data for the BBC Home
web page of Monday.
Figure 34.
Figure 35. Pareto CDF versus Empirical CDF of Monday’s BBC Home page
0
0.05
0.1
0.15
0.2
0.25
0.3
60 68 76
density
msec
- 41 -
arrival Time Measurements of the Web Pages
As the graphs produced by the measurements are too many we decided to prese
Monday Measurements
We will start presenting the measurements from the Monday for different web sites.At
Figure 34. PDF and CDF of Monday’s BBC Home page
Pareto CDF versus Empirical CDF of Monday’s BBC Home page
85 99 100
msec
0
0.2
0.4
0.6
0.8
1
60 68 76
density
arrival Time Measurements of the Web Pages
As the graphs produced by the measurements are too many we decided to present a
Monday for different web sites.At
PDF and CDF of Monday’s BBC Home page
Pareto CDF versus Empirical CDF of Monday’s BBC Home page
76 85 99 100
msec

Figure 36.
From the graphs that were presented in the previous page we can observe that the
measurements of the inter
not following the Pareto distribution and can be see
empirical CDF of the data have a different curve from the Pareto CDF of the data and have
only 2 common points. But comparing the CDF of the Normal distribution with the CDF of
the figure 34 which is the CDF of the co
to follow the Normal distribution rather than the Pareto distribution. It can also be
observed that from the PDF of the data because the data tend to have a bell
just like the Normal PDF.
Now we will present the results for the News Top Story 1 of the BBC News category.
Figure 37.
0
0.005
0.01
0.015
0.02
0.025
0 50
density
msec
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
16.5 23.5
density
msec
- 42 -
Normal PDF and CDF of Monday’s BBC Home page
measurements of the inter-arrival time of the packets of the BBC Home page response are
not following the Pareto distribution and can be seen clearly from the figure 35 where the
the figure 34 which is the CDF of the collected data we can see that the measurements tend
observed that from the PDF of the data because the data tend to have a bell
just like the Normal PDF.
we will present the results for the News Top Story 1 of the BBC News category.
Figure 37. PDF and CDF of Monday’s News Top Story 1
100 150
msec
0
0.2
0.4
0.6
0.8
1
0 50
density
26.5 35.3
msec
0
0.2
0.4
0.6
0.8
1
16.5 23.5
density
msec
and CDF of Monday’s BBC Home page
arrival time of the packets of the BBC Home page response are
n clearly from the figure 35 where the
llected data we can see that the measurements tend
observed that from the PDF of the data because the data tend to have a bell-shaped curve
we will present the results for the News Top Story 1 of the BBC News category.
PDF and CDF of Monday’s News Top Story 1
100 150
msec
23.5 26.5 35.3
msec

Figure 38. Pareto CDF versus Empirical CDF of Monday’s News Top Story 1
Figure 39.
In the figure 37 are presented the PDF and CDF of the measurements performed to the
News Top Story 1 web site for Monday and then in figure 38 we can see the graph where
we compare the Pareto CDF with the Empirical CDF of the collected data. From this
comparison the Pareto CDF is close to the empirical CDF of the data but from the figure
38 we can see that the Normal distribution is closer to the collected data PDF and CDF and
fits better than the Pareto distribution. So for this measurement the Normal distributio
more appropriate than the Pareto.
0
0.01
0.02
0.03
0.04
0.05
0.06
0 10 20
density
msec
- 43 -
Pareto CDF versus Empirical CDF of Monday’s News Top Story 1
Normal PDF and CDF of Monday’s News Top Story 1
figure 37 are presented the PDF and CDF of the measurements performed to the
the Pareto CDF is close to the empirical CDF of the data but from the figure
fits better than the Pareto distribution. So for this measurement the Normal distributio
more appropriate than the Pareto.
30 40
msec
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10
density
figure 37 are presented the PDF and CDF of the measurements performed to the
the Pareto CDF is close to the empirical CDF of the data but from the figure
fits better than the Pareto distribution. So for this measurement the Normal distribution is
20 30 40
msec

We will continue the analysis with the BBC’s News Top Story 3 web page.
Figure 40.
Figure 41. Pareto CDF versus Empirical CDF of Monday’s News Top Story 3
0
0.05
0.1
0.15
0.2
0.25
0.3
16.1 24.1 29.2
density
msec
- 44 -
Figure 40. PDF and CDF of Monday’s News Top Story 3
29.2 37.2 45.2
msec
0
0.2
0.4
0.6
0.8
1
16.1 24.1
density
PDF and CDF of Monday’s News Top Story 3
29.2 37.2 45.2
msec

Figure 42.
Also for this case of the measurements we can see that the Pareto distribution is not the
most appropriate one to characterize the collected data and can be seen from the figure 41
where there are parts of the curves that are fol
points in these parts. The figure 42 indicates that the Normal distribution is more
appropriate distribution to characterize the data and that can be confirmed by comparing
the PDF and CDF of the collected data with
5.2.2 Tuesday Measurements
Staring the measurements for the Tuesday we will examine the BBC News web page.
Figure 43.
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0 10 20
density
msec
0
0.05
0.1
0.15
0.2
17.8 25.8 26.8 32.8
density
msec
- 45 -
where there are parts of the curves that are following different ways and have no common
the PDF and CDF of the collected data with these of the Normal distribution.
Tuesday Measurements
Figure 43. PDF and CDF of Tuesday’s News web page
30 40 50
msec
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10
density
32.8 40.5 48.9 56.8 57.8
msec
0
0.2
0.4
0.6
0.8
1
17.8 25.8 26.8
density
Monday’s News Top Story 3
lowing different ways and have no common
these of the Normal distribution.
PDF and CDF of Tuesday’s News web page
20 30 40 50
msec
32.8 40.5 48.9 56.8 57.8
msec

Figure 44. Pareto CDF versus Empirical CDF of Tuesday’s News web page
Figure 45.
From the figure 43 we can see the PDF and CDF of the collected data for the News web
page. We have to mention that sometimes the PDF isn’t the appropriate method to compare
with other distribution and that’s b
try to compute their inter
be the same as the majority of the received packets are coming with time intervals so they
have the same probability of arrival with the others except when more than one packets are
received with small time intervals. The figure 44 shows the Pareto CDF and the Empiricla
CDF of the collected data and from that graph we can see that the Pareto is not the most
appropriate distribution that fits to the data but from the CDF of the Normal distribution in
figure 45 we can see that the Normal distribution is more appropriate and fits better to the
collected data.
0
0.005
0.01
0.015
0.02
0.025
0.03
0 20 40
density
msec
- 46 -
Pareto CDF versus Empirical CDF of Tuesday’s News web page
Normal PDF and CDF of Tuesday’s News web page
with other distribution and that’s because taking measurements from received packets and
try to compute their inter-arrival time will have as a result the probability of each packet to
obability of arrival with the others except when more than one packets are
ropriate distribution that fits to the data but from the CDF of the Normal distribution in
60 80
msec
0
0.2
0.4
0.6
0.8
1
0 20
density
Pareto CDF versus Empirical CDF of Tuesday’s News web page
Normal PDF and CDF of Tuesday’s News web page
ecause taking measurements from received packets and
arrival time will have as a result the probability of each packet to
obability of arrival with the others except when more than one packets are
ropriate distribution that fits to the data but from the CDF of the Normal distribution in
40 60 80
msec

Internet Traffic Measurement and Analysis

Internet Traffic Measurement and Analysis

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Internet Traffic Measurement and Analysis

Ähnlich wie Internet Traffic Measurement and Analysis (20)

Internet Traffic Measurement and Analysis