Internet application unit2

1

Web Browser
A web browser is a software application that enables a user to display and interact with
text, images, and other information typically located on a web page at a website on the
World Wide Web or a local area network. Text and images on a web page can contain
hyperlinks to other web pages at the same or different website. Web browsers allow a
user to quickly and easily access information provided on many web pages at many
websites by traversing these links. Web browsers format HTML information for display,
so the appearance of a web page may differ between browsers.
Some of the web browsers available for personal computers include Internet Explorer,
Mozilla Firefox, Safari, Netscape, and Opera in order of descending popularity (as of
August 2006).[1] Web browsers are the most commonly used type of HTTP user agent.
Although browsers are typically used to access the World Wide Web, they can also be
used to access information provided by web servers in private networks or content in file
systems.
Protocols and standards
Web browsers communicate with web servers primarily using HTTP (hypertext transfer
protocol) to fetch webpages. HTTP allows web browsers to submit information to web
servers as well as fetch web pages from them. The most commonly used HTTP is HTTP/
1.1, which is fully defined in RFC 2616. HTTP/1.1 has its own required standards that
Internet Explorer does not fully support, but most other current-generation web browsers
do.
Pages are located by means of a URL (uniform resource locator), which is treated as an
address, beginning with http: for HTTP access. Many browsers also support a variety of
other URL types and their corresponding protocols, such as ftp: for FTP (file transfer
protocol), rtsp: for RTSP (real-time streaming protocol), and https: for HTTPS (an SSL
encrypted version of HTTP).
The file format for a web page is usually HTML (hyper-text markup language) and is
identified in the HTTP protocol using a MIME content type. Most browsers natively
support a variety of formats in addition to HTML, such as the JPEG, PNG and GIF image
formats, and can be extended to support more through the use of plugins. The
combination of HTTP content type and URL protocol specification allows web page
designers to embed images, animations, video, sound, and streaming media into a web
page, or to make them accessible through the web page.
Early web browsers supported only a very simple version of HTML. The rapid
development of proprietary web browsers led to the development of non-standard dialects
of HTML, leading to problems with Web interoperability. Modern web browsers support
a combination of standards- and defacto-based HTML and XHTML, which should
display in the same way across all browsers. No browser fully supports HTML 4.01,
XHTML 1.x or CSS 2.1 yet. Currently many sites are designed using WYSIWYG HTML
generation programs such as Macromedia Dreamweaver or Microsoft Frontpage. These
often generate non-standard HTML by default, hindering the work of the W3C in

2

developing standards, specifically with XHTML and CSS (cascading style sheets, used
for page layout).
Some of the more popular browsers include additional components to support Usenet
news, IRC (Internet relay chat), and e-mail. Protocols supported may include NNTP
(network news transfer protocol), SMTP (simple mail transfer protocol), IMAP (Internet
message access protocol), and POP (post office protocol). These browsers are often
referred to as Internet suites or application suites rather than merely web browsers.
Brief history
A NeXTcube was used by Tim Berners-Lee (who pioneered the use of hypertext for
sharing information) as the world's first web server, and also to write the first web
browser, WorldWideWeb in 1990. Berners-Lee introduced it to colleagues at CERN in
March 1991. Since then the development of web browsers has been inseparably
intertwined with the development of the web itself.
The first browser, Silversmith, was created by John Bottoms in 1987.[2] The browser,
based on SGML tags, used a tag set from the Electronic Document Project of the AAP
with minor modifications and was sold to a number of early adopters. At the time SGML
was used exclusively for the formatting of printed documents. The use of SGML for
electronically displayed documents signaled a shift in electronic publishing and was met
with considerable resistance. Silversmith included an integrated indexer, full text
searches, hypertext links between images text and sound using SGML tags and a return
stack for use with hypertext links. It included features that are still not available in today's
browsers. These include capabilities such as the ability to restrict searches within
document structures, searches on indexed documents using wild cards and the ability to
search on tag attribute values and attribute names. SGML-FAQ US Patent
In 1992, Tony Johnson releases the MidasWWW browser. Based on Motif/X,
MidasWWW allows viewing of PostScript files on the Web from Unix and VMS, and
even handles compressed PostScript.
Another early popular web browser was ViolaWWW, which was modeled after
HyperCard. However, the explosion in popularity of the web was triggered by NCSA
Mosaic which was a graphical browser running originally on Unix but soon ported to the
Apple Macintosh and Microsoft Windows platforms. Version 1.0 was released in
September 1993, and was dubbed the killer application of the Internet. Marc Andreessen,
who was the leader of the Mosaic team at NCSA, quit to form a company that would later
be known as Netscape Communications Corporation.
Netscape released its flagship Navigator product in October 1994, and it took off the next
year. Microsoft, which had thus far not marketed a browser, now entered the fray with its
Internet Explorer product, purchased from Spyglass Inc. This began what is known as the
browser wars, the fight for the web browser market between Microsoft and Netscape.
The wars put the web in the hands of millions of ordinary PC users, but showed how
commercialization of the web could stymie standards efforts. Both Microsoft and
Netscape liberally incorporated proprietary extensions to HTML in their products, and
tried to gain an edge by product differentiation. Starting with the acceptance of the

3

Microsoft proposed Cascading Style Sheets over Netscape's JavaScript Style Sheets
(JSSS) by W3C, the Netscape browser started being generally considered inferior to
Microsoft's browser version after version, from feature considerations to application
robustness to standard compliance. The wars effectively ended in 1998 when it became
clear that Netscape's declining market share trend was irreversible. This trend may have
been due in part to Microsoft's integrating its browser with its operating system and
bundling deals with OEMs; Microsoft faced antitrust litigation on these charges.
Netscape responded by open sourcing its product, creating Mozilla. This did nothing to
slow Netscape's declining market share. The company was purchased by America Online
in late 1998. At first, the Mozilla project struggled to attract developers, but by 2002 it
had evolved into a relatively stable and powerful internet suite. Mozilla 1.0 was released
to mark this milestone. Also in 2002, a spin off project that would eventually become the
popular Mozilla Firefox was released. In 2004, Firefox 1.0 was released; Firefox 1.5 was
released in November 2005. Firefox 2, a major update, was released in October 2006 and
work has already begun on Firefox 3 which is scheduled for release in 2007. As of 2006,
Mozilla and its derivatives account for approximately 12% of web traffic.
Opera, an innovative, speedy browser popular in handheld devices, particularly mobile
phones, as well as on PCs in some countries was released in 1996 and remains a niche
player in the PC web browser market. It is available on Nintendo's DS, DS Lite and Wii
consoles[2]. The Opera Mini browser uses the Presto (layout engine) like all versions of
Opera, but runs on most phones supporting Java Midlets.
The Lynx browser remains popular for Unix shell users and with vision impaired users
due to its entirely text-based nature. There are also several text-mode browsers with
advanced features, such as w3m, Links (which can operate both in text and graphical
mode), and the Links forks such as ELinks.
The Macintosh scene too has traditionally been dominated by Internet Explorer and
Netscape. However, Apple's Safari, the default browser on Mac OS X since version 10.3,
has slowly grown to dominate this market.
In 2003, Microsoft announced that Internet Explorer would no longer be made available
as a separate product but would be part of the evolution of its Windows platform, and that
no more releases for the Macintosh would be made. However, in early 2005, Microsoft
changed its plans, releasing version 7 of Internet Explorer for Windows XP, Windows
Server 2003, and Windows Vista in October 2006.
Features
Different browsers can be distinguished from each other by the features they support.
Modern browsers and web pages tend to utilize many features and techniques that did not
exist in the early days of the web. As noted earlier, with the browser wars there was a
rapid and chaotic expansion of browser and World Wide Web feature sets.
The following is a list of some of the most notable features:
• Standards support
• HTTP and HTTPS

4

• HTML, XML and XHTML
• Graphics file formats including GIF, PNG, JPEG, and SVG
• Cascading Style Sheets (CSS)
• JavaScript (Dynamic HTML) and XMLHttpRequest
• Cookie
• Digital certificates
• Favicons
• RSS, Atom
Fundamental features
• Bookmark manager
• Caching of web contents
• Support of media types via plugins such as Macromedia Flash and QuickTime
Usability and accessibility features
• Autocompletion of URLs and form data
• Tabbed browsing
• Spatial navigation
• Caret navigation
• Screen reader or full speech support

HTML
HTML, short for HyperText Markup Language, is the predominant markup language
for the creation of web pages. It provides a means to describe the structure of text-based
information in a document — by denoting certain text as headings, paragraphs, lists, and
so on — and to supplement that text with interactive forms, embedded images, and other
objects. HTML is written in the form of labels (known as tags), created by greater-than
signs (>) and less-than signs (<). HTML can also describe, to some degree, the
appearance and semantics of a document, and can include embedded scripting language
code which can affect the behavior of web browsers and other HTML processors.
HTML is also often used to refer to content of the MIME type text/html or even more
broadly as a generic term for HTML whether in its XML-descended form (such as
XHTML 1.0 and later) or its form descended directly from SGML (such as HTML 4.01
and earlier).
What is HTML?
HTML stands for Hypertext Markup Language.
Hypertext is ordinary text that has been dressed up with extra features, such as
formatting, images, multimedia, and links to other documents.

5

Markup is the process of taking ordinary text and adding extra symbols. Each of the
symbols used for markup in HTML is a command that tells a browser how to display the
text.
History of HTML
Tim Berners-Lee created the original HTML (and many associated protocols such as
HTTP) on a NeXTcube workstation using the NeXTSTEP development environment. At
the time, HTML was not a specification, but a collection of tools to solve an immediate
problem: the communication and dissemination of ongoing research among Berners-Lee
and a group of his colleagues. His solution later combined with the emerging
international and public internet to garner worldwide attention.
Early versions of HTML were defined with loose syntactic rules, which helped its
adoption by those unfamiliar with web publishing. Web browsers commonly made
assumptions about intent and proceeded with rendering of the page. Over time, as the use
of authoring tools increased, the trend in the official standards has been to create an
increasingly strict language syntax. However, browsers still continue to render pages that
are far from valid HTML.
HTML is defined in formal specifications that were developed and published throughout
the 1990s, inspired by Tim Berners-Lee's prior proposals to graft hypertext capability
onto a homegrown SGML-like markup language for the Internet. The first published
specification for a language called HTML was drafted by Berners-Lee with Dan
Connolly, and was published in 1993 by the IETF as a formal "application" of SGML
(with an SGML Document Type Definition defining the grammar). The IETF created an
HTML Working Group in 1994 and published HTML 2.0 in 1995, but further
development under the auspices of the IETF was stalled by competing interests. Since
1996, the HTML specifications have been maintained, with input from commercial
software vendors, by the World Wide Web Consortium (W3C).[1] However, in 2000,
HTML also became an international standard (ISO/IEC 15445:2000). The last HTML
specification published by the W3C is the HTML 4.01 Recommendation, published in
late 1999 and its issues and errors were last acknowledged by errata published in 2001.
Since the publication of HTML 4.0 in late 1997, the W3C's HTML Working Group has
increasingly — and from 2002 through 2006, exclusively — focused on the development
of XHTML, an XML-based counterpart to HTML that is described on one W3C web
page as HTML's "successor".[2][3][4] XHTML applies the more rigorous, less ambiguous
syntax requirements of XML to HTML to make it easier to process and extend, and as
support for XHTML has increased in browsers and tools, it has been embraced by many
web standards advocates in preference to HTML. XHTML is routinely characterized by
mass-media publications for both general and technical audiences as the newest "version"
of HTML, but W3C publications, as of 2006, do not make such a claim; neither HTML
3.2 nor HTML 4.01 have been explicitly rescinded, deprecated, or superseded by any
W3C publications, and, as of 2006, they continue to be listed alongside XHTML as
current Recommendations in the W3C's primary publication indices.[5][6][7]

6

In November 2006, the HTML Working Group published a new charter indicating its
intent to resume development of HTML in a manner that unifies HTML 4 and XHTML
1, allowing for this hybrid language to manifest in both an XML format and a "classic
HTML" format that is SGML-compatible but not strictly SGML-based. Among other
things, it is planned that the new specification, to be released and refined throughout 2007
through 2008, will include conformance and parsing requirements, DOM APIs, and new
widgets and APIs. The group also intends to publish test suites and validation tools.[8]
Version history of the standard
HTML
Character encodings
Dynamic HTML
Font family
HTML editor
HTML element
HTML scripting
Layout engine comparison
Style Sheets
Unicode and HTML
W3C
Web browsers comparison
Web colors
XHTML
This box: view • talk • edit

Hypertext Markup Language (First Version), published June 1993 as an Internet
Engineering Task Force (IETF) working draft (not standard).
HTML 2.0, published November 1995 as IETF RFC 1866, supplemented by RFC 1867
(form-based file upload) that same month, RFC 1942 (tables) in May 1996, RFC 1980
(client-side image maps) in August 1996, and RFC 2070 (internationalization) in January
1997; ultimately all were declared obsolete/historic by RFC 2854 in June 2000.
HTML 3.2, published January 14, 1997 as a W3C Recommendation.
HTML 4.0, published December 18, 1997 as a W3C Recommendation. It offers three
"flavors":
Strict, in which deprecated elements are forbidden
Transitional, in which deprecated elements are allowed
Frameset, in which mostly only frame related elements are allowed
HTML 4.01, published December 24, 1999 as a W3C Recommendation. It offers the
same three flavors as HTML 4.0, and its last errata was published May 12, 2001.
ISO/IEC 15445:2000 ("ISO HTML", based on HTML 4.01 Strict), published May 15,
2000 as an ISO/IEC international standard.
HTML 4.01 and ISO/IEC 15445:2000 are the most recent and final versions of HTML.
XHTML is a separate language that began as a reformulation of HTML 4.01 using XML
1.0. It continues to be developed:

7

XHTML 1.0, published January 26, 2000 as a W3C Recommendation, later revised and
republished August 1, 2002. It offers the same three flavors as HTML 4.0 and 4.01,
reformulated in XML, with minor restrictions.
XHTML 1.1, published May 31, 2001 as a W3C Recommendation. It is based on
XHTML 1.0 Strict, but includes minor changes and is reformulated using modules from
Modularization of XHTML, which was published April 10, 2001 as a W3C
Recommendation.
XHTML 2.0 is still a W3C Working Draft.
There is no official standard HTML 1.0 specification because there were multiple
informal HTML standards at the time. Berners-Lee's original version did not include an
IMG element type. Work on a successor for HTML, then called "HTML+", began in late
1993, designed originally to be "A superset of HTML…which will allow a gradual
rollover from the previous format of HTML". The first formal specification was therefore
given the version number 2.0 in order to distinguish it from these unofficial "standards".
Work on HTML+ continued, but it never became a standard.
The HTML 3.0 standard was proposed by the newly formed W3C in March 1995, and
provided many new capabilities such as support for tables, text flow around figures, and
the display of complex math elements. Even though it was designed to be compatible
with HTML 2.0, it was too complex at the time to be implemented, and when the draft
expired in September 1995, work in this direction was discontinued due to lack of
browser support. HTML 3.1 was never officially proposed, and the next standard
proposal was HTML 3.2 (code-named "Wilbur"), which dropped the majority of the new
features in HTML 3.0 and instead adopted many browser-specific element types and
attributes which had been created for the Netscape and Mosaic web browsers. Math
support as proposed by HTML 3.0 finally came about years later with a different
standard, MathML.
HTML 4.0 likewise adopted many browser-specific element types and attributes, but at
the same time began to try to "clean up" the standard by marking some of them as
deprecated, and suggesting they not be used.
Minor editorial revisions to the HTML 4.0 specification were published as HTML 4.01.
The most common filename extension for files containing HTML is .html. However,
older operating systems and filesystems, such as the DOS versions from the 80's and
early 90's and FAT, limit file extensions to three letters, so a .htm extension is also used.
Although perhaps less common now, the shorter form is still widely supported by current
software.
HTML as a hypertext format
HTML is the basis of a comparatively weak hypertext implementation. Earlier hypertext
systems had features such as typed links, transclusion and source tracking. Another
feature lacking today is fat links.[9]
Even some hypertext features that were in early versions of HTML have been ignored by
most popular web browsers until recently, such as the link element and editable web
pages.

8

Sometimes web services or browser manufacturers remedy these shortcomings. For
instance, members of the modern social software landscape such as wikis and content
management systems allow surfers to edit the web pages they visit.
HTML markup
HTML markup consists of several types of entities, including: elements, attributes, data
types and character references.
The Document Type Definition
In order to enable Document Type Definition (DTD)-based validation with SGML tools
and in order to avoid the Quirks mode in browsers, all HTML documents should start
with a Document Type Declaration (informally, a "DOCTYPE"). The DTD contains
machine readable grammar specifying the permitted and prohibited content for a
document conforming to such a DTD. Browsers do not read the DTD, however. Browsers
only look at the doctype in order to decide the layout mode. Not all doctypes trigger the
Standards layout mode avoiding the Quirks mode. For example:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
This declaration references the Strict DTD of HTML 4.01, which does not have
presentational elements like , leaving formatting to Cascading Style Sheets.
SGML-based validators read the DTD in order to properly parse the document and to
perform validation. In modern browsers, the HTML 4.01 Strict doctype activates the
Standards layout mode for CSS as opposed to the Quirks mode.
In addition, HTML 4.01 provides Transitional and Frameset DTDs. The Transitional
DTD was intended to gradually phase in the changes made in the Strict DTD, while the
Frameset DTD was intended for those documents which contained frames.
[edit] Elements
See HTML elements for more detailed descriptions.
Elements are the basic structure for HTML markup. Elements have two basic properties:
attributes and content. Each attribute and each element's content has certain restrictions
that must be followed for an HTML document to be considered valid. An element usually
has a start label (eg. <label>) and an end label (eg. </label>). The element's attributes are
contained in the start label and content is located between the labels (eg.
<label>Content</label>). Some elements, such as , will never have any content and
do not need closing labels. Listed below are several types of markup elements used in
HTML.
Structural markup describes the purpose of text. For example, <h2>Golf</h2>
establishes "Golf" as a second-level heading, which would be rendered in a browser in a
manner similar to the "Markup element types" title at the start of this section. A blank
line is included after the header. Structural markup does not denote any specific
rendering, but most web browsers have standardized on how elements should be
formatted. Further styling should be done with Cascading Style Sheets (CSS).
Presentational markup describes the appearance of the text, regardless of its function.
For example boldface indicates that visual output devices should render

9

"boldface" in bold text, but has no clear semantics for aural devices that read the text
aloud for the sight-impaired. In the case of both bold and italic there
are elements which usually have an equivalent visual rendering but are more semantic in
nature, namely strong emphasis and emphasis
respectively. It is easier to see how an aural user agent should interpret the latter two
elements. However, they are not equivalent to their presentational counterparts: it would
be undesirable for a screen-reader to emphasize the name of a book, for instance, but on a
screen such a name would be italicized. Most presentational markup elements have
become deprecated under the HTML 4.0 specification, in favor of CSS based style
design.
Hypertext markup links parts of the document to other documents. HTML up through
version XHTML 1.1 requires the use of an anchor element to create a hyperlink in the
flow of text: <a>Wikipedia</a>. However, the href attribute must also be set to a valid
URL so for example the HTML code, <a

href="http://en.wikipedia.org/">Wikipedia</a>, will render the word
"Wikipedia" as a hyperlink. In order to view the HTML code in a website click --> View
--> Source.
[edit] Attributes
The attributes of an element are name-value pairs, separated by "=", and written within
the start label of an element, after the element's name. The value should be enclosed in
single or double quotes, although values consisting of certain characters can be left
unquoted in HTML (but not XHTML).[10][11] Leaving attribute values unquoted is
considered unsafe.[12]
Most elements take any of several common attributes: id, class, style and title. Most
also take language-related attributes: lang and dir.
The id attribute provides a document-wide unique identifier for an element. This can be
used by stylesheets to provide presentational properties, by browsers to focus attention on
the specific element or by scripts to alter the contents or presentation of an element. The
class attribute provides a way of classifying similar elements for presentation purposes.

For example, an HTML (or a set of documents) document may use the designation
class="notation" to indicate that all elements with this class value are all subordinate

to the main text of the document (or documents). Such notation classes of elements might
be gathered together and presented as footnotes on a page, rather than appearing in the
place where they appear in the source HTML.
An author may use the style non-attributal codes presentational properties to a particular
element. It is considered better practice to use an element’s son- id page and select the
element with a stylesheet, though sometimes this can be too cumbersome for a simple ad
hoc application of styled properties. The title is used to attach subtextual explanation to
an element. In most browsers this title attribute is displayed as what is often referred to
as a tooltip. The generic inline span element can be used to demonstrate these various
non-attributes.

10

HTML
which displays as HTML (pointing the cursor at the abbreviation should display the title
text in most browsers).
[edit] Other markup
As of version 4.0, HTML defines a set of 252 character entity references and a set of
1,114,050 numeric character references, both of which allow individual characters to be
written via simple markup, rather than literally. A literal character and its markup
equivalent are considered equivalent and are rendered identically.
The ability to "escape" characters in this way allows for the characters "<" and "&"
(when written as < and &, respectively) to be interpreted as character data, rather
than markup. For example, a literal "<" normally indicates the start of a label, and "&"
normally indicates the start of a character entity reference or numeric character reference;
writing it as "&" or "&" allows "&" to be included in the content of elements or
the values of attributes. The double-quote character, ", when used to quote an attribute
value, must also be escaped as """ or "" when it appears within in the
attribute value itself. However, since document authors often overlook the need to escape
these characters, browsers tend to be very forgiving, treating them as markup only when
subsequent text appears to confirm that intent.
Escaping also allows for characters that are not easily typed or that aren't even available
in the document's character encoding to be represented within the element and attribute
content. For example, "é", a character typically found only on Western European
keyboards, can be written in any HTML document as the entity reference é or as
the numeric references é or é. The characters comprising those references
(that is, the "&", the ";", the letters in "eacute", and so on) are available on all keyboards
and are supported in all character encodings, whereas the literal "é" is not.
HTML also defines several data types for element content, such as script data and
stylesheet data, and a plethora of types for attribute values, including IDs, names, URIs,
numbers, units of length, languages, media descriptors, colors, character encodings, dates
and times, and so on. All of these data types are specializations of character data.
[edit] Semantic HTML
There is no official specification called "Semantic HTML", though the strict flavors of
HTML discussed below are a push in that direction. Rather, semantic HTML refers to an
objective and a practice to create documents with HTML that contain only the author's
intended meaning, without any reference to how this meaning is presented or conveyed.
A classic example is the distinction between the emphasis element () and the italics
element (). Often the emphasis element is displayed in italics, so the presentation is
typically the same. However, emphasizing something is different from listing the title of
a book, for example, which may also be displayed in italics. In purely semantic HTML, a
book title would use a separate element than emphasized text uses (for example a
), because they are each meaningfully different things.

11

The goal of semantic HTML requires two things of authors:
1) to avoid the use of presentational markup (elements, attributes and other entities); 2)
the use of available markup to differentiate the meanings of phrases and structure in the
document. So for example, the book title from above would need to have its own element
and class specified such as <cite class="booktitle">The Grapes of

Wrath</cite>. Here, the <cite> element is used, because it most closely matches the

meaning of this phrase in the text. However, the <cite> element is not specific enough to
this task because we mean to cite specifically a book title as opposed to a newspaper
article or a particular academic journal.
Semantic HTML also requires complementary specifications and software compliance
with these specifications. Primarily, the development and proliferation of CSS has led to
increasing support for semantic HTML because CSS provides designers with a rich
language to alter the presentation of semantic-only documents. With the development of
CSS the need to include presentational properties in a document has virtually
disappeared. With the advent and refinement of CSS and the increasing support for it in
web browsers, subsequent editions of HTML increasingly stress only using markup that
suggests the semantic structure and phrasing of the document, like headings, paragraphs,
quotes, and lists, instead of using markup which is written for visual purposes only, like
, (bold), and (italics). Some of these elements are not permitted in certain

varieties of HTML, like HTML 4.01 Strict. CSS provides a way to separate document
semantics from the content's presentation, by keeping everything relevant to presentation
defined in a CSS file. See separation of style and content.
Semantic HTML offers many advantages. First, it ensures consistency in style across
elements that have the same meaning. Every heading, every quotation mark, every
similar element receives the same presentation properties.
Second, semantic HTML frees authors from the need to concern themselves with
presentation details. When writing the number two, for example, should it be written out
in words ("two"), or should it be written as a numeral (2)? A semantic markup might
enter something like <number>2</number> and leave presentation details to the
stylesheet designers. Similarly, an author might wonder where to break out quotations
into separate indented blocks of text - with purely semantic HTML, such details would be
left up to stylesheet designers. Authors would simply indicate quotations when they occur
in the text, and not concern themselves with presentation.
A third advantage is device independence and repurposing of documents. A semantic
HTML document can be paired with any number of stylesheets to provide output to
computer screens (through web browsers), high-resolution printers, handheld devices,
aural browsers or braille devices for those with visual impairments, and so on. To
accomplish this nothing needs to be changed in a well coded semantic HTML document.
Readily available stylesheets make this a simple matter of pairing a semantic HTML
document with the appropriate stylesheets (of course, the stylesheet's selectors need to
match the appropriate properties in the HTML document).

12

Some aspects of authoring documents make separating semantics from style (in other
words, meaning from presentation) difficult. Some elements are hybrids, using
presentation in their very meaning. For example, a table displays content in a tabular
form. Often this content only conveys the meaning when presented in this way.
Repurposing a table for an aural device typically involves somehow presenting the table
as an inherently visual element in an audible form. On the other hand, we frequently
present lyrical songs — something inherently meant for audible presentation — and
instead present them in textual form on a web page. For these types of elements, the
meaning is not so easily separated from their presentation. However, for a great many of
the elements used and meanings conveyed in HTML the translation is relatively smooth.
[edit] Delivery of HTML
HTML documents can be delivered by the same means as any other computer file;
however, HTML documents are most often delivered in one of the following two forms:
Over HTTP servers and through email.
[edit] Publishing HTML with HTTP
The World Wide Web is primarily composed of HTML documents transmitted from a
web server to a web browser using the HyperText Transfer Protocol (HTTP). However,
HTTP can be used to serve images, sound and other content in addition to HTML. To
allow the web browser to know how to handle the document it received, an indication of
the file format of the document must be transmitted along with the document. This vital
metadata includes the MIME type (text/html for HTML 4.01 and earlier,
application/xhtml+xml for XHTML 1.0 and later) and the character encoding (see

Character encodings in HTML).
In modern browsers, the MIME type that is sent with the HTML document affects how
the document is interpreted. A document sent with an XHTML MIME type, or served as
application/xhtml+xml, is expected to be well-formed XML and a syntax error may cause
the browser to fail to render the document. The same document sent with a HTML MIME
type, or served as text/html, might get displayed since web browsers are more lenient
with HTML. However, XHTML parsed this way is not considered either proper XHTML
nor HTML, but so-called tag soup.
If the MIME type is not recognized as HTML, the web browser should not attempt to
render the document as HTML, even if the document is prefaced with a correct
Document Type Declaration. Nevertheless, some web browsers do examine the contents
or URL of the document and attempt to infer the file type, despite this being forbidden by
the HTTP 1.1 specification.
[edit] HTML e-mail
Main article: HTML e-mail
Most graphical e-mail clients allow the use of a subset of HTML (often ill-defined) to
provide formatting and semantic markup capabilities not available with plain text, like
emphasized text, block quotations for replies, and diagrams or mathematical formulas
that couldn't easily be described otherwise. Many of these clients include both a GUI
editor for composing HTML e-mails and a rendering engine for displaying received

13

HTML e-mails. Use of HTML in e-mail is controversial due to compatibility issues,
because it can be used in phishing/privacy attacks, because it can confuse spam filters,
and because the message size is larger than plain text.
[edit] Current flavors of HTML
Since its inception HTML and its associated protocols gained acceptance relatively
quickly. However, no clear standards existed in the early years of the language. Though
its creators originally conceived of HTML as a semantic language devoid of presentation
details, practical uses pushed many presentational elements and attributes into the
language: driven largely by the various browser vendors. The latest standards
surrounding HTML reflect efforts to overcome the sometimes chaotic development of the
language and to create a rational foundation to build both meaningful and well-presented
documents. To return HTML to its role as a semantic language, the W3C has developed
style languages such as CSS and XSL to shoulder the burden of presentation. In
conjunction the HTML specification has slowly reined in the presentational elements
within the specification.
There are two axes differentiating various flavors of HTML as currently specified:
SGML-based HTML versus XML-based HTML (referred to as XHTML) on the one axis
and strict versus transitional (loose) versus frameset on the other axis.
[edit] Traditional versus XML-based HTML
One difference in the latest HTML specifications lies in the distinction between the
SGML-based specification and the XML-based specification. The XML-based
specification is often called XHTML to clearly distinguish it from the more traditional
definition; however, the root element name continues to be HTML even in the XHTML-
specified HTML. The W3C intends XHTML 1.0 to be identical with HTML 4.01 except
in the often stricter requirements of XML over traditional HTML. XHTML 1.0 likewise
has three sub-specifications: strict, loose and frameset. The strictness of XHTML in terms
of its syntax is often confused with the strictness of the strict versus the loose definitions
in terms of the content rules of the specifications. The strictness of XML lies in the need
to: always explicitly close elements (<h1>); and to always use quotation-marks (double "
or single ') to enclose attribute values. The use of implied closing labels in HTML led to
confusion for both editors and parsers.
Aside from the different opening declarations for a document, the differences between
HTML 4.01 and XHTML 1.0 — in each of the corresponding DTDs — is largely
syntactic. Adhering to valid and well-formed XHTML 1.0 will result in a well-formed
HTML 4.01 document in every way, except one. XHTML introduces a new markup in a
self-closing element as short-hand for handling empty elements. The short-hand adds a
slash (/) at the end of an opening label like this: . The introduction of this short-
hand, undefined in any HTML 4.01 DTD, may confuse earlier software unfamiliar with
this new convention. To help with the transition, the W3C recommends also including a
space character before the slash like this: . As validators and browsers adapt to this
evolution in the standard, the migration from traditional to XML-based HTML should be
relatively simple. The major problems occur when software is non-conforming to HTML

14

4.01 and its associated protocols to begin with, or erroneously implements the HTML
recommendations.
To understand the subtle differences between HTML and XHTML consider the
transformation of a valid and well-formed XHTML 1.0 document into a valid and well-
formed HTML 4.0. To make this translation requires the following steps::
The language code for the element should be specified with a lang rather than the
XHTML xml:lang attribute HTML 4.01 instead defines its own attribute for language)
whereas XHTML uses the XML defined attribute.
Remove the XML namespace (xmlns=URI). HTML does not require and has no
facilities for namespaces.
Change the DTD declaration from XHTML 1.0 to HTML 4.01. (see DTD section for
further explanation]]).
If present, remove the XML declaration (Typically this is: <?xml version="1.0"
encoding="utf-8"?>).

Change the document’s mime type to text/html This may come from a meta element,
from the HTTP header of the server or possibly from a filename extension (for example,
change .xhtml to html).
Change the XML empty label short-cut to a standard opening label ( to )
Those are the only changes necessary to translate a document from XHTML 1.0 to
HTML 4.01. The reverse operation can be much more complicated. HTML 4.01 allows
the omission of many labels in a complex pattern derived by determining which labels are
(in some sense) redundant for a valid document. In other words if the document is
authored precisely to the associated HTML 4.01 content model, some labels need not be
expressed. For example, since a paragraph cannot contain another paragraph, when an
opening paragraph label is followed by another opening paragraph label, this implies the
previous paragraph element is now closed. Similarly, elements such as br have no
allowed content, so HTML does not require an explicit closing label for this element.
Also since HTML was the only specification targeted by user-agents (browsers and other
HTML consuming software), the specification even allows the omission of opening and
closing labels for html, head, and body, if the document's head has no content. To
translate from HTML to XHTML would first require the addition of any omitted closing
labels (or using the closing label shortcut for empty elements like ).
Notice how XHTML’s requirement to always include explicit closing labels, allows the
separation between the concepts of valid and well-formed. A well-formed XHTML
document adheres to all the syntax requirements of XML. A valid document adheres to
the content specification for XHTML. In other words a valid document only includes
content, attributes and attribute values within each element in accord with the
specification. If a closing label is omitted, an XHTML parser can first determine the
document is not well-formed. Once the elements are all explicitly closed, the parser can
address the question of whether the document is also valid. For an HTML parse these
separate aspects of a document are not discernible. If a paragraph opening label (p) is

15

followed by a div, is it because the document is not well-formed (the closing paragraph
label is missing) or is the document invalid (a div does not belong in a paragraph)?
Whether coding in HTML or XHTML it may just be best to always include the optional
labels within an HTML document rather than remembering which labels can be omitted.
The W3C recommends several conventions to ensure an easy migration between HTML
and XHTML (see HTML Compatibility Guidelines). Basically the W3C recommends:
Including both xml:lang and lang attributes on any elements assigning language.
Using the self-closing label only for elements specified as empty
Make all label names and attribute names lower-case.
Ensuring all attribute values are quoted with either single quotes (') or double quotes (")
Including an extra space in self-closing labels: for example instead of 
Including explicit close labels for elements that permit content but are left empty (for
example, "<img></img>", not "<img />" )
Note that by carefully following the W3C’s compatibility guidelines the difference
between the resulting HTML 4.01 document and the XHTML 1.0 document is merely the
DOCTYPE declaration, and the XML declaration preceding the document’s contents.
The W3C allows the resulting XHTML 1.0 (or any XHTML 1.0) document to be
delivered as either HTML or XHTML. For delivery as HTML, the document’s MIME
type should be set to 'text/html', while, for XHTML, the document’s MIME type should
be set to 'application/xhtml+xml'. When delivered as XHTML, browsers and other user
agents are expected to adhere strictly to the XML specifications in parsing, interpreting,
and displaying the document’s contents.
[edit] Transitional versus Strict
The latest SGML-based specification HTML 4.01 and the earliest XHTML version
include three sub-specifications: strict, transitional (also called loose), and frameset. The
difference between strict on the one hand and loose and frameset on the other, is that the
strict definition tries to adhere more tightly to a presentation-free or style-free concept of
a semantic HTML. The loose standard maintains many of the various presentational
elements and attributes absent in the strict definition.
The primary differences making the transitional specification loose versus the strict
specification (whether XHTML 1.0 or HTML 4.01) are:
A looser content model
Inline elements and character strings (#PCDATA) are allowed in: body, blockquote,
form, noscript, noframes

Presentation related elements
underline (u)
strike-through (s and strike)
center
font
basefont

Presentation related attributes
background and bgcolor attributes for body element.

16

align attribute on div, form, paragraph (p), and heading (h1...h6) elements

align, noshade, size, and width attributes on hr element

align, border, vspace, and hspace attributes on img and object elements

align attribute on legend and caption elements

align and bgcolor on table element

nowrap, bgcolor, width, height on td and th elements

bgcolor attribute on tr element

clear attribute on br element

compact attribute on dl, dir and menu elements

type, compact, and start attributes on ol and ul elements

type and value attributes on li element

width attribute on pre element

Additional elements in loose (transitional) specification
menu list (no substitute, though unordered list is recommended; may return in XHTML

2.0 specification)
dir list (no substitute, though unordered list is recommended)

isindex (element requires server-side support and is typically added to documents

server-side)
applet (deprecated in favor of object element)

The pre element does not allow: applet, font, and basefont (elements not defined in strict
DTD)
The language attribute on script element (presumably redundant with type attribute,
though this is maintained for legacy reasons).
Frame related entities
frameset element (used in place of body for frameset DTD)

frame element
iframe
noframes

target attribute on anchor, client-side image-map (imagemap), link, form, and base

elements
[edit] Frameset versus transitional
In addition to the above transitional differences, the frameset specifications (whether
XHTML 1.0 or HTML 4.01) specifies a different content model:
<html>
<head>
Any of the various head related elements.
</head>
<frameset>
At least one of either: another frameset or a frame and an optional noframes element.
</frameset>
</html>

17

[edit] Summary of flavors
As this list demonstrates, the loose flavors of the specification are maintained for legacy
support. However, contrary to popular misconceptions, the move to XHTML does not
imply a removal of this legacy support. Rather the X in XML stands for extensible and
the W3C is modularizing the entire specification and opening it up to independent
extensions. The primary achievement in the move from XHTML 1.0 to XHTML 1.1 is
the modularization of the entire specification. The strict version of HTML is deployed in
XHTML 1.1 through a set of modular extensions to the base XHTML 1.1 specification.
Likewise someone looking for the loose (transitional) or frameset specifications will find
similar extended XHTML 1.1 support (much of it is contained in the legacy or frame
modules). The modularization also allows for separate features to develop on their own
timetable. So for example XHTML 1.1 will allow quicker migration to emerging XML
standards such as MathML (a presentational and semantic math language based on XML)
and XFORMS — a new highly advanced web-form technology to replace the existing
HTML forms.
In summary, the HTML 4.01 specification primarily reined in all the various HTML
implementations into a single clear written specification based on SGML. XHTML 1.0,
ported this specification, as is, to the new XML defined specification. Next, XHTML 1.1
takes advantage of the extensible nature of XML and modularizes the whole
specification. XHTML 2.0 will be the first step in adding new features to the
specification in a standards-body-based approach.

NetMeetting
Microsoft NetMeeting is a VoIP and multi-point videoconferencing client included in
many versions of Microsoft Windows (from Windows 95 OSR2 to Windows XP). It uses
the H.323 protocol for video and audio conferencing, and is interoperable with
OpenH323-based clients such as Ekiga, and Internet Locator Service (ILS) as mirror
server. It also uses a slightly modified version of the ITU T.120 Protocol for
whiteboarding, application sharing, desktop sharing, remote desktop sharing (RDS) and
file transfers. The secondary Whiteboard in NetMeeting 2.1 and later utilizes the H.324
protocol.
Before video service became common on free IM clients, such Yahoo Messenger and
MSN Messenger, NetMeeting was a popular way to perform video conferences and
chatting over the Internet (with the help of public ILS servers).
Since the release of Windows XP, Microsoft has deprecated it in favour of Windows
Messenger, although it is still installed by default (Start > Run... > conf.exe). Note that
Windows Messenger, MSN Messenger and Windows Live Messenger hooks directly into
NetMeeting for the application sharing, desktop sharing, and Whiteboard features
exposed by each application.

18

As of the release of Windows Vista, NetMeeting is no longer included and has been
replaced by Windows Meeting Space.
chat can refer to any kind of communication over the internet, but is primarily meant to
refer to direct 1-on-1 chat or text-based group chat (formally also known as synchronous
conferencing), using tools such as instant messaging applications—computer programs,
Internet Relay Chat, talkers and possibly MUDs, MUCKs, MUSHes and MOOes.
While many of the web's well known custodians offer online chat and messaging services
for free, an increasing number of providers are beginning to show strong revenue streams
from paid-for services. Again it is the Adult service providers, profiting from the advent
of reliable and high-speed broadband, (notably across Eastern Europe) who are at the
forefront of the paid-for online chat revolution.
For every business traveller engaging in a video call or conference call rather than
braving the check-in queue, there are countless web users replacing traditional
conversational means with online chat and messaging. Like Email, which has reduced the
need and usage of letter, fax and memo communication, online chat is steadily replacing
telephony as the means of office and home communication. The early adopters in these
areas are undoubtedly teenage users of instant messaging. It might not be long before
SMS text messaging usage declines as mobile handsets provide the technology for online
chat.
Other forms of online chat that are not usually referred to as online chat
[edit] MUDs
A MUD, or a multi-user dungeon, is a multi-user version of dungeons and dragons for the
internet, and is an early use of the internet. In a MUD, as well as playing the game,
people can chat to each other. Talkers were originally based off MUDs and the earliest
versions of talkers were primarily MUDs without the gaming element. Other derivations
of MUDs were used that combined gaming with talking, and these include MUSHes,
MOOs and MUCKs.
[edit] Discussion boards
Besides real-time chat, another type of online community includes Internet forums and
bulletin board systems (BBSes), where users write posts (blocks of text) to which later
visitors may respond. Unlike the transient nature of chats, these systems generally archive
posts and save them for weeks or years. They can be used for technical troubleshooting,
advice, general conversation and more.
See also
General terms
• Chat room
• Web chat site
• Voice chat
• VoIP Voice over IP
• Live support software
• Online discussion

19

• Online discourse environment
Protocols/Programs
• Talker
• Internet Relay Chat
• Instant messenger
• PalTalk
• Talk (Unix)
• MUD
• MUSH
• MOO
• Google Talk
• Yahoo! Messenger
• Skype
• SILC
• Windows Live Messenger
• Campfire
Chat programs supporting multiple protocols
• Adium
• Gaim
• Miranda IM
• Trillian
• Retrieved from "http://en.wikipedia.org/wiki/Online_chat"

Plugins
A plugin (or plug-in) is a computer program that interacts with a main (or host)
application (a web browser or an email program, for example) to provide a certain,
usually very specific, function on-demand.
Typical examples are
• plugins that read or edit specific types of files (for instance, decode multimedia
files)
• encrypt or decrypt email (for instance, PGP)
• filter images in graphic programs in ways that the host application could not
normally do
• play and watch Flash presentations in a web browser
The host application provides services which the plugins can use, including a way for
plugins to register themselves with the host application and a protocol by which data is
exchanged with plugins. Plugins are dependent on these services provided by the main
application and do not usually work by themselves. Conversely, the main application is

20

independent of the plugins, making it possible for plugins to be added and updated
dynamically without changes to the main application.
Plugins are slightly different from extensions, which modify or add to existing
functionality. The main difference is that plugins generally rely on the main application's
user interface and have a well-defined boundary to their possible set of actions.
Extensions generally have fewer restrictions on their actions, and may provide their own
user interfaces. They sometimes are used to decrease the size of the main application and
offer optional functions. Mozilla Firefox uses a well-developed extension system to
reduce the feature creep that plagued the Mozilla Application Suite.
Perhaps the first software applications to include a plugin function were HyperCard and
QuarkXPress on the Macintosh, both released in 1987. In 1988, Silicon Beach Software
included plugin functionality in Digital Darkroom and SuperPaint, and the term plug-in
was coined by Ed Bomke. Currently, plugins are typically implemented as shared
libraries that must be installed in a place prescribed by the main application. HyperCard
supported a similar facility, but it was more common for the plugin code to be included in
the HyperCard documents (called stacks) themselves. This way, the HyperCard stack
became a self-contained application in its own right, which could be distributed as a
single entity that could be run by the user without the need for additional installation
steps.
Open application programming interfaces (APIs) provide a standard interface, allowing
third parties to create plugins that interact with the main application. A stable API allows
third-party plugins to function as the original version changes and to extend the lifecycle
of obsolete applications. The Adobe Photoshop and After Effects plugin APIs have
become a standard and been adopted to some extent by competing applications. Other
examples of such APIs include Audio Units and VST.
Examples
Many professional software packages offer plugin APIs to developers, in order to
increase the utility of the base product. Examples of these include:
• Eclipse
• GStreamer multimedia pipe handler
• jEdit Program Editor
• Quintessential Media Player, Winamp, foobar2000 and XMMS
• Notepad++
• OmniPeek packet analysis platform
• VST Audio Plugin Format

Communications protocol
From Wikipedia, the free encyclopedia
Jump to: navigation, search

21

This article concerns communication between pairs of electronic devices. For the
specific topic of computing protocols, see Protocol (computing). For protocols on two-
way voice communications, see Voice procedure. For other meanings of the word
protocol, see Protocol.
In the field of telecommunications, a communications protocol is the set of standard
rules for data representation, signalling, authentication and error detection required to
send information over a communications channel. An example of a simple
communications protocol adapted to voice communication is the case of a radio
dispatcher talking to mobile stations. The communication protocols for digital computer
network communication have many features intended to ensure reliable interchange of
data over an imperfect communication channel. Communication protocol is basically
following certain rules so that the system works properly.
Network protocol design principles
Systems engineering principles have been applied to create a set of common network
protocol design principles[citation needed]
. These principles include effectiveness, reliability,
and resiliency.
Effectiveness
Needs to be specified in such a way, that engineers, designers, and in some cases
software developers can implement and/or use it. In human-machine systems, its design
needs to facilitate routine usage by humans. Protocol layering accomplishes these
objectives by dividing the protocol design into a number of smaller parts, each of which
performs closely related sub-tasks, and interacts with other layers of the protocol only in
a small number of well-defined ways.
Protocol layering allows the parts of a protocol to be designed and tested without a
combinatorial explosion of cases, keeping each design relatively simple. The
implementation of a sub-task on one layer can make assumptions about the behavior and
services offered by the layers beneath it. Thus, layering enables a "mix-and-match" of
protocols that permit familiar protocols to be adapted to unusual circumstances.
For an example that involves computing, consider an email protocol like the Simple Mail
Transfer Protocol (SMTP). An SMTP client can send messages to any server that
conforms to SMTP's specification. Actual applications can be (for example) an aircraft
with an SMTP server receiving messages from a ground controller over a radio-based
internet link. Any SMTP client can correctly interact with any SMTP server, because
they both conform to the same protocol specification, RFC2821, RT49764368.
This paragraph informally provides some examples of layers, some required
functionalities, and some protocols that implement them, all from the realm of computing
protocols.
At the lowest level, bits are encoded in electrical, light or radio signals by the Physical
layer. Some examples include RS-232, SONET, and WiFi.
A somewhat higher Data link layer such as the point-to-point protocol (PPP) may detect
errors and configure the transmission system.

22

An even higher protocol may perform network functions. One very common protocol is
the Internet protocol (IP), which implements addressing for large set of protocols. A
common associated protocol is the Transmission control protocol (TCP) which
implements error detection and correction (by retransmission). TCP and IP are often
paired, giving rise to the familiar acronym TCP/IP.
A layer in charge of presentation might describe how to encode text (ie: ASCII, or
Unicode).
An application protocol like SMTP, may (among other things) describe how to inquire
about electronic mail messages.
These different tasks show why there's a need for a software architecture or reference
model that systematically places each task into context.
The reference model usually used for protocol layering is the OSI seven layer model,
which can be applied to any protocol, not just the OSI protocols of the International
Organization for Standardization (ISO). In particular, the Internet Protocol can be
analysed using the OSI model.
Reliability
Assuring reliability of data transmission involves error detection and correction, or some
means of requesting retransmission. It is a truism that communication media are always
faulty. The conventional measure of quality is the number of failed bits per bits
transmitted. This has the useful feature of being a dimensionless figure of merit that can
be compared across any speed or type of communication media.
In telephony, links with bit error rates (BER) of 10-4 or more are regarded as faulty (they
interfere with telephone conversations), while links with a BER of 10-5 or more should be
dealt with by routine maintenance (they can be heard).
Data transmission often requires bit error rates below 10-12. Computer data transmissions
are so frequent that larger error rates would affect operations of customers like banks and
stock exchanges. Since most transmissions use networks with telephonic error rates, the
errors caused by these networks must be detected and then corrected.
Communications systems detect errors by transmitting a summary of the data with the
data. In TCP (the internet's Transmission Control Protocol), the sum of the data bytes of
packet is sent in each packet's header. Simple arithmetic sums do not detect out-of-order
data, or cancelling errors. A bit-wise binary polynomial, a cyclic redundancy check, can
detect these errors and more, but is slightly more expensive to calculate.
Communication systems correct errors by selectively resending bad parts of a message.
For example, in TCP when a checksum is bad, the packet is discarded. When a packet is
lost, the receiver acknowledges all of the packets up to, but not including the failed
packet. Eventually, the sender sees that too much time has elapsed without an
acknowledgement, so it resends all of the packets that have not been acknowledged. At
the same time, the sender backs off its rate of sending, in case the packet loss was caused
by saturation of the path between sender and receiver. (Note: this is an over-
simplification: see TCP and congestion collapse for more detail)

23

In general, the performance of TCP is severely degraded in conditions of high packet loss
(more than 0.1%), due to the need to resend packets repeatedly. For this reason, TCP/IP
connections are typically either run on highly reliable fiber networks, or over a lower-
level protocol with added error-detection and correction features (such as modem links
with ARQ). These connections typically have uncorrected bit error rates of 10-9 to 10-12,
ensuring high TCP/IP performance.
Resiliency
Resiliency addresses a form of network failure known as topological failure in which a
communications link is cut, or degrades below usable quality. Most modern
communication protocols periodically send messages to test a link. In phones, a framing
bit is sent every 24 bits on T1 lines. In phone systems, when "sync is lost", fail-safe
mechanisms reroute the signals around the failing equipment.
In packet switched networks, the equivalent functions are performed using router update
messages to detect loss of connectivity.
Standards organizations
Most recent protocols are assigned by the IETF for Internet communications, and the
IEEE, or the ISO organizations for other types. The ITU-T handles telecommunications
protocols and formats for the public switched telephone network (PSTN). The ITU-R
handles protocols and formats for radio communications. As the PSTN. radio systems,
and Internet converge, the different sets of standards are also being driven towards
technological convergence.
[edit] Protocol families
A number of major protocol stacks or families exist, including the following:
Open standards:
Internet protocol suite
Open Systems Interconnection (OSI)
A connection-oriented networking protocol is one which identifies traffic flows by some
connection identifier rather than by explicitly listing source and destination addresses.
Typically, this connection identifier is a small integer (10 bits for Frame Relay, 24 for
ATM, for example). This makes network switches substantially faster (as routing tables
are just simple look-up tables, and are trivial to implement in hardware). The impact is so
great, in fact, that even characteristically connectionless protocols, such as IP traffic, are
being tagged with connection-oriented header prefixes (e.g., as with MPLS, or IPv6's
built-in Flow ID field).
Note that connection-oriented protocols are not necessarily reliable protocols. ATM and
Frame Relay, for example, are both examples of a connection-oriented, unreliable
protocol. There are also reliable connectionless protocols as well, such as AX.25 when it
passes data in I-frames. But this combination is rare, and reliable-connectionless is
uncommon in commercial and academic networks.
Note that connection-oriented protocols handle real-time traffic substantially more
efficiently than connectionless protocols, which is why ATM has yet to be replaced by
Ethernet for carrying real-time, isochronous traffic streams, especially in heavily

24

aggregated networks like backbones, where the motto "bandwidth is cheap" fails to
deliver on its promise. Experience has also shown that overprovisioning bandwidth does
not resolve all quality of service issues. Hence, (10-)gigabit Ethernet is not expected to
replace ATM at this time.
[edit] List of Connection-oriented protocols
TCP
Phone Call - user must dial the telephone, get an answer before transmitting data
ATM
Frame Relay
Connectionless protocol
From Wikipedia, the free encyclopedia
Jump to: navigation, search
In telecommunications, connectionless describes communication between two network
end points in which a message can be sent from one end point to another without prior
arrangement. The device at one end of the communication transmits data to the other,
without first ensuring that the recipient is available and ready to receive the data. The
device sending a message simply sends it addressed to the intended recipient. As such
there are more frequent problems with transmission than with connection-orientated
protocols and it may be necessary to resend the data several times. Connectionless
protocols are often disfavoured by network administrators because it is much harder to
filter malicious packets from a connectionless protocol using a firewall. The Internet
Protocol (IP) and User Datagram Protocol (UDP) are connectionless protocols, but
TCP/IP (the most common use of IP) is connection-oriented.
Connectionless protocols are usually described as stateless because the endpoints have no
protocol-defined way to remember where they are in a "conversation" of message
exchanges. The alternative to the connectionless approach uses connection-oriented
protocols, which are sometimes described as stateful because they can keep track of a
conversation.
List of Connectionless protocols
• IP
• UDP
• ICMP
• IPX
In computing, a protocol is a convention or standard that controls or enables the
connection, communication, and data transfer between two computing endpoints. In its
simplest form, a protocol can be defined as the rules governing the syntax, semantics,
and synchronization of communication. Protocols may be implemented by hardware,
software, or a combination of the two. At the lowest level, a protocol defines the behavior
of a hardware connection.
Contents
[hide]

25

1 Typical properties
2 Importance
3 Common Protocols
4 See also
Typical properties
It is difficult to generalize about protocols because they vary so greatly in purpose and
sophistication. Most protocols specify one or more of the following properties:
Detection of the underlying physical connection (wired or wireless), or the existence of
the other endpoint or node
Handshaking
Negotiation of various connection characteristics
How to start and end a message
How to format a message
What to do with corrupted or improperly formatted messages (error correction)
How to detect unexpected loss of the connection, and what to do next
Termination of the session or connection.
Importance
The widespread use and expansion of communications protocols is both a prerequisite to
the Internet, and a major contributor to its power and success. The pair of Internet
Protocol (or IP) and Transmission Control Protocol (or TCP) are the most important of
these, and the term TCP/IP refers to a collection (or protocol suite) of its most used
protocols. Most of the Internet's communication protocols are described in the RFC
documents of the Internet Engineering Task Force (or IETF).
Object-oriented programming has extended the use of the term to include the
programming protocols available for connections and communication between objects.
Generally, only the simplest protocols are used alone. Most protocols, especially in the
context of communications or networking, are layered together into protocol stacks where
the various tasks listed above are divided among different protocols in the stack.
Whereas the protocol stack denotes a specific combination of protocols that work
together, the Reference Model is a software architecture that lists each layer and the
services each should offer. The classic seven-layer reference model is the OSI model,
which is used for conceptualizing protocol stacks and peer entities. This reference model
also provides an opportunity to teach more general software engineering concepts like
hiding, modularity, and delegation of tasks. This model has endured in spite of the
demise of many of its protocols (and protocol stacks) originally sanctioned by the ISO.
The OSI model is not the only reference model however.
[edit] Common Protocols
• HTTP (Hyper Text Transfer Protocol)
• POP3 (Post Office Protocol 3).
• SMTP (Simple Mail Transfer Protocol).
• FTP (File Transfer Protocol).
• IP (Internet Protocol).

26

• DHCP (Dynamic Host Configuration Protocol).
• IMAP (Internet Message Access Protocol).

Search Engine
A search engine is an information retrieval system designed to help find information
stored on a computer system, such as on the World Wide Web, inside a corporate or
proprietary network, or in a personal computer. The search engine allows one to ask for
content meeting specific criteria (typically those containing a given word or phrase) and
retrieves a list of items that match those criteria. This list is often sorted with respect to
some measure of relevance of the results. Search engines use regularly updated indexes to
operate quickly and efficiently.
Without further qualification, search engine usually refers to a Web search engine, which
searches for information on the public Web. Other kinds of search engine are enterprise
search engines, which search on intranets, personal search engines, and mobile search
engines. Different selection and relevance criteria may apply in different environments,
or for different uses.
Some search engines also mine data available in newsgroups, databases, or open
directories. Unlike Web directories, which are maintained by human editors, search
engines operate algorithmically or are a mixture of algorthmic and human input.
How search engines work
A search engine operates, in the following order
Web crawling
Indexing
Searching
A web crawler (also known as a Web spider or Web robot) is a program or automated
script which browses the World Wide Web in a methodical, automated manner. Other
less frequently used names for Web crawlers are ants, automatic indexers, bots, and
worms (Kobayashi and Takeda, 2000).
This process is called Web crawling or spidering. Many legitimate sites, in particular
search engines, use spidering as a means of providing up-to-date data. Web crawlers are
mainly used to create a copy of all the visited pages for later processing by a search
engine, that will index the downloaded pages to provide fast searches. Crawlers can also
be used for automating maintenance tasks on a Web site, such as checking links or
validating HTML code. Also, crawlers can be used to gather specific types of information
from Web pages, such as harvesting e-mail addresses (usually for spam).
A Web crawler is one type of bot, or software agent. In general, it starts with a list of
URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the
hyperlinks in the page and adds them to the list of URLs to visit, called the crawl
frontier. URLs from the frontier are recursively visited according to a set of policies.
Web crawler architectures

27

High-level architecture of a standard Web crawler
A crawler must not only have a good crawling strategy, as noted in the previous sections,
but it should also have a highly optimized architecture.
Shkapenyuk and Suel (Shkapenyuk and Suel, 2002) noted that: "While it is fairly easy to
build a slow crawler that downloads a few pages per second for a short period of time,
building a high-performance system that can download hundreds of millions of pages
over several weeks presents a number of challenges in system design, I/O and network
efficiency, and robustness and manageability."
Web crawlers are a central part of search engines, and details on their algorithms and
architecture are kept as business secrets. When crawler designs are published, there is
often an important lack of detail that prevents others from reproducing the work. There
are also emerging concerns about "search engine spamming", which prevent major search
engines from publishing their ranking algorithms.

indexing entails how data is collected, parsed, and stored to facilitate fast and accurate
retrieval. Index design incorporates interdisciplinary concepts from Linguistics,
Cognitive psychology, Mathematics, Informatics, Physics, and Compruter science. An
alternate name for the procss is Web indexing, within the context of search engines
designed to find web pages o the Internet.
Popular engines focus on the full-text indexing of online, natural language documents,
yet there are other searchable media types such as video, audio, and graphics. Meta
search engines reuse the indices of other services and do not store a local index, whereas
cache-based search engines permanently store the index along with the corpus. Unlike
full text indices, partial text services restrict the depth indexed to reduce index size.
Larger services typically perform indexing at a predetermined interval due to the required
time and processing costs, whereas agent-based search engines index in real time.

Indexing

28

The goal of storing an index is to optimize the speed and performance of finding relevant
documents for a search query. Without an index, the search engine would scan every
document in the corpus, which would take a considerable amount of time and computing
power. For example, an index of 1000 documents can be queried within milliseconds,
where a raw scan of 1000 documents could take hours. No search engine user would be
comfortable waiting several hours to get search results. The trade off for the time saved
during retrieval is that additional storage is required to store the index and that it takes a
considerable amount of time to update.
[edit] Index Design Factors
Major factors in designing a search engine's architecture include:
Merge factors - how data enters the index, or how words or subject features are added to
the index during corpus traversal, and whether multiple indexers can work
asynchronously. The indexer must first check whether it is updating old content or adding
new content. Traversal typically correlates to the data collection policy. Search engine
index merging is similar in concept to the SQL Merge command and other merge
algorithms. Storage techniques - how to store the index data - whether information should
be compressed or filtered
Index size - how much computer storage is required to support the index
Lookup speed - how quickly a word can be found in the inverted index. How quickly an
entry in a data structure can be found, versus how quickly it can be updated or removed,
is a central focus of computer science
Maintenance - maintaining the index over time
Fault tolerance - how important it is for the service to be reliable, how to deal with index
corruption, whether bad data can be treated in isolation, dealing with bad hardware,
partitioning schemes such as hash-based or composite partitioning, data replication
Index Data Structures
Search engine architectures vary in how indexing is performed and in index storage to
meet the various design factors. Types of indices include:
Suffix trees - figuratively structured like a tree, supports linear time lookup. Built by
storing the suffices of words. Used for searching for patterns in DNA sequences and
clustering. A major drawback is that the storage of a word in the tree may require more
storage than storing the word itself An alternate representation is a suffix array, which is
considered to require less memory and supports compression like BWT.
Tries - an ordered tree data structure that is used to store an associative array where the
keys are strings. Regarded as faster than a hash table, but are less space efficient. The
suffix tree is a type of trie. Tries support extendible hashing, which is important for
search engine indexing.
Inverted indices - stores a list of occurrences of each atomic search criterion, typically in
the form of a hash table or binary tree.
Citation indices - stores the existence of citations or hyperlinks between documents to
support citation analysis, a subject of Bibliometrics.

29

Ngram indices - for storing sequences of n length of data to support other types of
retrieval or text mining. Term document matrices - used in latent semantic analysis,
stores the occurrences of words in documents in a two dimensional sparse matrix.
Challenges in Parallelism
A major challenge in the design of search engines is the management of parallel
processes. There are many opportunities for race conditions and coherence faults. For
example, a new document is added to the corpus and the index must be updated, but the
index simultaneously needs to continue responding to search queries. This is a collision
between two competiting tasks. Consider that authors are producers of information, and a
crawler is the consumer of this information, grabbing the text and storing it in a cache (or
corpus). The forward index is the consumer of the information produced by the corpus,
and the inverted index is the consumer of information produced by the forward index.
This is commonly referred to as a producer-consumer model. The indexer is the
producer of searchable information and users are the consumers that need to search. The
challenge is magnified when working with distributed storage and distributed processing.
In an effort to scale with larger amounts of indexed information, the search engine's
architecture may involve distributed computing, where the search engine consists of
several machines operating in unison. This increases the possibilities for incoherency and
makes it more difficult to maintain a fully-synchronized, distributed, parallel architecture.
Inverted indices
Many search engines incorporate an inverted index when evaluating a search query to
quickly locate the documents which contain the words in a query and rank these
documents by relevance. The inverted index stores a list of the documents for each word.
The search engine can retrieve the matching documents quickly using direct access to
find the documents for a word. The following is a simplified illustration of the inverted
index:
Inverted Index
Word Documents
the Document 1, Document 3, Document 4, Document 5
cow Document 2, Document 3, Document 4
says Document 5
moo Document 7
The above figure is a simplified form of a Boolean index. Such an index would only
serve to determine whether a document matches a query, but would not contribute to
ranking matched documents. In some designs the index includes additional information
such as the frequency of each word in each document or the positions of the word in each
document. With position, the search algorithm can identify word proximity to support
searching for phrases. Frequency can be used to help in ranking the relevance of
documents to the query. Such topics are the central research focus of information
retrieval.
The inverted index is a sparse matrix given that words are not present in each document.
It is stored differently than a two dimensional array to reduce memory requirements. The

30

index is similar to the term document matrices employed by latent semantic analysis. The
inverted index can be considered a form of a hash table. In some cases the index is a form
of a binary tree, which requires additional storage but may reduce the lookup time. In
larger indices the architecture is typically distributed. Inverted indices can be
programmed in several computer programming languages.
Index Merging
The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but
first deletes the contents of the inverted index. The architecture may be designed to
support incremental indexing, where a merge involves identifying the document or
documents to add into or update in the index and parsing each document into words. For
technical accuracy, a merge involves the unison of newly indexed documents, typically
residing in virtual memory, with the index cache residing on one or more computer hard
drives.
After parsing, the indexer adds the containing document to the document list for the
appropriate words. The process of finding each word in the inverted index in order to
denote that it occurred within a document may be too time consuming when designing a
larger search engine, and so this process is commonly split up into the development of a
forward index and the process of sorting the contents of the forward index for entry into
the inverted index. The inverted index is named inverted because it is an inversion of the
forward index.
The Forward Index
The forward index stores a list of words for each document. The following is a simplified
form of the forward index:
Forward Index
Document Words
Document 1 the,cow,says,moo
Document 2 the,cat,and,the,hat
Document 3 the,dish,ran,away,with,the,spoon
The rationale behind developing a forward index is that as documents are parsed, it is
better to immediately store the words per document. The delineation enables
asynchronous processing, which partially circumvents the inverted index update
bottleneck. The forward index is sorting to transform it to an inverted index. The forward
index is essentially a list of pairs consisting of a document and a word, collated by the
document. Converting the forward index to an inverted index is only a matter of sorting
the pairs by the words. In this regard, the inverted index is a word-sorted forward index.
Compression
Generating or maintaining a large-scale search engine index represents a significant
storage and processing challenge. Many search engines utilize a form of compression to
reduce the size of the indices on disk. Consider the following scenario for a full text,
Internet, search engine.
An estimated 2,000,000,000 different web pages exist as of the year 2000

31

A fictitious estimate of 250 words per webpage on average, based on the assumption of
being similar to the pages of a novel.
It takes 8 bits (or 1 byte) to store a single character. Some encodings use 2 bytes per
characterThe average number of characters in any given word on a page can be estimated
at 5 (Wikipedia:Size comparisons)
The average personal computer comes with about 20 gigabytes of usable space
Given these estimates, generating a uncompressed index (assuming a non-conflated,
simple, index) for 2 billion web pages would need to store 5 billion word entries. At 1
byte per character, or 5 bytes per word, this would require 25 gigabytes of storage space
alone, more than the average size a personal computer's free disk space. This space is
further increased in the case of a distributed storage architecture that is fault-tolerant.
Using compression, the index size can be reduced to a portion of its size, depending on
which compression techniques are chosen. The trade off is the time and processing power
required to perform compression.
Notably, large scale search engine designs incorporate the cost of storage, and the costs
of electricity to power the storage. Compression, in this regard, is a measure of cost as
well.
Document Parsing
Document parsing involves breaking apart the components (words) of a document or
other form of media for insertion into the forward and inverted indices. For example, if
the full contents of a document consisted of the sentence "Hello World", there would
typically be two words found, the token "Hello" and the token "World". In the context of
search engine indexing and natural language processing, parsing is more commonly
referred to as tokenization, and sometimes word boundary disambiguation, tagging, Text
segmentation, Content analysis, text analysis, Text mining, Concordance generation,
Speech segmentation, lexing, or lexical analysis. The terms 'indexing', 'parsing', and
'tokenization' are used interchangeably in corporate slang.
Natural language processing, as of 2006, is the subject of continuous research and
technological improvement. There are a host of challenges in tokenization, in extracting
the necessary information from documents for indexing to support quality searching.
Tokenization for indexing involves multiple technologies, the implementation of which
are commonly kept as corporate secrets.
Challenges in Natural Language Processing
Word Boundary Ambiguity - native English speakers can at first consider tokenization to
be a straightfoward task, but this is not the case with designing a multilingual indexer. In
digital form, the text of other languages such as Chinese, Japanese or Arabic represent a
greater challenge as words are not clearly delineated by whitespace. The goal during
tokenization is to identify words for which users will search. Language specific logic is
employed to properly identify the boundaries of words, which is often the rationale for
designing a parser for each language supported (or for groups of languages with similar
boundary markers and syntax).

32

Language Ambiguity - to assist with properly ranking matching documents, many search
engines collect additional information about words, such as its language or lexical
category (part of speech). These techniques are language-dependent as the syntax varies
among languages. Documents do not always clearly identify the language of the
document or represent it accurately. In tokenizing the document, some search engines
attempt to automatically identify the language of the document.
Diverse File Formats - in order to correctly identify what bytes of a document represent
characters, the file format must be correctly handled. Search engines which support
multiple file formats must be able to correctly open and access the document and be able
to tokenize the characters of the document.
Faulty Storage - the quality of the natural language data is not always assumed to be
perfect. An unspecified amount of documents, particular on the Internet, do not always
closely obey proper file protocol. Binary characters may be mistakenly encoded into
various parts of a document. Without recognition of these characters and appropriate
handling, the index quality or indexer performance could degrade.
[edit] Tokenization
Unlike literrate human adults, computers are not inherently aware of the structure of a
natural language document and do not instantly recognize words and sentences. To a
computer, a document is only a big sequence of bytes. Computers do not know that a
space character between two sequences of characters means that there are two separate
words in the document. Instead, a computer program is developed by humans which
trains the computer, or instructs the computer, how to identify what constitutes an
individual or distinct word, referred to as a token. This program is commonly referred to
as a tokenizer or parser or lexer. Many search engines, as well as other natural language
processing software, incorporate specialized programs for parsing, such as YACC OR
Lex.
During tokenization, the parser identifies sequences of characters, which typically
represent words. Commonly recognized tokens include punctuation, sequences of
numerical characters, alphabetical characters, alphanumerical characters, binary
characters (backspace, null, print, and other antiquated print commands), whitespace
(space, tab, carriage return, line feed), and entities such as email addresses, phone
numbers, and URLs. When identifying each token, several characteristics may be stored
such as the token's case (upper, lower, mixed, proper), language or encoding, lexical
category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence
position, length, and line number.
Language Recognition
If the search engine supports multiple languages, a common initial step during
tokenization is to identify each document's language, given that many of the later steps
are language dependent (such as stemming and part of speech tagging). Language
recognition is the process by which a computer program attempts to automatically
identify, or categorize, the language of a document. Other names for language
recognition include language classification, language analysis, language identification,

33

and language tagging. Automated language recognition is the subject of ongoing research
in natural language processing. Finding which words the language belongs to may
involve the use of a language recognition chart.
Format Analysis
Depending on whether the search engine supports multiple document formats, documents
must be prepared for tokenization. The challenge is that many document formats contain,
in addition to textual content, formatting information. For example, HTML documents
contain HTML tags, which specify formatting information, like whether to start a new
line, or display a word in bold, or change the font size or family. If the search engine
were to ignore the difference between content and markup, the segments would also be
included in the index, leading to poor search results. Format analysis involves the
identification and handling of formatting content embedded within documents which
control how the document is rendered on a computer screen or interpreted by a software
program. Format analysis is also referred to as structure analysis, format parsing, tag
stripping, format stripping, text normalization, text cleaning, or text preparation. The
challenge of format analysis is further complicated by the intricacies of various file
formats. Certain file formats are proprietary and very little information is disclosed, while
others are well documented. Common, well-documented file formats that many search
engines support include:
Microsroft Word
Microsoft Excel
Microsoft Powerpoint
IBM Lotus Notes
HTML
ASCII Text files (a text document without any formatting)
Adobe's Portable Document Format (PDF)
PostScript (PS)
LaTex
The UseNet archive (NNTP) and other deprecated bulletin board formats
XML and derivatives like RSS
SGML (this is more of a general protocol)
Multimedia meta data formats like ID3
Techniques for dealing with various formats include:
Using a publicly available commercial parsing tool that is offered by the organization
which developed, maintains, or owns the format
Writing a custom parser
Some search engines support inspection of files that are stored in a compressed, or
encrypted, file format. If working with a compressed format, then the indexer first
decompresses the document, which may result in one or more files, each of which must
be indexed separately. Commonly supported compressed file formats include:
ZIP - Zip File
RAR - Archive File

34

CAB - Microsoft Windows Cabinet File
Gzip - Gzip file
BZIP - Bzip file
TAR, GZ, and TAR.GZ - Unix Gzip'ped Archives
Format analysis can involve quality improvement methods to avoid including 'bad
information' in the index. Content can manipulate the formatting information to include
additional content. Examples of abusing document formatting for spamdexing:
Including hundreds or thousands of words in a section which is hidden from view on the
computer screen, but visible to the indexer, by use of formatting (e.g. hidden "div" tag in
HTML, which may incorporate the use of CSS or Javascript to do so).
Setting the foreground font color of words to the same as the background color, making
words hidden on the computer screen to a person viewing the document, but not hidden
to the indexer.
Section Recognition
Some search engines incorporate section recognition, the identification of major parts of
a document, prior to tokenization. Not all the documents in a corpus read like a well-
written book, divided into organized chapters and pages. Many documents on the web
contain erroneous content and side-sections which do not contain primary material, that
which the document is about, such as newsletters and corporate reports. For example, this
article may display a side menu with words inside links to other web pages. Some file
formats, like HTML or PDF, allow for content to be displayed in columns. Even though
the content is displayed, or rendered, in different areas of the view, the raw markup
content may store this information sequentially. Words that appear in the raw source
content sequentially are indexed sequentially, even though these sentences and
paragraphs are rendered in different parts of the computer screen. If search engines index
this content as if it were normal content, a dilemma ensues where the quality of the index
is degraded and search quality is degraded due to the mixed content and improper word
proximity. Two primary problems are noted:
Content in different sections is treated as related in the index, when in reality it is not
Organizational 'side bar' content is included in the index, but the side bar content does not
contribute to the meaning of the document, and the index is filled with a poor
representation of its documents, assuming the goal is to go after the meaning of each
document, a sub-goal of providing quality search results.
Section analysis may require the search engine to implement the rendering logic of each
document, essentially an abstract representation of the actual document, and then index
the representation instead. For example, some content on the Internet is rendered via
Javascript. Viewers of web pages in web browsers see this content. If the search engine
does not render the page and evaluate the javascript within the page, it would not 'see' this
content in the same way, and index the document incorrectly. Given that some search
engines do not bother with rendering issues, many web page designers avoid displaying
content via javascript or use the Noscript tag to ensure that the web page is indexed

Internet application unit2

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (17)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie Internet application unit2

Ähnlich wie Internet application unit2 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Internet application unit2