SlideShare ist ein Scribd-Unternehmen logo
1 von 42
1

                                     Web Browser
A web browser is a software application that enables a user to display and interact with
text, images, and other information typically located on a web page at a website on the
World Wide Web or a local area network. Text and images on a web page can contain
hyperlinks to other web pages at the same or different website. Web browsers allow a
user to quickly and easily access information provided on many web pages at many
websites by traversing these links. Web browsers format HTML information for display,
so the appearance of a web page may differ between browsers.
Some of the web browsers available for personal computers include Internet Explorer,
Mozilla Firefox, Safari, Netscape, and Opera in order of descending popularity (as of
August 2006).[1] Web browsers are the most commonly used type of HTTP user agent.
Although browsers are typically used to access the World Wide Web, they can also be
used to access information provided by web servers in private networks or content in file
systems.
Protocols and standards
Web browsers communicate with web servers primarily using HTTP (hypertext transfer
protocol) to fetch webpages. HTTP allows web browsers to submit information to web
servers as well as fetch web pages from them. The most commonly used HTTP is HTTP/
1.1, which is fully defined in RFC 2616. HTTP/1.1 has its own required standards that
Internet Explorer does not fully support, but most other current-generation web browsers
do.
Pages are located by means of a URL (uniform resource locator), which is treated as an
address, beginning with http: for HTTP access. Many browsers also support a variety of
other URL types and their corresponding protocols, such as ftp: for FTP (file transfer
protocol), rtsp: for RTSP (real-time streaming protocol), and https: for HTTPS (an SSL
encrypted version of HTTP).
The file format for a web page is usually HTML (hyper-text markup language) and is
identified in the HTTP protocol using a MIME content type. Most browsers natively
support a variety of formats in addition to HTML, such as the JPEG, PNG and GIF image
formats, and can be extended to support more through the use of plugins. The
combination of HTTP content type and URL protocol specification allows web page
designers to embed images, animations, video, sound, and streaming media into a web
page, or to make them accessible through the web page.
Early web browsers supported only a very simple version of HTML. The rapid
development of proprietary web browsers led to the development of non-standard dialects
of HTML, leading to problems with Web interoperability. Modern web browsers support
a combination of standards- and defacto-based HTML and XHTML, which should
display in the same way across all browsers. No browser fully supports HTML 4.01,
XHTML 1.x or CSS 2.1 yet. Currently many sites are designed using WYSIWYG HTML
generation programs such as Macromedia Dreamweaver or Microsoft Frontpage. These
often generate non-standard HTML by default, hindering the work of the W3C in
2

developing standards, specifically with XHTML and CSS (cascading style sheets, used
for page layout).
Some of the more popular browsers include additional components to support Usenet
news, IRC (Internet relay chat), and e-mail. Protocols supported may include NNTP
(network news transfer protocol), SMTP (simple mail transfer protocol), IMAP (Internet
message access protocol), and POP (post office protocol). These browsers are often
referred to as Internet suites or application suites rather than merely web browsers.
Brief history
A NeXTcube was used by Tim Berners-Lee (who pioneered the use of hypertext for
sharing information) as the world's first web server, and also to write the first web
browser, WorldWideWeb in 1990. Berners-Lee introduced it to colleagues at CERN in
March 1991. Since then the development of web browsers has been inseparably
intertwined with the development of the web itself.
The first browser, Silversmith, was created by John Bottoms in 1987.[2] The browser,
based on SGML tags, used a tag set from the Electronic Document Project of the AAP
with minor modifications and was sold to a number of early adopters. At the time SGML
was used exclusively for the formatting of printed documents. The use of SGML for
electronically displayed documents signaled a shift in electronic publishing and was met
with considerable resistance. Silversmith included an integrated indexer, full text
searches, hypertext links between images text and sound using SGML tags and a return
stack for use with hypertext links. It included features that are still not available in today's
browsers. These include capabilities such as the ability to restrict searches within
document structures, searches on indexed documents using wild cards and the ability to
search on tag attribute values and attribute names. SGML-FAQ US Patent
In 1992, Tony Johnson releases the MidasWWW browser. Based on Motif/X,
MidasWWW allows viewing of PostScript files on the Web from Unix and VMS, and
even handles compressed PostScript.
Another early popular web browser was ViolaWWW, which was modeled after
HyperCard. However, the explosion in popularity of the web was triggered by NCSA
Mosaic which was a graphical browser running originally on Unix but soon ported to the
Apple Macintosh and Microsoft Windows platforms. Version 1.0 was released in
September 1993, and was dubbed the killer application of the Internet. Marc Andreessen,
who was the leader of the Mosaic team at NCSA, quit to form a company that would later
be known as Netscape Communications Corporation.
Netscape released its flagship Navigator product in October 1994, and it took off the next
year. Microsoft, which had thus far not marketed a browser, now entered the fray with its
Internet Explorer product, purchased from Spyglass Inc. This began what is known as the
browser wars, the fight for the web browser market between Microsoft and Netscape.
The wars put the web in the hands of millions of ordinary PC users, but showed how
commercialization of the web could stymie standards efforts. Both Microsoft and
Netscape liberally incorporated proprietary extensions to HTML in their products, and
tried to gain an edge by product differentiation. Starting with the acceptance of the
3

Microsoft proposed Cascading Style Sheets over Netscape's JavaScript Style Sheets
(JSSS) by W3C, the Netscape browser started being generally considered inferior to
Microsoft's browser version after version, from feature considerations to application
robustness to standard compliance. The wars effectively ended in 1998 when it became
clear that Netscape's declining market share trend was irreversible. This trend may have
been due in part to Microsoft's integrating its browser with its operating system and
bundling deals with OEMs; Microsoft faced antitrust litigation on these charges.
Netscape responded by open sourcing its product, creating Mozilla. This did nothing to
slow Netscape's declining market share. The company was purchased by America Online
in late 1998. At first, the Mozilla project struggled to attract developers, but by 2002 it
had evolved into a relatively stable and powerful internet suite. Mozilla 1.0 was released
to mark this milestone. Also in 2002, a spin off project that would eventually become the
popular Mozilla Firefox was released. In 2004, Firefox 1.0 was released; Firefox 1.5 was
released in November 2005. Firefox 2, a major update, was released in October 2006 and
work has already begun on Firefox 3 which is scheduled for release in 2007. As of 2006,
Mozilla and its derivatives account for approximately 12% of web traffic.
Opera, an innovative, speedy browser popular in handheld devices, particularly mobile
phones, as well as on PCs in some countries was released in 1996 and remains a niche
player in the PC web browser market. It is available on Nintendo's DS, DS Lite and Wii
consoles[2]. The Opera Mini browser uses the Presto (layout engine) like all versions of
Opera, but runs on most phones supporting Java Midlets.
The Lynx browser remains popular for Unix shell users and with vision impaired users
due to its entirely text-based nature. There are also several text-mode browsers with
advanced features, such as w3m, Links (which can operate both in text and graphical
mode), and the Links forks such as ELinks.
The Macintosh scene too has traditionally been dominated by Internet Explorer and
Netscape. However, Apple's Safari, the default browser on Mac OS X since version 10.3,
has slowly grown to dominate this market.
In 2003, Microsoft announced that Internet Explorer would no longer be made available
as a separate product but would be part of the evolution of its Windows platform, and that
no more releases for the Macintosh would be made. However, in early 2005, Microsoft
changed its plans, releasing version 7 of Internet Explorer for Windows XP, Windows
Server 2003, and Windows Vista in October 2006.
Features
Different browsers can be distinguished from each other by the features they support.
Modern browsers and web pages tend to utilize many features and techniques that did not
exist in the early days of the web. As noted earlier, with the browser wars there was a
rapid and chaotic expansion of browser and World Wide Web feature sets.
The following is a list of some of the most notable features:
   •   Standards support
   •   HTTP and HTTPS
4


    •   HTML, XML and XHTML
    •   Graphics file formats including GIF, PNG, JPEG, and SVG
    •   Cascading Style Sheets (CSS)
    •   JavaScript (Dynamic HTML) and XMLHttpRequest
    •   Cookie
    •   Digital certificates
    •   Favicons
    •   RSS, Atom
Fundamental features
    •   Bookmark manager
    •   Caching of web contents
    •   Support of media types via plugins such as Macromedia Flash and QuickTime
Usability and accessibility features
    •   Autocompletion of URLs and form data
    •   Tabbed browsing
    •   Spatial navigation
    •   Caret navigation
    •   Screen reader or full speech support




                                         HTML
HTML, short for HyperText Markup Language, is the predominant markup language
for the creation of web pages. It provides a means to describe the structure of text-based
information in a document — by denoting certain text as headings, paragraphs, lists, and
so on — and to supplement that text with interactive forms, embedded images, and other
objects. HTML is written in the form of labels (known as tags), created by greater-than
signs (>) and less-than signs (<). HTML can also describe, to some degree, the
appearance and semantics of a document, and can include embedded scripting language
code which can affect the behavior of web browsers and other HTML processors.
HTML is also often used to refer to content of the MIME type text/html or even more
broadly as a generic term for HTML whether in its XML-descended form (such as
XHTML 1.0 and later) or its form descended directly from SGML (such as HTML 4.01
and earlier).
   What is HTML?
HTML stands for Hypertext Markup Language.
Hypertext is ordinary text that has been dressed up with extra features, such as
formatting, images, multimedia, and links to other documents.
5

Markup is the process of taking ordinary text and adding extra symbols. Each of the
symbols used for markup in HTML is a command that tells a browser how to display the
text.
History of HTML
Tim Berners-Lee created the original HTML (and many associated protocols such as
HTTP) on a NeXTcube workstation using the NeXTSTEP development environment. At
the time, HTML was not a specification, but a collection of tools to solve an immediate
problem: the communication and dissemination of ongoing research among Berners-Lee
and a group of his colleagues. His solution later combined with the emerging
international and public internet to garner worldwide attention.
Early versions of HTML were defined with loose syntactic rules, which helped its
adoption by those unfamiliar with web publishing. Web browsers commonly made
assumptions about intent and proceeded with rendering of the page. Over time, as the use
of authoring tools increased, the trend in the official standards has been to create an
increasingly strict language syntax. However, browsers still continue to render pages that
are far from valid HTML.
HTML is defined in formal specifications that were developed and published throughout
the 1990s, inspired by Tim Berners-Lee's prior proposals to graft hypertext capability
onto a homegrown SGML-like markup language for the Internet. The first published
specification for a language called HTML was drafted by Berners-Lee with Dan
Connolly, and was published in 1993 by the IETF as a formal "application" of SGML
(with an SGML Document Type Definition defining the grammar). The IETF created an
HTML Working Group in 1994 and published HTML 2.0 in 1995, but further
development under the auspices of the IETF was stalled by competing interests. Since
1996, the HTML specifications have been maintained, with input from commercial
software vendors, by the World Wide Web Consortium (W3C).[1] However, in 2000,
HTML also became an international standard (ISO/IEC 15445:2000). The last HTML
specification published by the W3C is the HTML 4.01 Recommendation, published in
late 1999 and its issues and errors were last acknowledged by errata published in 2001.
Since the publication of HTML 4.0 in late 1997, the W3C's HTML Working Group has
increasingly — and from 2002 through 2006, exclusively — focused on the development
of XHTML, an XML-based counterpart to HTML that is described on one W3C web
page as HTML's "successor".[2][3][4] XHTML applies the more rigorous, less ambiguous
syntax requirements of XML to HTML to make it easier to process and extend, and as
support for XHTML has increased in browsers and tools, it has been embraced by many
web standards advocates in preference to HTML. XHTML is routinely characterized by
mass-media publications for both general and technical audiences as the newest "version"
of HTML, but W3C publications, as of 2006, do not make such a claim; neither HTML
3.2 nor HTML 4.01 have been explicitly rescinded, deprecated, or superseded by any
W3C publications, and, as of 2006, they continue to be listed alongside XHTML as
current Recommendations in the W3C's primary publication indices.[5][6][7]
6

In November 2006, the HTML Working Group published a new charter indicating its
intent to resume development of HTML in a manner that unifies HTML 4 and XHTML
1, allowing for this hybrid language to manifest in both an XML format and a "classic
HTML" format that is SGML-compatible but not strictly SGML-based. Among other
things, it is planned that the new specification, to be released and refined throughout 2007
through 2008, will include conformance and parsing requirements, DOM APIs, and new
widgets and APIs. The group also intends to publish test suites and validation tools.[8]
Version history of the standard
  HTML
  Character                      encodings
  Dynamic                          HTML
  Font                              family
  HTML                               editor
  HTML                            element
  HTML                            scripting
  Layout engine comparison
  Style                             Sheets
  Unicode             and          HTML
  W3C
  Web browsers comparison
  Web                               colors
  XHTML
  This box: view • talk • edit

Hypertext Markup Language (First Version), published June 1993 as an Internet
Engineering Task Force (IETF) working draft (not standard).
HTML 2.0, published November 1995 as IETF RFC 1866, supplemented by RFC 1867
(form-based file upload) that same month, RFC 1942 (tables) in May 1996, RFC 1980
(client-side image maps) in August 1996, and RFC 2070 (internationalization) in January
1997; ultimately all were declared obsolete/historic by RFC 2854 in June 2000.
HTML 3.2, published January 14, 1997 as a W3C Recommendation.
HTML 4.0, published December 18, 1997 as a W3C Recommendation. It offers three
"flavors":
Strict, in which deprecated elements are forbidden
Transitional, in which deprecated elements are allowed
Frameset, in which mostly only frame related elements are allowed
HTML 4.01, published December 24, 1999 as a W3C Recommendation. It offers the
same three flavors as HTML 4.0, and its last errata was published May 12, 2001.
ISO/IEC 15445:2000 ("ISO HTML", based on HTML 4.01 Strict), published May 15,
2000 as an ISO/IEC international standard.
HTML 4.01 and ISO/IEC 15445:2000 are the most recent and final versions of HTML.
XHTML is a separate language that began as a reformulation of HTML 4.01 using XML
1.0. It continues to be developed:
7

XHTML 1.0, published January 26, 2000 as a W3C Recommendation, later revised and
republished August 1, 2002. It offers the same three flavors as HTML 4.0 and 4.01,
reformulated in XML, with minor restrictions.
XHTML 1.1, published May 31, 2001 as a W3C Recommendation. It is based on
XHTML 1.0 Strict, but includes minor changes and is reformulated using modules from
Modularization of XHTML, which was published April 10, 2001 as a W3C
Recommendation.
XHTML 2.0 is still a W3C Working Draft.
There is no official standard HTML 1.0 specification because there were multiple
informal HTML standards at the time. Berners-Lee's original version did not include an
IMG element type. Work on a successor for HTML, then called "HTML+", began in late
1993, designed originally to be "A superset of HTML…which will allow a gradual
rollover from the previous format of HTML". The first formal specification was therefore
given the version number 2.0 in order to distinguish it from these unofficial "standards".
Work on HTML+ continued, but it never became a standard.
The HTML 3.0 standard was proposed by the newly formed W3C in March 1995, and
provided many new capabilities such as support for tables, text flow around figures, and
the display of complex math elements. Even though it was designed to be compatible
with HTML 2.0, it was too complex at the time to be implemented, and when the draft
expired in September 1995, work in this direction was discontinued due to lack of
browser support. HTML 3.1 was never officially proposed, and the next standard
proposal was HTML 3.2 (code-named "Wilbur"), which dropped the majority of the new
features in HTML 3.0 and instead adopted many browser-specific element types and
attributes which had been created for the Netscape and Mosaic web browsers. Math
support as proposed by HTML 3.0 finally came about years later with a different
standard, MathML.
HTML 4.0 likewise adopted many browser-specific element types and attributes, but at
the same time began to try to "clean up" the standard by marking some of them as
deprecated, and suggesting they not be used.
Minor editorial revisions to the HTML 4.0 specification were published as HTML 4.01.
The most common filename extension for files containing HTML is .html. However,
older operating systems and filesystems, such as the DOS versions from the 80's and
early 90's and FAT, limit file extensions to three letters, so a .htm extension is also used.
Although perhaps less common now, the shorter form is still widely supported by current
software.
HTML as a hypertext format
HTML is the basis of a comparatively weak hypertext implementation. Earlier hypertext
systems had features such as typed links, transclusion and source tracking. Another
feature lacking today is fat links.[9]
Even some hypertext features that were in early versions of HTML have been ignored by
most popular web browsers until recently, such as the link element and editable web
pages.
8

Sometimes web services or browser manufacturers remedy these shortcomings. For
instance, members of the modern social software landscape such as wikis and content
management systems allow surfers to edit the web pages they visit.
HTML markup
HTML markup consists of several types of entities, including: elements, attributes, data
types and character references.
The Document Type Definition
In order to enable Document Type Definition (DTD)-based validation with SGML tools
and in order to avoid the Quirks mode in browsers, all HTML documents should start
with a Document Type Declaration (informally, a "DOCTYPE"). The DTD contains
machine readable grammar specifying the permitted and prohibited content for a
document conforming to such a DTD. Browsers do not read the DTD, however. Browsers
only look at the doctype in order to decide the layout mode. Not all doctypes trigger the
Standards layout mode avoiding the Quirks mode. For example:
<!DOCTYPE         html         PUBLIC         "-//W3C//DTD           HTML        4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
This declaration references the Strict DTD of HTML 4.01, which does not have
presentational elements like <font>, leaving formatting to Cascading Style Sheets.
SGML-based validators read the DTD in order to properly parse the document and to
perform validation. In modern browsers, the HTML 4.01 Strict doctype activates the
Standards layout mode for CSS as opposed to the Quirks mode.
In addition, HTML 4.01 provides Transitional and Frameset DTDs. The Transitional
DTD was intended to gradually phase in the changes made in the Strict DTD, while the
Frameset DTD was intended for those documents which contained frames.
[edit] Elements
See HTML elements for more detailed descriptions.
Elements are the basic structure for HTML markup. Elements have two basic properties:
attributes and content. Each attribute and each element's content has certain restrictions
that must be followed for an HTML document to be considered valid. An element usually
has a start label (eg. <label>) and an end label (eg. </label>). The element's attributes are
contained in the start label and content is located between the labels (eg.
<label>Content</label>). Some elements, such as <br>, will never have any content and
do not need closing labels. Listed below are several types of markup elements used in
HTML.
Structural markup describes the purpose of text. For example, <h2>Golf</h2>
establishes "Golf" as a second-level heading, which would be rendered in a browser in a
manner similar to the "Markup element types" title at the start of this section. A blank
line is included after the header. Structural markup does not denote any specific
rendering, but most web browsers have standardized on how elements should be
formatted. Further styling should be done with Cascading Style Sheets (CSS).
Presentational markup describes the appearance of the text, regardless of its function.
For example <b>boldface</b> indicates that visual output devices should render
9

"boldface" in bold text, but has no clear semantics for aural devices that read the text
aloud for the sight-impaired. In the case of both <b>bold</b> and <i>italic</i> there
are elements which usually have an equivalent visual rendering but are more semantic in
nature, namely <strong>strong          emphasis</strong>       and <em>emphasis</em>
respectively. It is easier to see how an aural user agent should interpret the latter two
elements. However, they are not equivalent to their presentational counterparts: it would
be undesirable for a screen-reader to emphasize the name of a book, for instance, but on a
screen such a name would be italicized. Most presentational markup elements have
become deprecated under the HTML 4.0 specification, in favor of CSS based style
design.
Hypertext markup links parts of the document to other documents. HTML up through
version XHTML 1.1 requires the use of an anchor element to create a hyperlink in the
flow of text: <a>Wikipedia</a>. However, the href attribute must also be set to a valid
URL           so         for     example         the        HTML             code,      <a

href="http://en.wikipedia.org/">Wikipedia</a>,               will   render      the   word
"Wikipedia" as a hyperlink. In order to view the HTML code in a website click --> View
--> Source.
[edit] Attributes
The attributes of an element are name-value pairs, separated by "=", and written within
the start label of an element, after the element's name. The value should be enclosed in
single or double quotes, although values consisting of certain characters can be left
unquoted in HTML (but not XHTML).[10][11] Leaving attribute values unquoted is
considered unsafe.[12]
Most elements take any of several common attributes: id, class, style and title. Most
also take language-related attributes: lang and dir.
The id attribute provides a document-wide unique identifier for an element. This can be
used by stylesheets to provide presentational properties, by browsers to focus attention on
the specific element or by scripts to alter the contents or presentation of an element. The
class attribute provides a way of classifying similar elements for presentation purposes.

For example, an HTML (or a set of documents) document may use the designation
class="notation" to indicate that all elements with this class value are all subordinate

to the main text of the document (or documents). Such notation classes of elements might
be gathered together and presented as footnotes on a page, rather than appearing in the
place where they appear in the source HTML.
An author may use the style non-attributal codes presentational properties to a particular
element. It is considered better practice to use an element’s son- id page and select the
element with a stylesheet, though sometimes this can be too cumbersome for a simple ad
hoc application of styled properties. The title is used to attach subtextual explanation to
an element. In most browsers this title attribute is displayed as what is often referred to
as a tooltip. The generic inline span element can be used to demonstrate these various
non-attributes.
10

<span     id='anId'    class='aClass'     style='color:red;'    title='HyperText       Markup
Language'>HTML</span>
which displays as HTML (pointing the cursor at the abbreviation should display the title
text in most browsers).
[edit] Other markup
As of version 4.0, HTML defines a set of 252 character entity references and a set of
1,114,050 numeric character references, both of which allow individual characters to be
written via simple markup, rather than literally. A literal character and its markup
equivalent are considered equivalent and are rendered identically.
The ability to "escape" characters in this way allows for the characters "<" and "&"
(when written as &lt; and &amp;, respectively) to be interpreted as character data, rather
than markup. For example, a literal "<" normally indicates the start of a label, and "&"
normally indicates the start of a character entity reference or numeric character reference;
writing it as "&amp;" or "&#38;" allows "&" to be included in the content of elements or
the values of attributes. The double-quote character, ", when used to quote an attribute
value, must also be escaped as "&quot;" or "&#22;" when it appears within in the
attribute value itself. However, since document authors often overlook the need to escape
these characters, browsers tend to be very forgiving, treating them as markup only when
subsequent text appears to confirm that intent.
Escaping also allows for characters that are not easily typed or that aren't even available
in the document's character encoding to be represented within the element and attribute
content. For example, "é", a character typically found only on Western European
keyboards, can be written in any HTML document as the entity reference &eacute; or as
the numeric references &#233; or &#xE9;. The characters comprising those references
(that is, the "&", the ";", the letters in "eacute", and so on) are available on all keyboards
and are supported in all character encodings, whereas the literal "é" is not.
HTML also defines several data types for element content, such as script data and
stylesheet data, and a plethora of types for attribute values, including IDs, names, URIs,
numbers, units of length, languages, media descriptors, colors, character encodings, dates
and times, and so on. All of these data types are specializations of character data.
[edit] Semantic HTML
There is no official specification called "Semantic HTML", though the strict flavors of
HTML discussed below are a push in that direction. Rather, semantic HTML refers to an
objective and a practice to create documents with HTML that contain only the author's
intended meaning, without any reference to how this meaning is presented or conveyed.
A classic example is the distinction between the emphasis element (<em>) and the italics
element (<i>). Often the emphasis element is displayed in italics, so the presentation is
typically the same. However, emphasizing something is different from listing the title of
a book, for example, which may also be displayed in italics. In purely semantic HTML, a
book title would use a separate element than emphasized text uses (for example a
<span>), because they are each meaningfully different things.
11

The goal of semantic HTML requires two things of authors:
1) to avoid the use of presentational markup (elements, attributes and other entities); 2)
the use of available markup to differentiate the meanings of phrases and structure in the
document. So for example, the book title from above would need to have its own element
and   class   specified   such    as   <cite     class="booktitle">The      Grapes     of

Wrath</cite>. Here, the <cite> element is used, because it most closely matches the

meaning of this phrase in the text. However, the <cite> element is not specific enough to
this task because we mean to cite specifically a book title as opposed to a newspaper
article or a particular academic journal.
Semantic HTML also requires complementary specifications and software compliance
with these specifications. Primarily, the development and proliferation of CSS has led to
increasing support for semantic HTML because CSS provides designers with a rich
language to alter the presentation of semantic-only documents. With the development of
CSS the need to include presentational properties in a document has virtually
disappeared. With the advent and refinement of CSS and the increasing support for it in
web browsers, subsequent editions of HTML increasingly stress only using markup that
suggests the semantic structure and phrasing of the document, like headings, paragraphs,
quotes, and lists, instead of using markup which is written for visual purposes only, like
<font>, <b> (bold), and <i> (italics). Some of these elements are not permitted in certain

varieties of HTML, like HTML 4.01 Strict. CSS provides a way to separate document
semantics from the content's presentation, by keeping everything relevant to presentation
defined in a CSS file. See separation of style and content.
Semantic HTML offers many advantages. First, it ensures consistency in style across
elements that have the same meaning. Every heading, every quotation mark, every
similar element receives the same presentation properties.
Second, semantic HTML frees authors from the need to concern themselves with
presentation details. When writing the number two, for example, should it be written out
in words ("two"), or should it be written as a numeral (2)? A semantic markup might
enter something like <number>2</number> and leave presentation details to the
stylesheet designers. Similarly, an author might wonder where to break out quotations
into separate indented blocks of text - with purely semantic HTML, such details would be
left up to stylesheet designers. Authors would simply indicate quotations when they occur
in the text, and not concern themselves with presentation.
A third advantage is device independence and repurposing of documents. A semantic
HTML document can be paired with any number of stylesheets to provide output to
computer screens (through web browsers), high-resolution printers, handheld devices,
aural browsers or braille devices for those with visual impairments, and so on. To
accomplish this nothing needs to be changed in a well coded semantic HTML document.
Readily available stylesheets make this a simple matter of pairing a semantic HTML
document with the appropriate stylesheets (of course, the stylesheet's selectors need to
match the appropriate properties in the HTML document).
12

Some aspects of authoring documents make separating semantics from style (in other
words, meaning from presentation) difficult. Some elements are hybrids, using
presentation in their very meaning. For example, a table displays content in a tabular
form. Often this content only conveys the meaning when presented in this way.
Repurposing a table for an aural device typically involves somehow presenting the table
as an inherently visual element in an audible form. On the other hand, we frequently
present lyrical songs — something inherently meant for audible presentation — and
instead present them in textual form on a web page. For these types of elements, the
meaning is not so easily separated from their presentation. However, for a great many of
the elements used and meanings conveyed in HTML the translation is relatively smooth.
[edit] Delivery of HTML
HTML documents can be delivered by the same means as any other computer file;
however, HTML documents are most often delivered in one of the following two forms:
Over HTTP servers and through email.
[edit] Publishing HTML with HTTP
The World Wide Web is primarily composed of HTML documents transmitted from a
web server to a web browser using the HyperText Transfer Protocol (HTTP). However,
HTTP can be used to serve images, sound and other content in addition to HTML. To
allow the web browser to know how to handle the document it received, an indication of
the file format of the document must be transmitted along with the document. This vital
metadata includes the MIME type (text/html for HTML 4.01 and earlier,
application/xhtml+xml for XHTML 1.0 and later) and the character encoding (see

Character encodings in HTML).
In modern browsers, the MIME type that is sent with the HTML document affects how
the document is interpreted. A document sent with an XHTML MIME type, or served as
application/xhtml+xml, is expected to be well-formed XML and a syntax error may cause
the browser to fail to render the document. The same document sent with a HTML MIME
type, or served as text/html, might get displayed since web browsers are more lenient
with HTML. However, XHTML parsed this way is not considered either proper XHTML
nor HTML, but so-called tag soup.
If the MIME type is not recognized as HTML, the web browser should not attempt to
render the document as HTML, even if the document is prefaced with a correct
Document Type Declaration. Nevertheless, some web browsers do examine the contents
or URL of the document and attempt to infer the file type, despite this being forbidden by
the HTTP 1.1 specification.
[edit] HTML e-mail
Main article: HTML e-mail
Most graphical e-mail clients allow the use of a subset of HTML (often ill-defined) to
provide formatting and semantic markup capabilities not available with plain text, like
emphasized text, block quotations for replies, and diagrams or mathematical formulas
that couldn't easily be described otherwise. Many of these clients include both a GUI
editor for composing HTML e-mails and a rendering engine for displaying received
13

HTML e-mails. Use of HTML in e-mail is controversial due to compatibility issues,
because it can be used in phishing/privacy attacks, because it can confuse spam filters,
and because the message size is larger than plain text.
[edit] Current flavors of HTML
Since its inception HTML and its associated protocols gained acceptance relatively
quickly. However, no clear standards existed in the early years of the language. Though
its creators originally conceived of HTML as a semantic language devoid of presentation
details, practical uses pushed many presentational elements and attributes into the
language: driven largely by the various browser vendors. The latest standards
surrounding HTML reflect efforts to overcome the sometimes chaotic development of the
language and to create a rational foundation to build both meaningful and well-presented
documents. To return HTML to its role as a semantic language, the W3C has developed
style languages such as CSS and XSL to shoulder the burden of presentation. In
conjunction the HTML specification has slowly reined in the presentational elements
within the specification.
There are two axes differentiating various flavors of HTML as currently specified:
SGML-based HTML versus XML-based HTML (referred to as XHTML) on the one axis
and strict versus transitional (loose) versus frameset on the other axis.
[edit] Traditional versus XML-based HTML
One difference in the latest HTML specifications lies in the distinction between the
SGML-based specification and the XML-based specification. The XML-based
specification is often called XHTML to clearly distinguish it from the more traditional
definition; however, the root element name continues to be HTML even in the XHTML-
specified HTML. The W3C intends XHTML 1.0 to be identical with HTML 4.01 except
in the often stricter requirements of XML over traditional HTML. XHTML 1.0 likewise
has three sub-specifications: strict, loose and frameset. The strictness of XHTML in terms
of its syntax is often confused with the strictness of the strict versus the loose definitions
in terms of the content rules of the specifications. The strictness of XML lies in the need
to: always explicitly close elements (<h1>); and to always use quotation-marks (double "
or single ') to enclose attribute values. The use of implied closing labels in HTML led to
confusion for both editors and parsers.
Aside from the different opening declarations for a document, the differences between
HTML 4.01 and XHTML 1.0 — in each of the corresponding DTDs — is largely
syntactic. Adhering to valid and well-formed XHTML 1.0 will result in a well-formed
HTML 4.01 document in every way, except one. XHTML introduces a new markup in a
self-closing element as short-hand for handling empty elements. The short-hand adds a
slash (/) at the end of an opening label like this: <br/>. The introduction of this short-
hand, undefined in any HTML 4.01 DTD, may confuse earlier software unfamiliar with
this new convention. To help with the transition, the W3C recommends also including a
space character before the slash like this:<br />. As validators and browsers adapt to this
evolution in the standard, the migration from traditional to XML-based HTML should be
relatively simple. The major problems occur when software is non-conforming to HTML
14

4.01 and its associated protocols to begin with, or erroneously implements the HTML
recommendations.
To understand the subtle differences between HTML and XHTML consider the
transformation of a valid and well-formed XHTML 1.0 document into a valid and well-
formed HTML 4.0. To make this translation requires the following steps::
The language code for the element should be specified with a lang rather than the
XHTML xml:lang attribute HTML 4.01 instead defines its own attribute for language)
whereas XHTML uses the XML defined attribute.
Remove the XML namespace (xmlns=URI). HTML does not require and has no
facilities for namespaces.
Change the DTD declaration from XHTML 1.0 to HTML 4.01. (see DTD section for
further explanation]]).
If present, remove the XML declaration (Typically this is: <?xml version="1.0"
encoding="utf-8"?>).

Change the document’s mime type to text/html This may come from a meta element,
from the HTTP header of the server or possibly from a filename extension (for example,
change .xhtml to html).
Change the XML empty label short-cut to a standard opening label (<br/> to <br>)
Those are the only changes necessary to translate a document from XHTML 1.0 to
HTML 4.01. The reverse operation can be much more complicated. HTML 4.01 allows
the omission of many labels in a complex pattern derived by determining which labels are
(in some sense) redundant for a valid document. In other words if the document is
authored precisely to the associated HTML 4.01 content model, some labels need not be
expressed. For example, since a paragraph cannot contain another paragraph, when an
opening paragraph label is followed by another opening paragraph label, this implies the
previous paragraph element is now closed. Similarly, elements such as br have no
allowed content, so HTML does not require an explicit closing label for this element.
Also since HTML was the only specification targeted by user-agents (browsers and other
HTML consuming software), the specification even allows the omission of opening and
closing labels for html, head, and body, if the document's head has no content. To
translate from HTML to XHTML would first require the addition of any omitted closing
labels (or using the closing label shortcut for empty elements like <br/>).
Notice how XHTML’s requirement to always include explicit closing labels, allows the
separation between the concepts of valid and well-formed. A well-formed XHTML
document adheres to all the syntax requirements of XML. A valid document adheres to
the content specification for XHTML. In other words a valid document only includes
content, attributes and attribute values within each element in accord with the
specification. If a closing label is omitted, an XHTML parser can first determine the
document is not well-formed. Once the elements are all explicitly closed, the parser can
address the question of whether the document is also valid. For an HTML parse these
separate aspects of a document are not discernible. If a paragraph opening label (p) is
15

followed by a div, is it because the document is not well-formed (the closing paragraph
label is missing) or is the document invalid (a div does not belong in a paragraph)?
Whether coding in HTML or XHTML it may just be best to always include the optional
labels within an HTML document rather than remembering which labels can be omitted.
The W3C recommends several conventions to ensure an easy migration between HTML
and XHTML (see HTML Compatibility Guidelines). Basically the W3C recommends:
Including both xml:lang and lang attributes on any elements assigning language.
Using the self-closing label only for elements specified as empty
Make all label names and attribute names lower-case.
Ensuring all attribute values are quoted with either single quotes (') or double quotes (")
Including an extra space in self-closing labels: for example <br /> instead of <br/>
Including explicit close labels for elements that permit content but are left empty (for
example, "<img></img>", not "<img />" )
Note that by carefully following the W3C’s compatibility guidelines the difference
between the resulting HTML 4.01 document and the XHTML 1.0 document is merely the
DOCTYPE declaration, and the XML declaration preceding the document’s contents.
The W3C allows the resulting XHTML 1.0 (or any XHTML 1.0) document to be
delivered as either HTML or XHTML. For delivery as HTML, the document’s MIME
type should be set to 'text/html', while, for XHTML, the document’s MIME type should
be set to 'application/xhtml+xml'. When delivered as XHTML, browsers and other user
agents are expected to adhere strictly to the XML specifications in parsing, interpreting,
and displaying the document’s contents.
[edit] Transitional versus Strict
The latest SGML-based specification HTML 4.01 and the earliest XHTML version
include three sub-specifications: strict, transitional (also called loose), and frameset. The
difference between strict on the one hand and loose and frameset on the other, is that the
strict definition tries to adhere more tightly to a presentation-free or style-free concept of
a semantic HTML. The loose standard maintains many of the various presentational
elements and attributes absent in the strict definition.
The primary differences making the transitional specification loose versus the strict
specification (whether XHTML 1.0 or HTML 4.01) are:
A looser content model
Inline elements and character strings (#PCDATA) are allowed in: body, blockquote,
form, noscript, noframes

Presentation related elements
underline (u)
strike-through (s and strike)
center
font
basefont

Presentation related attributes
background and bgcolor attributes for body element.
16

align attribute on div, form, paragraph (p), and heading (h1...h6) elements

align, noshade, size, and width attributes on hr element

align, border, vspace, and hspace attributes on img and object elements

align attribute on legend and caption elements

align and bgcolor on table element

nowrap, bgcolor, width, height on td and th elements

bgcolor attribute on tr element

clear attribute on br element

compact attribute on dl, dir and menu elements

type, compact, and start attributes on ol and ul elements

type and value attributes on li element

width attribute on pre element

Additional elements in loose (transitional) specification
menu list (no substitute, though unordered list is recommended; may return in XHTML

2.0 specification)
dir list (no substitute, though unordered list is recommended)

isindex (element requires server-side support and is typically added to documents

server-side)
applet (deprecated in favor of object element)

The pre element does not allow: applet, font, and basefont (elements not defined in strict
DTD)
The language attribute on script element (presumably redundant with type attribute,
though this is maintained for legacy reasons).
Frame related entities
frameset element (used in place of body for frameset DTD)

frame element
iframe
noframes

target attribute on anchor, client-side image-map (imagemap), link, form, and base

elements
[edit] Frameset versus transitional
In addition to the above transitional differences, the frameset specifications (whether
XHTML 1.0 or HTML 4.01) specifies a different content model:
<html>
<head>
Any of the various head related elements.
</head>
<frameset>
At least one of either: another frameset or a frame and an optional noframes element.
</frameset>
</html>
17

[edit] Summary of flavors
As this list demonstrates, the loose flavors of the specification are maintained for legacy
support. However, contrary to popular misconceptions, the move to XHTML does not
imply a removal of this legacy support. Rather the X in XML stands for extensible and
the W3C is modularizing the entire specification and opening it up to independent
extensions. The primary achievement in the move from XHTML 1.0 to XHTML 1.1 is
the modularization of the entire specification. The strict version of HTML is deployed in
XHTML 1.1 through a set of modular extensions to the base XHTML 1.1 specification.
Likewise someone looking for the loose (transitional) or frameset specifications will find
similar extended XHTML 1.1 support (much of it is contained in the legacy or frame
modules). The modularization also allows for separate features to develop on their own
timetable. So for example XHTML 1.1 will allow quicker migration to emerging XML
standards such as MathML (a presentational and semantic math language based on XML)
and XFORMS — a new highly advanced web-form technology to replace the existing
HTML forms.
In summary, the HTML 4.01 specification primarily reined in all the various HTML
implementations into a single clear written specification based on SGML. XHTML 1.0,
ported this specification, as is, to the new XML defined specification. Next, XHTML 1.1
takes advantage of the extensible nature of XML and modularizes the whole
specification. XHTML 2.0 will be the first step in adding new features to the
specification in a standards-body-based approach.




                                      NetMeetting
Microsoft NetMeeting is a VoIP and multi-point videoconferencing client included in
many versions of Microsoft Windows (from Windows 95 OSR2 to Windows XP). It uses
the H.323 protocol for video and audio conferencing, and is interoperable with
OpenH323-based clients such as Ekiga, and Internet Locator Service (ILS) as mirror
server. It also uses a slightly modified version of the ITU T.120 Protocol for
whiteboarding, application sharing, desktop sharing, remote desktop sharing (RDS) and
file transfers. The secondary Whiteboard in NetMeeting 2.1 and later utilizes the H.324
protocol.
Before video service became common on free IM clients, such Yahoo Messenger and
MSN Messenger, NetMeeting was a popular way to perform video conferences and
chatting over the Internet (with the help of public ILS servers).
Since the release of Windows XP, Microsoft has deprecated it in favour of Windows
Messenger, although it is still installed by default (Start > Run... > conf.exe). Note that
Windows Messenger, MSN Messenger and Windows Live Messenger hooks directly into
NetMeeting for the application sharing, desktop sharing, and Whiteboard features
exposed by each application.
18

As of the release of Windows Vista, NetMeeting is no longer included and has been
replaced by Windows Meeting Space.
chat can refer to any kind of communication over the internet, but is primarily meant to
refer to direct 1-on-1 chat or text-based group chat (formally also known as synchronous
conferencing), using tools such as instant messaging applications—computer programs,
Internet Relay Chat, talkers and possibly MUDs, MUCKs, MUSHes and MOOes.
While many of the web's well known custodians offer online chat and messaging services
for free, an increasing number of providers are beginning to show strong revenue streams
from paid-for services. Again it is the Adult service providers, profiting from the advent
of reliable and high-speed broadband, (notably across Eastern Europe) who are at the
forefront of the paid-for online chat revolution.
For every business traveller engaging in a video call or conference call rather than
braving the check-in queue, there are countless web users replacing traditional
conversational means with online chat and messaging. Like Email, which has reduced the
need and usage of letter, fax and memo communication, online chat is steadily replacing
telephony as the means of office and home communication. The early adopters in these
areas are undoubtedly teenage users of instant messaging. It might not be long before
SMS text messaging usage declines as mobile handsets provide the technology for online
chat.
Other forms of online chat that are not usually referred to as online chat
[edit] MUDs
A MUD, or a multi-user dungeon, is a multi-user version of dungeons and dragons for the
internet, and is an early use of the internet. In a MUD, as well as playing the game,
people can chat to each other. Talkers were originally based off MUDs and the earliest
versions of talkers were primarily MUDs without the gaming element. Other derivations
of MUDs were used that combined gaming with talking, and these include MUSHes,
MOOs and MUCKs.
[edit] Discussion boards
Besides real-time chat, another type of online community includes Internet forums and
bulletin board systems (BBSes), where users write posts (blocks of text) to which later
visitors may respond. Unlike the transient nature of chats, these systems generally archive
posts and save them for weeks or years. They can be used for technical troubleshooting,
advice, general conversation and more.
See also
General terms
   •    Chat room
   •    Web chat site
   •    Voice chat
   •    VoIP Voice over IP
   •    Live support software
   •    Online discussion
19


   •   Online discourse environment
Protocols/Programs
   •   Talker
   •   Internet Relay Chat
   •   Instant messenger
   •   PalTalk
   •   Talk (Unix)
   •   MUD
   •   MUSH
   •   MOO
   •   Google Talk
   •   Yahoo! Messenger
   •   Skype
   •   SILC
   •   Windows Live Messenger
   •   Campfire
Chat programs supporting multiple protocols
   •   Adium
   •   Gaim
   •   Miranda IM
   •   Trillian
   •   Retrieved from "http://en.wikipedia.org/wiki/Online_chat"




                                        Plugins
A plugin (or plug-in) is a computer program that interacts with a main (or host)
application (a web browser or an email program, for example) to provide a certain,
usually very specific, function on-demand.
Typical examples are
   •   plugins that read or edit specific types of files (for instance, decode multimedia
       files)
   •   encrypt or decrypt email (for instance, PGP)
   •   filter images in graphic programs in ways that the host application could not
       normally do
   •   play and watch Flash presentations in a web browser
The host application provides services which the plugins can use, including a way for
plugins to register themselves with the host application and a protocol by which data is
exchanged with plugins. Plugins are dependent on these services provided by the main
application and do not usually work by themselves. Conversely, the main application is
20

independent of the plugins, making it possible for plugins to be added and updated
dynamically without changes to the main application.
Plugins are slightly different from extensions, which modify or add to existing
functionality. The main difference is that plugins generally rely on the main application's
user interface and have a well-defined boundary to their possible set of actions.
Extensions generally have fewer restrictions on their actions, and may provide their own
user interfaces. They sometimes are used to decrease the size of the main application and
offer optional functions. Mozilla Firefox uses a well-developed extension system to
reduce the feature creep that plagued the Mozilla Application Suite.
Perhaps the first software applications to include a plugin function were HyperCard and
QuarkXPress on the Macintosh, both released in 1987. In 1988, Silicon Beach Software
included plugin functionality in Digital Darkroom and SuperPaint, and the term plug-in
was coined by Ed Bomke. Currently, plugins are typically implemented as shared
libraries that must be installed in a place prescribed by the main application. HyperCard
supported a similar facility, but it was more common for the plugin code to be included in
the HyperCard documents (called stacks) themselves. This way, the HyperCard stack
became a self-contained application in its own right, which could be distributed as a
single entity that could be run by the user without the need for additional installation
steps.
Open application programming interfaces (APIs) provide a standard interface, allowing
third parties to create plugins that interact with the main application. A stable API allows
third-party plugins to function as the original version changes and to extend the lifecycle
of obsolete applications. The Adobe Photoshop and After Effects plugin APIs have
become a standard and been adopted to some extent by competing applications. Other
examples of such APIs include Audio Units and VST.
Examples
Many professional software packages offer plugin APIs to developers, in order to
increase the utility of the base product. Examples of these include:
   •     Eclipse
   •     GStreamer multimedia pipe handler
   •     jEdit Program Editor
   •     Quintessential Media Player, Winamp, foobar2000 and XMMS
   •     Notepad++
   •     OmniPeek packet analysis platform
   •     VST Audio Plugin Format




                                Communications protocol
From Wikipedia, the free encyclopedia
Jump to: navigation, search
21

This article concerns communication between pairs of electronic devices. For the
specific topic of computing protocols, see Protocol (computing). For protocols on two-
way voice communications, see Voice procedure. For other meanings of the word
protocol, see Protocol.
In the field of telecommunications, a communications protocol is the set of standard
rules for data representation, signalling, authentication and error detection required to
send information over a communications channel. An example of a simple
communications protocol adapted to voice communication is the case of a radio
dispatcher talking to mobile stations. The communication protocols for digital computer
network communication have many features intended to ensure reliable interchange of
data over an imperfect communication channel. Communication protocol is basically
following certain rules so that the system works properly.
Network protocol design principles
Systems engineering principles have been applied to create a set of common network
protocol design principles[citation   needed]
                                            . These principles include effectiveness, reliability,
and resiliency.
Effectiveness
Needs to be specified in such a way, that engineers, designers, and in some cases
software developers can implement and/or use it. In human-machine systems, its design
needs to facilitate routine usage by humans. Protocol layering accomplishes these
objectives by dividing the protocol design into a number of smaller parts, each of which
performs closely related sub-tasks, and interacts with other layers of the protocol only in
a small number of well-defined ways.
Protocol layering allows the parts of a protocol to be designed and tested without a
combinatorial explosion of cases, keeping each design relatively simple. The
implementation of a sub-task on one layer can make assumptions about the behavior and
services offered by the layers beneath it. Thus, layering enables a "mix-and-match" of
protocols that permit familiar protocols to be adapted to unusual circumstances.
For an example that involves computing, consider an email protocol like the Simple Mail
Transfer Protocol (SMTP). An SMTP client can send messages to any server that
conforms to SMTP's specification. Actual applications can be (for example) an aircraft
with an SMTP server receiving messages from a ground controller over a radio-based
internet link. Any SMTP client can correctly interact with any SMTP server, because
they both conform to the same protocol specification, RFC2821, RT49764368.
This paragraph informally provides some examples of layers, some required
functionalities, and some protocols that implement them, all from the realm of computing
protocols.
At the lowest level, bits are encoded in electrical, light or radio signals by the Physical
layer. Some examples include RS-232, SONET, and WiFi.
A somewhat higher Data link layer such as the point-to-point protocol (PPP) may detect
errors and configure the transmission system.
22

An even higher protocol may perform network functions. One very common protocol is
the Internet protocol (IP), which implements addressing for large set of protocols. A
common associated protocol is the Transmission control protocol (TCP) which
implements error detection and correction (by retransmission). TCP and IP are often
paired, giving rise to the familiar acronym TCP/IP.
A layer in charge of presentation might describe how to encode text (ie: ASCII, or
Unicode).
An application protocol like SMTP, may (among other things) describe how to inquire
about electronic mail messages.
These different tasks show why there's a need for a software architecture or reference
model that systematically places each task into context.
The reference model usually used for protocol layering is the OSI seven layer model,
which can be applied to any protocol, not just the OSI protocols of the International
Organization for Standardization (ISO). In particular, the Internet Protocol can be
analysed using the OSI model.
Reliability
Assuring reliability of data transmission involves error detection and correction, or some
means of requesting retransmission. It is a truism that communication media are always
faulty. The conventional measure of quality is the number of failed bits per bits
transmitted. This has the useful feature of being a dimensionless figure of merit that can
be compared across any speed or type of communication media.
In telephony, links with bit error rates (BER) of 10-4 or more are regarded as faulty (they
interfere with telephone conversations), while links with a BER of 10-5 or more should be
dealt with by routine maintenance (they can be heard).
Data transmission often requires bit error rates below 10-12. Computer data transmissions
are so frequent that larger error rates would affect operations of customers like banks and
stock exchanges. Since most transmissions use networks with telephonic error rates, the
errors caused by these networks must be detected and then corrected.
Communications systems detect errors by transmitting a summary of the data with the
data. In TCP (the internet's Transmission Control Protocol), the sum of the data bytes of
packet is sent in each packet's header. Simple arithmetic sums do not detect out-of-order
data, or cancelling errors. A bit-wise binary polynomial, a cyclic redundancy check, can
detect these errors and more, but is slightly more expensive to calculate.
Communication systems correct errors by selectively resending bad parts of a message.
For example, in TCP when a checksum is bad, the packet is discarded. When a packet is
lost, the receiver acknowledges all of the packets up to, but not including the failed
packet. Eventually, the sender sees that too much time has elapsed without an
acknowledgement, so it resends all of the packets that have not been acknowledged. At
the same time, the sender backs off its rate of sending, in case the packet loss was caused
by saturation of the path between sender and receiver. (Note: this is an over-
simplification: see TCP and congestion collapse for more detail)
23

In general, the performance of TCP is severely degraded in conditions of high packet loss
(more than 0.1%), due to the need to resend packets repeatedly. For this reason, TCP/IP
connections are typically either run on highly reliable fiber networks, or over a lower-
level protocol with added error-detection and correction features (such as modem links
with ARQ). These connections typically have uncorrected bit error rates of 10-9 to 10-12,
ensuring high TCP/IP performance.
Resiliency
Resiliency addresses a form of network failure known as topological failure in which a
communications link is cut, or degrades below usable quality. Most modern
communication protocols periodically send messages to test a link. In phones, a framing
bit is sent every 24 bits on T1 lines. In phone systems, when "sync is lost", fail-safe
mechanisms reroute the signals around the failing equipment.
In packet switched networks, the equivalent functions are performed using router update
messages to detect loss of connectivity.
Standards organizations
Most recent protocols are assigned by the IETF for Internet communications, and the
IEEE, or the ISO organizations for other types. The ITU-T handles telecommunications
protocols and formats for the public switched telephone network (PSTN). The ITU-R
handles protocols and formats for radio communications. As the PSTN. radio systems,
and Internet converge, the different sets of standards are also being driven towards
technological convergence.
[edit] Protocol families
A number of major protocol stacks or families exist, including the following:
Open standards:
Internet protocol suite
Open Systems Interconnection (OSI)
A connection-oriented networking protocol is one which identifies traffic flows by some
connection identifier rather than by explicitly listing source and destination addresses.
Typically, this connection identifier is a small integer (10 bits for Frame Relay, 24 for
ATM, for example). This makes network switches substantially faster (as routing tables
are just simple look-up tables, and are trivial to implement in hardware). The impact is so
great, in fact, that even characteristically connectionless protocols, such as IP traffic, are
being tagged with connection-oriented header prefixes (e.g., as with MPLS, or IPv6's
built-in Flow ID field).
Note that connection-oriented protocols are not necessarily reliable protocols. ATM and
Frame Relay, for example, are both examples of a connection-oriented, unreliable
protocol. There are also reliable connectionless protocols as well, such as AX.25 when it
passes data in I-frames. But this combination is rare, and reliable-connectionless is
uncommon in commercial and academic networks.
Note that connection-oriented protocols handle real-time traffic substantially more
efficiently than connectionless protocols, which is why ATM has yet to be replaced by
Ethernet for carrying real-time, isochronous traffic streams, especially in heavily
24

aggregated networks like backbones, where the motto "bandwidth is cheap" fails to
deliver on its promise. Experience has also shown that overprovisioning bandwidth does
not resolve all quality of service issues. Hence, (10-)gigabit Ethernet is not expected to
replace ATM at this time.
[edit] List of Connection-oriented protocols
TCP
Phone Call - user must dial the telephone, get an answer before transmitting data
ATM
Frame Relay
Connectionless protocol
From Wikipedia, the free encyclopedia
Jump to: navigation, search
In telecommunications, connectionless describes communication between two network
end points in which a message can be sent from one end point to another without prior
arrangement. The device at one end of the communication transmits data to the other,
without first ensuring that the recipient is available and ready to receive the data. The
device sending a message simply sends it addressed to the intended recipient. As such
there are more frequent problems with transmission than with connection-orientated
protocols and it may be necessary to resend the data several times. Connectionless
protocols are often disfavoured by network administrators because it is much harder to
filter malicious packets from a connectionless protocol using a firewall. The Internet
Protocol (IP) and User Datagram Protocol (UDP) are connectionless protocols, but
TCP/IP (the most common use of IP) is connection-oriented.
Connectionless protocols are usually described as stateless because the endpoints have no
protocol-defined way to remember where they are in a "conversation" of message
exchanges. The alternative to the connectionless approach uses connection-oriented
protocols, which are sometimes described as stateful because they can keep track of a
conversation.
List of Connectionless protocols
   •   IP
   •   UDP
   •   ICMP
   •   IPX
In computing, a protocol is a convention or standard that controls or enables the
connection, communication, and data transfer between two computing endpoints. In its
simplest form, a protocol can be defined as the rules governing the syntax, semantics,
and synchronization of communication. Protocols may be implemented by hardware,
software, or a combination of the two. At the lowest level, a protocol defines the behavior
of a hardware connection.
                              Contents
                              [hide]
25


                                 1 Typical properties
                                 2 Importance
                                 3 Common Protocols
                                 4 See also
   Typical properties
It is difficult to generalize about protocols because they vary so greatly in purpose and
sophistication. Most protocols specify one or more of the following properties:
Detection of the underlying physical connection (wired or wireless), or the existence of
the other endpoint or node
Handshaking
Negotiation of various connection characteristics
How to start and end a message
How to format a message
What to do with corrupted or improperly formatted messages (error correction)
How to detect unexpected loss of the connection, and what to do next
Termination of the session or connection.
Importance
The widespread use and expansion of communications protocols is both a prerequisite to
the Internet, and a major contributor to its power and success. The pair of Internet
Protocol (or IP) and Transmission Control Protocol (or TCP) are the most important of
these, and the term TCP/IP refers to a collection (or protocol suite) of its most used
protocols. Most of the Internet's communication protocols are described in the RFC
documents of the Internet Engineering Task Force (or IETF).
Object-oriented programming has extended the use of the term to include the
programming protocols available for connections and communication between objects.
Generally, only the simplest protocols are used alone. Most protocols, especially in the
context of communications or networking, are layered together into protocol stacks where
the various tasks listed above are divided among different protocols in the stack.
Whereas the protocol stack denotes a specific combination of protocols that work
together, the Reference Model is a software architecture that lists each layer and the
services each should offer. The classic seven-layer reference model is the OSI model,
which is used for conceptualizing protocol stacks and peer entities. This reference model
also provides an opportunity to teach more general software engineering concepts like
hiding, modularity, and delegation of tasks. This model has endured in spite of the
demise of many of its protocols (and protocol stacks) originally sanctioned by the ISO.
The OSI model is not the only reference model however.
[edit] Common Protocols
   •   HTTP (Hyper Text Transfer Protocol)
   •   POP3 (Post Office Protocol 3).
   •   SMTP (Simple Mail Transfer Protocol).
   •   FTP (File Transfer Protocol).
   •   IP (Internet Protocol).
26


    •   DHCP (Dynamic Host Configuration Protocol).
    •   IMAP (Internet Message Access Protocol).




                                      Search Engine
A search engine is an information retrieval system designed to help find information
stored on a computer system, such as on the World Wide Web, inside a corporate or
proprietary network, or in a personal computer. The search engine allows one to ask for
content meeting specific criteria (typically those containing a given word or phrase) and
retrieves a list of items that match those criteria. This list is often sorted with respect to
some measure of relevance of the results. Search engines use regularly updated indexes to
operate quickly and efficiently.
Without further qualification, search engine usually refers to a Web search engine, which
searches for information on the public Web. Other kinds of search engine are enterprise
search engines, which search on intranets, personal search engines, and mobile search
engines. Different selection and relevance criteria may apply in different environments,
or for different uses.
Some search engines also mine data available in newsgroups, databases, or open
directories. Unlike Web directories, which are maintained by human editors, search
engines operate algorithmically or are a mixture of algorthmic and human input.
How search engines work
A search engine operates, in the following order
Web crawling
Indexing
Searching
A web crawler (also known as a Web spider or Web robot) is a program or automated
script which browses the World Wide Web in a methodical, automated manner. Other
less frequently used names for Web crawlers are ants, automatic indexers, bots, and
worms (Kobayashi and Takeda, 2000).
This process is called Web crawling or spidering. Many legitimate sites, in particular
search engines, use spidering as a means of providing up-to-date data. Web crawlers are
mainly used to create a copy of all the visited pages for later processing by a search
engine, that will index the downloaded pages to provide fast searches. Crawlers can also
be used for automating maintenance tasks on a Web site, such as checking links or
validating HTML code. Also, crawlers can be used to gather specific types of information
from Web pages, such as harvesting e-mail addresses (usually for spam).
A Web crawler is one type of bot, or software agent. In general, it starts with a list of
URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the
hyperlinks in the page and adds them to the list of URLs to visit, called the crawl
frontier. URLs from the frontier are recursively visited according to a set of policies.
Web crawler architectures
27




High-level architecture of a standard Web crawler
A crawler must not only have a good crawling strategy, as noted in the previous sections,
but it should also have a highly optimized architecture.
Shkapenyuk and Suel (Shkapenyuk and Suel, 2002) noted that: "While it is fairly easy to
build a slow crawler that downloads a few pages per second for a short period of time,
building a high-performance system that can download hundreds of millions of pages
over several weeks presents a number of challenges in system design, I/O and network
efficiency, and robustness and manageability."
Web crawlers are a central part of search engines, and details on their algorithms and
architecture are kept as business secrets. When crawler designs are published, there is
often an important lack of detail that prevents others from reproducing the work. There
are also emerging concerns about "search engine spamming", which prevent major search
engines         from          publishing          their       ranking         algorithms.


indexing entails how data is collected, parsed, and stored to facilitate fast and accurate
retrieval. Index design incorporates interdisciplinary concepts from Linguistics,
Cognitive psychology, Mathematics, Informatics, Physics, and Compruter science. An
alternate name for the procss is Web indexing, within the context of search engines
designed to find web pages o the Internet.
Popular engines focus on the full-text indexing of online, natural language documents,
yet there are other searchable media types such as video, audio, and graphics. Meta
search engines reuse the indices of other services and do not store a local index, whereas
cache-based search engines permanently store the index along with the corpus. Unlike
full text indices, partial text services restrict the depth indexed to reduce index size.
Larger services typically perform indexing at a predetermined interval due to the required
time and processing costs, whereas agent-based search engines index in real time.


   Indexing
28

The goal of storing an index is to optimize the speed and performance of finding relevant
documents for a search query. Without an index, the search engine would scan every
document in the corpus, which would take a considerable amount of time and computing
power. For example, an index of 1000 documents can be queried within milliseconds,
where a raw scan of 1000 documents could take hours. No search engine user would be
comfortable waiting several hours to get search results. The trade off for the time saved
during retrieval is that additional storage is required to store the index and that it takes a
considerable amount of time to update.
[edit] Index Design Factors
Major factors in designing a search engine's architecture include:
Merge factors - how data enters the index, or how words or subject features are added to
the index during corpus traversal, and whether multiple indexers can work
asynchronously. The indexer must first check whether it is updating old content or adding
new content. Traversal typically correlates to the data collection policy. Search engine
index merging is similar in concept to the SQL Merge command and other merge
algorithms. Storage techniques - how to store the index data - whether information should
be compressed or filtered
Index size - how much computer storage is required to support the index
Lookup speed - how quickly a word can be found in the inverted index. How quickly an
entry in a data structure can be found, versus how quickly it can be updated or removed,
is a central focus of computer science
Maintenance - maintaining the index over time
Fault tolerance - how important it is for the service to be reliable, how to deal with index
corruption, whether bad data can be treated in isolation, dealing with bad hardware,
partitioning schemes such as hash-based or composite partitioning, data replication
Index Data Structures
Search engine architectures vary in how indexing is performed and in index storage to
meet the various design factors. Types of indices include:
Suffix trees - figuratively structured like a tree, supports linear time lookup. Built by
storing the suffices of words. Used for searching for patterns in DNA sequences and
clustering. A major drawback is that the storage of a word in the tree may require more
storage than storing the word itself An alternate representation is a suffix array, which is
considered to require less memory and supports compression like BWT.
Tries - an ordered tree data structure that is used to store an associative array where the
keys are strings. Regarded as faster than a hash table, but are less space efficient. The
suffix tree is a type of trie. Tries support extendible hashing, which is important for
search engine indexing.
Inverted indices - stores a list of occurrences of each atomic search criterion, typically in
the form of a hash table or binary tree.
Citation indices - stores the existence of citations or hyperlinks between documents to
support citation analysis, a subject of Bibliometrics.
29

Ngram indices - for storing sequences of n length of data to support other types of
retrieval or text mining. Term document matrices - used in latent semantic analysis,
stores the occurrences of words in documents in a two dimensional sparse matrix.
Challenges in Parallelism
A major challenge in the design of search engines is the management of parallel
processes. There are many opportunities for race conditions and coherence faults. For
example, a new document is added to the corpus and the index must be updated, but the
index simultaneously needs to continue responding to search queries. This is a collision
between two competiting tasks. Consider that authors are producers of information, and a
crawler is the consumer of this information, grabbing the text and storing it in a cache (or
corpus). The forward index is the consumer of the information produced by the corpus,
and the inverted index is the consumer of information produced by the forward index.
This is commonly referred to as a producer-consumer model. The indexer is the
producer of searchable information and users are the consumers that need to search. The
challenge is magnified when working with distributed storage and distributed processing.
In an effort to scale with larger amounts of indexed information, the search engine's
architecture may involve distributed computing, where the search engine consists of
several machines operating in unison. This increases the possibilities for incoherency and
makes it more difficult to maintain a fully-synchronized, distributed, parallel architecture.
Inverted indices
Many search engines incorporate an inverted index when evaluating a search query to
quickly locate the documents which contain the words in a query and rank these
documents by relevance. The inverted index stores a list of the documents for each word.
The search engine can retrieve the matching documents quickly using direct access to
find the documents for a word. The following is a simplified illustration of the inverted
index:
             Inverted Index
             Word Documents
             the Document 1, Document 3, Document 4, Document 5
             cow Document 2, Document 3, Document 4
             says Document 5
             moo Document 7
The above figure is a simplified form of a Boolean index. Such an index would only
serve to determine whether a document matches a query, but would not contribute to
ranking matched documents. In some designs the index includes additional information
such as the frequency of each word in each document or the positions of the word in each
document. With position, the search algorithm can identify word proximity to support
searching for phrases. Frequency can be used to help in ranking the relevance of
documents to the query. Such topics are the central research focus of information
retrieval.
The inverted index is a sparse matrix given that words are not present in each document.
It is stored differently than a two dimensional array to reduce memory requirements. The
30

index is similar to the term document matrices employed by latent semantic analysis. The
inverted index can be considered a form of a hash table. In some cases the index is a form
of a binary tree, which requires additional storage but may reduce the lookup time. In
larger indices the architecture is typically distributed. Inverted indices can be
programmed in several computer programming languages.
Index Merging
The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but
first deletes the contents of the inverted index. The architecture may be designed to
support incremental indexing, where a merge involves identifying the document or
documents to add into or update in the index and parsing each document into words. For
technical accuracy, a merge involves the unison of newly indexed documents, typically
residing in virtual memory, with the index cache residing on one or more computer hard
drives.
After parsing, the indexer adds the containing document to the document list for the
appropriate words. The process of finding each word in the inverted index in order to
denote that it occurred within a document may be too time consuming when designing a
larger search engine, and so this process is commonly split up into the development of a
forward index and the process of sorting the contents of the forward index for entry into
the inverted index. The inverted index is named inverted because it is an inversion of the
forward index.
The Forward Index
The forward index stores a list of words for each document. The following is a simplified
form of the forward index:
                    Forward Index
                    Document Words
                    Document 1 the,cow,says,moo
                    Document 2 the,cat,and,the,hat
                    Document 3 the,dish,ran,away,with,the,spoon
The rationale behind developing a forward index is that as documents are parsed, it is
better to immediately store the words per document. The delineation enables
asynchronous processing, which partially circumvents the inverted index update
bottleneck. The forward index is sorting to transform it to an inverted index. The forward
index is essentially a list of pairs consisting of a document and a word, collated by the
document. Converting the forward index to an inverted index is only a matter of sorting
the pairs by the words. In this regard, the inverted index is a word-sorted forward index.
Compression
Generating or maintaining a large-scale search engine index represents a significant
storage and processing challenge. Many search engines utilize a form of compression to
reduce the size of the indices on disk. Consider the following scenario for a full text,
Internet, search engine.
An estimated 2,000,000,000 different web pages exist as of the year 2000
31

A fictitious estimate of 250 words per webpage on average, based on the assumption of
being similar to the pages of a novel.
It takes 8 bits (or 1 byte) to store a single character. Some encodings use 2 bytes per
characterThe average number of characters in any given word on a page can be estimated
at 5 (Wikipedia:Size comparisons)
The average personal computer comes with about 20 gigabytes of usable space
Given these estimates, generating a uncompressed index (assuming a non-conflated,
simple, index) for 2 billion web pages would need to store 5 billion word entries. At 1
byte per character, or 5 bytes per word, this would require 25 gigabytes of storage space
alone, more than the average size a personal computer's free disk space. This space is
further increased in the case of a distributed storage architecture that is fault-tolerant.
Using compression, the index size can be reduced to a portion of its size, depending on
which compression techniques are chosen. The trade off is the time and processing power
required to perform compression.
Notably, large scale search engine designs incorporate the cost of storage, and the costs
of electricity to power the storage. Compression, in this regard, is a measure of cost as
well.
Document Parsing
Document parsing involves breaking apart the components (words) of a document or
other form of media for insertion into the forward and inverted indices. For example, if
the full contents of a document consisted of the sentence "Hello World", there would
typically be two words found, the token "Hello" and the token "World". In the context of
search engine indexing and natural language processing, parsing is more commonly
referred to as tokenization, and sometimes word boundary disambiguation, tagging, Text
segmentation, Content analysis, text analysis, Text mining, Concordance generation,
Speech segmentation, lexing, or lexical analysis. The terms 'indexing', 'parsing', and
'tokenization' are used interchangeably in corporate slang.
Natural language processing, as of 2006, is the subject of continuous research and
technological improvement. There are a host of challenges in tokenization, in extracting
the necessary information from documents for indexing to support quality searching.
Tokenization for indexing involves multiple technologies, the implementation of which
are commonly kept as corporate secrets.
Challenges in Natural Language Processing
Word Boundary Ambiguity - native English speakers can at first consider tokenization to
be a straightfoward task, but this is not the case with designing a multilingual indexer. In
digital form, the text of other languages such as Chinese, Japanese or Arabic represent a
greater challenge as words are not clearly delineated by whitespace. The goal during
tokenization is to identify words for which users will search. Language specific logic is
employed to properly identify the boundaries of words, which is often the rationale for
designing a parser for each language supported (or for groups of languages with similar
boundary markers and syntax).
32

Language Ambiguity - to assist with properly ranking matching documents, many search
engines collect additional information about words, such as its language or lexical
category (part of speech). These techniques are language-dependent as the syntax varies
among languages. Documents do not always clearly identify the language of the
document or represent it accurately. In tokenizing the document, some search engines
attempt to automatically identify the language of the document.
Diverse File Formats - in order to correctly identify what bytes of a document represent
characters, the file format must be correctly handled. Search engines which support
multiple file formats must be able to correctly open and access the document and be able
to tokenize the characters of the document.
Faulty Storage - the quality of the natural language data is not always assumed to be
perfect. An unspecified amount of documents, particular on the Internet, do not always
closely obey proper file protocol. Binary characters may be mistakenly encoded into
various parts of a document. Without recognition of these characters and appropriate
handling, the index quality or indexer performance could degrade.
[edit] Tokenization
Unlike literrate human adults, computers are not inherently aware of the structure of a
natural language document and do not instantly recognize words and sentences. To a
computer, a document is only a big sequence of bytes. Computers do not know that a
space character between two sequences of characters means that there are two separate
words in the document. Instead, a computer program is developed by humans which
trains the computer, or instructs the computer, how to identify what constitutes an
individual or distinct word, referred to as a token. This program is commonly referred to
as a tokenizer or parser or lexer. Many search engines, as well as other natural language
processing software, incorporate specialized programs for parsing, such as YACC OR
Lex.
During tokenization, the parser identifies sequences of characters, which typically
represent words. Commonly recognized tokens include punctuation, sequences of
numerical characters, alphabetical characters, alphanumerical characters, binary
characters (backspace, null, print, and other antiquated print commands), whitespace
(space, tab, carriage return, line feed), and entities such as email addresses, phone
numbers, and URLs. When identifying each token, several characteristics may be stored
such as the token's case (upper, lower, mixed, proper), language or encoding, lexical
category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence
position, length, and line number.
Language Recognition
If the search engine supports multiple languages, a common initial step during
tokenization is to identify each document's language, given that many of the later steps
are language dependent (such as stemming and part of speech tagging). Language
recognition is the process by which a computer program attempts to automatically
identify, or categorize, the language of a document. Other names for language
recognition include language classification, language analysis, language identification,
33

and language tagging. Automated language recognition is the subject of ongoing research
in natural language processing. Finding which words the language belongs to may
involve the use of a language recognition chart.
Format Analysis
Depending on whether the search engine supports multiple document formats, documents
must be prepared for tokenization. The challenge is that many document formats contain,
in addition to textual content, formatting information. For example, HTML documents
contain HTML tags, which specify formatting information, like whether to start a new
line, or display a word in bold, or change the font size or family. If the search engine
were to ignore the difference between content and markup, the segments would also be
included in the index, leading to poor search results. Format analysis involves the
identification and handling of formatting content embedded within documents which
control how the document is rendered on a computer screen or interpreted by a software
program. Format analysis is also referred to as structure analysis, format parsing, tag
stripping, format stripping, text normalization, text cleaning, or text preparation. The
challenge of format analysis is further complicated by the intricacies of various file
formats. Certain file formats are proprietary and very little information is disclosed, while
others are well documented. Common, well-documented file formats that many search
engines support include:
Microsroft Word
Microsoft Excel
Microsoft Powerpoint
IBM Lotus Notes
HTML
ASCII Text files (a text document without any formatting)
Adobe's Portable Document Format (PDF)
PostScript (PS)
LaTex
The UseNet archive (NNTP) and other deprecated bulletin board formats
XML and derivatives like RSS
SGML (this is more of a general protocol)
Multimedia meta data formats like ID3
Techniques for dealing with various formats include:
Using a publicly available commercial parsing tool that is offered by the organization
which developed, maintains, or owns the format
Writing a custom parser
Some search engines support inspection of files that are stored in a compressed, or
encrypted, file format. If working with a compressed format, then the indexer first
decompresses the document, which may result in one or more files, each of which must
be indexed separately. Commonly supported compressed file formats include:
ZIP - Zip File
RAR - Archive File
34

CAB - Microsoft Windows Cabinet File
Gzip - Gzip file
BZIP - Bzip file
TAR, GZ, and TAR.GZ - Unix Gzip'ped Archives
Format analysis can involve quality improvement methods to avoid including 'bad
information' in the index. Content can manipulate the formatting information to include
additional content. Examples of abusing document formatting for spamdexing:
Including hundreds or thousands of words in a section which is hidden from view on the
computer screen, but visible to the indexer, by use of formatting (e.g. hidden "div" tag in
HTML, which may incorporate the use of CSS or Javascript to do so).
Setting the foreground font color of words to the same as the background color, making
words hidden on the computer screen to a person viewing the document, but not hidden
to the indexer.
Section Recognition
Some search engines incorporate section recognition, the identification of major parts of
a document, prior to tokenization. Not all the documents in a corpus read like a well-
written book, divided into organized chapters and pages. Many documents on the web
contain erroneous content and side-sections which do not contain primary material, that
which the document is about, such as newsletters and corporate reports. For example, this
article may display a side menu with words inside links to other web pages. Some file
formats, like HTML or PDF, allow for content to be displayed in columns. Even though
the content is displayed, or rendered, in different areas of the view, the raw markup
content may store this information sequentially. Words that appear in the raw source
content sequentially are indexed sequentially, even though these sentences and
paragraphs are rendered in different parts of the computer screen. If search engines index
this content as if it were normal content, a dilemma ensues where the quality of the index
is degraded and search quality is degraded due to the mixed content and improper word
proximity. Two primary problems are noted:
Content in different sections is treated as related in the index, when in reality it is not
Organizational 'side bar' content is included in the index, but the side bar content does not
contribute to the meaning of the document, and the index is filled with a poor
representation of its documents, assuming the goal is to go after the meaning of each
document, a sub-goal of providing quality search results.
Section analysis may require the search engine to implement the rendering logic of each
document, essentially an abstract representation of the actual document, and then index
the representation instead. For example, some content on the Internet is rendered via
Javascript. Viewers of web pages in web browsers see this content. If the search engine
does not render the page and evaluate the javascript within the page, it would not 'see' this
content in the same way, and index the document incorrectly. Given that some search
engines do not bother with rendering issues, many web page designers avoid displaying
content via javascript or use the Noscript tag to ensure that the web page is indexed
Internet application unit2
Internet application unit2
Internet application unit2
Internet application unit2
Internet application unit2
Internet application unit2
Internet application unit2
Internet application unit2

Weitere ähnliche Inhalte

Was ist angesagt? (17)

Web Browser ! Batra Computer Centre
Web Browser ! Batra Computer CentreWeb Browser ! Batra Computer Centre
Web Browser ! Batra Computer Centre
 
Web browser
Web browserWeb browser
Web browser
 
Browsers in the actuality.
Browsers in the actuality.Browsers in the actuality.
Browsers in the actuality.
 
Introduction to web page
Introduction to web pageIntroduction to web page
Introduction to web page
 
Browsers 2
Browsers 2Browsers 2
Browsers 2
 
Web browser pdf
Web browser pdfWeb browser pdf
Web browser pdf
 
Research on Web Browsers ppt
Research on Web Browsers pptResearch on Web Browsers ppt
Research on Web Browsers ppt
 
Web browsers
Web browsersWeb browsers
Web browsers
 
Browser wars
Browser warsBrowser wars
Browser wars
 
Browser war
Browser warBrowser war
Browser war
 
Web browsers
Web browsersWeb browsers
Web browsers
 
Web browsers
Web browsersWeb browsers
Web browsers
 
Research on Web Browsers
Research on Web BrowsersResearch on Web Browsers
Research on Web Browsers
 
browsers MEZH
browsers MEZHbrowsers MEZH
browsers MEZH
 
Web browers
Web browersWeb browers
Web browers
 
ICT project
ICT projectICT project
ICT project
 
Web browsers
Web browsersWeb browsers
Web browsers
 

Andere mochten auch (8)

Superhero lunchwithouttitlepage
Superhero lunchwithouttitlepageSuperhero lunchwithouttitlepage
Superhero lunchwithouttitlepage
 
Materi Kup I Suwardi
Materi Kup I SuwardiMateri Kup I Suwardi
Materi Kup I Suwardi
 
C basics
C basicsC basics
C basics
 
Uisvr1.5
Uisvr1.5Uisvr1.5
Uisvr1.5
 
Materi Kup Ii By
Materi Kup Ii ByMateri Kup Ii By
Materi Kup Ii By
 
01internet concepts
01internet concepts01internet concepts
01internet concepts
 
Alma.Daskalaki.Portfolio 2010
Alma.Daskalaki.Portfolio 2010Alma.Daskalaki.Portfolio 2010
Alma.Daskalaki.Portfolio 2010
 
Innovation As Project Management Pm Showcase SA 2007
Innovation As Project Management   Pm Showcase SA 2007Innovation As Project Management   Pm Showcase SA 2007
Innovation As Project Management Pm Showcase SA 2007
 

Ähnlich wie Internet application unit2

Browsers manuel zapata
Browsers  manuel zapataBrowsers  manuel zapata
Browsers manuel zapata
Manuel Zapata
 
Browser of internet 2011
Browser of internet 2011Browser of internet 2011
Browser of internet 2011
Jose Rincon
 
Browser of internet 2011
Browser of internet 2011Browser of internet 2011
Browser of internet 2011
Jose Rincon
 
Browser of internet 2011
Browser of internet 2011Browser of internet 2011
Browser of internet 2011
Jose Rincon
 

Ähnlich wie Internet application unit2 (20)

Evolution of the web
Evolution of the webEvolution of the web
Evolution of the web
 
Evolution Of The Web
Evolution Of The WebEvolution Of The Web
Evolution Of The Web
 
Evolution of the web
Evolution of the webEvolution of the web
Evolution of the web
 
Evolution of the web
Evolution of the webEvolution of the web
Evolution of the web
 
Web design
Web designWeb design
Web design
 
9 10 july2020
9 10 july20209 10 july2020
9 10 july2020
 
HTML for beginners
HTML for beginnersHTML for beginners
HTML for beginners
 
Web browsers
Web browsersWeb browsers
Web browsers
 
Web browsers
Web browsersWeb browsers
Web browsers
 
Web browser
Web browserWeb browser
Web browser
 
Web Browser
Web BrowserWeb Browser
Web Browser
 
Informatica exploradores
Informatica exploradoresInformatica exploradores
Informatica exploradores
 
Browsers manuel zapata
Browsers  manuel zapataBrowsers  manuel zapata
Browsers manuel zapata
 
Browsers
BrowsersBrowsers
Browsers
 
Web design
Web designWeb design
Web design
 
Web design
Web designWeb design
Web design
 
Browser of internet 2011
Browser of internet 2011Browser of internet 2011
Browser of internet 2011
 
Browser of internet 2011
Browser of internet 2011Browser of internet 2011
Browser of internet 2011
 
Browser of internet 2011
Browser of internet 2011Browser of internet 2011
Browser of internet 2011
 
Browser
BrowserBrowser
Browser
 

Kürzlich hochgeladen

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
SoniaTolstoy
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Krashi Coaching
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Kürzlich hochgeladen (20)

fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 

Internet application unit2

  • 1. 1 Web Browser A web browser is a software application that enables a user to display and interact with text, images, and other information typically located on a web page at a website on the World Wide Web or a local area network. Text and images on a web page can contain hyperlinks to other web pages at the same or different website. Web browsers allow a user to quickly and easily access information provided on many web pages at many websites by traversing these links. Web browsers format HTML information for display, so the appearance of a web page may differ between browsers. Some of the web browsers available for personal computers include Internet Explorer, Mozilla Firefox, Safari, Netscape, and Opera in order of descending popularity (as of August 2006).[1] Web browsers are the most commonly used type of HTTP user agent. Although browsers are typically used to access the World Wide Web, they can also be used to access information provided by web servers in private networks or content in file systems. Protocols and standards Web browsers communicate with web servers primarily using HTTP (hypertext transfer protocol) to fetch webpages. HTTP allows web browsers to submit information to web servers as well as fetch web pages from them. The most commonly used HTTP is HTTP/ 1.1, which is fully defined in RFC 2616. HTTP/1.1 has its own required standards that Internet Explorer does not fully support, but most other current-generation web browsers do. Pages are located by means of a URL (uniform resource locator), which is treated as an address, beginning with http: for HTTP access. Many browsers also support a variety of other URL types and their corresponding protocols, such as ftp: for FTP (file transfer protocol), rtsp: for RTSP (real-time streaming protocol), and https: for HTTPS (an SSL encrypted version of HTTP). The file format for a web page is usually HTML (hyper-text markup language) and is identified in the HTTP protocol using a MIME content type. Most browsers natively support a variety of formats in addition to HTML, such as the JPEG, PNG and GIF image formats, and can be extended to support more through the use of plugins. The combination of HTTP content type and URL protocol specification allows web page designers to embed images, animations, video, sound, and streaming media into a web page, or to make them accessible through the web page. Early web browsers supported only a very simple version of HTML. The rapid development of proprietary web browsers led to the development of non-standard dialects of HTML, leading to problems with Web interoperability. Modern web browsers support a combination of standards- and defacto-based HTML and XHTML, which should display in the same way across all browsers. No browser fully supports HTML 4.01, XHTML 1.x or CSS 2.1 yet. Currently many sites are designed using WYSIWYG HTML generation programs such as Macromedia Dreamweaver or Microsoft Frontpage. These often generate non-standard HTML by default, hindering the work of the W3C in
  • 2. 2 developing standards, specifically with XHTML and CSS (cascading style sheets, used for page layout). Some of the more popular browsers include additional components to support Usenet news, IRC (Internet relay chat), and e-mail. Protocols supported may include NNTP (network news transfer protocol), SMTP (simple mail transfer protocol), IMAP (Internet message access protocol), and POP (post office protocol). These browsers are often referred to as Internet suites or application suites rather than merely web browsers. Brief history A NeXTcube was used by Tim Berners-Lee (who pioneered the use of hypertext for sharing information) as the world's first web server, and also to write the first web browser, WorldWideWeb in 1990. Berners-Lee introduced it to colleagues at CERN in March 1991. Since then the development of web browsers has been inseparably intertwined with the development of the web itself. The first browser, Silversmith, was created by John Bottoms in 1987.[2] The browser, based on SGML tags, used a tag set from the Electronic Document Project of the AAP with minor modifications and was sold to a number of early adopters. At the time SGML was used exclusively for the formatting of printed documents. The use of SGML for electronically displayed documents signaled a shift in electronic publishing and was met with considerable resistance. Silversmith included an integrated indexer, full text searches, hypertext links between images text and sound using SGML tags and a return stack for use with hypertext links. It included features that are still not available in today's browsers. These include capabilities such as the ability to restrict searches within document structures, searches on indexed documents using wild cards and the ability to search on tag attribute values and attribute names. SGML-FAQ US Patent In 1992, Tony Johnson releases the MidasWWW browser. Based on Motif/X, MidasWWW allows viewing of PostScript files on the Web from Unix and VMS, and even handles compressed PostScript. Another early popular web browser was ViolaWWW, which was modeled after HyperCard. However, the explosion in popularity of the web was triggered by NCSA Mosaic which was a graphical browser running originally on Unix but soon ported to the Apple Macintosh and Microsoft Windows platforms. Version 1.0 was released in September 1993, and was dubbed the killer application of the Internet. Marc Andreessen, who was the leader of the Mosaic team at NCSA, quit to form a company that would later be known as Netscape Communications Corporation. Netscape released its flagship Navigator product in October 1994, and it took off the next year. Microsoft, which had thus far not marketed a browser, now entered the fray with its Internet Explorer product, purchased from Spyglass Inc. This began what is known as the browser wars, the fight for the web browser market between Microsoft and Netscape. The wars put the web in the hands of millions of ordinary PC users, but showed how commercialization of the web could stymie standards efforts. Both Microsoft and Netscape liberally incorporated proprietary extensions to HTML in their products, and tried to gain an edge by product differentiation. Starting with the acceptance of the
  • 3. 3 Microsoft proposed Cascading Style Sheets over Netscape's JavaScript Style Sheets (JSSS) by W3C, the Netscape browser started being generally considered inferior to Microsoft's browser version after version, from feature considerations to application robustness to standard compliance. The wars effectively ended in 1998 when it became clear that Netscape's declining market share trend was irreversible. This trend may have been due in part to Microsoft's integrating its browser with its operating system and bundling deals with OEMs; Microsoft faced antitrust litigation on these charges. Netscape responded by open sourcing its product, creating Mozilla. This did nothing to slow Netscape's declining market share. The company was purchased by America Online in late 1998. At first, the Mozilla project struggled to attract developers, but by 2002 it had evolved into a relatively stable and powerful internet suite. Mozilla 1.0 was released to mark this milestone. Also in 2002, a spin off project that would eventually become the popular Mozilla Firefox was released. In 2004, Firefox 1.0 was released; Firefox 1.5 was released in November 2005. Firefox 2, a major update, was released in October 2006 and work has already begun on Firefox 3 which is scheduled for release in 2007. As of 2006, Mozilla and its derivatives account for approximately 12% of web traffic. Opera, an innovative, speedy browser popular in handheld devices, particularly mobile phones, as well as on PCs in some countries was released in 1996 and remains a niche player in the PC web browser market. It is available on Nintendo's DS, DS Lite and Wii consoles[2]. The Opera Mini browser uses the Presto (layout engine) like all versions of Opera, but runs on most phones supporting Java Midlets. The Lynx browser remains popular for Unix shell users and with vision impaired users due to its entirely text-based nature. There are also several text-mode browsers with advanced features, such as w3m, Links (which can operate both in text and graphical mode), and the Links forks such as ELinks. The Macintosh scene too has traditionally been dominated by Internet Explorer and Netscape. However, Apple's Safari, the default browser on Mac OS X since version 10.3, has slowly grown to dominate this market. In 2003, Microsoft announced that Internet Explorer would no longer be made available as a separate product but would be part of the evolution of its Windows platform, and that no more releases for the Macintosh would be made. However, in early 2005, Microsoft changed its plans, releasing version 7 of Internet Explorer for Windows XP, Windows Server 2003, and Windows Vista in October 2006. Features Different browsers can be distinguished from each other by the features they support. Modern browsers and web pages tend to utilize many features and techniques that did not exist in the early days of the web. As noted earlier, with the browser wars there was a rapid and chaotic expansion of browser and World Wide Web feature sets. The following is a list of some of the most notable features: • Standards support • HTTP and HTTPS
  • 4. 4 • HTML, XML and XHTML • Graphics file formats including GIF, PNG, JPEG, and SVG • Cascading Style Sheets (CSS) • JavaScript (Dynamic HTML) and XMLHttpRequest • Cookie • Digital certificates • Favicons • RSS, Atom Fundamental features • Bookmark manager • Caching of web contents • Support of media types via plugins such as Macromedia Flash and QuickTime Usability and accessibility features • Autocompletion of URLs and form data • Tabbed browsing • Spatial navigation • Caret navigation • Screen reader or full speech support HTML HTML, short for HyperText Markup Language, is the predominant markup language for the creation of web pages. It provides a means to describe the structure of text-based information in a document — by denoting certain text as headings, paragraphs, lists, and so on — and to supplement that text with interactive forms, embedded images, and other objects. HTML is written in the form of labels (known as tags), created by greater-than signs (>) and less-than signs (<). HTML can also describe, to some degree, the appearance and semantics of a document, and can include embedded scripting language code which can affect the behavior of web browsers and other HTML processors. HTML is also often used to refer to content of the MIME type text/html or even more broadly as a generic term for HTML whether in its XML-descended form (such as XHTML 1.0 and later) or its form descended directly from SGML (such as HTML 4.01 and earlier). What is HTML? HTML stands for Hypertext Markup Language. Hypertext is ordinary text that has been dressed up with extra features, such as formatting, images, multimedia, and links to other documents.
  • 5. 5 Markup is the process of taking ordinary text and adding extra symbols. Each of the symbols used for markup in HTML is a command that tells a browser how to display the text. History of HTML Tim Berners-Lee created the original HTML (and many associated protocols such as HTTP) on a NeXTcube workstation using the NeXTSTEP development environment. At the time, HTML was not a specification, but a collection of tools to solve an immediate problem: the communication and dissemination of ongoing research among Berners-Lee and a group of his colleagues. His solution later combined with the emerging international and public internet to garner worldwide attention. Early versions of HTML were defined with loose syntactic rules, which helped its adoption by those unfamiliar with web publishing. Web browsers commonly made assumptions about intent and proceeded with rendering of the page. Over time, as the use of authoring tools increased, the trend in the official standards has been to create an increasingly strict language syntax. However, browsers still continue to render pages that are far from valid HTML. HTML is defined in formal specifications that were developed and published throughout the 1990s, inspired by Tim Berners-Lee's prior proposals to graft hypertext capability onto a homegrown SGML-like markup language for the Internet. The first published specification for a language called HTML was drafted by Berners-Lee with Dan Connolly, and was published in 1993 by the IETF as a formal "application" of SGML (with an SGML Document Type Definition defining the grammar). The IETF created an HTML Working Group in 1994 and published HTML 2.0 in 1995, but further development under the auspices of the IETF was stalled by competing interests. Since 1996, the HTML specifications have been maintained, with input from commercial software vendors, by the World Wide Web Consortium (W3C).[1] However, in 2000, HTML also became an international standard (ISO/IEC 15445:2000). The last HTML specification published by the W3C is the HTML 4.01 Recommendation, published in late 1999 and its issues and errors were last acknowledged by errata published in 2001. Since the publication of HTML 4.0 in late 1997, the W3C's HTML Working Group has increasingly — and from 2002 through 2006, exclusively — focused on the development of XHTML, an XML-based counterpart to HTML that is described on one W3C web page as HTML's "successor".[2][3][4] XHTML applies the more rigorous, less ambiguous syntax requirements of XML to HTML to make it easier to process and extend, and as support for XHTML has increased in browsers and tools, it has been embraced by many web standards advocates in preference to HTML. XHTML is routinely characterized by mass-media publications for both general and technical audiences as the newest "version" of HTML, but W3C publications, as of 2006, do not make such a claim; neither HTML 3.2 nor HTML 4.01 have been explicitly rescinded, deprecated, or superseded by any W3C publications, and, as of 2006, they continue to be listed alongside XHTML as current Recommendations in the W3C's primary publication indices.[5][6][7]
  • 6. 6 In November 2006, the HTML Working Group published a new charter indicating its intent to resume development of HTML in a manner that unifies HTML 4 and XHTML 1, allowing for this hybrid language to manifest in both an XML format and a "classic HTML" format that is SGML-compatible but not strictly SGML-based. Among other things, it is planned that the new specification, to be released and refined throughout 2007 through 2008, will include conformance and parsing requirements, DOM APIs, and new widgets and APIs. The group also intends to publish test suites and validation tools.[8] Version history of the standard HTML Character encodings Dynamic HTML Font family HTML editor HTML element HTML scripting Layout engine comparison Style Sheets Unicode and HTML W3C Web browsers comparison Web colors XHTML This box: view • talk • edit Hypertext Markup Language (First Version), published June 1993 as an Internet Engineering Task Force (IETF) working draft (not standard). HTML 2.0, published November 1995 as IETF RFC 1866, supplemented by RFC 1867 (form-based file upload) that same month, RFC 1942 (tables) in May 1996, RFC 1980 (client-side image maps) in August 1996, and RFC 2070 (internationalization) in January 1997; ultimately all were declared obsolete/historic by RFC 2854 in June 2000. HTML 3.2, published January 14, 1997 as a W3C Recommendation. HTML 4.0, published December 18, 1997 as a W3C Recommendation. It offers three "flavors": Strict, in which deprecated elements are forbidden Transitional, in which deprecated elements are allowed Frameset, in which mostly only frame related elements are allowed HTML 4.01, published December 24, 1999 as a W3C Recommendation. It offers the same three flavors as HTML 4.0, and its last errata was published May 12, 2001. ISO/IEC 15445:2000 ("ISO HTML", based on HTML 4.01 Strict), published May 15, 2000 as an ISO/IEC international standard. HTML 4.01 and ISO/IEC 15445:2000 are the most recent and final versions of HTML. XHTML is a separate language that began as a reformulation of HTML 4.01 using XML 1.0. It continues to be developed:
  • 7. 7 XHTML 1.0, published January 26, 2000 as a W3C Recommendation, later revised and republished August 1, 2002. It offers the same three flavors as HTML 4.0 and 4.01, reformulated in XML, with minor restrictions. XHTML 1.1, published May 31, 2001 as a W3C Recommendation. It is based on XHTML 1.0 Strict, but includes minor changes and is reformulated using modules from Modularization of XHTML, which was published April 10, 2001 as a W3C Recommendation. XHTML 2.0 is still a W3C Working Draft. There is no official standard HTML 1.0 specification because there were multiple informal HTML standards at the time. Berners-Lee's original version did not include an IMG element type. Work on a successor for HTML, then called "HTML+", began in late 1993, designed originally to be "A superset of HTML…which will allow a gradual rollover from the previous format of HTML". The first formal specification was therefore given the version number 2.0 in order to distinguish it from these unofficial "standards". Work on HTML+ continued, but it never became a standard. The HTML 3.0 standard was proposed by the newly formed W3C in March 1995, and provided many new capabilities such as support for tables, text flow around figures, and the display of complex math elements. Even though it was designed to be compatible with HTML 2.0, it was too complex at the time to be implemented, and when the draft expired in September 1995, work in this direction was discontinued due to lack of browser support. HTML 3.1 was never officially proposed, and the next standard proposal was HTML 3.2 (code-named "Wilbur"), which dropped the majority of the new features in HTML 3.0 and instead adopted many browser-specific element types and attributes which had been created for the Netscape and Mosaic web browsers. Math support as proposed by HTML 3.0 finally came about years later with a different standard, MathML. HTML 4.0 likewise adopted many browser-specific element types and attributes, but at the same time began to try to "clean up" the standard by marking some of them as deprecated, and suggesting they not be used. Minor editorial revisions to the HTML 4.0 specification were published as HTML 4.01. The most common filename extension for files containing HTML is .html. However, older operating systems and filesystems, such as the DOS versions from the 80's and early 90's and FAT, limit file extensions to three letters, so a .htm extension is also used. Although perhaps less common now, the shorter form is still widely supported by current software. HTML as a hypertext format HTML is the basis of a comparatively weak hypertext implementation. Earlier hypertext systems had features such as typed links, transclusion and source tracking. Another feature lacking today is fat links.[9] Even some hypertext features that were in early versions of HTML have been ignored by most popular web browsers until recently, such as the link element and editable web pages.
  • 8. 8 Sometimes web services or browser manufacturers remedy these shortcomings. For instance, members of the modern social software landscape such as wikis and content management systems allow surfers to edit the web pages they visit. HTML markup HTML markup consists of several types of entities, including: elements, attributes, data types and character references. The Document Type Definition In order to enable Document Type Definition (DTD)-based validation with SGML tools and in order to avoid the Quirks mode in browsers, all HTML documents should start with a Document Type Declaration (informally, a "DOCTYPE"). The DTD contains machine readable grammar specifying the permitted and prohibited content for a document conforming to such a DTD. Browsers do not read the DTD, however. Browsers only look at the doctype in order to decide the layout mode. Not all doctypes trigger the Standards layout mode avoiding the Quirks mode. For example: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> This declaration references the Strict DTD of HTML 4.01, which does not have presentational elements like <font>, leaving formatting to Cascading Style Sheets. SGML-based validators read the DTD in order to properly parse the document and to perform validation. In modern browsers, the HTML 4.01 Strict doctype activates the Standards layout mode for CSS as opposed to the Quirks mode. In addition, HTML 4.01 provides Transitional and Frameset DTDs. The Transitional DTD was intended to gradually phase in the changes made in the Strict DTD, while the Frameset DTD was intended for those documents which contained frames. [edit] Elements See HTML elements for more detailed descriptions. Elements are the basic structure for HTML markup. Elements have two basic properties: attributes and content. Each attribute and each element's content has certain restrictions that must be followed for an HTML document to be considered valid. An element usually has a start label (eg. <label>) and an end label (eg. </label>). The element's attributes are contained in the start label and content is located between the labels (eg. <label>Content</label>). Some elements, such as <br>, will never have any content and do not need closing labels. Listed below are several types of markup elements used in HTML. Structural markup describes the purpose of text. For example, <h2>Golf</h2> establishes "Golf" as a second-level heading, which would be rendered in a browser in a manner similar to the "Markup element types" title at the start of this section. A blank line is included after the header. Structural markup does not denote any specific rendering, but most web browsers have standardized on how elements should be formatted. Further styling should be done with Cascading Style Sheets (CSS). Presentational markup describes the appearance of the text, regardless of its function. For example <b>boldface</b> indicates that visual output devices should render
  • 9. 9 "boldface" in bold text, but has no clear semantics for aural devices that read the text aloud for the sight-impaired. In the case of both <b>bold</b> and <i>italic</i> there are elements which usually have an equivalent visual rendering but are more semantic in nature, namely <strong>strong emphasis</strong> and <em>emphasis</em> respectively. It is easier to see how an aural user agent should interpret the latter two elements. However, they are not equivalent to their presentational counterparts: it would be undesirable for a screen-reader to emphasize the name of a book, for instance, but on a screen such a name would be italicized. Most presentational markup elements have become deprecated under the HTML 4.0 specification, in favor of CSS based style design. Hypertext markup links parts of the document to other documents. HTML up through version XHTML 1.1 requires the use of an anchor element to create a hyperlink in the flow of text: <a>Wikipedia</a>. However, the href attribute must also be set to a valid URL so for example the HTML code, <a href="http://en.wikipedia.org/">Wikipedia</a>, will render the word "Wikipedia" as a hyperlink. In order to view the HTML code in a website click --> View --> Source. [edit] Attributes The attributes of an element are name-value pairs, separated by "=", and written within the start label of an element, after the element's name. The value should be enclosed in single or double quotes, although values consisting of certain characters can be left unquoted in HTML (but not XHTML).[10][11] Leaving attribute values unquoted is considered unsafe.[12] Most elements take any of several common attributes: id, class, style and title. Most also take language-related attributes: lang and dir. The id attribute provides a document-wide unique identifier for an element. This can be used by stylesheets to provide presentational properties, by browsers to focus attention on the specific element or by scripts to alter the contents or presentation of an element. The class attribute provides a way of classifying similar elements for presentation purposes. For example, an HTML (or a set of documents) document may use the designation class="notation" to indicate that all elements with this class value are all subordinate to the main text of the document (or documents). Such notation classes of elements might be gathered together and presented as footnotes on a page, rather than appearing in the place where they appear in the source HTML. An author may use the style non-attributal codes presentational properties to a particular element. It is considered better practice to use an element’s son- id page and select the element with a stylesheet, though sometimes this can be too cumbersome for a simple ad hoc application of styled properties. The title is used to attach subtextual explanation to an element. In most browsers this title attribute is displayed as what is often referred to as a tooltip. The generic inline span element can be used to demonstrate these various non-attributes.
  • 10. 10 <span id='anId' class='aClass' style='color:red;' title='HyperText Markup Language'>HTML</span> which displays as HTML (pointing the cursor at the abbreviation should display the title text in most browsers). [edit] Other markup As of version 4.0, HTML defines a set of 252 character entity references and a set of 1,114,050 numeric character references, both of which allow individual characters to be written via simple markup, rather than literally. A literal character and its markup equivalent are considered equivalent and are rendered identically. The ability to "escape" characters in this way allows for the characters "<" and "&" (when written as &lt; and &amp;, respectively) to be interpreted as character data, rather than markup. For example, a literal "<" normally indicates the start of a label, and "&" normally indicates the start of a character entity reference or numeric character reference; writing it as "&amp;" or "&#38;" allows "&" to be included in the content of elements or the values of attributes. The double-quote character, ", when used to quote an attribute value, must also be escaped as "&quot;" or "&#22;" when it appears within in the attribute value itself. However, since document authors often overlook the need to escape these characters, browsers tend to be very forgiving, treating them as markup only when subsequent text appears to confirm that intent. Escaping also allows for characters that are not easily typed or that aren't even available in the document's character encoding to be represented within the element and attribute content. For example, "é", a character typically found only on Western European keyboards, can be written in any HTML document as the entity reference &eacute; or as the numeric references &#233; or &#xE9;. The characters comprising those references (that is, the "&", the ";", the letters in "eacute", and so on) are available on all keyboards and are supported in all character encodings, whereas the literal "é" is not. HTML also defines several data types for element content, such as script data and stylesheet data, and a plethora of types for attribute values, including IDs, names, URIs, numbers, units of length, languages, media descriptors, colors, character encodings, dates and times, and so on. All of these data types are specializations of character data. [edit] Semantic HTML There is no official specification called "Semantic HTML", though the strict flavors of HTML discussed below are a push in that direction. Rather, semantic HTML refers to an objective and a practice to create documents with HTML that contain only the author's intended meaning, without any reference to how this meaning is presented or conveyed. A classic example is the distinction between the emphasis element (<em>) and the italics element (<i>). Often the emphasis element is displayed in italics, so the presentation is typically the same. However, emphasizing something is different from listing the title of a book, for example, which may also be displayed in italics. In purely semantic HTML, a book title would use a separate element than emphasized text uses (for example a <span>), because they are each meaningfully different things.
  • 11. 11 The goal of semantic HTML requires two things of authors: 1) to avoid the use of presentational markup (elements, attributes and other entities); 2) the use of available markup to differentiate the meanings of phrases and structure in the document. So for example, the book title from above would need to have its own element and class specified such as <cite class="booktitle">The Grapes of Wrath</cite>. Here, the <cite> element is used, because it most closely matches the meaning of this phrase in the text. However, the <cite> element is not specific enough to this task because we mean to cite specifically a book title as opposed to a newspaper article or a particular academic journal. Semantic HTML also requires complementary specifications and software compliance with these specifications. Primarily, the development and proliferation of CSS has led to increasing support for semantic HTML because CSS provides designers with a rich language to alter the presentation of semantic-only documents. With the development of CSS the need to include presentational properties in a document has virtually disappeared. With the advent and refinement of CSS and the increasing support for it in web browsers, subsequent editions of HTML increasingly stress only using markup that suggests the semantic structure and phrasing of the document, like headings, paragraphs, quotes, and lists, instead of using markup which is written for visual purposes only, like <font>, <b> (bold), and <i> (italics). Some of these elements are not permitted in certain varieties of HTML, like HTML 4.01 Strict. CSS provides a way to separate document semantics from the content's presentation, by keeping everything relevant to presentation defined in a CSS file. See separation of style and content. Semantic HTML offers many advantages. First, it ensures consistency in style across elements that have the same meaning. Every heading, every quotation mark, every similar element receives the same presentation properties. Second, semantic HTML frees authors from the need to concern themselves with presentation details. When writing the number two, for example, should it be written out in words ("two"), or should it be written as a numeral (2)? A semantic markup might enter something like <number>2</number> and leave presentation details to the stylesheet designers. Similarly, an author might wonder where to break out quotations into separate indented blocks of text - with purely semantic HTML, such details would be left up to stylesheet designers. Authors would simply indicate quotations when they occur in the text, and not concern themselves with presentation. A third advantage is device independence and repurposing of documents. A semantic HTML document can be paired with any number of stylesheets to provide output to computer screens (through web browsers), high-resolution printers, handheld devices, aural browsers or braille devices for those with visual impairments, and so on. To accomplish this nothing needs to be changed in a well coded semantic HTML document. Readily available stylesheets make this a simple matter of pairing a semantic HTML document with the appropriate stylesheets (of course, the stylesheet's selectors need to match the appropriate properties in the HTML document).
  • 12. 12 Some aspects of authoring documents make separating semantics from style (in other words, meaning from presentation) difficult. Some elements are hybrids, using presentation in their very meaning. For example, a table displays content in a tabular form. Often this content only conveys the meaning when presented in this way. Repurposing a table for an aural device typically involves somehow presenting the table as an inherently visual element in an audible form. On the other hand, we frequently present lyrical songs — something inherently meant for audible presentation — and instead present them in textual form on a web page. For these types of elements, the meaning is not so easily separated from their presentation. However, for a great many of the elements used and meanings conveyed in HTML the translation is relatively smooth. [edit] Delivery of HTML HTML documents can be delivered by the same means as any other computer file; however, HTML documents are most often delivered in one of the following two forms: Over HTTP servers and through email. [edit] Publishing HTML with HTTP The World Wide Web is primarily composed of HTML documents transmitted from a web server to a web browser using the HyperText Transfer Protocol (HTTP). However, HTTP can be used to serve images, sound and other content in addition to HTML. To allow the web browser to know how to handle the document it received, an indication of the file format of the document must be transmitted along with the document. This vital metadata includes the MIME type (text/html for HTML 4.01 and earlier, application/xhtml+xml for XHTML 1.0 and later) and the character encoding (see Character encodings in HTML). In modern browsers, the MIME type that is sent with the HTML document affects how the document is interpreted. A document sent with an XHTML MIME type, or served as application/xhtml+xml, is expected to be well-formed XML and a syntax error may cause the browser to fail to render the document. The same document sent with a HTML MIME type, or served as text/html, might get displayed since web browsers are more lenient with HTML. However, XHTML parsed this way is not considered either proper XHTML nor HTML, but so-called tag soup. If the MIME type is not recognized as HTML, the web browser should not attempt to render the document as HTML, even if the document is prefaced with a correct Document Type Declaration. Nevertheless, some web browsers do examine the contents or URL of the document and attempt to infer the file type, despite this being forbidden by the HTTP 1.1 specification. [edit] HTML e-mail Main article: HTML e-mail Most graphical e-mail clients allow the use of a subset of HTML (often ill-defined) to provide formatting and semantic markup capabilities not available with plain text, like emphasized text, block quotations for replies, and diagrams or mathematical formulas that couldn't easily be described otherwise. Many of these clients include both a GUI editor for composing HTML e-mails and a rendering engine for displaying received
  • 13. 13 HTML e-mails. Use of HTML in e-mail is controversial due to compatibility issues, because it can be used in phishing/privacy attacks, because it can confuse spam filters, and because the message size is larger than plain text. [edit] Current flavors of HTML Since its inception HTML and its associated protocols gained acceptance relatively quickly. However, no clear standards existed in the early years of the language. Though its creators originally conceived of HTML as a semantic language devoid of presentation details, practical uses pushed many presentational elements and attributes into the language: driven largely by the various browser vendors. The latest standards surrounding HTML reflect efforts to overcome the sometimes chaotic development of the language and to create a rational foundation to build both meaningful and well-presented documents. To return HTML to its role as a semantic language, the W3C has developed style languages such as CSS and XSL to shoulder the burden of presentation. In conjunction the HTML specification has slowly reined in the presentational elements within the specification. There are two axes differentiating various flavors of HTML as currently specified: SGML-based HTML versus XML-based HTML (referred to as XHTML) on the one axis and strict versus transitional (loose) versus frameset on the other axis. [edit] Traditional versus XML-based HTML One difference in the latest HTML specifications lies in the distinction between the SGML-based specification and the XML-based specification. The XML-based specification is often called XHTML to clearly distinguish it from the more traditional definition; however, the root element name continues to be HTML even in the XHTML- specified HTML. The W3C intends XHTML 1.0 to be identical with HTML 4.01 except in the often stricter requirements of XML over traditional HTML. XHTML 1.0 likewise has three sub-specifications: strict, loose and frameset. The strictness of XHTML in terms of its syntax is often confused with the strictness of the strict versus the loose definitions in terms of the content rules of the specifications. The strictness of XML lies in the need to: always explicitly close elements (<h1>); and to always use quotation-marks (double " or single ') to enclose attribute values. The use of implied closing labels in HTML led to confusion for both editors and parsers. Aside from the different opening declarations for a document, the differences between HTML 4.01 and XHTML 1.0 — in each of the corresponding DTDs — is largely syntactic. Adhering to valid and well-formed XHTML 1.0 will result in a well-formed HTML 4.01 document in every way, except one. XHTML introduces a new markup in a self-closing element as short-hand for handling empty elements. The short-hand adds a slash (/) at the end of an opening label like this: <br/>. The introduction of this short- hand, undefined in any HTML 4.01 DTD, may confuse earlier software unfamiliar with this new convention. To help with the transition, the W3C recommends also including a space character before the slash like this:<br />. As validators and browsers adapt to this evolution in the standard, the migration from traditional to XML-based HTML should be relatively simple. The major problems occur when software is non-conforming to HTML
  • 14. 14 4.01 and its associated protocols to begin with, or erroneously implements the HTML recommendations. To understand the subtle differences between HTML and XHTML consider the transformation of a valid and well-formed XHTML 1.0 document into a valid and well- formed HTML 4.0. To make this translation requires the following steps:: The language code for the element should be specified with a lang rather than the XHTML xml:lang attribute HTML 4.01 instead defines its own attribute for language) whereas XHTML uses the XML defined attribute. Remove the XML namespace (xmlns=URI). HTML does not require and has no facilities for namespaces. Change the DTD declaration from XHTML 1.0 to HTML 4.01. (see DTD section for further explanation]]). If present, remove the XML declaration (Typically this is: <?xml version="1.0" encoding="utf-8"?>). Change the document’s mime type to text/html This may come from a meta element, from the HTTP header of the server or possibly from a filename extension (for example, change .xhtml to html). Change the XML empty label short-cut to a standard opening label (<br/> to <br>) Those are the only changes necessary to translate a document from XHTML 1.0 to HTML 4.01. The reverse operation can be much more complicated. HTML 4.01 allows the omission of many labels in a complex pattern derived by determining which labels are (in some sense) redundant for a valid document. In other words if the document is authored precisely to the associated HTML 4.01 content model, some labels need not be expressed. For example, since a paragraph cannot contain another paragraph, when an opening paragraph label is followed by another opening paragraph label, this implies the previous paragraph element is now closed. Similarly, elements such as br have no allowed content, so HTML does not require an explicit closing label for this element. Also since HTML was the only specification targeted by user-agents (browsers and other HTML consuming software), the specification even allows the omission of opening and closing labels for html, head, and body, if the document's head has no content. To translate from HTML to XHTML would first require the addition of any omitted closing labels (or using the closing label shortcut for empty elements like <br/>). Notice how XHTML’s requirement to always include explicit closing labels, allows the separation between the concepts of valid and well-formed. A well-formed XHTML document adheres to all the syntax requirements of XML. A valid document adheres to the content specification for XHTML. In other words a valid document only includes content, attributes and attribute values within each element in accord with the specification. If a closing label is omitted, an XHTML parser can first determine the document is not well-formed. Once the elements are all explicitly closed, the parser can address the question of whether the document is also valid. For an HTML parse these separate aspects of a document are not discernible. If a paragraph opening label (p) is
  • 15. 15 followed by a div, is it because the document is not well-formed (the closing paragraph label is missing) or is the document invalid (a div does not belong in a paragraph)? Whether coding in HTML or XHTML it may just be best to always include the optional labels within an HTML document rather than remembering which labels can be omitted. The W3C recommends several conventions to ensure an easy migration between HTML and XHTML (see HTML Compatibility Guidelines). Basically the W3C recommends: Including both xml:lang and lang attributes on any elements assigning language. Using the self-closing label only for elements specified as empty Make all label names and attribute names lower-case. Ensuring all attribute values are quoted with either single quotes (') or double quotes (") Including an extra space in self-closing labels: for example <br /> instead of <br/> Including explicit close labels for elements that permit content but are left empty (for example, "<img></img>", not "<img />" ) Note that by carefully following the W3C’s compatibility guidelines the difference between the resulting HTML 4.01 document and the XHTML 1.0 document is merely the DOCTYPE declaration, and the XML declaration preceding the document’s contents. The W3C allows the resulting XHTML 1.0 (or any XHTML 1.0) document to be delivered as either HTML or XHTML. For delivery as HTML, the document’s MIME type should be set to 'text/html', while, for XHTML, the document’s MIME type should be set to 'application/xhtml+xml'. When delivered as XHTML, browsers and other user agents are expected to adhere strictly to the XML specifications in parsing, interpreting, and displaying the document’s contents. [edit] Transitional versus Strict The latest SGML-based specification HTML 4.01 and the earliest XHTML version include three sub-specifications: strict, transitional (also called loose), and frameset. The difference between strict on the one hand and loose and frameset on the other, is that the strict definition tries to adhere more tightly to a presentation-free or style-free concept of a semantic HTML. The loose standard maintains many of the various presentational elements and attributes absent in the strict definition. The primary differences making the transitional specification loose versus the strict specification (whether XHTML 1.0 or HTML 4.01) are: A looser content model Inline elements and character strings (#PCDATA) are allowed in: body, blockquote, form, noscript, noframes Presentation related elements underline (u) strike-through (s and strike) center font basefont Presentation related attributes background and bgcolor attributes for body element.
  • 16. 16 align attribute on div, form, paragraph (p), and heading (h1...h6) elements align, noshade, size, and width attributes on hr element align, border, vspace, and hspace attributes on img and object elements align attribute on legend and caption elements align and bgcolor on table element nowrap, bgcolor, width, height on td and th elements bgcolor attribute on tr element clear attribute on br element compact attribute on dl, dir and menu elements type, compact, and start attributes on ol and ul elements type and value attributes on li element width attribute on pre element Additional elements in loose (transitional) specification menu list (no substitute, though unordered list is recommended; may return in XHTML 2.0 specification) dir list (no substitute, though unordered list is recommended) isindex (element requires server-side support and is typically added to documents server-side) applet (deprecated in favor of object element) The pre element does not allow: applet, font, and basefont (elements not defined in strict DTD) The language attribute on script element (presumably redundant with type attribute, though this is maintained for legacy reasons). Frame related entities frameset element (used in place of body for frameset DTD) frame element iframe noframes target attribute on anchor, client-side image-map (imagemap), link, form, and base elements [edit] Frameset versus transitional In addition to the above transitional differences, the frameset specifications (whether XHTML 1.0 or HTML 4.01) specifies a different content model: <html> <head> Any of the various head related elements. </head> <frameset> At least one of either: another frameset or a frame and an optional noframes element. </frameset> </html>
  • 17. 17 [edit] Summary of flavors As this list demonstrates, the loose flavors of the specification are maintained for legacy support. However, contrary to popular misconceptions, the move to XHTML does not imply a removal of this legacy support. Rather the X in XML stands for extensible and the W3C is modularizing the entire specification and opening it up to independent extensions. The primary achievement in the move from XHTML 1.0 to XHTML 1.1 is the modularization of the entire specification. The strict version of HTML is deployed in XHTML 1.1 through a set of modular extensions to the base XHTML 1.1 specification. Likewise someone looking for the loose (transitional) or frameset specifications will find similar extended XHTML 1.1 support (much of it is contained in the legacy or frame modules). The modularization also allows for separate features to develop on their own timetable. So for example XHTML 1.1 will allow quicker migration to emerging XML standards such as MathML (a presentational and semantic math language based on XML) and XFORMS — a new highly advanced web-form technology to replace the existing HTML forms. In summary, the HTML 4.01 specification primarily reined in all the various HTML implementations into a single clear written specification based on SGML. XHTML 1.0, ported this specification, as is, to the new XML defined specification. Next, XHTML 1.1 takes advantage of the extensible nature of XML and modularizes the whole specification. XHTML 2.0 will be the first step in adding new features to the specification in a standards-body-based approach. NetMeetting Microsoft NetMeeting is a VoIP and multi-point videoconferencing client included in many versions of Microsoft Windows (from Windows 95 OSR2 to Windows XP). It uses the H.323 protocol for video and audio conferencing, and is interoperable with OpenH323-based clients such as Ekiga, and Internet Locator Service (ILS) as mirror server. It also uses a slightly modified version of the ITU T.120 Protocol for whiteboarding, application sharing, desktop sharing, remote desktop sharing (RDS) and file transfers. The secondary Whiteboard in NetMeeting 2.1 and later utilizes the H.324 protocol. Before video service became common on free IM clients, such Yahoo Messenger and MSN Messenger, NetMeeting was a popular way to perform video conferences and chatting over the Internet (with the help of public ILS servers). Since the release of Windows XP, Microsoft has deprecated it in favour of Windows Messenger, although it is still installed by default (Start > Run... > conf.exe). Note that Windows Messenger, MSN Messenger and Windows Live Messenger hooks directly into NetMeeting for the application sharing, desktop sharing, and Whiteboard features exposed by each application.
  • 18. 18 As of the release of Windows Vista, NetMeeting is no longer included and has been replaced by Windows Meeting Space. chat can refer to any kind of communication over the internet, but is primarily meant to refer to direct 1-on-1 chat or text-based group chat (formally also known as synchronous conferencing), using tools such as instant messaging applications—computer programs, Internet Relay Chat, talkers and possibly MUDs, MUCKs, MUSHes and MOOes. While many of the web's well known custodians offer online chat and messaging services for free, an increasing number of providers are beginning to show strong revenue streams from paid-for services. Again it is the Adult service providers, profiting from the advent of reliable and high-speed broadband, (notably across Eastern Europe) who are at the forefront of the paid-for online chat revolution. For every business traveller engaging in a video call or conference call rather than braving the check-in queue, there are countless web users replacing traditional conversational means with online chat and messaging. Like Email, which has reduced the need and usage of letter, fax and memo communication, online chat is steadily replacing telephony as the means of office and home communication. The early adopters in these areas are undoubtedly teenage users of instant messaging. It might not be long before SMS text messaging usage declines as mobile handsets provide the technology for online chat. Other forms of online chat that are not usually referred to as online chat [edit] MUDs A MUD, or a multi-user dungeon, is a multi-user version of dungeons and dragons for the internet, and is an early use of the internet. In a MUD, as well as playing the game, people can chat to each other. Talkers were originally based off MUDs and the earliest versions of talkers were primarily MUDs without the gaming element. Other derivations of MUDs were used that combined gaming with talking, and these include MUSHes, MOOs and MUCKs. [edit] Discussion boards Besides real-time chat, another type of online community includes Internet forums and bulletin board systems (BBSes), where users write posts (blocks of text) to which later visitors may respond. Unlike the transient nature of chats, these systems generally archive posts and save them for weeks or years. They can be used for technical troubleshooting, advice, general conversation and more. See also General terms • Chat room • Web chat site • Voice chat • VoIP Voice over IP • Live support software • Online discussion
  • 19. 19 • Online discourse environment Protocols/Programs • Talker • Internet Relay Chat • Instant messenger • PalTalk • Talk (Unix) • MUD • MUSH • MOO • Google Talk • Yahoo! Messenger • Skype • SILC • Windows Live Messenger • Campfire Chat programs supporting multiple protocols • Adium • Gaim • Miranda IM • Trillian • Retrieved from "http://en.wikipedia.org/wiki/Online_chat" Plugins A plugin (or plug-in) is a computer program that interacts with a main (or host) application (a web browser or an email program, for example) to provide a certain, usually very specific, function on-demand. Typical examples are • plugins that read or edit specific types of files (for instance, decode multimedia files) • encrypt or decrypt email (for instance, PGP) • filter images in graphic programs in ways that the host application could not normally do • play and watch Flash presentations in a web browser The host application provides services which the plugins can use, including a way for plugins to register themselves with the host application and a protocol by which data is exchanged with plugins. Plugins are dependent on these services provided by the main application and do not usually work by themselves. Conversely, the main application is
  • 20. 20 independent of the plugins, making it possible for plugins to be added and updated dynamically without changes to the main application. Plugins are slightly different from extensions, which modify or add to existing functionality. The main difference is that plugins generally rely on the main application's user interface and have a well-defined boundary to their possible set of actions. Extensions generally have fewer restrictions on their actions, and may provide their own user interfaces. They sometimes are used to decrease the size of the main application and offer optional functions. Mozilla Firefox uses a well-developed extension system to reduce the feature creep that plagued the Mozilla Application Suite. Perhaps the first software applications to include a plugin function were HyperCard and QuarkXPress on the Macintosh, both released in 1987. In 1988, Silicon Beach Software included plugin functionality in Digital Darkroom and SuperPaint, and the term plug-in was coined by Ed Bomke. Currently, plugins are typically implemented as shared libraries that must be installed in a place prescribed by the main application. HyperCard supported a similar facility, but it was more common for the plugin code to be included in the HyperCard documents (called stacks) themselves. This way, the HyperCard stack became a self-contained application in its own right, which could be distributed as a single entity that could be run by the user without the need for additional installation steps. Open application programming interfaces (APIs) provide a standard interface, allowing third parties to create plugins that interact with the main application. A stable API allows third-party plugins to function as the original version changes and to extend the lifecycle of obsolete applications. The Adobe Photoshop and After Effects plugin APIs have become a standard and been adopted to some extent by competing applications. Other examples of such APIs include Audio Units and VST. Examples Many professional software packages offer plugin APIs to developers, in order to increase the utility of the base product. Examples of these include: • Eclipse • GStreamer multimedia pipe handler • jEdit Program Editor • Quintessential Media Player, Winamp, foobar2000 and XMMS • Notepad++ • OmniPeek packet analysis platform • VST Audio Plugin Format Communications protocol From Wikipedia, the free encyclopedia Jump to: navigation, search
  • 21. 21 This article concerns communication between pairs of electronic devices. For the specific topic of computing protocols, see Protocol (computing). For protocols on two- way voice communications, see Voice procedure. For other meanings of the word protocol, see Protocol. In the field of telecommunications, a communications protocol is the set of standard rules for data representation, signalling, authentication and error detection required to send information over a communications channel. An example of a simple communications protocol adapted to voice communication is the case of a radio dispatcher talking to mobile stations. The communication protocols for digital computer network communication have many features intended to ensure reliable interchange of data over an imperfect communication channel. Communication protocol is basically following certain rules so that the system works properly. Network protocol design principles Systems engineering principles have been applied to create a set of common network protocol design principles[citation needed] . These principles include effectiveness, reliability, and resiliency. Effectiveness Needs to be specified in such a way, that engineers, designers, and in some cases software developers can implement and/or use it. In human-machine systems, its design needs to facilitate routine usage by humans. Protocol layering accomplishes these objectives by dividing the protocol design into a number of smaller parts, each of which performs closely related sub-tasks, and interacts with other layers of the protocol only in a small number of well-defined ways. Protocol layering allows the parts of a protocol to be designed and tested without a combinatorial explosion of cases, keeping each design relatively simple. The implementation of a sub-task on one layer can make assumptions about the behavior and services offered by the layers beneath it. Thus, layering enables a "mix-and-match" of protocols that permit familiar protocols to be adapted to unusual circumstances. For an example that involves computing, consider an email protocol like the Simple Mail Transfer Protocol (SMTP). An SMTP client can send messages to any server that conforms to SMTP's specification. Actual applications can be (for example) an aircraft with an SMTP server receiving messages from a ground controller over a radio-based internet link. Any SMTP client can correctly interact with any SMTP server, because they both conform to the same protocol specification, RFC2821, RT49764368. This paragraph informally provides some examples of layers, some required functionalities, and some protocols that implement them, all from the realm of computing protocols. At the lowest level, bits are encoded in electrical, light or radio signals by the Physical layer. Some examples include RS-232, SONET, and WiFi. A somewhat higher Data link layer such as the point-to-point protocol (PPP) may detect errors and configure the transmission system.
  • 22. 22 An even higher protocol may perform network functions. One very common protocol is the Internet protocol (IP), which implements addressing for large set of protocols. A common associated protocol is the Transmission control protocol (TCP) which implements error detection and correction (by retransmission). TCP and IP are often paired, giving rise to the familiar acronym TCP/IP. A layer in charge of presentation might describe how to encode text (ie: ASCII, or Unicode). An application protocol like SMTP, may (among other things) describe how to inquire about electronic mail messages. These different tasks show why there's a need for a software architecture or reference model that systematically places each task into context. The reference model usually used for protocol layering is the OSI seven layer model, which can be applied to any protocol, not just the OSI protocols of the International Organization for Standardization (ISO). In particular, the Internet Protocol can be analysed using the OSI model. Reliability Assuring reliability of data transmission involves error detection and correction, or some means of requesting retransmission. It is a truism that communication media are always faulty. The conventional measure of quality is the number of failed bits per bits transmitted. This has the useful feature of being a dimensionless figure of merit that can be compared across any speed or type of communication media. In telephony, links with bit error rates (BER) of 10-4 or more are regarded as faulty (they interfere with telephone conversations), while links with a BER of 10-5 or more should be dealt with by routine maintenance (they can be heard). Data transmission often requires bit error rates below 10-12. Computer data transmissions are so frequent that larger error rates would affect operations of customers like banks and stock exchanges. Since most transmissions use networks with telephonic error rates, the errors caused by these networks must be detected and then corrected. Communications systems detect errors by transmitting a summary of the data with the data. In TCP (the internet's Transmission Control Protocol), the sum of the data bytes of packet is sent in each packet's header. Simple arithmetic sums do not detect out-of-order data, or cancelling errors. A bit-wise binary polynomial, a cyclic redundancy check, can detect these errors and more, but is slightly more expensive to calculate. Communication systems correct errors by selectively resending bad parts of a message. For example, in TCP when a checksum is bad, the packet is discarded. When a packet is lost, the receiver acknowledges all of the packets up to, but not including the failed packet. Eventually, the sender sees that too much time has elapsed without an acknowledgement, so it resends all of the packets that have not been acknowledged. At the same time, the sender backs off its rate of sending, in case the packet loss was caused by saturation of the path between sender and receiver. (Note: this is an over- simplification: see TCP and congestion collapse for more detail)
  • 23. 23 In general, the performance of TCP is severely degraded in conditions of high packet loss (more than 0.1%), due to the need to resend packets repeatedly. For this reason, TCP/IP connections are typically either run on highly reliable fiber networks, or over a lower- level protocol with added error-detection and correction features (such as modem links with ARQ). These connections typically have uncorrected bit error rates of 10-9 to 10-12, ensuring high TCP/IP performance. Resiliency Resiliency addresses a form of network failure known as topological failure in which a communications link is cut, or degrades below usable quality. Most modern communication protocols periodically send messages to test a link. In phones, a framing bit is sent every 24 bits on T1 lines. In phone systems, when "sync is lost", fail-safe mechanisms reroute the signals around the failing equipment. In packet switched networks, the equivalent functions are performed using router update messages to detect loss of connectivity. Standards organizations Most recent protocols are assigned by the IETF for Internet communications, and the IEEE, or the ISO organizations for other types. The ITU-T handles telecommunications protocols and formats for the public switched telephone network (PSTN). The ITU-R handles protocols and formats for radio communications. As the PSTN. radio systems, and Internet converge, the different sets of standards are also being driven towards technological convergence. [edit] Protocol families A number of major protocol stacks or families exist, including the following: Open standards: Internet protocol suite Open Systems Interconnection (OSI) A connection-oriented networking protocol is one which identifies traffic flows by some connection identifier rather than by explicitly listing source and destination addresses. Typically, this connection identifier is a small integer (10 bits for Frame Relay, 24 for ATM, for example). This makes network switches substantially faster (as routing tables are just simple look-up tables, and are trivial to implement in hardware). The impact is so great, in fact, that even characteristically connectionless protocols, such as IP traffic, are being tagged with connection-oriented header prefixes (e.g., as with MPLS, or IPv6's built-in Flow ID field). Note that connection-oriented protocols are not necessarily reliable protocols. ATM and Frame Relay, for example, are both examples of a connection-oriented, unreliable protocol. There are also reliable connectionless protocols as well, such as AX.25 when it passes data in I-frames. But this combination is rare, and reliable-connectionless is uncommon in commercial and academic networks. Note that connection-oriented protocols handle real-time traffic substantially more efficiently than connectionless protocols, which is why ATM has yet to be replaced by Ethernet for carrying real-time, isochronous traffic streams, especially in heavily
  • 24. 24 aggregated networks like backbones, where the motto "bandwidth is cheap" fails to deliver on its promise. Experience has also shown that overprovisioning bandwidth does not resolve all quality of service issues. Hence, (10-)gigabit Ethernet is not expected to replace ATM at this time. [edit] List of Connection-oriented protocols TCP Phone Call - user must dial the telephone, get an answer before transmitting data ATM Frame Relay Connectionless protocol From Wikipedia, the free encyclopedia Jump to: navigation, search In telecommunications, connectionless describes communication between two network end points in which a message can be sent from one end point to another without prior arrangement. The device at one end of the communication transmits data to the other, without first ensuring that the recipient is available and ready to receive the data. The device sending a message simply sends it addressed to the intended recipient. As such there are more frequent problems with transmission than with connection-orientated protocols and it may be necessary to resend the data several times. Connectionless protocols are often disfavoured by network administrators because it is much harder to filter malicious packets from a connectionless protocol using a firewall. The Internet Protocol (IP) and User Datagram Protocol (UDP) are connectionless protocols, but TCP/IP (the most common use of IP) is connection-oriented. Connectionless protocols are usually described as stateless because the endpoints have no protocol-defined way to remember where they are in a "conversation" of message exchanges. The alternative to the connectionless approach uses connection-oriented protocols, which are sometimes described as stateful because they can keep track of a conversation. List of Connectionless protocols • IP • UDP • ICMP • IPX In computing, a protocol is a convention or standard that controls or enables the connection, communication, and data transfer between two computing endpoints. In its simplest form, a protocol can be defined as the rules governing the syntax, semantics, and synchronization of communication. Protocols may be implemented by hardware, software, or a combination of the two. At the lowest level, a protocol defines the behavior of a hardware connection. Contents [hide]
  • 25. 25 1 Typical properties 2 Importance 3 Common Protocols 4 See also Typical properties It is difficult to generalize about protocols because they vary so greatly in purpose and sophistication. Most protocols specify one or more of the following properties: Detection of the underlying physical connection (wired or wireless), or the existence of the other endpoint or node Handshaking Negotiation of various connection characteristics How to start and end a message How to format a message What to do with corrupted or improperly formatted messages (error correction) How to detect unexpected loss of the connection, and what to do next Termination of the session or connection. Importance The widespread use and expansion of communications protocols is both a prerequisite to the Internet, and a major contributor to its power and success. The pair of Internet Protocol (or IP) and Transmission Control Protocol (or TCP) are the most important of these, and the term TCP/IP refers to a collection (or protocol suite) of its most used protocols. Most of the Internet's communication protocols are described in the RFC documents of the Internet Engineering Task Force (or IETF). Object-oriented programming has extended the use of the term to include the programming protocols available for connections and communication between objects. Generally, only the simplest protocols are used alone. Most protocols, especially in the context of communications or networking, are layered together into protocol stacks where the various tasks listed above are divided among different protocols in the stack. Whereas the protocol stack denotes a specific combination of protocols that work together, the Reference Model is a software architecture that lists each layer and the services each should offer. The classic seven-layer reference model is the OSI model, which is used for conceptualizing protocol stacks and peer entities. This reference model also provides an opportunity to teach more general software engineering concepts like hiding, modularity, and delegation of tasks. This model has endured in spite of the demise of many of its protocols (and protocol stacks) originally sanctioned by the ISO. The OSI model is not the only reference model however. [edit] Common Protocols • HTTP (Hyper Text Transfer Protocol) • POP3 (Post Office Protocol 3). • SMTP (Simple Mail Transfer Protocol). • FTP (File Transfer Protocol). • IP (Internet Protocol).
  • 26. 26 • DHCP (Dynamic Host Configuration Protocol). • IMAP (Internet Message Access Protocol). Search Engine A search engine is an information retrieval system designed to help find information stored on a computer system, such as on the World Wide Web, inside a corporate or proprietary network, or in a personal computer. The search engine allows one to ask for content meeting specific criteria (typically those containing a given word or phrase) and retrieves a list of items that match those criteria. This list is often sorted with respect to some measure of relevance of the results. Search engines use regularly updated indexes to operate quickly and efficiently. Without further qualification, search engine usually refers to a Web search engine, which searches for information on the public Web. Other kinds of search engine are enterprise search engines, which search on intranets, personal search engines, and mobile search engines. Different selection and relevance criteria may apply in different environments, or for different uses. Some search engines also mine data available in newsgroups, databases, or open directories. Unlike Web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorthmic and human input. How search engines work A search engine operates, in the following order Web crawling Indexing Searching A web crawler (also known as a Web spider or Web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner. Other less frequently used names for Web crawlers are ants, automatic indexers, bots, and worms (Kobayashi and Takeda, 2000). This process is called Web crawling or spidering. Many legitimate sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam). A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. Web crawler architectures
  • 27. 27 High-level architecture of a standard Web crawler A crawler must not only have a good crawling strategy, as noted in the previous sections, but it should also have a highly optimized architecture. Shkapenyuk and Suel (Shkapenyuk and Suel, 2002) noted that: "While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability." Web crawlers are a central part of search engines, and details on their algorithms and architecture are kept as business secrets. When crawler designs are published, there is often an important lack of detail that prevents others from reproducing the work. There are also emerging concerns about "search engine spamming", which prevent major search engines from publishing their ranking algorithms. indexing entails how data is collected, parsed, and stored to facilitate fast and accurate retrieval. Index design incorporates interdisciplinary concepts from Linguistics, Cognitive psychology, Mathematics, Informatics, Physics, and Compruter science. An alternate name for the procss is Web indexing, within the context of search engines designed to find web pages o the Internet. Popular engines focus on the full-text indexing of online, natural language documents, yet there are other searchable media types such as video, audio, and graphics. Meta search engines reuse the indices of other services and do not store a local index, whereas cache-based search engines permanently store the index along with the corpus. Unlike full text indices, partial text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined interval due to the required time and processing costs, whereas agent-based search engines index in real time. Indexing
  • 28. 28 The goal of storing an index is to optimize the speed and performance of finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would take a considerable amount of time and computing power. For example, an index of 1000 documents can be queried within milliseconds, where a raw scan of 1000 documents could take hours. No search engine user would be comfortable waiting several hours to get search results. The trade off for the time saved during retrieval is that additional storage is required to store the index and that it takes a considerable amount of time to update. [edit] Index Design Factors Major factors in designing a search engine's architecture include: Merge factors - how data enters the index, or how words or subject features are added to the index during corpus traversal, and whether multiple indexers can work asynchronously. The indexer must first check whether it is updating old content or adding new content. Traversal typically correlates to the data collection policy. Search engine index merging is similar in concept to the SQL Merge command and other merge algorithms. Storage techniques - how to store the index data - whether information should be compressed or filtered Index size - how much computer storage is required to support the index Lookup speed - how quickly a word can be found in the inverted index. How quickly an entry in a data structure can be found, versus how quickly it can be updated or removed, is a central focus of computer science Maintenance - maintaining the index over time Fault tolerance - how important it is for the service to be reliable, how to deal with index corruption, whether bad data can be treated in isolation, dealing with bad hardware, partitioning schemes such as hash-based or composite partitioning, data replication Index Data Structures Search engine architectures vary in how indexing is performed and in index storage to meet the various design factors. Types of indices include: Suffix trees - figuratively structured like a tree, supports linear time lookup. Built by storing the suffices of words. Used for searching for patterns in DNA sequences and clustering. A major drawback is that the storage of a word in the tree may require more storage than storing the word itself An alternate representation is a suffix array, which is considered to require less memory and supports compression like BWT. Tries - an ordered tree data structure that is used to store an associative array where the keys are strings. Regarded as faster than a hash table, but are less space efficient. The suffix tree is a type of trie. Tries support extendible hashing, which is important for search engine indexing. Inverted indices - stores a list of occurrences of each atomic search criterion, typically in the form of a hash table or binary tree. Citation indices - stores the existence of citations or hyperlinks between documents to support citation analysis, a subject of Bibliometrics.
  • 29. 29 Ngram indices - for storing sequences of n length of data to support other types of retrieval or text mining. Term document matrices - used in latent semantic analysis, stores the occurrences of words in documents in a two dimensional sparse matrix. Challenges in Parallelism A major challenge in the design of search engines is the management of parallel processes. There are many opportunities for race conditions and coherence faults. For example, a new document is added to the corpus and the index must be updated, but the index simultaneously needs to continue responding to search queries. This is a collision between two competiting tasks. Consider that authors are producers of information, and a crawler is the consumer of this information, grabbing the text and storing it in a cache (or corpus). The forward index is the consumer of the information produced by the corpus, and the inverted index is the consumer of information produced by the forward index. This is commonly referred to as a producer-consumer model. The indexer is the producer of searchable information and users are the consumers that need to search. The challenge is magnified when working with distributed storage and distributed processing. In an effort to scale with larger amounts of indexed information, the search engine's architecture may involve distributed computing, where the search engine consists of several machines operating in unison. This increases the possibilities for incoherency and makes it more difficult to maintain a fully-synchronized, distributed, parallel architecture. Inverted indices Many search engines incorporate an inverted index when evaluating a search query to quickly locate the documents which contain the words in a query and rank these documents by relevance. The inverted index stores a list of the documents for each word. The search engine can retrieve the matching documents quickly using direct access to find the documents for a word. The following is a simplified illustration of the inverted index: Inverted Index Word Documents the Document 1, Document 3, Document 4, Document 5 cow Document 2, Document 3, Document 4 says Document 5 moo Document 7 The above figure is a simplified form of a Boolean index. Such an index would only serve to determine whether a document matches a query, but would not contribute to ranking matched documents. In some designs the index includes additional information such as the frequency of each word in each document or the positions of the word in each document. With position, the search algorithm can identify word proximity to support searching for phrases. Frequency can be used to help in ranking the relevance of documents to the query. Such topics are the central research focus of information retrieval. The inverted index is a sparse matrix given that words are not present in each document. It is stored differently than a two dimensional array to reduce memory requirements. The
  • 30. 30 index is similar to the term document matrices employed by latent semantic analysis. The inverted index can be considered a form of a hash table. In some cases the index is a form of a binary tree, which requires additional storage but may reduce the lookup time. In larger indices the architecture is typically distributed. Inverted indices can be programmed in several computer programming languages. Index Merging The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but first deletes the contents of the inverted index. The architecture may be designed to support incremental indexing, where a merge involves identifying the document or documents to add into or update in the index and parsing each document into words. For technical accuracy, a merge involves the unison of newly indexed documents, typically residing in virtual memory, with the index cache residing on one or more computer hard drives. After parsing, the indexer adds the containing document to the document list for the appropriate words. The process of finding each word in the inverted index in order to denote that it occurred within a document may be too time consuming when designing a larger search engine, and so this process is commonly split up into the development of a forward index and the process of sorting the contents of the forward index for entry into the inverted index. The inverted index is named inverted because it is an inversion of the forward index. The Forward Index The forward index stores a list of words for each document. The following is a simplified form of the forward index: Forward Index Document Words Document 1 the,cow,says,moo Document 2 the,cat,and,the,hat Document 3 the,dish,ran,away,with,the,spoon The rationale behind developing a forward index is that as documents are parsed, it is better to immediately store the words per document. The delineation enables asynchronous processing, which partially circumvents the inverted index update bottleneck. The forward index is sorting to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index. Compression Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. Many search engines utilize a form of compression to reduce the size of the indices on disk. Consider the following scenario for a full text, Internet, search engine. An estimated 2,000,000,000 different web pages exist as of the year 2000
  • 31. 31 A fictitious estimate of 250 words per webpage on average, based on the assumption of being similar to the pages of a novel. It takes 8 bits (or 1 byte) to store a single character. Some encodings use 2 bytes per characterThe average number of characters in any given word on a page can be estimated at 5 (Wikipedia:Size comparisons) The average personal computer comes with about 20 gigabytes of usable space Given these estimates, generating a uncompressed index (assuming a non-conflated, simple, index) for 2 billion web pages would need to store 5 billion word entries. At 1 byte per character, or 5 bytes per word, this would require 25 gigabytes of storage space alone, more than the average size a personal computer's free disk space. This space is further increased in the case of a distributed storage architecture that is fault-tolerant. Using compression, the index size can be reduced to a portion of its size, depending on which compression techniques are chosen. The trade off is the time and processing power required to perform compression. Notably, large scale search engine designs incorporate the cost of storage, and the costs of electricity to power the storage. Compression, in this regard, is a measure of cost as well. Document Parsing Document parsing involves breaking apart the components (words) of a document or other form of media for insertion into the forward and inverted indices. For example, if the full contents of a document consisted of the sentence "Hello World", there would typically be two words found, the token "Hello" and the token "World". In the context of search engine indexing and natural language processing, parsing is more commonly referred to as tokenization, and sometimes word boundary disambiguation, tagging, Text segmentation, Content analysis, text analysis, Text mining, Concordance generation, Speech segmentation, lexing, or lexical analysis. The terms 'indexing', 'parsing', and 'tokenization' are used interchangeably in corporate slang. Natural language processing, as of 2006, is the subject of continuous research and technological improvement. There are a host of challenges in tokenization, in extracting the necessary information from documents for indexing to support quality searching. Tokenization for indexing involves multiple technologies, the implementation of which are commonly kept as corporate secrets. Challenges in Natural Language Processing Word Boundary Ambiguity - native English speakers can at first consider tokenization to be a straightfoward task, but this is not the case with designing a multilingual indexer. In digital form, the text of other languages such as Chinese, Japanese or Arabic represent a greater challenge as words are not clearly delineated by whitespace. The goal during tokenization is to identify words for which users will search. Language specific logic is employed to properly identify the boundaries of words, which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax).
  • 32. 32 Language Ambiguity - to assist with properly ranking matching documents, many search engines collect additional information about words, such as its language or lexical category (part of speech). These techniques are language-dependent as the syntax varies among languages. Documents do not always clearly identify the language of the document or represent it accurately. In tokenizing the document, some search engines attempt to automatically identify the language of the document. Diverse File Formats - in order to correctly identify what bytes of a document represent characters, the file format must be correctly handled. Search engines which support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document. Faulty Storage - the quality of the natural language data is not always assumed to be perfect. An unspecified amount of documents, particular on the Internet, do not always closely obey proper file protocol. Binary characters may be mistakenly encoded into various parts of a document. Without recognition of these characters and appropriate handling, the index quality or indexer performance could degrade. [edit] Tokenization Unlike literrate human adults, computers are not inherently aware of the structure of a natural language document and do not instantly recognize words and sentences. To a computer, a document is only a big sequence of bytes. Computers do not know that a space character between two sequences of characters means that there are two separate words in the document. Instead, a computer program is developed by humans which trains the computer, or instructs the computer, how to identify what constitutes an individual or distinct word, referred to as a token. This program is commonly referred to as a tokenizer or parser or lexer. Many search engines, as well as other natural language processing software, incorporate specialized programs for parsing, such as YACC OR Lex. During tokenization, the parser identifies sequences of characters, which typically represent words. Commonly recognized tokens include punctuation, sequences of numerical characters, alphabetical characters, alphanumerical characters, binary characters (backspace, null, print, and other antiquated print commands), whitespace (space, tab, carriage return, line feed), and entities such as email addresses, phone numbers, and URLs. When identifying each token, several characteristics may be stored such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number. Language Recognition If the search engine supports multiple languages, a common initial step during tokenization is to identify each document's language, given that many of the later steps are language dependent (such as stemming and part of speech tagging). Language recognition is the process by which a computer program attempts to automatically identify, or categorize, the language of a document. Other names for language recognition include language classification, language analysis, language identification,
  • 33. 33 and language tagging. Automated language recognition is the subject of ongoing research in natural language processing. Finding which words the language belongs to may involve the use of a language recognition chart. Format Analysis Depending on whether the search engine supports multiple document formats, documents must be prepared for tokenization. The challenge is that many document formats contain, in addition to textual content, formatting information. For example, HTML documents contain HTML tags, which specify formatting information, like whether to start a new line, or display a word in bold, or change the font size or family. If the search engine were to ignore the difference between content and markup, the segments would also be included in the index, leading to poor search results. Format analysis involves the identification and handling of formatting content embedded within documents which control how the document is rendered on a computer screen or interpreted by a software program. Format analysis is also referred to as structure analysis, format parsing, tag stripping, format stripping, text normalization, text cleaning, or text preparation. The challenge of format analysis is further complicated by the intricacies of various file formats. Certain file formats are proprietary and very little information is disclosed, while others are well documented. Common, well-documented file formats that many search engines support include: Microsroft Word Microsoft Excel Microsoft Powerpoint IBM Lotus Notes HTML ASCII Text files (a text document without any formatting) Adobe's Portable Document Format (PDF) PostScript (PS) LaTex The UseNet archive (NNTP) and other deprecated bulletin board formats XML and derivatives like RSS SGML (this is more of a general protocol) Multimedia meta data formats like ID3 Techniques for dealing with various formats include: Using a publicly available commercial parsing tool that is offered by the organization which developed, maintains, or owns the format Writing a custom parser Some search engines support inspection of files that are stored in a compressed, or encrypted, file format. If working with a compressed format, then the indexer first decompresses the document, which may result in one or more files, each of which must be indexed separately. Commonly supported compressed file formats include: ZIP - Zip File RAR - Archive File
  • 34. 34 CAB - Microsoft Windows Cabinet File Gzip - Gzip file BZIP - Bzip file TAR, GZ, and TAR.GZ - Unix Gzip'ped Archives Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. Content can manipulate the formatting information to include additional content. Examples of abusing document formatting for spamdexing: Including hundreds or thousands of words in a section which is hidden from view on the computer screen, but visible to the indexer, by use of formatting (e.g. hidden "div" tag in HTML, which may incorporate the use of CSS or Javascript to do so). Setting the foreground font color of words to the same as the background color, making words hidden on the computer screen to a person viewing the document, but not hidden to the indexer. Section Recognition Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well- written book, divided into organized chapters and pages. Many documents on the web contain erroneous content and side-sections which do not contain primary material, that which the document is about, such as newsletters and corporate reports. For example, this article may display a side menu with words inside links to other web pages. Some file formats, like HTML or PDF, allow for content to be displayed in columns. Even though the content is displayed, or rendered, in different areas of the view, the raw markup content may store this information sequentially. Words that appear in the raw source content sequentially are indexed sequentially, even though these sentences and paragraphs are rendered in different parts of the computer screen. If search engines index this content as if it were normal content, a dilemma ensues where the quality of the index is degraded and search quality is degraded due to the mixed content and improper word proximity. Two primary problems are noted: Content in different sections is treated as related in the index, when in reality it is not Organizational 'side bar' content is included in the index, but the side bar content does not contribute to the meaning of the document, and the index is filled with a poor representation of its documents, assuming the goal is to go after the meaning of each document, a sub-goal of providing quality search results. Section analysis may require the search engine to implement the rendering logic of each document, essentially an abstract representation of the actual document, and then index the representation instead. For example, some content on the Internet is rendered via Javascript. Viewers of web pages in web browsers see this content. If the search engine does not render the page and evaluate the javascript within the page, it would not 'see' this content in the same way, and index the document incorrectly. Given that some search engines do not bother with rendering issues, many web page designers avoid displaying content via javascript or use the Noscript tag to ensure that the web page is indexed