3. What is XML?
• Some important facts about XML
– XML stands for the eXtensible Markup Language
– It was developed by W3C
• World Wide Web Consortium
• www.w3.org
– XML 1.0 (2nd Edition)
• W3C recommendation
• http://www.w3.org/TR/REC-xml
– XML 1.1
• Candidate recommendation
4. Evolution of WWW
• Web was once a publishing
tool for scientific documents
only.
• Now it is a full-fledged medium,
like TV or print.
– Furthermore, Web is an
Interactive medium
– Over 800 million Web pages are
written with HTML
5. Problems of HTML (I)
• Over the years, HTML has been extended
– HTML has close to 100 tags
– Supporting technologies has been introduced by
vendors
– Still more tags are needed!
• Example
– E-commerce applications need tags for prices, product
references
– Streaming would nee tags to control the flow of media
– HTML is already on the verge of collapsing under its
own weight!!
6. Problems of HTML (II)
• Some applications would benefit greatly
from a reduction in the tag count!
– More and more people are accessing Web
from PDA and smart phones
• Mobile devices are not as powerful as PC
• The complex Web language cannot be processed
• The web tags are more than the web content itself
7. Basic Principles of XML
• Increasing specialized applications need more
tags, while other applications want a simple
language
– W3C resolve this dilemma by making two changes to
HTML
• No predefined tags
• Stricter syntax
8. No Predefined Tags (I)
• XML has no predefined tags.
– The author creates all the tags he needs
• If u need a certain tag, just make it
HTML <table>
<tr>
<td>Price USD 499 </td>
<td><a href=”/newsletter”><b>Pineapplesoft Link</b></a></td>
</tr>
</table>
XML <price currency=“usd”>499.00</price>
<toc xlink:href=”/newsletter”>Pineapplesoft Link</toc>
9. No Predefined Tags (II)
• How does the browser know what the author-
defined tag looks like?
– Style sheet
• Can we compare different prices?
• What about the current and previous browsers?
• Can we simplify Web site maintenance?
10. Stricter Syntax
• More than 50% codes in a browser are devoted
to handle errors or sloppiness on the author’s
part.
– Due to increasing using HTML editors
– Browsers are growing in size and becoming slower
• XML adopt a strict syntax for smaller and faster
browsers
<p>Welcome to our site! <img src=logo.jpg>
<p>Welcome to our site! <img src=”logo.jpg”/></p>
11. Document Structures (I)
• An example
INTERNAL MEMO title
From: Bh Huang
To: Conrad Ho
Regarding: Using User Attention Model in
header
Watermarking
Have u finished the job? Can I adopt the program
directly?I think it will be of great benefits by using
the user attention model.
body
Bh
12. Document Structures (II)
<?xml version=“1.0”?>
<memo>
<header>
<from>Bh Huang </from>
<to>Conrad Ho</to>
<subject> Using User Attention Model in Watermarking </subject>
</header>
<body>
<para>Have u finished the job? Can I adopt the program directly?I think it will be
great benefits in using the user attention model.</para>
<signature>Bh</signature>
</body>
</memo>
13. Application of XML
• Most popular applications of XML
– Document applications manipulate information
primarily intended for human consumption
– Data applications manipulate information primarily
intended for software communications
14. Document Publishing (I)
• XML concentrates on the structure of the
document, making it independent of the delivery
medium
HTML
PDF WML
XML
Document
15. Document Publishing (II)
• It is possible to edit and maintain documents in
XML and automatically publish them on different
media
– More and more publication are available online and
in print
– Web is changing rapidly
– New markup languages are introduced for specific
devices
16. Data Applications
• If the structure of a document can be expressed
in XML, so as the structure of a database.
• XML web site can be regarded as a large
database that application can tap
17. Near-term Applications of XML
• Large web site maintenance
• Exchange information between organizations
• Content made available to different web sites
• E-commerce applications where different
organizations collaborate to server a customer
• Scientific applications with new markup
languages for formulas or specifications
• E-books needs to express rights and ownerships
19. <?xml version="1.0"?>
<!-- Download from www.marchal.com or www.mcp.com -->
<address-book>
An Example <entry>
<name>John Doe</name>
<address>
<street>34 Fountain Square Plaza</street>
John Doe
<region>OH</region>
34 Fountain Square Plaza <postal-code>45202</postal-code>
Cincinnati, OH 45202 <locality>Cincinnati</locality>
US <country>US</country>
</address>
513-744-8889 (preferred)
<tel preferred="true">513-744-8889</tel>
513-744-7098 <tel>513-744-7098</tel>
jdoe@emailaholic.com <email href="mailto:john@emailaholic.com"/>
Jack Smith </entry>
513-744-3465 <entry>
<name>Jack Smith</name>
jsmith@emailaholic.com <tel>513-744-3465</tel>
Never leave messages on his <email href="mailto:jack@emailaholic.com"/>
answering machine. Email instead. <comments>Never leave messages on his answering
machine. <b>Email instead.</b></comments>
Plain text file </entry>
</address-book>
•Which one is easier to read?
•Which one is easier for software to interpret?
XML Document
20. Elements
• Fundamental Units of XML
– E.g. <tel>513-744-7098</tel>
– Each element is surrounded by a start tag and an
end tag, which are quite similar to HTML
• Start tag is the element name contained in the “<“ and “>”
pair
• End tag must include an additional “/”
– Both a start tag and a end tag is required for an
element
21. Naming an Element
• The names of elements must follow specific rules.
– The element name must start with letters or _
– Other parts of an element name can consist letters, digits, -, .,
or -.
– Spaces are not allowed in an element name
– Element names are case-sensitive
<copyright-information> <123> <address> address-book
<p> <first name> <ADDRESS> AddressBook
<base64> <Tom&jerry> <Address>
Suggested writing
<decompte.client>
<firstname> Illegal Case sensitivity
Legal
22. Attributes
• Additional information of elements
– <tel preferred=”true”>513-744-8889</tel>
• An attribute is consisting of its attribute name and value.
• Attribute names must follow the same rules as element names
• Start tag of an element can contain more than one or no
attributes
• Quote marks are required!! (quotes can be ‘ or “)
– <confidentiality level=“I don’t know”>This document is not confidential
</confidentiality>
• Attributes are not parts of element names
23. Special Attributes
• xml:space
– Specifying the space handling style
• preserve: preserving all spaces
• default: neglecting repeated spaces
• xml:lang
– Specifying content of the element is written in which
language
• <p xml:lang=“en-GB”>What colour is it?</p>
• <p xml:lang=“en-US”>What color is it?</p>
24. Empty Elements
• Elements having no contents are called empty
elements
– <email href=“bhhuang@ms23.hinet.net” />
– <email href=“bhhuang@ms23.hinet.net”></email>
25. Hierarchical Structure <?xml version="1.0"?>
<!-- Download from www.marchal.com or www.mcp.com -->
<address-book>
of Elements <entry>
Containing texts <name>John Doe</name>
<address>
<street>34 Fountain Square Plaza</street>
<region>OH</region>
<postal-code>45202</postal-code>
<locality>Cincinnati</locality>
<country>US</country>
</address>
<tel preferred="true">513-744-8889</tel>
<tel>513-744-7098</tel>
<email href="mailto:john@emailaholic.com"/>
</entry>
Containing other elements <entry>
<name>Jack Smith</name>
<tel>513-744-3465</tel>
<email href="mailto:jack@emailaholic.com"/>
Containing mixture of both <comments>Never leave messages on his answering
machine. <b>Email instead.</b></comments>
</entry>
</address-book>
26. Hierarchical Structure of Elements (cont.)
<entry> Correct •Elements containing other elements
<name>Jack Smith</name> are called parents
<tel>513-744-3465</tel> •Elements contained in other elements
<email href="mailto:jack@emailaholic.com"/>
are called children
<comments>Never leave messages on his answering
machine. <b>Email instead.</b></comments> •Children must be fully contained
</entry> within their parents
<entry>
<name>Jack Smith</name>
<tel>513-744-3465</tel>
<email href="mailto:jack@emailaholic.com"/>
<comments>Never leave messages on his answering
machine. <b>Email instead. </entry> </comments>
</b>
Wrong
27. The Root Element
• Each document should have only one root element
– All other elements must be children of the root element
<?xml version="1.0"?> Wrong <?xml version="1.0"?> Correct
<entry> <address-book>
<name>John Doe</name> <entry>
<email href="mailto:john@emailaholic.com"/> <name>John Doe</name>
</entry> <email href="mailto:john@emailaholic.com"/>
<entry> </entry>
<name>Jack Smith</name> <entry>
<email href="mailto:jack@emailaholic.com"/> <name>Jack Smith</name>
</entry> <email href="mailto:jack@emailaholic.com"/>
</entry>
</address-book>
28. The XML Declaration
• The first line in an XML document is called the XML declaration
– <?xml version="1.0"?>
• As long as a document contains the XML declaration, it means that it is a XML
document
• XML version is included in the XML declaration
• XML declaration is now optional, but is suggested to be included too
•Current version of XML is 1.0.
•The second edition is only the first edition with errors corrected.
29. Comments
• Comments are surrounded by “<!--” and “-->”
• Since comments are read by human users only,
the XML parsers will neglect them automatically.
– E.g. <!-- Download from www.marchal.com or www.mcp.com -->
• Comments cannot be added within an element
– E.g. <name <!-- an invalid comment -->>Jack </name>
30. Unicode
• Unicode support all languages in the world that
are still being used and mathematical or other
symbols
• All characters in Unicode are represented by 16
bits
– The XML file size will be 2X larger than usual text file
– Solution: specifying “UTF-8” or “UTF-16” in XML declaration
– E.g. <?xml version=“1.0” encoding=“ISO-9959-1” ?>
31. Entity
• Complicated XML documents are usually located
within several files
• The organizing unit of XML documents is entity
• E.g. if we defined an entity “us” with value
“United States”
– <country>&us;</country>
– <country>United States></country>
32. Predefined Entities
• < <
Entity reference:
• & &
<company> Marks & Spencer</company>
• > ]]> <company> Marks & Spencer</company>
• ' ‘
• " “ Character reference:
<name> Benoît Marchal</name>
33. Processing Instruction
• The mechanism to insert non-XML statement
into an XML document
– Compromising the structural property of XML
– Enclosure with “<?” and “>”
– The first word is called target, to which application or
device the instruction is directed
• <?xml version=“1.0” encoding=“ISO-8859-1” ?>
• <?xml-stylesheet href=“simple-ie5.xsl” type=“text/xsl” ?>
34. CDATA Sections
• Enclosure with
“<![CDATA[“ and “]]>” <? xml version=“1.0”?>
<example>
• XML parser will neglect all <![CDATA[
escaping symbols <?xml version=“1.0”?>
<entry>
• Used when entity <name> John Doe</name>
references are used too </entry>]]>
frequently or another XML </example>
document is included
35. Common Errors
• The end tag is missing
• XML is case sensitive
• Using spaces in element names
• Quotes of the attribute value is missing