2. 2
• XML stands for eXtensible Markup Language.
• A markup language is used to provide information
about a document.
• Tags are added to the document to provide the extra
information.
• HTML tags tell a browser how to display the
document.
• Separates presentation (HTML) from data (XML)
• XML tags give a reader some idea what some of the
data means.
3. 3
• XML documents are used to transfer data from one place to
another often over the Internet.
• XML subsets are designed for particular applications.
• One is RSS (Rich Site Summary or Really Simple Syndication ).
It is used to send breaking news bulletins from one web site
to another.
• A number of fields have their own subsets. These include
chemistry, mathematics, and books publishing.
• Most of these subsets are registered with the W3Consortium
and are available for anyone’s use.
5. XML is a “use everywhere” data specification
Documents
Configuration
Database
Application X
Repository
XML XML
XML XML
6. Why XML is so Popular
• XML is popular because:
– it allows very complex sets of structured data to
be transmitted over the internet as a simple
stream of characters (a document), but…
– It allows processing of that simple stream as a rich
object model in the receiving application
• XML is ubiquitous on the Internet, and is the
basis of all web-services applications
7. Documents vs. Data
• XML is used to represent
two main types of things:
– Documents
• Lots of text with tags to
identify and annotate portions
of the document
– Data
• Hierarchical data structures
8. Overview
XML and Structured Data
• Pre-XML representation of data:
• XML representation of the same data:“PO-1234”,”CUST001”,”X9876”,”5”,”14.98”
<PURCHASE_ORDER>
<PO_NUM> PO-1234 </PO_NUM>
<CUST_ID> CUST001 </CUST_ID>
<ITEM_NUM> X9876 </ITEM_NUM>
<QUANTITY> 5 </QUANTITY>
<PRICE> 14.98 </PRICE>
</PURCHASE_ORDER>
9. Advantages of XML
• XML is text (Unicode) based.
– Takes up less space.
– Can be transmitted efficiently.
• One XML document can be displayed differently in different media.
– Html, video, CD, DVD,
– You only have to change the XML document in order to change all the rest.
• XML documents can be modularized. Parts can be reused.
• Electronic data interchange is critical in today’s networked
world and needs to be standardized
– Examples:
• Banking: funds transfer
• Education: e-learning contents
• Scientific data
– Chemistry: ChemML, …
– Genetics: BSML (Bio-Sequence Markup Language), …
10. The defunct W3C text-centric XML framework for XML-based documents
(XHTML2 is dead, the programmer-centric HTML5 rules)
HTML
XMLapplications.
PICS
P3P
RDF Semantics
XML
RDF
app. 2.0
SGML
XSL
CSS
DSSL
(for
SGML)
XLink
XPointer
Typical document
SVG
.....ML
SMIL
(subset of SGML)
XSLT
applications
XPath
The W3C consortium defines many XML-based languages
... details later
11. Syntax and Structure
Components of an XML Document
• Elements
– Each element has a beginning and ending tag
• <TAG_NAME>...</TAG_NAME>
– Elements can be empty (<TAG_NAME />)
• Attributes
– Describes an element; e.g. data type, data range, etc.
– Can only appear on beginning tag
• Processing instructions
– Encoding specification (Unicode by default)
– Namespace declaration
– Schema declaration
12. Syntax and Structure
Components of an XML Document
<?xml version=“1.0” ?>
<?xml-stylesheet type="text/xsl” href=“template.xsl"?>
<ROOT>
<ELEMENT1><SUBELEMENT1 /><SUBELEMENT2 /></ELEMENT1>
<ELEMENT2> </ELEMENT2>
<ELEMENT3 type=‘string’> </ELEMENT3>
<ELEMENT4 type=‘integer’ value=‘9.3’> </ELEMENT4>
</ROOT>
Prologue (processing instructions)
Elements
Elements with Attributes
14. Syntax and Structure
Rules For Well-Formed XML
• There must be one, and only one, root element
• Sub-elements must be properly nested
– A tag must end within the tag in which it was started
• Attributes are optional
– Defined by an optional schema
• Attribute values must be enclosed in “” or ‘’
• Processing instructions are optional
• XML is case-sensitive
– <tag> and <TAG> are not the same type of element
15. An XML Document Example
<imdb>
<show year=“1993”>
<title>Fugitive, The</title>
<review>
<suntimes>
<reviewer>Roger Ebert</reviewer> gives <rating>two thumbs
up</rating>! A fun action movie, Harrison Ford at his best.
</suntimes>
</review>
<review>
<nyt>The standard &hollywood; summer movie strikes back.</nyt>
</review>
<box_office>183,752,965</box_office>
</show>
<show year=“1994”>
<title>X Files,The</title>
<seasons>4</seasons>
</show>
</imdb>
Start Tag
End Tag
Attribute
Element
Mixed Content
17. Comparisons
• Extensible set of tags
• Content orientated
• Standard Data
infrastructure
• Allows multiple output
forms
• Fixed set of tags
• Presentation oriented
• No data validation
capabilities
• Single presentation
XML HTML
18. HTML vs. XML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
<bibliography>
<book> <title> Foundations…
</title>
<author> Abiteboul
</author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley
</publisher>
<year> 1995 </year>
</book>
…
</bibliography>
19. <h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
HTML describes presentation
XML describes contentXSL (stylesheets)
can be used to
specify the conversion
20. DKS – 15/09/2019 Slide 20
XML information structures (1)
Book
FrontMatter
BookTitle
Author(s)
PubInfo
Chapter(s)
ChapterTitle
Paragraph(s)
BackMatter
References
Index
Example 1: A possible book structure
21. XML information structures (2)
• Premise: A text is the sum of its component parts
– A <Book> could be defined as containing:
<FrontMatter>, <Chapter>s, <BackMatter>
– <FrontMatter> could contain:
<BookTitle> <Author>s <PubInfo>
– A <Chapter> could contain:
<ChapterTitle> <Paragraph>s
– A <Paragraph> could contain:
<Sentence>s or <Table>s or <Figure>s …
• Components chosen for book markup language
should reflect anticipated use ….
22. XML information structures (3)
<Book>
<FrontMatter>
<BookTitle>XML Is Easy</BookTitle>
<Author>Tim Cole</Author>
<Author>Tom Habing</Author>
<PubInfo>CDP Press, 2002</PubInfo>
</FrontMatter>
<Chapter>
<ChapterTitle>First Was SGML</ChapterTitle>
<Paragraph>Once upon a time …</Paragraph>
</Chapter>
</Book>
A corresponding XML fragment
(based on a corresponding XML application):
begin
element
end
element
23. XML information structures (4)
Example 2: Movies
• Elements can have attributes
<movies>
<movie genre="action" star="Halle Berry">
<name>Catwoman</name>
<date>(2004)</date>
<length>104 minutes</length>
</movie>
<movie genre="horror" star="Halle Berry">
<name>Gothika</name>
<date>(2003)</date>
<length>98 minutes</length>
</movie>
<movie genre="drama" star="Halle Berry">
<name>Monster's Ball</name>
<date>(2001)</date>
<length>111 minutes</length>
</movie>
</movies>
attribute
24. Encoding
• XML (like Java) uses Unicode to encode characters.
• Unicode comes in many flavors. The most common one used
in the West is UTF-8.
• UTF-8 is a variable length code. Characters are encoded in 1
byte, 2 bytes, or 4 bytes.
• The first 128 characters in Unicode are ASCII.
• In UTF-8, the numbers between 128 and 255 code for some of
the more common characters used in western Europe, such as
ã, á, å, or ç.
• Two byte codes are used for some characters not listed in the
first 256 and some Asian ideographs.
• Four byte codes can handle any ideographs that are left.
• Those using non-western languages should investigate other
versions of Unicode.
25. DKS – 15/09/2019 Slide 25
What is an XML document ?
• An XML document is a content marked up
with XML (can be a file, a string, a message
content or any other sort of data storage)
• There are 2 levels of conforming documents:
– Well-formed: respects the XML syntax
– Valid: In addition, respects one (or more)
associated grammars (schemas).
26. What is a well-formed XML document (1) ?
Well-formed documents follow basic syntax rules e.g.
• there is an XML declaration in the first line
• there is a single document root
• all tags use proper delimiters
• all elements have start and end tags
– But can be minimized if empty: <br/> instead of <br></br>
• all elements are properly nested
<author> <firstname>Mark</firstname>
<lastname>Twain</lastname> </author>
• appropriate use of special characters
• all attribute values are quoted:
<subject scheme=“LCSH”>Music</subject>
27. DKS – 15/09/2019 Slide 27
What is a well-formed XML document (2) ?
• Good example
<addressBook>
<person>
<name> <family>Wallace</family> <given>Bob</given> </name>
<email>bwallace@megacorp.com</email>
<address>Rue de Lausanne, Genève</address>
</person>
</addressBook>
• Bad example
<addressBook>
<address>Rue de Lausanne, Genève <person></address>
<name>
<family>Schneider</family> <firstName>Nina</firstName>
</name>
<email>nina@nina.name</email>
</person>
<name><family> Muller </family> <name>
</addressBook>
28. What is a valid XML document (3) ?
• Parser (i.e. that program that reads the XML) can
check markup of individual document against rules
expressed in a schema (DTD, XML Schema, etc.)
• Typically, a schema (grammar):
– Defines available elements
– Defines attributes of elements
– Defines how elements can be embedded
– Defines mandatory and optional information
• Authoring tools usually can enforce rules of
DTD/Schema while document is edited
29. DKS – 15/09/2019 Slide 29
Document Type Definitions (DTDs 1)
• XML document types can be specified using a DTD
• DTD constraints structure of XML data
– What elements can occur
– What attributes can/must an element have
– What subelements can/must occur inside each element, and how
many times.
• DTD does not constrain data types
– All values represented as strings in XML
• DTD definition syntax
– <!ELEMENT element (subelements-specification) >
– <!ATTLIST element (attributes) >
– … more details later
• Valid XML documents refer to a DTD (or other Schema)
30. DKS – 15/09/2019 Slide 30
Document Type Definitions (DTDs 2)
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE test PUBLIC "-//Webster//DTD test V1.0//EN"
<test> "test" is a document element </test>
<!DOCTYPE test [
<!ELEMENT test EMPTY> ]>
<test/>
External Public DTD Declaration
Internal DTD Declaration
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE test SYSTEM "test.dtd">
<test> "test" is a document element </test>
External DTD Declaration referring to a file or a URL
test = name
of the root
element
DTD is defined
in file test.dtd
DTD is defined
inside XML
Application
should know
DTD
31. XML Files are Trees
address
name email phone birthday
first last year month day
32. XML Trees
• An XML document has a single root node.
• The tree is a general ordered tree.
– A parent node may have any number of children.
– Child nodes are ordered, and may have siblings.
• Preorder traversals are usually used for getting
information out of the tree.
33. Validity
• A well-formed document has a tree structure and
obeys all the XML rules.
• A particular application may add more rules in either
a DTD (document type definition) or in a schema.
• Many specialized DTDs and schemas have been
created to describe particular areas.
• These range from disseminating news bulletins (RSS)
to chemical formulas.
• DTDs were developed first, so they are not as
comprehensive as schema.
34. Document Type Definitions
• A DTD describes the tree structure of a
document and something about its data.
• There are two data types, PCDATA and CDATA.
– PCDATA is parsed character data.
– CDATA is character data, not usually parsed.
• A DTD determines how many times a node
may appear, and how child nodes are ordered.
35. DTD for address Example
<!ELEMENT address (name, email, phone, birthday)>
<!ELEMENT name (first, last)>
<!ELEMENT first (#PCDATA)>
<!ELEMENT last (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
<!ELEMENT birthday (year, month, day)>
<!ELEMENT year (#PCDATA)>
<!ELEMENT month (#PCDATA)>
<!ELEMENT day (#PCDATA)>
36. Schemas
• Schemas are themselves XML documents.
• They were standardized after DTDs and provide more
information about the document.
• They have a number of data types including string,
decimal, integer, boolean, date, and time.
• They divide elements into simple and complex types.
• They also determine the tree structure and how
many children a node may have.
38. DKS – 15/09/2019 Slide 38
XML Schemas
• XML Schema is a more sophisticated schema language which
addresses the drawbacks of DTDs. Supports
– Typing of values
• E.g. integer, string, etc
• Also, constraints on min/max values
– User-defined, complex types
– Many more features, including
• uniqueness and foreign key constraints, inheritance
• XML Schema is itself specified in XML syntax, unlike DTDs
– More-standard representation, but verbose
• XML Scheme is integrated with namespaces
• BUT: XML Schema is significantly more complicated than
DTDs.
40. Explanation of Example Schema
<?xml version="1.0" encoding="ISO-8859-1" ?>
• ISO-8859-1, Latin-1, is the same as UTF-8 in the first 128 characters.
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
• www.w3.org/2001/XMLSchema contains the schema standards.
<xs:element name="address">
<xs:complexType>
• This states that address is a complex type element.
<xs:sequence>
• This states that the following elements form a sequence and must come in the
order shown.
<xs:element name="name" type="xs:string"/>
• This says that the element, name, must be a string.
<xs:element name="birthday" type="xs:date"/>
• This states that the element, birthday, is a date. Dates are always of the form
yyyy-mm-dd.
41. XSLT
Extensible Stylesheet Language Transformations
• XSLT is used to transform one xml document into
another, often an html document.
• The Transform classes are now part of Java 1.4.
• A program is used that takes as input one xml
document and produces as output another.
• If the resulting document is in html, it can be viewed
by a web browser.
• This is a good way to display xml data.
43. The Result of the Transformation
Alice Lee
alee@aol.com
123-45-6789
1983-7-15
44. Parsers
• There are two principal models for parsers.
• SAX – Simple API for XML
– Uses a call-back method
– Similar to javax listeners
• DOM – Document Object Model
– Creates a parse tree
– Requires a tree traversal