SlideShare ist ein Scribd-Unternehmen logo
1 von 105
XML
HTML and XML,

    XML stands for eXtensible Markup Language
    HTML is used to mark up text so it    XML is used to mark up data so it
    can be displayed to users             can be processed by computers



    HTML describes both structure (e.g.   XML describes only content, or
    <p>, <h2>, <em>) and appearance       “meaning”
    (e.g. <br>, <font>, <i>)




    HTML uses a fixed, unchangeable       In XML, you make up your own
    set of tags                           tags

2
HTML and XML, II
    • HTML and XML look similar, because they are
      both SGML languages (SGML = Standard
      Generalized Markup Language)
      – Both HTML and XML use elements enclosed in tags
        (e.g. <body>This is an element</body>)
      – Both use tag attributes (e.g.,
        <font face="Verdana" size="+1" color="red">)
      – Both use entities (&lt;, &gt;, &amp;, &quot;,
        &apos;)
    • More precisely,
      – HTML is defined in SGML
      – XML is a (very small) subset of SGML
3
HTML and XML
    • HTML is for humans
      – HTML describes web pages
      – You don’t want to see error messages about the
        web pages you visit
      – Browsers ignore and/or correct as many HTML
        errors as they can, so HTML is often sloppy
    • XML is for computers
      – XML describes data
      – The rules are strict and errors are not allowed
         • In this way, XML is like a programming language
      – Current versions of most browsers can display XML
         • However, browser support of XML is spotty at best
4
XML-related technologies
• DTD (Document Type Definition) and XML Schemas are used
  to define legal XML tags and their attributes for particular
  purposes

• CSS (Cascading Style Sheets) describe how to display HTML or
  XML in a browser

• XSLT (eXtensible Stylesheet Language Transformations) and
  XPath are used to translate from one form of XML to another

• DOM (Document Object Model), SAX (Simple API for XML, and
  JAXP (Java API for XML Processing) are all APIs for XML parsing

5
Example XML document

    <?xml version="1.0"?>
    <weatherReport>
      <date>7/14/97</date>
      <city>North Place</city>, <state>NX</state>
      <country>USA</country>
      High Temp: <high scale="F">103</high>
      Low Temp: <low scale="F">70</low>
      Morning: <morning>Partly cloudy, Hazy</morning>
      Afternoon: <afternoon>Sunny &amp; hot</afternoon>
      Evening: <evening>Clear and Cooler</evening>
    </weatherReport>




6                                      From: XML: A Primer, by Simon St. Laurent
Overall structure
• An XML document may start with one or more
  processing instructions (PIs) or directives:
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/css" href="ss.css"?>
• Following the directives, there must be exactly
  one root element containing all the rest of the
  XML:
    <weatherReport>
       ...
    </weatherReport>
7
XML building blocks
    • Aside from the directives, an XML document
      is built from:
      – elements: high in <high scale="F">103</high>
      – tags, in pairs: <high scale="F">103</high>
      – attributes: <high scale="F">103</high>
      – entities: <afternoon>Sunny &amp; hot</afternoon>
      – character data, which may be:
         • parsed (processed as XML)--this is the default
         • unparsed (all characters stand for themselves)
8
Elements and attributes
• Attributes and elements are somewhat interchangeable
• Example using just elements:
      <name>
        <first>David</first>
        <last>Matuszek</last>
      </name>
• Example using attributes:
      <name first="David" last="Matuszek"></name>
• You will find that elements are easier to use in your programs--this
  is a good reason to prefer them
• Attributes often contain metadata, such as unique IDs
• Generally speaking, browsers display only elements (values
  enclosed by tags), not tags and attributes


9
Well-formed XML
• Every element must have both a start tag and an end tag, e.g.
  <name> ... </name>
     – But empty elements can be abbreviated: <break />.
     – XML tags are case sensitive
     – XML tags may not begin with the letters xml, in any
       combination of cases
• Elements must be properly nested, e.g. not <b><i>bold and
  italic</b></i>
• Every XML document must have one and only one root element
• The values of attributes must be enclosed in single or double
  quotes, e.g. <time unit="days">
• Character data cannot contain < or &

10
Entities
     • Five special characters must be written as
       entities:
         &amp; for    &   (almost always necessary)
         &lt;   for   <    (almost always necessary)
         &gt; for     >   (not usually necessary)
         &quot; for   "    (necessary inside double quotes)
         &apos; for   '   (necessary inside single quotes)
     • These entities can be used even in places
       where they are not absolutely required
     • These are the only predefined entities in XML
11
XML declaration
     • The XML declaration looks like this:
       <?xml version="1.0" encoding="UTF-8"
       standalone="yes"?>
        – The XML declaration is not required by browsers, but is required by
          most XML processors (so include it!)
        – If present, the XML declaration must be first--not even whitespace
          should precede it
        – Note that the brackets are <? and ?>
        – version="1.0" is required (this is the only version so far)
        – encoding can be "UTF-8" (ASCII) or "UTF-16" (Unicode), or
          something else, or it can be omitted
        – standalone tells whether there is a separate DTD


12
Processing instructions
• PIs (Processing Instructions) may occur anywhere in the XML
  document (but usually first)
• A PI is a command to the program processing the XML
  document to handle it in a certain way
• XML documents are typically processed by more than one
  program
• Programs that do not recognize a given PI should just ignore it
• General format of a PI: <?target instructions?>
• Example: <?xml-stylesheet type="text/css"
  href="mySheet.css"?>


13
Comments
• <!-- This is a comment in both HTML and XML -->
• Comments can be put anywhere in an XML document
• Comments are useful for:
     – Explaining the structure of an XML document
     – Commenting out parts of the XML during development and testing
•    Comments are not elements and do not have an end tag
•    The blanks after <!-- and before --> are optional
•    The character sequence -- cannot occur in the comment
•    The closing bracket must be -->
•    Comments are not displayed by browsers, but can be seen by
     anyone who looks at the source code

14
CDATA
• By default, all text inside an XML document is parsed
• You can force text to be treated as unparsed character data by
  enclosing it in <![CDATA[ ... ]]>
• Any characters, even & and <, can occur inside a CDATA
• Whitespace inside a CDATA is (usually) preserved
• The only real restriction is that the character sequence ]]>
  cannot occur inside a CDATA
• CDATA is useful when your text has a lot of illegal characters
  (for example, if your XML document contains some HTML
  text)


15
Names in XML
• Names (as used for tags and attributes) must
  begin with a letter or underscore, and can consist
  of:
     – Letters, both Roman (English) and foreign
     – Digits, both Roman and foreign
       . (dot)
       - (hyphen)
       _ (underscore)
       : (colon) should be used only for namespaces
     – Combining characters and extenders (not used in
       English)

16
Namespaces
• Recall that DTDs are used to define the tags that
  can be used in an XML document
• An XML document may reference more than one
  DTD
• Namespaces are a way to specify which DTD
  defines a given tag
• XML, like Java, uses qualified names
     –   This helps to avoid collisions between names
     –   Java: myObject.myVariable
     –   XML: myDTD:myTag
     –   Note that XML uses a colon (:) rather than a dot (.)
17
Namespaces and URIs
     • A namespace is defined as a unique string
       – To guarantee uniqueness, typically a URI
         (Uniform Resource Indicator) is used, because
         the author “owns” the domain
       – It doesn't have to be a “real” URI; it just has to
         be a unique string
       – Example: http://www.matuszek.org/ns
         There are two ways to use namespaces:
       – Declare a default namespace
       – Associate a prefix with a namespace, then use
         the prefix in the XML to refer to the namespace
18
Namespace syntax
• In any start tag you can use the reserved attribute name xmlns:
      <book xmlns="http://www.matuszek.org/ns">
   – This namespace will be used as the default for all elements up to the
      corresponding end tag
   – You can override it with a specific prefix

• You can use almost this same form to declare a prefix:
      <book xmlns:dave="http://www.matuszek.org/ns">
   – Use this prefix on every tag and attribute you want to use from this
      namespace, including end tags--it is not a default prefix
      <dave:chapter dave:number="1">To Begin</dave:chapter>

• You can use the prefix in the start tag in which it is defined:
      <dave:book xmlns:dave="http://www.matuszek.org/ns">


19
Review of XML rules
• Start with <?xml version="1"?>
• XML is case sensitive
• You must have exactly one root element that
  encloses all the rest of the XML
• Every element must have a closing tag
• Elements must be properly nested
• Attribute values must be enclosed in double or
  single quotation marks
• There are only five predeclared entities

20
Another well-structured example
     <novel>
       <foreword>
         <paragraph> This is the great American novel.
         </paragraph>
     </foreword>
       <chapter number="1">
         <paragraph>It was a dark and stormy night.
        </paragraph>
         <paragraph>Suddenly, a shot rang out!
         </paragraph>
       </chapter>
     </novel>


21
XML as a tree
• An XML document represents a hierarchy; a hierarchy is a tree

                          novel


           foreword                     chapter
                                      number="1"



        paragraph            paragraph          paragraph


      This is the great     It was a dark     Suddenly, a shot
      American novel.     and stormy night.      rang out!
22
Valid XML
• You can make up your own XML tags and attributes, but...
     – ...any program that uses the XML must know what to expect!
• A DTD (Document Type Definition) defines what tags are legal
  and where they can occur in the XML
• An XML document does not require a DTD
• XML is well-structured if it follows the rules given earlier
• In addition, XML is valid if it declares a DTD and conforms to that
  DTD
• A DTD can be included in the XML, but is typically a separate
  document
• Errors in XML documents will stop XML programs
• Some alternatives to DTDs are XML Schemas and RELAX NG
23
Viewing XML
• XML is designed to be processed by computer
  programs, not to be displayed to humans
• Nevertheless, almost all current browsers can
  display XML documents
     – They don’t all display it the same way
     – They may not display it at all if it has errors
     – For best results, update your browsers to the newest
       available versions
• Remember:
    HTML is designed to be viewed,
    XML is designed to be used

24
Extended document standards
• You can define your own XML tag sets, but here
  are some already available:
     –   XHTML: HTML redefined in XML
     –   SMIL: Synchronized Multimedia Integration Language
     –   MathML: Mathematical Markup Language
     –   SVG: Scalable Vector Graphics
     –   DrawML: Drawing MetaLanguage
     –   ICE: Information and Content Exchange
     –   ebXML: Electronic Business with XML
     –   cxml: Commerce XML
     –   CBL: Common Business Library

25
Vocabulary
     • SGML: Standard Generalized Markup Language
     • XML : Extensible Markup Language
     • DTD: Document Type Definition
     • element: a start and end tag, along with their contents
     • attribute: a value given in the start tag of an element
     • entity: a representation of a particular character or string
     • PI: a Processing Instruction, to possibly be used by a
       program that processes this XML
     • namespace: a unique string that references a DTD
     • well-formed XML: XML that follows the basic syntax rules
     • valid XML: well-formed XML that conforms to a DTD
26
XML Schemas
XML Schemas
• “Schemas” is a general term--DTDs are a form
  of XML schemas
  – According to the dictionary, a schema is “a
    structured framework or plan”
• DTDs and XML Schemas are all XML schema
  languages
Why XML Schemas?
• DTDs provide a very weak specification language
   – You can’t put any restrictions on text content
   – You have very little control over mixed content (text plus elements)
   – You have little control over ordering of elements
• DTDs are written in a strange (non-XML) format
   – You need separate parsers for DTDs and XML
• The XML Schema Definition language solves these problems
   – XSD gives you much more control over structure and content
   – XSD is written in XML
Why not XML schemas?
• DTDs have been around longer than XSD
  – Therefore they are more widely used
  – Also, more tools support them
• XSD is very verbose, even by XML standards
• More advanced XML Schema instructions can
  be non-intuitive and confusing

• Nevertheless, XSD is not likely to go away
  quickly
Referring to a schema
• To refer to a DTD in an XML document, the reference goes before the root
  element:
    – <?xml version="1.0"?>
      <!DOCTYPE rootElement SYSTEM "url">
      <rootElement> ... </rootElement>
• To refer to an XML Schema in an XML document, the reference goes in the
  root element:
    – <?xml version="1.0"?>
      <rootElement
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                (The XML Schema Instance reference is required)
         xsi:noNamespaceSchemaLocation="url.xsd">
                (This is where your XML Schema definition can be found)
        ...
      </rootElement>
<schema>
• The <schema> element may have attributes:
  – xmlns:xs="http://www.w3.org/2001/XMLSche
    ma"
     • This is necessary to specify where all our XSD tags are
       defined
  – elementFormDefault="qualified"
     • This means that all XML elements must be qualified
“Simple” and “complex” elements
• A “simple” element is one that contains text and
  nothing else
   –   A simple element cannot have attributes
   –   A simple element cannot contain other elements
   –   A simple element cannot be empty
   –   However, the text can be of many different types, and
       may have various restrictions applied to it
• If an element isn’t simple, it’s “complex”
   – A complex element may have attributes
   – A complex element may be empty, or it may contain
     text, other elements, or both text and other elements
Defining a simple element
• A simple element is defined as
    <xs:element name="name" type="type"
  />
  where:
  – name is the name of the element
  – the most common values for type are
       xs:boolean       xs:integer
       xs:date          xs:string
       xs:decimal       xs:time
• Other attributes a simple element may have:
  – default="default value" if no other value is
    specified
  – fixed="value"         no other value may be
    specified
Defining an attribute
• Attributes themselves are always declared as simple types
• An attribute is defined as
    <xs:attribute name="name" type="type" />
  where:
   – name and type are the same as for xs:element
• Other attributes a simple element may have:
   –   default="default value" if no other value is specified
   –   fixed="value"              no other value may be specified
   –   use="optional"        the attribute is not required (default)
   –   use="required"       the attribute must be present
Restrictions, or “facets”
• The general form for putting a restriction
  on a text value is:
  – <xs:element name="name">                (or
    xs:attribute)
      <xs:restriction base="type">
          ... the restrictions ...
       </xs:restriction>
    </xs:element>
• For example:
  – <xs:element name="age">
       <xs:restriction base="xs:integer">
           <xs:minInclusive value="0">
           <xs:maxInclusive value="140">
       </xs:restriction>
    </xs:element>
Restrictions on numbers
• minInclusive -- number must be ≥ the given value
• minExclusive -- number must be > the given value
• maxInclusive -- number must be ≤ the given value
• maxExclusive -- number must be < the given value
• totalDigits -- number must have exactly value digits
• fractionDigits -- number must have no more than value
  digits after the decimal point
Restrictions on strings
• length -- the string must contain exactly value characters
• minLength -- the string must contain at least value characters
• maxLength -- the string must contain no more than value characters
• pattern -- the value is a regular expression that the string must match
• whiteSpace -- not really a “restriction”--tells what to do with whitespace
    – value="preserve"    Keep all whitespace
    – value="replace"     Change all whitespace characters to spaces
    – value="collapse"    Remove leading and trailing whitespace, and replace
                         all sequences of whitespace with a single space
Enumeration
• An enumeration restricts the value to be one
  of a fixed set of values
• Example:
  – <xs:element name="season">
       <xs:simpleType>
           <xs:restriction base="xs:string">
              <xs:enumeration value="Spring"/>
              <xs:enumeration value="Summer"/>
              <xs:enumeration value="Autumn"/>
              <xs:enumeration value="Fall"/>
              <xs:enumeration value="Winter"/>
           </xs:restriction>
       </xs:simpleType>
    </xs:element>
Complex elements
• A complex element is defined as
    <xs:element name="name">
       <xs:complexType>
           ... information about the complex type...
       </xs:complexType>
    </xs:element>
• Example:
    <xs:element name="person">
       <xs:complexType>
           <xs:sequence>
              <xs:element name="firstName" type="xs:string" />
              <xs:element name="lastName" type="xs:string" />
           </xs:sequence>
       </xs:complexType>
    </xs:element>
• <xs:sequence> says that elements must occur in this order
• Remember that attributes are always simple types
Global and local definitions
• Elements declared at the “top level” of a <schema> are available for
  use throughout the schema
• Elements declared within a xs:complexType are local to that type
• Thus, in
    <xs:element name="person">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="firstName" type="xs:string" />
                <xs:element name="lastName" type="xs:string" />
            </xs:sequence>
        </xs:complexType>
    </xs:element>
  the elements firstName and lastName are only locally declared
• The order of declarations at the “top level” of a <schema> do not
  specify the order in the XML data document
Declaration and use
• So far we’ve been talking about how to
  declare types, not how to use them
• To use a type we have declared, use it as the
  value of type="..."
  – Examples:
     • <xs:element name="student" type="person"/>
     • <xs:element name="professor" type="person"/>
  – Scope is important: you cannot use a type if is
    local to some other type
xs:sequence
• We’ve already seen an example of a complex
  type whose elements must occur in a specific
  order:
• <xs:element name="person">
     <xs:complexType>
        <xs:sequence>
           <xs:element name="firstName" type="xs:string" />
           <xs:element name="lastName" type="xs:string" />
        </xs:sequence>
     </xs:complexType>
   </xs:element>
xs:all
• xs:all allows elements to appear in any order
•   <xs:element name="person">
       <xs:complexType>
          <xs:all>
             <xs:element name="firstName" type="xs:string" />
             <xs:element name="lastName" type="xs:string" />
          </xs:all>
       </xs:complexType>
     </xs:element>
• Despite the name, the members of an xs:all group can occur
  once or not at all
Referencing
• Once you have defined an element or
  attribute (with name="..."), you can refer to it
  with ref="..."
• Example:
  – <xs:element name="person">
       <xs:complexType>
           <xs:all>
               <xs:element name="firstName" type="xs:string" />
               <xs:element name="lastName" type="xs:string" />
           </xs:all>
       </xs:complexType>
     </xs:element>
  – <xs:element name="student" ref="person">
  – Or just: <xs:element ref="person">
Text element with attributes
• If a text element has attributes, it is no longer
  a simple type
  – <xs:element name="population">
       <xs:complexType>
           <xs:simpleContent>
              <xs:extension base="xs:integer">
                  <xs:attribute name="year" type="xs:integer">
              </xs:extension>
           </xs:simpleContent>
       </xs:complexType>
  – </xs:element>
Empty elements
• Empty elements are (ridiculously) complex

• <xs:complexType name="counter">
     <xs:complexContent>
        <xs:extension base="xs:anyType"/>
        <xs:attribute name="count" type="xs:integer"/>
    </xs:complexContent>
  </xs:complexType>
Mixed elements
• Mixed elements may contain both text and
  elements
• We add mixed="true" to the xs:complexType element
• The text itself is not mentioned in the element,
  and may go anywhere (it is basically ignored)

•   <xs:complexType name="paragraph" mixed="true">
       <xs:sequence>
          <xs:element name="someName" type="xs:anyType"/>
      </xs:sequence>
    </xs:complexType>
Extensions
• You can base a complex type on another
  complex type
• <xs:complexType name="newType">
     <xs:complexContent>
        <xs:extension base="otherType">
           ...new stuff...
        </xs:extension>
     </xs:complexContent>
  </xs:complexType>
Predefined string types
• Recall that a simple element is defined as:
    <xs:element name="name" type="type"
  />
• Here are a few of the possible string types:
   – xs:string -- a string
   – xs:normalizedString -- a string that doesn’t
     contain tabs, newlines, or carriage returns
   – xs:token -- a string that doesn’t contain any
     whitespace other than single spaces
• Allowable restrictions on strings:
   – enumeration, length, maxLength, minLength,
     pattern, whiteSpace
Predefined date and time types
• xs:date -- A date in the format CCYY-MM-
  DD, for example, 2002-11-05
• xs:time -- A date in the format hh:mm:ss
  (hours, minutes, seconds)
• xs:dateTime -- Format is CCYY-MM-
  DDThh:mm:ss
• Allowable restrictions on dates and times:
  –
      enumeration, minInclusive, maxExclusive, maxInc
      lusive, maxExclusive, pattern, whiteSpace
Predefined numeric types
• Here are some of the predefined numeric types:
 xs:decimal                    xs:positiveInteger
 xs:byte                       xs:negativeInteger
 xs:short                      xs:nonPositiveInteger
 xs:int                        xs:nonNegativeInteger
 xs:long

• Allowable restrictions on numeric types:
   – enumeration, minInclusive, maxExclusive, maxInclusive,
     maxExclusive, fractionDigits, totalDigits, pattern,
     whiteSpace
DOM




      31-Oct-12
SAX and DOM
• SAX and DOM are standards for XML parsers--
  program APIs to read and interpret XML files
  – DOM is a W3C standard
  – SAX is an ad-hoc (but very popular) standard
• There are various implementations available
• Java implementations are provided in JAXP
  (Java API for XML Processing)
• Unlike many XML technologies, SAX and DOM
  are relatively easy
Difference between SAX and DOM
• DOM reads the entire XML document into memory and
  stores it as a tree data structure
• SAX reads the XML document and sends an event for each
  element that it encounters
• Consequences:
   – DOM provides “random access” into the XML document
   – SAX provides only sequential access to the XML document
   – DOM is slow and requires huge amounts of memory, so it cannot
     be used for large XML documents
   – SAX is fast and requires very little memory, so it can be used for
     huge documents (or large numbers of documents)
       • This makes SAX much more popular for web sites
   – Some DOM implementations have methods for changing the XML
     document in memory; SAX implementations do not
Simple DOM program, I

• import javax.xml.parsers.*;
  import org.w3c.dom.*;
• public class SecondDom {
    public static void main(String args[]) {
       try {
          ...Main part of program goes here...
       } catch (Exception e) {
          e.printStackTrace(System.out);
       }
    }
  }
Simple DOM program, II
• First we need to create a DOM parser, called a
  “DocumentBuilder”
• The parser is created, not by a constructor, but by
  calling a static factory method
   – This is a common technique in advanced Java
     programming
   – The use of a factory method makes it easier if you
     later switch to a different parser
  DocumentBuilderFactory factory =
    DocumentBuilderFactory.newInstance();
  DocumentBuilder builder =
    factory.newDocumentBuilder();
Simple DOM program, III
• The next step is to load in the XML file
• Here is the XML file, named hello.xml:
     <?xml version="1.0"?>
     <display>Hello World!</display>
• To read this file in, we add the following line to our
  program:
     Document document = builder.parse("hello.xml");
• Notes:
   – document contains the entire XML file (as a tree); it is the
     Document Object Model
   – If you run this from the command line, your XML file should be in
     the same directory as your program
   – An IDE may look in a different directory for your file; if you get a
     java.io.FileNotFoundException, this is probably why
Simple DOM program, IV
• The following code finds the content of the root
  element and prints it:
  Element root = document.getDocumentElement();
  Node textNode = root.getFirstChild();
  System.out.println(textNode.getNodeValue());

• This code should be mostly self-explanatory; we’ll
  get into the details shortly

• The output of the program is: Hello World!
Reading in the tree
• The parse method reads in the entire XML
  document and represents it as a tree in memory
  – For a large document, parsing could take a while
  – If you want to interact with your program while it is
    parsing, you need to parse in a separate thread
     • Once parsing starts, you cannot interrupt or stop it
     • Do not try to access the parse tree until parsing is done

• An XML parse tree may require up to ten times
  as much memory as the original XML document
  – If you have a lot of tree manipulation to do, DOM is
    much more convenient than SAX
  – If you don’t have a lot of tree manipulation to do,
    consider using SAX instead
Structure of the DOM tree
• The DOM tree is composed of Node objects
• Node is an interface
  – Some of the more important subinterfaces are
    Element, Attr, and Text
     • An Element node may have children
     • Attr and Text nodes are leaves
  – Additional types are Document,
    ProcessingInstruction, Comment, Entity,
    CDATASection and several others
• Hence, the DOM tree is composed entirely of
  Node objects, but the Node objects can be
  downcast into more specific types as needed
Operations on Nodes, I
• The results returned by getNodeName(), getNodeValue(),
  getNodeType() and getAttributes() depend on the
  subtype of the node, as follows:

                    Element          Text            Attr

  getNodeName()       tag name       "#text"          name of attribute

  getNodeValue()      null           text contents    value of attribute

  getNodeType()       ELEMENT_NODE   TEXT_NODE        ATTRIBUTE_NODE

  getAttributes()     NamedNodeMap null               null
Distinguishing Node types
• Here’s an easy way to tell what kind of a node you are dealing with:
      switch(node.getNodeType()) {
          case Node.ELEMENT_NODE:
               Element element = (Element)node;
               ...;
               break;
          case Node.TEXT_NODE:
               Text text = (Text)node;
               ...
               break;
          case Node.ATTRIBUTE_NODE:
               Attr attr = (Attr)node;
               ...
               break;
          default: ...
      }
Operations on Nodes, II
• Tree-walking operations that return a Node:
   –   getParentNode()
   –   getFirstChild()
   –   getNextSibling()
   –   getPreviousSibling()
   –   getLastChild()

• Tests that return a   boolean:
   – hasAttributes()
   – hasChildNodes()
Operations for Elements
• String getTagName()
   – Returns the name of the tag
• boolean hasAttribute(String name)
   – Returns true if this Element has the named attribute
• String getAttribute(String name)
   – Returns the (String) value of the named attribute
• boolean hasAttributes()
   – Returns true if this Element has any attributes
   – This method is actually inherited from Node
       • Returns false if it is applied to a Node that isn’t an Element
• NamedNodeMap getAttributes()
   – Returns a NamedNodeMap of all the Element’s attributes
   – This method is actually inherited from Node
       • Returns null if it is applied to a Node that isn’t an Element
NamedNodeMap
• The node.getAttributes() operation returns a
  NamedNodeMap
   – Because NamedNodeMaps are used for other kinds of nodes
     (elsewhere in Java), the contents are treated as general Nodes, not
     specifically as Attrs
• Some operations on a NamedNodeMap are:
   – getNamedItem(String name) returns (as a Node) the attribute with
     the given name
   – getLength() returns (as an int) the number of Nodes in this
     NamedNodeMap
   – item(int index) returns (as a Node) the indexth item
       • This operation lets you conveniently step through all the nodes in the
         NamedNodeMap
       • Java does not guarantee the order in which nodes are returned
Operations on Texts
• Text is a subinterface of CharacterData and
  inherits the following operations (among others):
  – public String getData() throws DOMException
     • Returns the text contents of this Text node
  – public int getLength()
     • Returns the number of Unicode characters in the text
  – public String substringData(int offset, int count)
                    throws DOMException
     • Returns a substring of the text contents
Operations on Attrs
• String getName()
  – Returns the name of this attribute.
• Element getOwnerElement()
  – Returns the Element node this attribute is attached
    to, or null if this attribute is not in use
• boolean getSpecified()
  – Returns true if this attribute was explicitly given a
    value in the original document
• String getValue()
  – Returns the value of the attribute as a String
Preorder traversal
• The DOM is stored in memory as a tree
• An easy way to traverse a tree is in preorder
• You should remember how to do this from
  your course in Data Structures
• The general form of a preorder traversal is:
  – Visit the root
  – Traverse each subtree, in order
Preorder traversal in Java
•   static void simplePreorderPrint(String indent, Node node) {
         printNode(indent, node);
         if(node.hasChildNodes()) {
            Node child = node.getFirstChild();
            while (child != null) {
               simplePreorderPrint(indent + " ", child);
               child = child.getNextSibling();
            }
         }
      }
•   static void printNode(String indent, Node node) {
         System.out.print(indent);
         System.out.print(node.getNodeType() + " ");
         System.out.print(node.getNodeName() + " ");
         System.out.print(node.getNodeValue() + " ");
         System.out.println(node.getAttributes());
      }
Trying out the program



  Input:

  <?xml version="1.0"?>
  <novel>
   <chapter num="1">The
  Beginning</chapter>
   <chapter num="2">The
  Middle</chapter>
   <chapter num="3">The
  End</chapter>
  </novel>
A DOM XML parser read below XML file and print out
           each elements one by one.
file.xml
<?xml version="1.0"?>
<company>
    <staff>
        <firstname>yong</firstname>
        <lastname>mook kim</lastname> <nickname>mkyong</nickname>
        <salary>100000</salary>
   </staff>
   <staff>
        <firstname>low</firstname>
        <lastname>yin fong</lastname>
         <nickname>fong fong</nickname>
         <salary>200000</salary>
   </staff>
</company>
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
 import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
Import org.w3c.dom.Element;
import java.io.File;
  public class ReadXMLFile {
  public static void main(String argv[]) {
   try { File fXmlFile = new File("c:file.xml");
 DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
 Document doc = dBuilder.parse(fXmlFile); doc.getDocumentElement().normalize();
   System.out.println("Root element :" + doc.getDocumentElement().getNodeName());
 NodeList nList = doc.getElementsByTagName("staff");
 System.out.println("-----------------------");
   for (int temp = 0; temp < nList.getLength(); temp++) {
  Node nNode = nList.item(temp);
 if (nNode.getNodeType() == Node.ELEMENT_NODE)
{ Element eElement = (Element) nNode;
  System.out.println("First Name : " + getTagValue("firstname", eElement));
 System.out.println("Last Name : " + getTagValue("lastname", eElement));
 System.out.println("Nick Name : " + getTagValue("nickname", eElement));
System.out.println("Salary : " + getTagValue("salary", eElement)); } } }
 catch (Exception e) { e.printStackTrace(); } }
   private static String getTagValue(String sTag, Element eElement) {
 NodeList nlList = eElement.getElementsByTagName(sTag).
item(0).getChildNodes();
  Node nValue = (Node) nlList.item(0);
SAX


      A parser for XML Documents
XML Parsers
• Two types of parser
  – SAX (Simple API for XML)
     • Event driven API
     • Sends events to the application as the document is read
  – DOM (Document Object Model)
     • Reads the entire document into memory in a tree
       structure
SAX Parser
• When should I use it?
  – Large documents
  – Memory constrained devices
• When should I use something else?
  – If you need to modify the document
  – SAX doesn’t remember previous events unless you
    write explicit code to do so.
SAX Parser
• Which languages are supported?
  – Java
  – Perl
  – C++
  – Python
Difference between SAX and DOM
• DOM reads the entire XML document into memory and
  stores it as a tree data structure
• SAX reads the XML document and calls one of your methods
  for each element or block of text that it encounters
• Consequences:
   – DOM provides “random access” into the XML document
   – SAX provides only sequential access to the XML document
   – DOM is slow and requires huge amounts of memory, so it cannot be
     used for large XML documents
   – SAX is fast and requires very little memory, so it can be used for huge
     documents (or large numbers of documents)
       • This makes SAX much more popular for web sites
   – Some DOM implementations have methods for changing the XML
     document in memory; SAX implementations do not
Callbacks
• SAX works through callbacks: you call the parser,
  it calls methods that you supply
               Your program

                                 startDocument(...)
               The SAX parser
                                 startElement(...)
   main(...)
                   parse(...)    characters(...)
                                 endElement( )
                                 endDocument( )
Simple SAX program
• The following program is adapted from CodeNotes® for XML
  by Gregory Brill, pages 158-159
• The program consists of two classes:
   – Sample -- This class contains the main method; it
       •   Gets a factory to make parsers
       •   Gets a parser from the factory
       •   Creates a Handler object to handle callbacks from the parser
       •   Tells the parser which handler to send its callbacks to
       •   Reads and parses the input XML file
   – Handler -- This class contains handlers for three kinds of callbacks:
       • startElement callbacks, generated when a start tag is seen
       • endElement callbacks, generated when an end tag is seen
       • characters callbacks, generated for the contents of an element
The Sample class, I
• import javax.xml.parsers.*; // for both SAX and DOM
  import org.xml.sax.*;
  import org.xml.sax.helpers.*;

• // For simplicity, we let the operating system handle exceptions
  // In "real life" this is poor programming practice
  public class Sample {
    public static void main(String args[]) throws Exception {
•      // Create a parser factory
       SAXParserFactory factory = SAXParserFactory.newInstance();
•      // Tell factory that the parser must understand namespaces
       factory.setNamespaceAware(true);
•      // Make the parser
        SAXParser saxParser = factory.newSAXParser();
        XMLReader parser = saxParser.getXMLReader();
The Sample class, II
• In the previous slide we made a parser, of type XMLReader

•       // Create a handler
        Handler handler = new Handler();
•       // Tell the parser to use this handler
        parser.setContentHandler(handler);
•       // Finally, read and parse the document
        parser.parse("hello.xml");
•    } // end of Sample class

• You will need to put the file hello.xml :
    – In the same directory, if you run the program from the command line
    – Or where it can be found by the particular IDE you are using
The Handler class, I
• public class Handler extends DefaultHandler {
   – DefaultHandler is an adapter class that defines these methods and
     others as do-nothing methods, to be overridden as desired
   – We will define three very similar methods to handle (1) start tags, (2)
     contents, and (3) end tags--our methods will just print a line
   – Each of these three methods could throw a SAXException

•     // SAX calls this method when it encounters a start tag
      public void startElement(String namespaceURI,
                               String localName,
                               String qualifiedName,
                               Attributes attributes)
            throws SAXException {
        System.out.println("startElement: " + qualifiedName);
      }
The Handler class, II
•     // SAX calls this method to pass in character data
      public void characters(char ch[], int start, int length)
            throws SAXException {
         System.out.println("characters: "" +
                            new String(ch, start, length) + """);
      }

•     // SAX call this method when it encounters an end tag
      public void endElement(String namespaceURI,
                              String localName,
                              String qualifiedName)
            throws SAXException {
         System.out.println("Element: /" + qualifiedName);
      }
    } // End of Handler class
Results
• If the file hello.xml contains:

     <?xml version="1.0"?>
     <display>Hello World!</display>

• Then the output from running java Sample
  will be:

     startElement: display
     characters: "Hello World!"
     Element: /display
More results
• Now suppose the file             startElement: display
  hello.xml contains:              characters: "" // empty string
   – <?xml version="1.0"?>         characters: "
     <display>                     " // newline
        <i>Hello</i> World!        characters: "     " // spaces
     </display>
                                   startElement: i
• Notice that the root element,    characters: "Hello"
  <display>, now contains a        endElement: /i
  nested element <i> and           characters: "World!"
  some whitespace (including       characters: "
  newlines)                        " // another newline
• The result will be as shown at   endElement: /display
  the right:
Parser factories
• A factory is an alternative to constructors
• To create a SAX parser factory, call this method:
  SAXParserFactory.newInstance()
   – This returns an object of type SAXParserFactory
   – It may throw a FactoryConfigurationError
• You can then say what kind of parser you want:
   – public void setNamespaceAware(boolean awareness)
       • Call this with true if you are using namespaces
       • The default (if you don’t call this method) is false
   – public void setValidating(boolean validating)
       • Call this with true if you want to validate against a DTD
       • The default (if you don’t call this method) is false
       • Validation will give an error if you don’t have a DTD
Getting a parser
• Once you have a SAXParserFactory set up (say it’s named
  factory), you can create a parser with:
    SAXParser saxParser = factory.newSAXParser();
   XMLReader parser = saxParser.getXMLReader();
Declaring which handler to use
• Since the SAX parser will be calling our methods, we need to
  supply these methods
• In the example these are in a separate class, Handler
• We need to tell the parser where to find the methods:
     Handler handler = new Handler();
     parser.setContentHandler(handler);
• These statements could be combined:
     parser.setContentHandler(new Handler());
• Finally, we call the parser and tell it what file to parse:
      parser.parse("hello.xml");
• Everything else will be done in the handler methods
SAX handlers
• A callback handler for SAX must implement these four interfaces:
    – interface ContentHandler
        • This is the most important interface--it handles basic parsing callbacks, such as
          element starts and ends
    – interface DTDHandler
        • Handles only notation and unparsed entity declarations
    – interface EntityResolver
        • Does customized handling for external entities
    – interface ErrorHandler
        • Must be implemented or parsing errors will be ignored!
• You could implement all these interfaces yourself, but that’s a lot of
  work--it’s easier to use an adapter class
Class DefaultHandler
• DefaultHandler is in package
  org.xml.sax.helpers
• DefaultHandler implements ContentHandler,
  DTDHandler, EntityResolver, and ErrorHandler
• DefaultHandler is an adapter class--it provides
  empty methods for every method declared in
  each of the four interfaces
  – Empty methods don’t do anything
• To use this class, extend it and override the
  methods that are important to your application
  – We will cover some of the methods in the
    ContentHandler and ErrorHandler interfaces
ContentHandler methods, I
• public void setDocumentLocator(Locator loc)
   – This method is called once, when parsing first starts
   – The Locator contains either a URL or a URN, or both, that specifies
     where the document is located
   – You may need this information if you need to find a document whose
     position is specified relative to this XML document
   – Locator methods include:
       • public String getPublicId() returns the public identifier for the current
         document
       • public String getSystemId() returns the system identifier for the current
         document
   – Every ContentHandler method except this one may throw a
     SAXException
ContentHandler methods, II
• public void processingInstruction(String target,
                                        String data)
                       throws SAXException
• This method is called once for each processing instruction (PI)
  that is encountered
• The PI is presented as two strings: <?target data?>
• According to XML rules, PIs may occur anywhere in the
  document after the initial <?xml ...?> line
   – This means calls to processingInstruction do not
     necessarily occur before startElement is called
     with the document root--they may occur later
ContentHandler methods, III
• public void startDocument()
               throws SAXException
   – This is called just once, at the beginning of parsing

• public void endDocument()
              throws SAXException
   – This is called just once, and is the last method called by the parser

• Remember: when you override a method, you can throw fewer kinds
  of exceptions, but you can’t throw any new kinds
   – In other words: your methods don’t have to throw a
      SAXException
   – But if they must throw an exception, it can only be a
      SAXException
ContentHandler methods, IV
• public void startElement(String namespaceURI,
                               String localName,
                               String qualifiedName,
                               Attributes atts)
       throws SAXException
• This method is called at the beginning of every element
• If the parser is namespace-aware,
   – namespaceURI will hold the prefix (before the colon)
   – localName will hold the element name (without a prefix)
   – qualifiedName will be the empty string
• If the parser is not using namespaces,
   – namespaceURI and localName will be empty strings
   – qualifiedName will hold the element name (possibly with prefix)
Attributes, I
• When SAX calls startElement, it passes in a parameter of
  type Attributes
• Attributes is an interface that defines a number of useful
  methods; here are a few of them:
   –   getLength() returns the number of attributes
   –   getLocalName(index) returns the attribute’s local name
   –   getQName(index) returns the attribute’s qualified name
   –   getValue(index) returns the attribute’s value
   –   getType(index) returns the attribute’s type, which will be one of
       the Strings "CDATA", "ID", "IDREF", "IDREFS", "NMTOKEN",
       "NMTOKENS", "ENTITY", "ENTITIES", or "NOTATION"
• As with elements, if the local name is the empty string, then
  the attribute’s name is in the qualified name
Attributes, II
• SAX does not guarantee that the attributes will be returned in
  the same order they are written
   – After all, the order is irrelevant in XML
• The following methods look up attributes by name rather than
  by index:
   –   public int getIndex(String qualifiedName)
   –   public int getIndex(String uri, String localName)
   –   public String getValue(String qualifiedName)
   –   public String getValue(String uri, String localName)
• An Attributes object is valid only during the call to
  characters
   – If you need to remember attributes longer, use:
     AttributesImpl attrImpl = new AttributesImpl(attributes);
ContentHandler methods, V
• endElement(String namespaceURI,
              String localName,
              String qualifiedName)
     throws SAXException

• The parameters to endElement are the same as those to
  startElement, except that the Attributes parameter is
  omitted
ContentHandler methods, VI
• public void characters(char[] ch,
                          int start,
                          int length) throws SAXException

• ch is an array of characters
   – Only length characters, starting from ch[start], are
     the contents of the element
• The String constructor new String(ch, start, length) is an
  easy way to extract the relevant characters from the char array
• characters may be called multiple times for one element
   – Newlines and entities may break the data characters into separate calls
   – characters may be called with length = 0
   – All data characters of the element will eventually be given to characters
Example
• If hello.xml contains:
  – <?xml version="1.0"?>
    <display>
        Hello World!
    </display>
• Then the sample program we started with gives:
  – startElement: display
    characters:                 <-- zero length string
    characters:                 <-- LF character (ASCII 10)
    characters:   Hello World! <-- spaces are preserved
    characters:                <-- LF character (ASCII 10)
    Element: /display
Whitespace
• Whitespace is a major nuisance
  – Whitespace is characters; characters are PCDATA
  – IF you are validating, the parser will ignore whitespace
    where PCDATA is not allowed by the DTD
  – If you are not validating, the parser cannot ignore
    whitespace
  – If you ignore whitespace, you lose your indentation
• To ignore whitespace when validating:
  – Happens automatically
• To ignore whitespace when not validating:
  – Use the String function trim() to remove whitespace
  – Check the result to see if it is the empty string
Handling ignorable whitespace
• A nonvalidating parser cannot ignore whitespace, because it
  cannot distinguish it from real data
• A validating parser can, and does, ignore whitespace where
  character data is not allowed
   – For processing XML, this is usually what you want
   – However, if you are manipulating and writing out XML, discarding
     whitespace ruins your indentation
   – To capture ignorable whitespace, you can override this method
     (defined in DefaultHandler):
        public void ignorableWhitespace(char[] ch,
                                              int start,
                                              int length)
                       throws SAXException
       • Parameters are the same as those for characters
Error Handling, I
• SAX error handling is unusual
• Most errors are ignored unless you register
  an error handler
  (org.xml.sax.ErrorHandler)
  – Ignored errors can cause bizarre behavior
  – Failing to provide an error handler is unwise
• The ErrorHandler interface declares:
  – public void fatalError (SAXParseException exception)
                throws SAXException // XML not well structured
  – public void error (SAXParseException exception)
                throws SAXException // XML validation error
  – public void warning (SAXParseException exception)
                throws SAXException // minor problem
Error Handling, II
• If you are extending DefaultHandler, it implements ErrorHandler
  and registers itself
   – DefaultHandler’s version of fatalError() throws a SAXException, but...
   – its error() and warning() methods do nothing!
• You can (and should) override these methods
• Note that the only kind of exception your override methods can
  throw is a SAXException
   – When you override a method, you cannot add exception types
   – If you need to throw another kind of exception, say an IOException, you
     can encapsulate it in a SAXException:
       • catch (IOException ioException) {
            throw new SAXException("I/O error: ", ioException)
         }
Error Handling, III
• If you are not extending DefaultHandler:
  – Create a new class (say, MyErrorHandler) that
    implements ErrorHandler (by supplying the three
    methods fatalError, error, and warning)
  – Create a new object of this class
  – Tell your XMLReader object about it by sending it the
    following message:
     setErrorHandler(ErrorHandler handler)
• Example:
     XMLReader parser = saxParser.getXMLReader();
     parser.setErrorHandler(new MyErrorHandler());

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

XML
XMLXML
XML
 
XML
XMLXML
XML
 
Unit iv xml
Unit iv xmlUnit iv xml
Unit iv xml
 
Introduction to xml
Introduction to xmlIntroduction to xml
Introduction to xml
 
Dom parser
Dom parserDom parser
Dom parser
 
Understanding XML DOM
Understanding XML DOMUnderstanding XML DOM
Understanding XML DOM
 
Css
CssCss
Css
 
Extracting data from xml
Extracting data from xmlExtracting data from xml
Extracting data from xml
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
Xhtml
XhtmlXhtml
Xhtml
 
Unit iv xml dom
Unit iv xml domUnit iv xml dom
Unit iv xml dom
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
Xml 215-presentation
Xml 215-presentationXml 215-presentation
Xml 215-presentation
 
Publishing xml
Publishing xmlPublishing xml
Publishing xml
 
Html (1)
Html (1)Html (1)
Html (1)
 
uptu web technology unit 2 Xml2
uptu web technology unit 2 Xml2uptu web technology unit 2 Xml2
uptu web technology unit 2 Xml2
 
Basics and different xml files used in android
Basics and different xml files used in androidBasics and different xml files used in android
Basics and different xml files used in android
 
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - Introduction to XML
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - Introduction to XMLFergus Fahey - DRI/ARA(I) Training: Introduction to EAD - Introduction to XML
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - Introduction to XML
 
Querring xml with xpath
Querring xml with xpath Querring xml with xpath
Querring xml with xpath
 

Andere mochten auch

Andere mochten auch (8)

Np unit iii
Np unit iiiNp unit iii
Np unit iii
 
Np unit iv i
Np unit iv iNp unit iv i
Np unit iv i
 
Unit2wt
Unit2wtUnit2wt
Unit2wt
 
Unit 1wt
Unit 1wtUnit 1wt
Unit 1wt
 
Cookies
CookiesCookies
Cookies
 
Unit4wt
Unit4wtUnit4wt
Unit4wt
 
Select and poll functions
Select and poll functionsSelect and poll functions
Select and poll functions
 
Np unit2
Np unit2Np unit2
Np unit2
 

Ähnlich wie Unit3wt (20)

eXtensible Markup Language (By Dr.Hatem Mohamed)
eXtensible Markup Language (By Dr.Hatem Mohamed)eXtensible Markup Language (By Dr.Hatem Mohamed)
eXtensible Markup Language (By Dr.Hatem Mohamed)
 
XML Presentation-2
XML Presentation-2XML Presentation-2
XML Presentation-2
 
Introduction to xml
Introduction to xmlIntroduction to xml
Introduction to xml
 
Xml
XmlXml
Xml
 
Xml
XmlXml
Xml
 
Ch2 neworder
Ch2 neworderCh2 neworder
Ch2 neworder
 
Xml
XmlXml
Xml
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
M.FLORENCE DAYANA WEB DESIGN -Unit 5 XML
M.FLORENCE DAYANA WEB DESIGN -Unit 5   XMLM.FLORENCE DAYANA WEB DESIGN -Unit 5   XML
M.FLORENCE DAYANA WEB DESIGN -Unit 5 XML
 
Intro xml
Intro xmlIntro xml
Intro xml
 
Xml intro1
Xml intro1Xml intro1
Xml intro1
 
Xml
XmlXml
Xml
 
Xml and DTD's
Xml and DTD'sXml and DTD's
Xml and DTD's
 
Xml iet 2015
Xml iet 2015Xml iet 2015
Xml iet 2015
 
23xml
23xml23xml
23xml
 
XML, DTD & XSD Overview
XML, DTD & XSD OverviewXML, DTD & XSD Overview
XML, DTD & XSD Overview
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
BITM3730 10-18.pptx
BITM3730 10-18.pptxBITM3730 10-18.pptx
BITM3730 10-18.pptx
 
BITM3730 10-31.pptx
BITM3730 10-31.pptxBITM3730 10-31.pptx
BITM3730 10-31.pptx
 
Unit 5 xml (1)
Unit 5   xml (1)Unit 5   xml (1)
Unit 5 xml (1)
 

Mehr von vamsitricks (19)

Unit 6
Unit 6Unit 6
Unit 6
 
Np unit1
Np unit1Np unit1
Np unit1
 
Np unit iv ii
Np unit iv iiNp unit iv ii
Np unit iv ii
 
Unit 7
Unit 7Unit 7
Unit 7
 
Npc16
Npc16Npc16
Npc16
 
Npc14
Npc14Npc14
Npc14
 
Npc13
Npc13Npc13
Npc13
 
Npc08
Npc08Npc08
Npc08
 
Unit 3
Unit 3Unit 3
Unit 3
 
Unit 5
Unit 5Unit 5
Unit 5
 
Unit 2
Unit 2Unit 2
Unit 2
 
Unit 7
Unit 7Unit 7
Unit 7
 
Unit 6
Unit 6Unit 6
Unit 6
 
Unit 4
Unit 4Unit 4
Unit 4
 
Servletarchitecture,lifecycle,get,post
Servletarchitecture,lifecycle,get,postServletarchitecture,lifecycle,get,post
Servletarchitecture,lifecycle,get,post
 
Jsp with mvc
Jsp with mvcJsp with mvc
Jsp with mvc
 
Servletarchitecture,lifecycle,get,post
Servletarchitecture,lifecycle,get,postServletarchitecture,lifecycle,get,post
Servletarchitecture,lifecycle,get,post
 
Javabeans
JavabeansJavabeans
Javabeans
 
Javabeanproperties
JavabeanpropertiesJavabeanproperties
Javabeanproperties
 

Unit3wt

  • 1. XML
  • 2. HTML and XML, XML stands for eXtensible Markup Language HTML is used to mark up text so it XML is used to mark up data so it can be displayed to users can be processed by computers HTML describes both structure (e.g. XML describes only content, or <p>, <h2>, <em>) and appearance “meaning” (e.g. <br>, <font>, <i>) HTML uses a fixed, unchangeable In XML, you make up your own set of tags tags 2
  • 3. HTML and XML, II • HTML and XML look similar, because they are both SGML languages (SGML = Standard Generalized Markup Language) – Both HTML and XML use elements enclosed in tags (e.g. <body>This is an element</body>) – Both use tag attributes (e.g., <font face="Verdana" size="+1" color="red">) – Both use entities (&lt;, &gt;, &amp;, &quot;, &apos;) • More precisely, – HTML is defined in SGML – XML is a (very small) subset of SGML 3
  • 4. HTML and XML • HTML is for humans – HTML describes web pages – You don’t want to see error messages about the web pages you visit – Browsers ignore and/or correct as many HTML errors as they can, so HTML is often sloppy • XML is for computers – XML describes data – The rules are strict and errors are not allowed • In this way, XML is like a programming language – Current versions of most browsers can display XML • However, browser support of XML is spotty at best 4
  • 5. XML-related technologies • DTD (Document Type Definition) and XML Schemas are used to define legal XML tags and their attributes for particular purposes • CSS (Cascading Style Sheets) describe how to display HTML or XML in a browser • XSLT (eXtensible Stylesheet Language Transformations) and XPath are used to translate from one form of XML to another • DOM (Document Object Model), SAX (Simple API for XML, and JAXP (Java API for XML Processing) are all APIs for XML parsing 5
  • 6. Example XML document <?xml version="1.0"?> <weatherReport> <date>7/14/97</date> <city>North Place</city>, <state>NX</state> <country>USA</country> High Temp: <high scale="F">103</high> Low Temp: <low scale="F">70</low> Morning: <morning>Partly cloudy, Hazy</morning> Afternoon: <afternoon>Sunny &amp; hot</afternoon> Evening: <evening>Clear and Cooler</evening> </weatherReport> 6 From: XML: A Primer, by Simon St. Laurent
  • 7. Overall structure • An XML document may start with one or more processing instructions (PIs) or directives: <?xml version="1.0"?> <?xml-stylesheet type="text/css" href="ss.css"?> • Following the directives, there must be exactly one root element containing all the rest of the XML: <weatherReport> ... </weatherReport> 7
  • 8. XML building blocks • Aside from the directives, an XML document is built from: – elements: high in <high scale="F">103</high> – tags, in pairs: <high scale="F">103</high> – attributes: <high scale="F">103</high> – entities: <afternoon>Sunny &amp; hot</afternoon> – character data, which may be: • parsed (processed as XML)--this is the default • unparsed (all characters stand for themselves) 8
  • 9. Elements and attributes • Attributes and elements are somewhat interchangeable • Example using just elements: <name> <first>David</first> <last>Matuszek</last> </name> • Example using attributes: <name first="David" last="Matuszek"></name> • You will find that elements are easier to use in your programs--this is a good reason to prefer them • Attributes often contain metadata, such as unique IDs • Generally speaking, browsers display only elements (values enclosed by tags), not tags and attributes 9
  • 10. Well-formed XML • Every element must have both a start tag and an end tag, e.g. <name> ... </name> – But empty elements can be abbreviated: <break />. – XML tags are case sensitive – XML tags may not begin with the letters xml, in any combination of cases • Elements must be properly nested, e.g. not <b><i>bold and italic</b></i> • Every XML document must have one and only one root element • The values of attributes must be enclosed in single or double quotes, e.g. <time unit="days"> • Character data cannot contain < or & 10
  • 11. Entities • Five special characters must be written as entities: &amp; for & (almost always necessary) &lt; for < (almost always necessary) &gt; for > (not usually necessary) &quot; for " (necessary inside double quotes) &apos; for ' (necessary inside single quotes) • These entities can be used even in places where they are not absolutely required • These are the only predefined entities in XML 11
  • 12. XML declaration • The XML declaration looks like this: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> – The XML declaration is not required by browsers, but is required by most XML processors (so include it!) – If present, the XML declaration must be first--not even whitespace should precede it – Note that the brackets are <? and ?> – version="1.0" is required (this is the only version so far) – encoding can be "UTF-8" (ASCII) or "UTF-16" (Unicode), or something else, or it can be omitted – standalone tells whether there is a separate DTD 12
  • 13. Processing instructions • PIs (Processing Instructions) may occur anywhere in the XML document (but usually first) • A PI is a command to the program processing the XML document to handle it in a certain way • XML documents are typically processed by more than one program • Programs that do not recognize a given PI should just ignore it • General format of a PI: <?target instructions?> • Example: <?xml-stylesheet type="text/css" href="mySheet.css"?> 13
  • 14. Comments • <!-- This is a comment in both HTML and XML --> • Comments can be put anywhere in an XML document • Comments are useful for: – Explaining the structure of an XML document – Commenting out parts of the XML during development and testing • Comments are not elements and do not have an end tag • The blanks after <!-- and before --> are optional • The character sequence -- cannot occur in the comment • The closing bracket must be --> • Comments are not displayed by browsers, but can be seen by anyone who looks at the source code 14
  • 15. CDATA • By default, all text inside an XML document is parsed • You can force text to be treated as unparsed character data by enclosing it in <![CDATA[ ... ]]> • Any characters, even & and <, can occur inside a CDATA • Whitespace inside a CDATA is (usually) preserved • The only real restriction is that the character sequence ]]> cannot occur inside a CDATA • CDATA is useful when your text has a lot of illegal characters (for example, if your XML document contains some HTML text) 15
  • 16. Names in XML • Names (as used for tags and attributes) must begin with a letter or underscore, and can consist of: – Letters, both Roman (English) and foreign – Digits, both Roman and foreign . (dot) - (hyphen) _ (underscore) : (colon) should be used only for namespaces – Combining characters and extenders (not used in English) 16
  • 17. Namespaces • Recall that DTDs are used to define the tags that can be used in an XML document • An XML document may reference more than one DTD • Namespaces are a way to specify which DTD defines a given tag • XML, like Java, uses qualified names – This helps to avoid collisions between names – Java: myObject.myVariable – XML: myDTD:myTag – Note that XML uses a colon (:) rather than a dot (.) 17
  • 18. Namespaces and URIs • A namespace is defined as a unique string – To guarantee uniqueness, typically a URI (Uniform Resource Indicator) is used, because the author “owns” the domain – It doesn't have to be a “real” URI; it just has to be a unique string – Example: http://www.matuszek.org/ns There are two ways to use namespaces: – Declare a default namespace – Associate a prefix with a namespace, then use the prefix in the XML to refer to the namespace 18
  • 19. Namespace syntax • In any start tag you can use the reserved attribute name xmlns: <book xmlns="http://www.matuszek.org/ns"> – This namespace will be used as the default for all elements up to the corresponding end tag – You can override it with a specific prefix • You can use almost this same form to declare a prefix: <book xmlns:dave="http://www.matuszek.org/ns"> – Use this prefix on every tag and attribute you want to use from this namespace, including end tags--it is not a default prefix <dave:chapter dave:number="1">To Begin</dave:chapter> • You can use the prefix in the start tag in which it is defined: <dave:book xmlns:dave="http://www.matuszek.org/ns"> 19
  • 20. Review of XML rules • Start with <?xml version="1"?> • XML is case sensitive • You must have exactly one root element that encloses all the rest of the XML • Every element must have a closing tag • Elements must be properly nested • Attribute values must be enclosed in double or single quotation marks • There are only five predeclared entities 20
  • 21. Another well-structured example <novel> <foreword> <paragraph> This is the great American novel. </paragraph> </foreword> <chapter number="1"> <paragraph>It was a dark and stormy night. </paragraph> <paragraph>Suddenly, a shot rang out! </paragraph> </chapter> </novel> 21
  • 22. XML as a tree • An XML document represents a hierarchy; a hierarchy is a tree novel foreword chapter number="1" paragraph paragraph paragraph This is the great It was a dark Suddenly, a shot American novel. and stormy night. rang out! 22
  • 23. Valid XML • You can make up your own XML tags and attributes, but... – ...any program that uses the XML must know what to expect! • A DTD (Document Type Definition) defines what tags are legal and where they can occur in the XML • An XML document does not require a DTD • XML is well-structured if it follows the rules given earlier • In addition, XML is valid if it declares a DTD and conforms to that DTD • A DTD can be included in the XML, but is typically a separate document • Errors in XML documents will stop XML programs • Some alternatives to DTDs are XML Schemas and RELAX NG 23
  • 24. Viewing XML • XML is designed to be processed by computer programs, not to be displayed to humans • Nevertheless, almost all current browsers can display XML documents – They don’t all display it the same way – They may not display it at all if it has errors – For best results, update your browsers to the newest available versions • Remember: HTML is designed to be viewed, XML is designed to be used 24
  • 25. Extended document standards • You can define your own XML tag sets, but here are some already available: – XHTML: HTML redefined in XML – SMIL: Synchronized Multimedia Integration Language – MathML: Mathematical Markup Language – SVG: Scalable Vector Graphics – DrawML: Drawing MetaLanguage – ICE: Information and Content Exchange – ebXML: Electronic Business with XML – cxml: Commerce XML – CBL: Common Business Library 25
  • 26. Vocabulary • SGML: Standard Generalized Markup Language • XML : Extensible Markup Language • DTD: Document Type Definition • element: a start and end tag, along with their contents • attribute: a value given in the start tag of an element • entity: a representation of a particular character or string • PI: a Processing Instruction, to possibly be used by a program that processes this XML • namespace: a unique string that references a DTD • well-formed XML: XML that follows the basic syntax rules • valid XML: well-formed XML that conforms to a DTD 26
  • 28. XML Schemas • “Schemas” is a general term--DTDs are a form of XML schemas – According to the dictionary, a schema is “a structured framework or plan” • DTDs and XML Schemas are all XML schema languages
  • 29. Why XML Schemas? • DTDs provide a very weak specification language – You can’t put any restrictions on text content – You have very little control over mixed content (text plus elements) – You have little control over ordering of elements • DTDs are written in a strange (non-XML) format – You need separate parsers for DTDs and XML • The XML Schema Definition language solves these problems – XSD gives you much more control over structure and content – XSD is written in XML
  • 30. Why not XML schemas? • DTDs have been around longer than XSD – Therefore they are more widely used – Also, more tools support them • XSD is very verbose, even by XML standards • More advanced XML Schema instructions can be non-intuitive and confusing • Nevertheless, XSD is not likely to go away quickly
  • 31. Referring to a schema • To refer to a DTD in an XML document, the reference goes before the root element: – <?xml version="1.0"?> <!DOCTYPE rootElement SYSTEM "url"> <rootElement> ... </rootElement> • To refer to an XML Schema in an XML document, the reference goes in the root element: – <?xml version="1.0"?> <rootElement xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" (The XML Schema Instance reference is required) xsi:noNamespaceSchemaLocation="url.xsd"> (This is where your XML Schema definition can be found) ... </rootElement>
  • 32. <schema> • The <schema> element may have attributes: – xmlns:xs="http://www.w3.org/2001/XMLSche ma" • This is necessary to specify where all our XSD tags are defined – elementFormDefault="qualified" • This means that all XML elements must be qualified
  • 33. “Simple” and “complex” elements • A “simple” element is one that contains text and nothing else – A simple element cannot have attributes – A simple element cannot contain other elements – A simple element cannot be empty – However, the text can be of many different types, and may have various restrictions applied to it • If an element isn’t simple, it’s “complex” – A complex element may have attributes – A complex element may be empty, or it may contain text, other elements, or both text and other elements
  • 34. Defining a simple element • A simple element is defined as <xs:element name="name" type="type" /> where: – name is the name of the element – the most common values for type are xs:boolean xs:integer xs:date xs:string xs:decimal xs:time • Other attributes a simple element may have: – default="default value" if no other value is specified – fixed="value" no other value may be specified
  • 35. Defining an attribute • Attributes themselves are always declared as simple types • An attribute is defined as <xs:attribute name="name" type="type" /> where: – name and type are the same as for xs:element • Other attributes a simple element may have: – default="default value" if no other value is specified – fixed="value" no other value may be specified – use="optional" the attribute is not required (default) – use="required" the attribute must be present
  • 36. Restrictions, or “facets” • The general form for putting a restriction on a text value is: – <xs:element name="name"> (or xs:attribute) <xs:restriction base="type"> ... the restrictions ... </xs:restriction> </xs:element> • For example: – <xs:element name="age"> <xs:restriction base="xs:integer"> <xs:minInclusive value="0"> <xs:maxInclusive value="140"> </xs:restriction> </xs:element>
  • 37. Restrictions on numbers • minInclusive -- number must be ≥ the given value • minExclusive -- number must be > the given value • maxInclusive -- number must be ≤ the given value • maxExclusive -- number must be < the given value • totalDigits -- number must have exactly value digits • fractionDigits -- number must have no more than value digits after the decimal point
  • 38. Restrictions on strings • length -- the string must contain exactly value characters • minLength -- the string must contain at least value characters • maxLength -- the string must contain no more than value characters • pattern -- the value is a regular expression that the string must match • whiteSpace -- not really a “restriction”--tells what to do with whitespace – value="preserve" Keep all whitespace – value="replace" Change all whitespace characters to spaces – value="collapse" Remove leading and trailing whitespace, and replace all sequences of whitespace with a single space
  • 39. Enumeration • An enumeration restricts the value to be one of a fixed set of values • Example: – <xs:element name="season"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="Spring"/> <xs:enumeration value="Summer"/> <xs:enumeration value="Autumn"/> <xs:enumeration value="Fall"/> <xs:enumeration value="Winter"/> </xs:restriction> </xs:simpleType> </xs:element>
  • 40. Complex elements • A complex element is defined as <xs:element name="name"> <xs:complexType> ... information about the complex type... </xs:complexType> </xs:element> • Example: <xs:element name="person"> <xs:complexType> <xs:sequence> <xs:element name="firstName" type="xs:string" /> <xs:element name="lastName" type="xs:string" /> </xs:sequence> </xs:complexType> </xs:element> • <xs:sequence> says that elements must occur in this order • Remember that attributes are always simple types
  • 41. Global and local definitions • Elements declared at the “top level” of a <schema> are available for use throughout the schema • Elements declared within a xs:complexType are local to that type • Thus, in <xs:element name="person"> <xs:complexType> <xs:sequence> <xs:element name="firstName" type="xs:string" /> <xs:element name="lastName" type="xs:string" /> </xs:sequence> </xs:complexType> </xs:element> the elements firstName and lastName are only locally declared • The order of declarations at the “top level” of a <schema> do not specify the order in the XML data document
  • 42. Declaration and use • So far we’ve been talking about how to declare types, not how to use them • To use a type we have declared, use it as the value of type="..." – Examples: • <xs:element name="student" type="person"/> • <xs:element name="professor" type="person"/> – Scope is important: you cannot use a type if is local to some other type
  • 43. xs:sequence • We’ve already seen an example of a complex type whose elements must occur in a specific order: • <xs:element name="person"> <xs:complexType> <xs:sequence> <xs:element name="firstName" type="xs:string" /> <xs:element name="lastName" type="xs:string" /> </xs:sequence> </xs:complexType> </xs:element>
  • 44. xs:all • xs:all allows elements to appear in any order • <xs:element name="person"> <xs:complexType> <xs:all> <xs:element name="firstName" type="xs:string" /> <xs:element name="lastName" type="xs:string" /> </xs:all> </xs:complexType> </xs:element> • Despite the name, the members of an xs:all group can occur once or not at all
  • 45. Referencing • Once you have defined an element or attribute (with name="..."), you can refer to it with ref="..." • Example: – <xs:element name="person"> <xs:complexType> <xs:all> <xs:element name="firstName" type="xs:string" /> <xs:element name="lastName" type="xs:string" /> </xs:all> </xs:complexType> </xs:element> – <xs:element name="student" ref="person"> – Or just: <xs:element ref="person">
  • 46. Text element with attributes • If a text element has attributes, it is no longer a simple type – <xs:element name="population"> <xs:complexType> <xs:simpleContent> <xs:extension base="xs:integer"> <xs:attribute name="year" type="xs:integer"> </xs:extension> </xs:simpleContent> </xs:complexType> – </xs:element>
  • 47. Empty elements • Empty elements are (ridiculously) complex • <xs:complexType name="counter"> <xs:complexContent> <xs:extension base="xs:anyType"/> <xs:attribute name="count" type="xs:integer"/> </xs:complexContent> </xs:complexType>
  • 48. Mixed elements • Mixed elements may contain both text and elements • We add mixed="true" to the xs:complexType element • The text itself is not mentioned in the element, and may go anywhere (it is basically ignored) • <xs:complexType name="paragraph" mixed="true"> <xs:sequence> <xs:element name="someName" type="xs:anyType"/> </xs:sequence> </xs:complexType>
  • 49. Extensions • You can base a complex type on another complex type • <xs:complexType name="newType"> <xs:complexContent> <xs:extension base="otherType"> ...new stuff... </xs:extension> </xs:complexContent> </xs:complexType>
  • 50. Predefined string types • Recall that a simple element is defined as: <xs:element name="name" type="type" /> • Here are a few of the possible string types: – xs:string -- a string – xs:normalizedString -- a string that doesn’t contain tabs, newlines, or carriage returns – xs:token -- a string that doesn’t contain any whitespace other than single spaces • Allowable restrictions on strings: – enumeration, length, maxLength, minLength, pattern, whiteSpace
  • 51. Predefined date and time types • xs:date -- A date in the format CCYY-MM- DD, for example, 2002-11-05 • xs:time -- A date in the format hh:mm:ss (hours, minutes, seconds) • xs:dateTime -- Format is CCYY-MM- DDThh:mm:ss • Allowable restrictions on dates and times: – enumeration, minInclusive, maxExclusive, maxInc lusive, maxExclusive, pattern, whiteSpace
  • 52. Predefined numeric types • Here are some of the predefined numeric types: xs:decimal xs:positiveInteger xs:byte xs:negativeInteger xs:short xs:nonPositiveInteger xs:int xs:nonNegativeInteger xs:long • Allowable restrictions on numeric types: – enumeration, minInclusive, maxExclusive, maxInclusive, maxExclusive, fractionDigits, totalDigits, pattern, whiteSpace
  • 53. DOM 31-Oct-12
  • 54. SAX and DOM • SAX and DOM are standards for XML parsers-- program APIs to read and interpret XML files – DOM is a W3C standard – SAX is an ad-hoc (but very popular) standard • There are various implementations available • Java implementations are provided in JAXP (Java API for XML Processing) • Unlike many XML technologies, SAX and DOM are relatively easy
  • 55. Difference between SAX and DOM • DOM reads the entire XML document into memory and stores it as a tree data structure • SAX reads the XML document and sends an event for each element that it encounters • Consequences: – DOM provides “random access” into the XML document – SAX provides only sequential access to the XML document – DOM is slow and requires huge amounts of memory, so it cannot be used for large XML documents – SAX is fast and requires very little memory, so it can be used for huge documents (or large numbers of documents) • This makes SAX much more popular for web sites – Some DOM implementations have methods for changing the XML document in memory; SAX implementations do not
  • 56. Simple DOM program, I • import javax.xml.parsers.*; import org.w3c.dom.*; • public class SecondDom { public static void main(String args[]) { try { ...Main part of program goes here... } catch (Exception e) { e.printStackTrace(System.out); } } }
  • 57. Simple DOM program, II • First we need to create a DOM parser, called a “DocumentBuilder” • The parser is created, not by a constructor, but by calling a static factory method – This is a common technique in advanced Java programming – The use of a factory method makes it easier if you later switch to a different parser DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder();
  • 58. Simple DOM program, III • The next step is to load in the XML file • Here is the XML file, named hello.xml: <?xml version="1.0"?> <display>Hello World!</display> • To read this file in, we add the following line to our program: Document document = builder.parse("hello.xml"); • Notes: – document contains the entire XML file (as a tree); it is the Document Object Model – If you run this from the command line, your XML file should be in the same directory as your program – An IDE may look in a different directory for your file; if you get a java.io.FileNotFoundException, this is probably why
  • 59. Simple DOM program, IV • The following code finds the content of the root element and prints it: Element root = document.getDocumentElement(); Node textNode = root.getFirstChild(); System.out.println(textNode.getNodeValue()); • This code should be mostly self-explanatory; we’ll get into the details shortly • The output of the program is: Hello World!
  • 60. Reading in the tree • The parse method reads in the entire XML document and represents it as a tree in memory – For a large document, parsing could take a while – If you want to interact with your program while it is parsing, you need to parse in a separate thread • Once parsing starts, you cannot interrupt or stop it • Do not try to access the parse tree until parsing is done • An XML parse tree may require up to ten times as much memory as the original XML document – If you have a lot of tree manipulation to do, DOM is much more convenient than SAX – If you don’t have a lot of tree manipulation to do, consider using SAX instead
  • 61. Structure of the DOM tree • The DOM tree is composed of Node objects • Node is an interface – Some of the more important subinterfaces are Element, Attr, and Text • An Element node may have children • Attr and Text nodes are leaves – Additional types are Document, ProcessingInstruction, Comment, Entity, CDATASection and several others • Hence, the DOM tree is composed entirely of Node objects, but the Node objects can be downcast into more specific types as needed
  • 62. Operations on Nodes, I • The results returned by getNodeName(), getNodeValue(), getNodeType() and getAttributes() depend on the subtype of the node, as follows: Element Text Attr getNodeName() tag name "#text" name of attribute getNodeValue() null text contents value of attribute getNodeType() ELEMENT_NODE TEXT_NODE ATTRIBUTE_NODE getAttributes() NamedNodeMap null null
  • 63. Distinguishing Node types • Here’s an easy way to tell what kind of a node you are dealing with: switch(node.getNodeType()) { case Node.ELEMENT_NODE: Element element = (Element)node; ...; break; case Node.TEXT_NODE: Text text = (Text)node; ... break; case Node.ATTRIBUTE_NODE: Attr attr = (Attr)node; ... break; default: ... }
  • 64. Operations on Nodes, II • Tree-walking operations that return a Node: – getParentNode() – getFirstChild() – getNextSibling() – getPreviousSibling() – getLastChild() • Tests that return a boolean: – hasAttributes() – hasChildNodes()
  • 65. Operations for Elements • String getTagName() – Returns the name of the tag • boolean hasAttribute(String name) – Returns true if this Element has the named attribute • String getAttribute(String name) – Returns the (String) value of the named attribute • boolean hasAttributes() – Returns true if this Element has any attributes – This method is actually inherited from Node • Returns false if it is applied to a Node that isn’t an Element • NamedNodeMap getAttributes() – Returns a NamedNodeMap of all the Element’s attributes – This method is actually inherited from Node • Returns null if it is applied to a Node that isn’t an Element
  • 66. NamedNodeMap • The node.getAttributes() operation returns a NamedNodeMap – Because NamedNodeMaps are used for other kinds of nodes (elsewhere in Java), the contents are treated as general Nodes, not specifically as Attrs • Some operations on a NamedNodeMap are: – getNamedItem(String name) returns (as a Node) the attribute with the given name – getLength() returns (as an int) the number of Nodes in this NamedNodeMap – item(int index) returns (as a Node) the indexth item • This operation lets you conveniently step through all the nodes in the NamedNodeMap • Java does not guarantee the order in which nodes are returned
  • 67. Operations on Texts • Text is a subinterface of CharacterData and inherits the following operations (among others): – public String getData() throws DOMException • Returns the text contents of this Text node – public int getLength() • Returns the number of Unicode characters in the text – public String substringData(int offset, int count) throws DOMException • Returns a substring of the text contents
  • 68. Operations on Attrs • String getName() – Returns the name of this attribute. • Element getOwnerElement() – Returns the Element node this attribute is attached to, or null if this attribute is not in use • boolean getSpecified() – Returns true if this attribute was explicitly given a value in the original document • String getValue() – Returns the value of the attribute as a String
  • 69. Preorder traversal • The DOM is stored in memory as a tree • An easy way to traverse a tree is in preorder • You should remember how to do this from your course in Data Structures • The general form of a preorder traversal is: – Visit the root – Traverse each subtree, in order
  • 70. Preorder traversal in Java • static void simplePreorderPrint(String indent, Node node) { printNode(indent, node); if(node.hasChildNodes()) { Node child = node.getFirstChild(); while (child != null) { simplePreorderPrint(indent + " ", child); child = child.getNextSibling(); } } } • static void printNode(String indent, Node node) { System.out.print(indent); System.out.print(node.getNodeType() + " "); System.out.print(node.getNodeName() + " "); System.out.print(node.getNodeValue() + " "); System.out.println(node.getAttributes()); }
  • 71. Trying out the program Input: <?xml version="1.0"?> <novel> <chapter num="1">The Beginning</chapter> <chapter num="2">The Middle</chapter> <chapter num="3">The End</chapter> </novel>
  • 72. A DOM XML parser read below XML file and print out each elements one by one. file.xml <?xml version="1.0"?> <company> <staff> <firstname>yong</firstname> <lastname>mook kim</lastname> <nickname>mkyong</nickname> <salary>100000</salary> </staff> <staff> <firstname>low</firstname> <lastname>yin fong</lastname> <nickname>fong fong</nickname> <salary>200000</salary> </staff> </company>
  • 73. import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.DocumentBuilder; import org.w3c.dom.Document; import org.w3c.dom.NodeList; import org.w3c.dom.Node; Import org.w3c.dom.Element; import java.io.File; public class ReadXMLFile { public static void main(String argv[]) { try { File fXmlFile = new File("c:file.xml"); DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder dBuilder = dbFactory.newDocumentBuilder(); Document doc = dBuilder.parse(fXmlFile); doc.getDocumentElement().normalize(); System.out.println("Root element :" + doc.getDocumentElement().getNodeName()); NodeList nList = doc.getElementsByTagName("staff"); System.out.println("-----------------------"); for (int temp = 0; temp < nList.getLength(); temp++) { Node nNode = nList.item(temp); if (nNode.getNodeType() == Node.ELEMENT_NODE) { Element eElement = (Element) nNode; System.out.println("First Name : " + getTagValue("firstname", eElement)); System.out.println("Last Name : " + getTagValue("lastname", eElement)); System.out.println("Nick Name : " + getTagValue("nickname", eElement)); System.out.println("Salary : " + getTagValue("salary", eElement)); } } } catch (Exception e) { e.printStackTrace(); } } private static String getTagValue(String sTag, Element eElement) { NodeList nlList = eElement.getElementsByTagName(sTag). item(0).getChildNodes(); Node nValue = (Node) nlList.item(0);
  • 74. SAX A parser for XML Documents
  • 75. XML Parsers • Two types of parser – SAX (Simple API for XML) • Event driven API • Sends events to the application as the document is read – DOM (Document Object Model) • Reads the entire document into memory in a tree structure
  • 76. SAX Parser • When should I use it? – Large documents – Memory constrained devices • When should I use something else? – If you need to modify the document – SAX doesn’t remember previous events unless you write explicit code to do so.
  • 77. SAX Parser • Which languages are supported? – Java – Perl – C++ – Python
  • 78. Difference between SAX and DOM • DOM reads the entire XML document into memory and stores it as a tree data structure • SAX reads the XML document and calls one of your methods for each element or block of text that it encounters • Consequences: – DOM provides “random access” into the XML document – SAX provides only sequential access to the XML document – DOM is slow and requires huge amounts of memory, so it cannot be used for large XML documents – SAX is fast and requires very little memory, so it can be used for huge documents (or large numbers of documents) • This makes SAX much more popular for web sites – Some DOM implementations have methods for changing the XML document in memory; SAX implementations do not
  • 79. Callbacks • SAX works through callbacks: you call the parser, it calls methods that you supply Your program startDocument(...) The SAX parser startElement(...) main(...) parse(...) characters(...) endElement( ) endDocument( )
  • 80. Simple SAX program • The following program is adapted from CodeNotes® for XML by Gregory Brill, pages 158-159 • The program consists of two classes: – Sample -- This class contains the main method; it • Gets a factory to make parsers • Gets a parser from the factory • Creates a Handler object to handle callbacks from the parser • Tells the parser which handler to send its callbacks to • Reads and parses the input XML file – Handler -- This class contains handlers for three kinds of callbacks: • startElement callbacks, generated when a start tag is seen • endElement callbacks, generated when an end tag is seen • characters callbacks, generated for the contents of an element
  • 81. The Sample class, I • import javax.xml.parsers.*; // for both SAX and DOM import org.xml.sax.*; import org.xml.sax.helpers.*; • // For simplicity, we let the operating system handle exceptions // In "real life" this is poor programming practice public class Sample { public static void main(String args[]) throws Exception { • // Create a parser factory SAXParserFactory factory = SAXParserFactory.newInstance(); • // Tell factory that the parser must understand namespaces factory.setNamespaceAware(true); • // Make the parser SAXParser saxParser = factory.newSAXParser(); XMLReader parser = saxParser.getXMLReader();
  • 82. The Sample class, II • In the previous slide we made a parser, of type XMLReader • // Create a handler Handler handler = new Handler(); • // Tell the parser to use this handler parser.setContentHandler(handler); • // Finally, read and parse the document parser.parse("hello.xml"); • } // end of Sample class • You will need to put the file hello.xml : – In the same directory, if you run the program from the command line – Or where it can be found by the particular IDE you are using
  • 83. The Handler class, I • public class Handler extends DefaultHandler { – DefaultHandler is an adapter class that defines these methods and others as do-nothing methods, to be overridden as desired – We will define three very similar methods to handle (1) start tags, (2) contents, and (3) end tags--our methods will just print a line – Each of these three methods could throw a SAXException • // SAX calls this method when it encounters a start tag public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes attributes) throws SAXException { System.out.println("startElement: " + qualifiedName); }
  • 84. The Handler class, II • // SAX calls this method to pass in character data public void characters(char ch[], int start, int length) throws SAXException { System.out.println("characters: "" + new String(ch, start, length) + """); } • // SAX call this method when it encounters an end tag public void endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException { System.out.println("Element: /" + qualifiedName); } } // End of Handler class
  • 85. Results • If the file hello.xml contains: <?xml version="1.0"?> <display>Hello World!</display> • Then the output from running java Sample will be: startElement: display characters: "Hello World!" Element: /display
  • 86. More results • Now suppose the file startElement: display hello.xml contains: characters: "" // empty string – <?xml version="1.0"?> characters: " <display> " // newline <i>Hello</i> World! characters: " " // spaces </display> startElement: i • Notice that the root element, characters: "Hello" <display>, now contains a endElement: /i nested element <i> and characters: "World!" some whitespace (including characters: " newlines) " // another newline • The result will be as shown at endElement: /display the right:
  • 87. Parser factories • A factory is an alternative to constructors • To create a SAX parser factory, call this method: SAXParserFactory.newInstance() – This returns an object of type SAXParserFactory – It may throw a FactoryConfigurationError • You can then say what kind of parser you want: – public void setNamespaceAware(boolean awareness) • Call this with true if you are using namespaces • The default (if you don’t call this method) is false – public void setValidating(boolean validating) • Call this with true if you want to validate against a DTD • The default (if you don’t call this method) is false • Validation will give an error if you don’t have a DTD
  • 88. Getting a parser • Once you have a SAXParserFactory set up (say it’s named factory), you can create a parser with: SAXParser saxParser = factory.newSAXParser(); XMLReader parser = saxParser.getXMLReader();
  • 89. Declaring which handler to use • Since the SAX parser will be calling our methods, we need to supply these methods • In the example these are in a separate class, Handler • We need to tell the parser where to find the methods: Handler handler = new Handler(); parser.setContentHandler(handler); • These statements could be combined: parser.setContentHandler(new Handler()); • Finally, we call the parser and tell it what file to parse: parser.parse("hello.xml"); • Everything else will be done in the handler methods
  • 90. SAX handlers • A callback handler for SAX must implement these four interfaces: – interface ContentHandler • This is the most important interface--it handles basic parsing callbacks, such as element starts and ends – interface DTDHandler • Handles only notation and unparsed entity declarations – interface EntityResolver • Does customized handling for external entities – interface ErrorHandler • Must be implemented or parsing errors will be ignored! • You could implement all these interfaces yourself, but that’s a lot of work--it’s easier to use an adapter class
  • 91. Class DefaultHandler • DefaultHandler is in package org.xml.sax.helpers • DefaultHandler implements ContentHandler, DTDHandler, EntityResolver, and ErrorHandler • DefaultHandler is an adapter class--it provides empty methods for every method declared in each of the four interfaces – Empty methods don’t do anything • To use this class, extend it and override the methods that are important to your application – We will cover some of the methods in the ContentHandler and ErrorHandler interfaces
  • 92. ContentHandler methods, I • public void setDocumentLocator(Locator loc) – This method is called once, when parsing first starts – The Locator contains either a URL or a URN, or both, that specifies where the document is located – You may need this information if you need to find a document whose position is specified relative to this XML document – Locator methods include: • public String getPublicId() returns the public identifier for the current document • public String getSystemId() returns the system identifier for the current document – Every ContentHandler method except this one may throw a SAXException
  • 93. ContentHandler methods, II • public void processingInstruction(String target, String data) throws SAXException • This method is called once for each processing instruction (PI) that is encountered • The PI is presented as two strings: <?target data?> • According to XML rules, PIs may occur anywhere in the document after the initial <?xml ...?> line – This means calls to processingInstruction do not necessarily occur before startElement is called with the document root--they may occur later
  • 94. ContentHandler methods, III • public void startDocument() throws SAXException – This is called just once, at the beginning of parsing • public void endDocument() throws SAXException – This is called just once, and is the last method called by the parser • Remember: when you override a method, you can throw fewer kinds of exceptions, but you can’t throw any new kinds – In other words: your methods don’t have to throw a SAXException – But if they must throw an exception, it can only be a SAXException
  • 95. ContentHandler methods, IV • public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) throws SAXException • This method is called at the beginning of every element • If the parser is namespace-aware, – namespaceURI will hold the prefix (before the colon) – localName will hold the element name (without a prefix) – qualifiedName will be the empty string • If the parser is not using namespaces, – namespaceURI and localName will be empty strings – qualifiedName will hold the element name (possibly with prefix)
  • 96. Attributes, I • When SAX calls startElement, it passes in a parameter of type Attributes • Attributes is an interface that defines a number of useful methods; here are a few of them: – getLength() returns the number of attributes – getLocalName(index) returns the attribute’s local name – getQName(index) returns the attribute’s qualified name – getValue(index) returns the attribute’s value – getType(index) returns the attribute’s type, which will be one of the Strings "CDATA", "ID", "IDREF", "IDREFS", "NMTOKEN", "NMTOKENS", "ENTITY", "ENTITIES", or "NOTATION" • As with elements, if the local name is the empty string, then the attribute’s name is in the qualified name
  • 97. Attributes, II • SAX does not guarantee that the attributes will be returned in the same order they are written – After all, the order is irrelevant in XML • The following methods look up attributes by name rather than by index: – public int getIndex(String qualifiedName) – public int getIndex(String uri, String localName) – public String getValue(String qualifiedName) – public String getValue(String uri, String localName) • An Attributes object is valid only during the call to characters – If you need to remember attributes longer, use: AttributesImpl attrImpl = new AttributesImpl(attributes);
  • 98. ContentHandler methods, V • endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException • The parameters to endElement are the same as those to startElement, except that the Attributes parameter is omitted
  • 99. ContentHandler methods, VI • public void characters(char[] ch, int start, int length) throws SAXException • ch is an array of characters – Only length characters, starting from ch[start], are the contents of the element • The String constructor new String(ch, start, length) is an easy way to extract the relevant characters from the char array • characters may be called multiple times for one element – Newlines and entities may break the data characters into separate calls – characters may be called with length = 0 – All data characters of the element will eventually be given to characters
  • 100. Example • If hello.xml contains: – <?xml version="1.0"?> <display> Hello World! </display> • Then the sample program we started with gives: – startElement: display characters: <-- zero length string characters: <-- LF character (ASCII 10) characters: Hello World! <-- spaces are preserved characters: <-- LF character (ASCII 10) Element: /display
  • 101. Whitespace • Whitespace is a major nuisance – Whitespace is characters; characters are PCDATA – IF you are validating, the parser will ignore whitespace where PCDATA is not allowed by the DTD – If you are not validating, the parser cannot ignore whitespace – If you ignore whitespace, you lose your indentation • To ignore whitespace when validating: – Happens automatically • To ignore whitespace when not validating: – Use the String function trim() to remove whitespace – Check the result to see if it is the empty string
  • 102. Handling ignorable whitespace • A nonvalidating parser cannot ignore whitespace, because it cannot distinguish it from real data • A validating parser can, and does, ignore whitespace where character data is not allowed – For processing XML, this is usually what you want – However, if you are manipulating and writing out XML, discarding whitespace ruins your indentation – To capture ignorable whitespace, you can override this method (defined in DefaultHandler): public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException • Parameters are the same as those for characters
  • 103. Error Handling, I • SAX error handling is unusual • Most errors are ignored unless you register an error handler (org.xml.sax.ErrorHandler) – Ignored errors can cause bizarre behavior – Failing to provide an error handler is unwise • The ErrorHandler interface declares: – public void fatalError (SAXParseException exception) throws SAXException // XML not well structured – public void error (SAXParseException exception) throws SAXException // XML validation error – public void warning (SAXParseException exception) throws SAXException // minor problem
  • 104. Error Handling, II • If you are extending DefaultHandler, it implements ErrorHandler and registers itself – DefaultHandler’s version of fatalError() throws a SAXException, but... – its error() and warning() methods do nothing! • You can (and should) override these methods • Note that the only kind of exception your override methods can throw is a SAXException – When you override a method, you cannot add exception types – If you need to throw another kind of exception, say an IOException, you can encapsulate it in a SAXException: • catch (IOException ioException) { throw new SAXException("I/O error: ", ioException) }
  • 105. Error Handling, III • If you are not extending DefaultHandler: – Create a new class (say, MyErrorHandler) that implements ErrorHandler (by supplying the three methods fatalError, error, and warning) – Create a new object of this class – Tell your XMLReader object about it by sending it the following message: setErrorHandler(ErrorHandler handler) • Example: XMLReader parser = saxParser.getXMLReader(); parser.setErrorHandler(new MyErrorHandler());