Appearance
Using DOM
Since XML is a tree-structured document, it has two standard parsing APIs:
- DOM: Reads the entire XML at once and represents it as a tree structure in memory;
- SAX: Reads XML as a stream, using event callbacks.
Let's first look at how to use DOM to read XML.
DOM stands for Document Object Model. The DOM model treats the XML structure as a tree, starting from the root node, where each node can contain any number of child nodes.
Using the following XML as an example:
xml
<?xml version="1.0" encoding="UTF-8" ?>
<book id="1">
<name>Core Java</name>
<author>Cay S. Horstmann</author>
<isbn lang="CN">1234567</isbn>
<tags>
<tag>Java</tag>
<tag>Network</tag>
</tags>
<pubDate/>
</book>
If parsed into a DOM structure, it would look approximately like this:
┌─────────┐
│document │
└─────────┘
│
▼
┌─────────┐
│ book │
└─────────┘
│
┌──────────┬──────────┼──────────┬──────────┐
▼ ▼ ▼ ▼ ▼
┌─────────┐┌─────────┐┌─────────┐┌─────────┐┌─────────┐
│ name ││ author ││ isbn ││ tags ││ pubDate │
└─────────┘└─────────┘└─────────┘└─────────┘└─────────┘
│
┌────┴────┐
▼ ▼
┌───────┐ ┌───────┐
│ tag │ │ tag │
└───────┘ └───────┘
Notice that the top-level document
represents the XML document itself, which is the true "root". Although <book>
is the root element, it is a child node of document
.
Java provides the DOM API to parse XML, which uses the following objects to represent the XML content:
- Document: Represents the entire XML document;
- Element: Represents an XML element;
- Attribute: Represents an attribute of an element.
The code to parse an XML document using the DOM API is as follows:
java
InputStream input = Main.class.getResourceAsStream("/book.xml");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(input);
DocumentBuilder.parse()
is used to parse an XML, which can accept an InputStream
, File
, or URL
. If the parsing is successful, we obtain a Document
object that represents the tree structure of the entire XML document. We need to traverse it to read the values of specific elements:
java
void printNode(Node n, int indent) {
for (int i = 0; i < indent; i++) {
System.out.print(' ');
}
switch (n.getNodeType()) {
case Node.DOCUMENT_NODE: // Document node
System.out.println("Document: " + n.getNodeName());
break;
case Node.ELEMENT_NODE: // Element node
System.out.println("Element: " + n.getNodeName());
break;
case Node.TEXT_NODE: // Text node
System.out.println("Text: " + n.getNodeName() + " = " + n.getNodeValue());
break;
case Node.ATTRIBUTE_NODE: // Attribute node
System.out.println("Attr: " + n.getNodeName() + " = " + n.getNodeValue());
break;
default: // Other nodes
System.out.println("NodeType: " + n.getNodeType() + ", NodeName: " + n.getNodeName());
}
for (Node child = n.getFirstChild(); child != null; child = child.getNextSibling()) {
printNode(child, indent + 1);
}
}
The parsed structure is as follows:
Document: #document
Element: book
Text: #text =
Element: name
Text: #text = Core Java
Text: #text =
Element: author
Text: #text = Cay S. Horstmann
Text: #text =
...
For the structure parsed by the DOM API, starting from the root node Document
, you can traverse all child nodes to obtain all elements, attributes, and text data, including comments. These nodes are collectively referred to as Node. Each Node
has its own Type
, which distinguishes whether a Node
is an element, attribute, text, etc.
When using the DOM API, if you need to read the text of a particular element, you must access its child node of type Text
, making it somewhat cumbersome to use.
Exercise
Use DOM to parse XML.
Summary
- Java's DOM API can parse XML into a DOM structure, represented by the
Document
object; - DOM can fully represent the XML data structure in memory;
- DOM parsing is slow and consumes a lot of memory.