NewsML Toolkit 2.0 Manual

Contents

License
1. Introduction
2. System Requirements and Setup
2.1. Runtime System Requirements
2.2. Compilation Requirements
3. Quick Start
4. Reading and Writing XML
4.1. Reading XML
4.2. Writing XML
5. Navigating
5.1. Moving Up: Parent and Root Nodes
5.2. Moving Down: Child and Descendant Nodes
5.2.1. Attribute-Based Child Nodes
5.2.2. Element-Based Child Nodes
5.3. Stopping: Leaf Values
5.3.1. Text
5.3.2. DataContent
6. Searching
6.1. NewsML XPath Examples
6.2. Notes for XSLT Users
7. Creating and Modifying
7.1. Creating Nodes
7.1.1. Creating New Nodes
7.1.2. Copying Existing Nodes
7.2. Modifying Nodes
7.2.1. Modifying Values
7.2.2. Inserting Children
7.2.3. Replacing Children
7.2.4. Removing Children
8. Conformance Testing
A. Troubleshooting
B. Minimal Sample NewsML Document
B. References

License

Version 2.0 of the NewsML Toolkit is copyright (c) 2002 by Reuters PLC and is released under the terms of the Gnu Lesser General Public License [LGPL], which explicitly allows it to be used in non-free software.

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.

1. Introduction

This manual is designed for computer programmers who want to incorporate version 2.0 of the NewsML Toolkit into their own Java programs. It does not provide a general introduction to XML, NewsML, or Java (though it is possible to use most of the toolkit without any specialized XML knowledge).

The NewsML Toolkit [TOOLKIT] is designed to read and write XML documents based on the NewsML 1.0 Functional Specification [NEWSML]. Most of the XML details are hidden from the user so that she can concentrate on the logical structure of a NewsML package.

To learn more about NewsML and other news industry specifications, you can visit the Internation Press Telecommunications Council (IPTC) Web site at

http://www.iptc.org/

2. System Requirements and Setup

If you want to build or rebuild the toolkit from source code, you will need to read through this entire chapter. If you just plan on using the library, you hate reading system requirements, and you just want to get up and running quickly, simply add the following JAR files from the NewsML Toolkit distribution to your Java class path:

Different Java environments have different ways of setting the class path. For Sun's reference Java Development Environment, you can set the class path with the CLASSPATH environment variable, separating each entry with a colon (Unix-based systems) or a semicolon (DOS-based systems). If you're using a different environment, read the documentation that came with it.

The rest of this chapter explains the system requirements in more detail, for those who want or need to know.

2.1. Runtime System Requirements

The NewsML Toolkit requires the following, additional Open-Source runtime libraries. Copies of all of these are included in the lib/ subdirectory at the top level of the distribution:

  1. An XML parser supporting SAX2 [SAX] and DOM level 2 [DOM] (tested with Xerces-J 1.4.0 [XERCES]). The DOM level 2 interfaces are also required, but they will usually be bundled with the parser.

    The XML parser handles low-level reading and error-reporting for XML documents.

  2. SAXPath [SAXPATH] (tested with version 1.0beta5).

    The SAXPath library provides parsing support for XPath [XPATH] syntax; together with Jaxen, it makes the NewsML Toolkit XPath-aware.

  3. Jaxen [JAXEN] (tested with version 1.0beta6)

    Together with SAXPath, Jaxen provides XPath support for the NewsML Toolkit. NewsML requires XPath for resolving formal names and bases for choice, and XPath can also be a useful syntax for referring to parts of a document.

If you plan on using the conformance-testing suite in the org.newsml.toolkit.conformance package, you will also require the following:

  1. Gnu Regular Expression library [GNUREGEXP] (tested with 1.1.3).

    This library adds support for regular-expression processing. The conformance suite uses regular expressions extensively in its tests to verify data formats.

2.2. Compilation Requirements

The requirements in this section apply only if you plan to compile or recompile the NewsML Toolkit itself. You might want to do this because you plan to contribute to the toolkit's development, because you want to use a specialized Java compiler (such as one that compiles to machine code), or because you want to keep up to date with the latest CVS version. If you don't plan on doing anything like that (or perhaps don't even understand what some of those things are), feel free to skip this section.

To compile and test the NewsML Toolkit, you will need the following utilities:

  1. Ant [ANT] (tested with version 1.4).

    Ant is a project-building tool, similar to make but written for and in Java. Strictly, speaking, this tool is not required -- you can import the Java files into any Java environment you wish. However, the distribution comes with a prewritten build.xml file for ant, so rebuilding and running the unit tests is as simple as typing "ant" in the root directory of the distribution.

  2. JUnit [JUNIT] (tested with version 3.7).

    JUnit is an extremely popular unit-testing package for Java. The NewsML Toolkit currently contains nearly 500 individual unit tests in the org.newsml.dom.unittests package: these tests help to ensure that changes to the toolkit do not break any existing functionality.

These packages are not included in the main distribution: if you do not have them already, you will have to download and install them yourself.

3. Quick Start

First, you need to install the required JAR libraries as described in the System Requirements and Setup chapter. When everything's in place, try this quick test from the src subdirectory of the NewsML distribution directory:

java NormalizeNewsML ExampleText.xml

(If your development environment has a different way of invoking the Java runtime environment, use it instead.) If everything is installed correctly, you will see a complete XML document scrolled quickly down the screen. If you get any exceptions or error messages, first make sure that you're in the src/ subdirectory of the distribution, then go back and make sure that all of the required libraries are actually on your class path and that there are no other versions shadowing them.

Now that you know that the toolkit is installed and functioning properly, you can try writing your first program with it. This chapter walks step-by-step through a simple Java program to load, examine, modify, and save a NewsML document (the document used in the examples is in the appendix Minimal Sample NewsML Document).

First, we need to load the NewsML document into the toolkit; that requires two lines of code:

NewsMLFactory factory = new DOMNewsMLFactory();
NewsML newsml = factory.createNewsML(input_file);

The first line creates a NewsMLFactory for building NewsML nodes, and the second uses the factor to open a NewsML document and return the root NewsML node. It doesn't matter if input_file is a local filename or a remote URL; the toolkit will work with either (local files will be faster and more secure, of course). The NewsMLFactory.createNewsML(String) method will throw a regular Java IOException if there is any problem loading the document.

Next, let's pull some information out of the document. NewsML 1.0 requires every NewsML document to contain a NewsEnvelope and every NewsEnvelope to contain a DateAndTime stating when the NewsML package was sent. The following code extracts that date string and prints it to standard output:

Text date = newsml.getNewsEnvelope().getDateAndTime();
System.out.println("Information sent at " + date.toString());

The method calls simply follow the path of the document: the NewsML root contains a NewsEnvelope, the NewsEnvelope contains the DateAndTime, and the DateAndTime contains text. (Normally, it is not quite this simple, because many nodes are optional and the getters may return null values, which you will have to check.)

Next, let's assume that the date in the sample document ("20020411T132700-0400", or 11 April 2002 at 1:27pm EDT) is incorrect, and we want to change it to 12 April:

date.setString("20020412T132700-0400");
System.out.println("Date changed to " + date.toString());

Now that the document has been corrected, we can write it back out as XML:

FileWriter output = new FileWriter(output_file);
newsml.writeXML(output, true);
output.close();

The BaseNode.writeXML(Writer, boolean) method takes writes the XML document to an output character stream; if you prefer a string, you can use BaseNode.toXML(boolean).

That's it -- you've loaded, examine, modified, and saved a NewsML document using the toolkit. Of course, a full Java application always requires some extra code for importing classes, checking arguments, catching exceptions, and so on. Here's a complete Java application based on the code above:

import java.io.IOException;
import java.io.FileWriter;

import org.newsml.toolkit.NewsML;
import org.newsml.toolkit.NewsMLFactory;
import org.newsml.toolkit.Text;

import org.newsml.toolkit.dom.DOMNewsMLFactory;

public class NewsMLExample
{

    public static void main (String args[])
    {
	if (args.length != 2) {
	    System.err.println("Usage: java NewsMLExample1 <input> <output>");
	    System.exit(2);
	}
	String input_file = args[0];
	String output_file = args[1];

	NewsMLFactory factory = new DOMNewsMLFactory();

	try {
	    NewsML newsml = factory.createNewsML(input_file);
	    Text date = newsml.getNewsEnvelope().getDateAndTime();
	    System.out.println("Information sent at " + date.toString());
	    date.setString("20020412T132700-0400");
	    System.out.println("Date changed to " + date.toString());
	    FileWriter output = new FileWriter(output_file);
	    newsml.writeXML(output, true);
	    output.close();
	} catch (IOException e) {
	    System.err.println("Failed to open NewsML document "
			       + args[0] + ": " + e.getMessage());
	    System.exit(-1);
	}
    }

}

Enjoy. You can probably do a lot of useful work now just by browsing the JavaDoc, but the rest of this manual provides more information and examples if you need them.

4. Reading and Writing XML

The NewsML Toolkit is designed to allow programmers to read and write XML documents without any specialized XML knowledge.

4.1. Reading XML

The methods for reading XML appear in the org.newsml.toolkit.NewsMLFactory interface:

public NewsML createNewsML (String url)
  throws IOException;

public NewsML createNewsML (Reader input, String baseURL)
  throws IOException;

public BaseNode createNode (String url)
  throws IOException;

public BaseNode createNode (Reader input, String baseURL)
  throws IOException;

The createNewsML methods read a full NewsML document (with the root element NewsML), and the createNode methods read a NewsML document with any root element, such as Catalog or TopicSet. Each of these comes in two flavours: one that reads the document over the Web from a URL, and one that reads the document from any Java Reader.

When you read the document from a Java Reader, you may also supply a base URL for resolving references in the NewsML document; if you provide a null argument, the method will generate a base URL based on the current directory.

Here is a simple example that reads a NewsML document directly from an external URL:

import java.io.IOException;
import org.newsml.toolkit.NewsMLFactory;
import org.newsml.toolkit.dom.DOMNewsMLFactory;

public NewsML read_newsml (String url)
{
  NewsML newsml = null;
  NewsMLFactory factory = new DOMNewsMLFactory();
  try {
    newsml = factory.createNewsML(url);
  } catch (IOException e) {
    System.err.println("Failed to read NewsML from " + url + ": "
                       + e.getMessage());
  }
  return newsml;
}

(The org.newsml.toolkit.dom.DOMNewsMLFactory class is an implementation of the NewsMLFactory interface, built on top of the Document Object Model [DOM]. In version 2.0 of the toolkit, it is the only implementation provided, but others may appear in the future.)

Here is another example, that reads a NewsML document from a file and lets the factory build a default base URL:

NewsML newsml = null;
NewsMLFactory factory = new DOMNewsMLFactory();
try {
  FileReader reader = new FileReader("mynewsml.xml");
  newsml = factory.createNewsML(reader, null);
} catch (IOException e) {
  System.err.println("Error reading mynewsml.xml: " + e.getMessage());
}

To read a NewsML document from a string, simply use the Java StringReader class instead of FileReader (both are in the java.io package).

Sometimes the root element of the NewsML document is not NewsML. For example, some providers publish standalone catalogs and topic sets, where the root element is Catalog or TopicSet; in other cases, you might want to create any arbitrary NewsML element from XML text provided by a user. In both cases, you need to use the createNode methods and cast to the correct type. Here is an example that creates a comment node from an XML string:

Comment comment = null;
NewsMLFactory factory = new DOMNewsMLFactory();
String xmlData = "<Comment Duid=\"XXX\">Hello, world!</Comment>";
try {
  comment = (Comment)factory.createNode(new StringReader(xmlData), null);
} catch (IOException e) {
  System.err.println("Error parsing XML string: " + e.getMessage());
}

Finally, here is a second example that reads a topic set from a (ficticious) remote URL:

TopicSet people = null;
NewsMLFactory factory = new DOMNewsMLFactory();
try {
  people = 
    (TopicSet)factory.createNode("http://www.acmenews.net/people.xml");
} catch (IOException e) {
  System.err.println("Failed to read topic set: " + e.getMessage());  
}

4.2. Writing XML

The methods for writing XML appear in the org.newsml.toolkit.BaseNode interface, which all NewsML nodes implement:

public void writeXML (Writer output, boolean isDocument)
  throws IOException;

public void writeXML (Writer output, String encoding, String internalSubset)
  throws IOException;

public String toXML (boolean isDocument);

public String toXML (String encoding, String internalSubset);

The writeXML methods write to any Java Writer, while the toXML methods (analogous to java.lang.Object.toString) writes the XML representation to a string and returns the string.

Each of the methods comes in two flavours. The first has an isDocument flag that determines whether the XML should be written as a fragment or as a standalone document with its own XML declaration and document-type declaration. The second provides greater control, by allowing the user to specify a character encoding and internal DTD subset to be included.

For example, to write a TopicSet out as a document fragment to the file "newsml-out.xml", you could use the following:

FileWriter output = new FileWriter("newsml-out.xml");
topicset.writeXML(output, false);
output.close();

Since isDocument is false, this method will write out something like this:

<TopicSet FormalName="places">
 ...
</TopicSet>

This is perfectly good XML by itself, but contains no information about the character encoding and no document type declaration. In case you want those (or need them to be fully NewsML-compliant), you can get a default XML declaration and document type declaration by setting isDocument to true:

topicset.writeXML(output, true);

If you originally loaded the document from XML, you will get the original public identifier, system identifier, and internal DTD subset. The result might look something like this:

<?xml version="1.0"?>

<DOCTYPE TopicSet SYSTEM "NewsMLv1.0.dtd">

<TopicSet FormalName="places">
 ...
</TopicSet>

For even finer control, you can use a different version of the method where you can provide your own character encoding and internal DTD subset:

topicset.writeXML(output, "ISO-8859-1", "<!ENTITY acme \"ACME News Inc.\">");

The result might look like this:

<?xml version="1.0" encoding="ISO-8859-1"?>

<DOCTYPE TopicSet SYSTEM "NewsMLv1.0.dtd" [
<!ENTITY acme "ACME News Inc.">
]>

<TopicSet FormalName="places">
 ...
</TopicSet>

The toXML methods work exactly the same way, except that they do not take a Writer argument and they return the XML as a string.

5. Navigating

The NewsML Toolkit stores NewsML nodes in a tree. Every node except the root (usually NewsML, but sometimes TopicSet, Catalog, or another type) has a parent node, and every node except the leaf nodes has children.

(Actually, things are not quite that simple. The design of NewsML requires that an API retain some lexical details of XML markup, and one of those details is the distinction between XML elements and attributes. More on that point later.)

The navigation methods in the toolkit make it possible to navigate from any node to any other node in the same NewsML document. This approach works well when you know the exact path from where you are to where you want to go; for an alternative approach, see the Searching chapter.

5.1. Moving Up: Parent and Root Nodes

To get a node's parent, you use the BaseNode.getParent() method, and then cast to the appropriate type:

NewsItem newsitem = (NewsItem)identification.getParent();

When a node can have several different types of parent, you can use the BaseNode.getXMLName() method to figure out what to cast to.

Since every node except the root has a non-null parent, you can find the root easily from any node like this:

BaseNode root = node;
while (root.getParent() != null)
  root = root.getParent();

5.2. Moving Down: Child and Descendant Nodes

The best way to get a child node is to use the appropriate specialized method in the node's interface.

For example, to get the second ByLine node inside a NewsLines node, use the NewsLines.getByLine(int) method:

OriginText byline = newslines.getByLine(1);

If you want to know how many ByLine children are available, use the NewsLines.getByLineCount() method:

int nByLines = newslines.getByLineCount();
for (int i = 0; i < nByLines; i++) {
  process_byline(newslines.getByLine(i));
}

Finally, to get all of the ByLine children in a single array, use the NewsLines#getByLine() method:

ByLine bylines[] = newslines.getByline();

All of the NewsML interfaces follow exactly this pattern for child nodes that can appear more than once: there is a get*Count() method that returns the number of children, a get*(int) method that returns a specific child, and a get*() method that returns all of the children in a single array. For example, the org.newsml.toolkit.NewsEnvelope interface has NewsEnvelope.getNewsServicecount() and NewsEnvelope.getNewsProductCount(), NewsEnvelope.getNewsService(int) and NewsEnvelope.getNewsProduct(int), and NewsEnvelope.getNewsService() and NewsEnvelope.getNewsProduct().

When a child type can appear only once, there is no need to get a count or to collect all children of the same type into an array, so an interface will have simply a get*() method to return the child, such as NewsEnvelope.getDateAndTime():

IdText date = newsenvelope.getDateAndTime();

These accessors work well when you are writing code to follow a specific path through a NewsML document. There will be times, however, when you want to write more generic code that is not tied to specific NewsML node types (such as interactive editors or search engines). In these cases, you can take advantage of the generic methods in the org.newsml.toolkit.BaseNode interface.

To use the generic methods, you need to know first whether the node you want is represented in the XML markup as an element or an attribute.

5.2.1. Attribute-Based Child Nodes

In XML documents, attributes have several special properties:

  • They are not repeatable: a parent element may have only one attribute with any single name.

  • They are unordered: it is not allowed to matter what order attributes appear in in a document, and they have no ordering relative to elements.

  • They may not contain children: attributes always represent leaf nodes.

The navigation methods in the node-specific NewsML interfaces hide these differences, but the generic navigation methods cannot.

To get a child node represented by an attribute, you use the BaseNode.getAttr(String) method:

Text rank = node.getAttr("Rank");

Note that, like most leaf nodes, attribute-based nodes (currently) all implement the org.newsml.toolkit.Text interface. You'll need to use Text.toInt(), Text.toBoolean(), or Text.toString() method to get the actual attribute value.

Text node = node.getAttr("Rank");
int rank;
if (node != null)
  rank = node.toInt();

5.2.2. Element-Based Child Nodes

Because elements are repeatable, are ordered, and can have their own child nodes, the generic navigation methods for nodes based on elements are more complex than for those based on attributes.

The primitive methods for retrieving element-based child nodes are BaseNode.getChildCount() and BaseNode.getChild(int):

int nChildren = node.getChildCount();
for (int i = 0; i < nChildren; i++) {
  BaseNode child = node.getChild(i);
  System.out.println("Child " + i + " is a " + child.getXMLName());
}

The getChildCount method counts only element-based nodes, not attribute-based nodes. There is also a BaseNode.getChild()method for retrieving all of the element-based child nodes in a single array:

BaseNode children[] = node.getChild();

Much of the time, however, you already know the XML element name of the node you're looking for, so you're interested only in child nodes with that name. In that case, you can use the derived convenience methods BaseNode.getChildCount(String) and BaseNode.getChild(String, int):

int nChildren = node.getChildCount("Comment");
for (int i = 0; i < nChildren; i++)
  process_comment((Comment)node.getChild("Comment", i));

Note that the various getChild methods return the type BaseNode, so you must cast down to the appropriate type (as in the previous example).

Finally, there is a BaseNode.getChild()method for retrieving all of the element-based child nodes with a specified name in a single array (note that the array will be of type BaseNode[], not of the derived interface type):

BaseNode comments[] = node.getChild("Comment");

5.3. Stopping: Leaf Values

While the higher-level structure of a NewsML document is important, eventually you need to get at the basic information like dates, names, places, and numbers. In the NewsML 1.0 XML format [], this information exists as character data and attribute values, and represents the leaf nodes of the NewsML tree. There are two different ways of accessing the information, depending on whether it occurs as part of the NewsML markup, or as part of some other markup language inside a DataContent element.

5.3.1. Text

Nearly all nodes that contain leaf data implement the org.newsml.toolkit.Text interface or one of its subinterfaces, and the following methods provide the standard way to get leaf values for NewsML:

Internally, leaf values are always managed as strings; the int and boolean methods are simply for convenience.

The following example gets the text of a comment:

String text = comment.toString();

The following example gets the revision number of a NewsIdentifier:

int rev = newsIdentifier.getRevisionId().toInt();

The getBoolean method returns true if the value is 'y' and false otherwise.

5.3.2. DataContent

There is one type of leaf data that does not use the Text interface. The DataContent element in NewsML contains an inline payload in a NewsML document, which is not (usually) in the NewsML format. The org.newsml.toolkit.DataContent interface provides several different ways of accessing the non-NewsML data:

For example, consider the following data content (using a very simple, ficticious sports markup language):

<DataContent>
 <SportsScore>
  <sport>hockey</sport>
  <team>
   <name>Montreal</name>
   <score>3</score>
  </team>
  <team>
   <name>Boston</name>
   <score>2</score>
  </team>
 <SportsScore>
</DataContent>

The getText method would return a Java string containing the above example with all XML tags stripped out (and any entity references expanded):


  hockey

   Montreal
   3


   Boston
   2


The getXMLString method would return all of the above example except for the opening and closing DataContent tags in a single Java String:

 <SportsScore>
  <sport>hockey</sport>
  <team>
   <name>Montreal</name>
   <score>3</score>
  </team>
  <team>
   <name>Boston</name>
   <score>2</score>
  </team>
 <SportsScore>

Finally, the getDOMNodes method would return a list of DOM nodes representing the SportsScore element and any whitespace text around it.

Obviously, getText is not that useful if the data content contains XML markup; it is designed mainly for payloads like plain (non-XML) text, including base64-encoded binary objects like photos.

6. Searching

The NewsML Toolkit is not primarily designed as a search or query engine; however, the built-in XPath [XPATH] support does allow for certain types of contextual searching.

The org.newsml.toolkit.NewsMLSession interface includes two methods for executing XPath queries:

public BaseNode [] getNodesByXPath (BaseNode contextNode, String xpath)
  throws NewsMLException;

public BaseNode getNodeByXPath (BaseNode contextNode, String xpath)
  throws NewsMLException;

The first method returns an array of all nodes that match the XPath expression; the second returns only the first match.

The current session is accessible from any node through the org.newsml.toolkit.BaseNode interface, so a typical query might look like this:

BaseNode matches[] = node.getSession().getNodesByXPath(node, xpath);

6.1. NewsML XPath Examples

This section contains some NewsML-specific examples of XPath queries.

Find every node using Swiss French:

//node()[@xml:lang="fr-CH"]

Find the closest ancestor NewsComponent:

ancestor-or-self::NewsComponent

Find every content item using the "image/jpeg" MIME type:

//ContentItem[MimeType/@FormalName="/image/jpeg"]

Find every news item that contains an update:

//NewsItem/Update

Note that you must cast each result from BaseNode to the appropriate type.

6.2. Notes for XSLT Users

If your only experience with XPath is through XSLT [XSLT], you may find that the getNodesByXPath and getNodeByXPath methods do not behave exactly as you expect. For example, if you wanted to find every comment that appears inside a NewsComponent, you might be tempted to try

  // WRONG!!
BaseNode matches[] = session.getNodesByXPath(node, "NewsComponent/Comment");

However, this XPath query will return nothing at all unless context node root happens itself to contain a NewsComponent child with a Comment child. XSLT uses a processing model that usually iterates through an entire XML document, trying an XPath expression with each node as context node: that's the only reason that an expression like "NewsComponent/Comment" works in XSLT. For general XPath use, there is only a single context node. To find every Comment inside a NewsComponent descendant of the node, you need to enter the expression like this:

  // Correct for a specific subtree.
BaseNode matches[] = session.getNodesByXPath(node, "//NewsComponent/Comment");

The leading "//" tells the XPath engine to look for any NewsComponent descendant rather than just the immediate children. To search the entire document rather than just the descendants of the context node, you need to use something like

  // Correct for the entire document.
BaseNode matches[] = 
  session.getNodesByXPath(node, "/descendant::NewsComponent/Comment");

For more information, review the XPath specification listed in the References appendix.

7. Creating and Modifying

Starting with version 2.0, NewsML Toolkit includes support for creating new nodes and modifying existing ones. Client applications can now use the toolkit to perform simple or complex modifications on a NewsML package before saving it back to XML, or even to create a new NewsML package entirely from scratch.

7.1. Creating Nodes

Creating nodes is very similar to reading from an XML document (see 4.1. Reading XML); in fact, reading an XML document is just a special case of creating a new node. The methods for creating new NewsML nodes also appear in the org.newsml.toolkit.NewsMLFactory interface, and are divided into two groups: methods for creating new, empty nodes, and methods for copying existing nodes.

7.1.1. Creating New Nodes

In the org.newsml.toolkit.NewsMLFactory interface, there is a no-argument create method for every XML element and attribute type in NewsML 1.0 [NEWSML]. Here are some examples:

public OriginText createByLine ()
  throws IOException;

public NewsML createNewsML ()
  throws IOException;

public Text createAssignedByAttr ()
  throws IOException;

public Text createContextAttr ()
  throws IOException;

The factory methods for nodes represented in NewsML by XML attributes always end in "Attr" and always return an instance of org.newsml.toolkit.Text; the factory methods for nodes represented by XML elements return the appropriate node type.

Note in version 2.0, the NewsML Toolkit does not yet have DTD knowledge built in. When you create a new node, it will be empty, even if it is based on an XML element that has required attributes or child elements in the NewsML 1.0 DTD. For example, the NewsML 1.0 DTD requires that the XML NewsItem have at least Identification and NewsManagement child elements; however, if you invoke

NewsItem item = factory.createNewsItem();
System.out.print(item.toXML());

The output will be simply

<NewsItem></NewsItem>

It is the application's reponsibility to ensure that the NewsML document is valid and conformant.

The application has not saved a reference to the current NewsMLFactory, it can look up the reference through the current session, which is available through the BaseNode.getSession() method:

NewsMLFactory factory = node.getSession().getFactory();

NewsMLFactory also has the generic methods NewsMLFactory.createNewNode(String) and NewsMLFactory.createNewNodeAttr(String) for creating any type of node. Note that these require you to know whether the node is represented by an element or attribute in the NewsML 1.0 XML markup, and that they also require you to cast the result to the appropriate type. Here is an example:

NewsEnvelope envelope = (NewsEnvelope)factory.createNewNode("NewsEnvelope");

7.1.2. Copying Existing Nodes

The NewsMLFactory interface also contains support for copying existing NewsML nodes. For every factory method like this

public NewsEnvelope createNewsEnvelope()
  throws IOException;

There is also one like this

public NewsEnvelope createNewsEnvelope(NewsEnvelope node) 
  throws IOException;

The second type of method will create a deep copy of the argument provided: any modifications to the copy will not affect the original. Note also that the copy will not have any parent until you add it to a NewsML document explicitly. Here is an easy way to clone a NewsComponent:

NewsComponent component2 = factory.createNewsComponent(component);

As with creating, there are also generic methods NewsMLFactory.createNewNode(BaseNode) and NewsMLFactory.createNewNodeAttr(Text) for copying any type of NewsML node:

NewsEnvelope envelope_copy = (NewsEnvelope)factory.createNewNode(envelope);

Again, with the generic methods, you must know whether NewsML 1.0 represents the node in XML as an element or an attribute.

7.2. Modifying Nodes

The major change between version 1.1 and version 2.0 of the NewsML Toolkit is the ability to modify a NewsML document programatically. There are two types of modification possible:

  1. You can change the data value of a node.
  2. You can add to, take away from, or rearrange the children of a node (and by extension, change a node's position in the tree).

7.2.1. Modifying Values

Starting with version 2.0, the NewsML Toolkit allows client applications to modify the leaf values of nodes. For example, the application can change a date, correct a spelling mistake, or reassign a formal name using the methods described in this section. When the application is creating a new NewsML document from scratch rather than modifying an existing one, these methods allow the application to populate the document with actual information.

Most of the time, you will be using three key methods in the org.newsml.toolkit.Text interface: Text.setBoolean(boolean), Text.setInt(int), and Text.setString(String); these correspond to the accessor methods Text.toBoolean(), Text.toInt(), and Text.toString().

For example, to set the text of a BasisForChoice, you could use the following:

basisForChoice.setString("//Format/@FormalName");

As a slightly more complex example, to increment the RevisionId of a NewsIdentifier, you could use the following:

id.getRevisionId().setInt(id.getRevisionId().toInt()+1);

Internally, leaf values are always managed as strings; the int and boolean methods are simply for convenience.

There are also DataContent.setText(String), DataContent.setXMLString(String), and DataContent.setDOMNodes(NodeList) methods corresponding to the accessor methods described in the DataContent section.

The corresponding setText method is convenient for setting raw, non-XML text like base64-encoded graphics or encryption keys, because it does not require you to escape any special XML characters like '<' or '&':

content.setText("AT&T");

The setXMLString method is a little more restrictive than the getXMLString method: the string returned by getXMLString might not be a well-formed XML document, since there may be more than one top-level element; setXMLString requires a single root element and a well-formed XML document, since it will actually parse the string as XML. Note that entity references might cause problems. Here is an example of setting a very simple piece of XML content:

content.setXMLString("<company>AT&amp;T</company>");

Finally, the setDOMNodes method is useful if your XML is already parsed in memory, and you want to avoid the overhead of parsing it again. It is the application's responsibility to ensure that the document remains DTD valid, if validation is being performed.

7.2.2. Inserting Children

The primitive method for inserting a new child into a node is BaseNode.insertChild(int, BaseNode). This method inserts a child base node into the current node at the absolute position specified, pushing any child originally in that position (and all subsequent children) forward by one position. All positions are zero-indexed.

For example, consider the following NewsML XML markup:

<Resource>
 <Urn>urn:xxx:yyy</Urn>
 <Url>http://www.acmenews.com/vocab/</Url>
</Resource>

The Resource (an instance of org.newsml.toolkit.Resource) has two child nodes, Urn and Url (both instances of org.newsml.toolkit.IdText). The following code adds a new Url in position 1:

resource.insertChild(1, new_url);

Depending on the contents of new_url, the resulting XML rendition would look something like this:

<Resource>
 <Urn>urn:xxx:yyy</Urn>
 <Url>http://www.newsrus.com/metadata/</Url>
 <Url>http://www.acmenews.com/vocab/</Url>
</Resource>

An application can use this method to insert a new child anywhere in the NewsML tree. Version 2.0 of the toolkit does not apply DTD constraints, so it is the responsibility of the application to ensure that the document remains NewsML-conformant after the new child has been inserted.

An index of -1 will always insert the child in the last position, so

resource.insertChild(-1, new_url);

would result in something like

<Resource>
 <Urn>urn:xxx:yyy</Urn>
 <Url>http://www.acmenews.com/vocab/</Url>
 <Url>http://www.newsrus.com/metadata/</Url>
</Resource>

This convention is useful for appending new children to a node incrementally.

It is not always convenient to provide an absolute position for inserting a child, however; in many cases, an application will need to add a child to the beginning or end of a list of similar children. For that, the methods BaseNode.insertBefore(String, int, BaseNode) and BaseNode.insertAfter(String, int, BaseNode) are helpful:

node.insertBefore("Comment", 0, comment);

This example will insert comment immediately before the first existing Comment node.

There are several other convenience methods for inserting. Here is the entire list:

public BaseNode insertChild (int index, BaseNode child);
public BaseNode[] insertChild (int index, BaseNode children[]);
public BaseNode insertFirst (BaseNode child);
public BaseNode[] insertFirst (BaseNode children[]);
public BaseNode insertLast (BaseNode child);
public BaseNode[] insertLast (BaseNode children[]);
public BaseNode insertBefore (String name, int index, BaseNode child);
public BaseNode insertBefore (String name, BaseNode child);
public BaseNode[] insertBefore (String name, int index, BaseNode children[]);
public BaseNode[] insertBefore (String name, BaseNode children[]);
public BaseNode insertBeforeDuid (String duid, BaseNode child);
public BaseNode[] insertBeforeDuid (String duid, BaseNode children[]);
public BaseNode insertAfter (String name, int index, BaseNode child);
public BaseNode insertAfter (String name, BaseNode child);
public BaseNode[] insertAfter (String name, int index, BaseNode children[]);
public BaseNode[] insertAfter (String name, BaseNode children[]);
public BaseNode insertAfterDuid (String duid, BaseNode child);
public BaseNode[] insertAfterDuid (String duid, BaseNode children[]);

For more information, see the documentation for org.newsml.toolkit.BaseNode.

7.2.3. Replacing Children

Replacing children is like Inserting Children, except that the child at the existing index is removed rather than being pushed forward, and the replace methods always return the node that has been removed rather than any nodes that have been added. The fundamental method is BaseNode.replaceChild(int, BaseNode). Consider the following document:

<Resource>
 <Urn>urn:xxx:yyy</Urn>
 <Url>http://www.acmenews.com/vocab/</Url>
</Resource>

When the application executes the following command

resource.replaceChild(1, url);

The URL in position 1 will be removed, and the new URL will be added in its place:

<Resource>
 <Urn>urn:xxx:yyy</Urn>
 <Url>http://www.newsrus.com/metadata/</Url>
</Resource>

As with inserting, there is an extensive collection of convenience methods in the org.newsml.toolkit.BaseNode class:

public BaseNode replaceChild (int index, BaseNode child);
public BaseNode replaceChild (int index, BaseNode children[]);
public BaseNode replaceChild (String xmlName, int index, BaseNode child);
public BaseNode replaceChild (String xmlName, BaseNode child);
public BaseNode replaceChild (String xmlName, int index, BaseNode children[]);
public BaseNode replaceChild (String xmlName, BaseNode children[]);

7.2.4. Removing Children

The methods for removing children are similar to those for Replacing Children, except that the application does not supply any replacement for the node being removed. The basic method is BaseNode.removeChild(int).

For example, consider once again the following NewsML markup:

<Resource>
 <Urn>urn:xxx:yyy</Urn>
 <Url>http://www.acmenews.com/vocab/</Url>
</Resource>

When the application executes the following code

resource.removeChild(1);

the result will be

<Resource>
 <Urn>urn:xxx:yyy</Urn>
 <Url>http://www.acmenews.com/vocab/</Url>
</Resource>

As with inserting and deleting, the org.newsml.toolkit.BaseNode interface contains a couple of convenience methods:

public BaseNode removeChild (int index);
public BaseNode removeChild (String xmlName, int index);
public void removeSelf () throws NewsMLException;

Note again that version 2.0 of the NewsML Toolkit does not enforce DTD constraints, so it is the application's reponsibility to ensure that the document remains DTD-valid and NewsML-conformant when removing nodes.

8. Conformance Testing

Sometimes, especially on the authoring side, an application will need to check a NewsML document for conformance errors. Since the NewsML 1.0 specification includes a DTD, some of the simpler, more obvious structural errors can be detected simply by enabling DTD validation using the NewsMLFactory.setValidation(boolean) method before reading an XML document. With this flag set to true, and DTD validation errors in the NewsML document will cause an exception at load time.

DTDs (or other schemas) can detect only simple, obvious errors, however. A large part of the conformance requirements of NewsML are not covered by general-purpose schemas. As a result, the NewsML toolkit includes an org.newsml.toolkit.conformance package for running additional, non-DTD-based tests for data-type constraints, referential integrity, proper formal-name use, and so on.

The main entry point to the conformance testing is the org.newsml.toolkit.conformance.NewsMLTestManager class. This class allows the application to register a series of tests and then to run the tests against a NewsML document. Here is a simple example that uses the built-in default tests:

NewsMLTestManager mgr = new NewsMLTestManager();
mgr.addDefaultTests();
mgr.runTests(newsml, true);

By default, the test manager will print messages for warnings and errors to System.err. The application can override the default by supplying its own implementation of org.newsml.toolkit.conformance.ErrorVisitor using the org.newsml.toolkit.conformance/NewsMLTestManager.setErrorVisitor(ErrorVisitor) method.

An application writer can choose not to use the default tests, and can add new tests for local business rules by extending the org.newsml.toolkit.conformance.TestBase class. The NewsMLTestManager.addTest(String, TestBase) method takes as its first argument an XPath expression matching the nodes to which the tests should be applied.

For example, a test to warn if the provider id is not "acmenews.com" would look like this:

import org.newsml.toolkit.BaseNode;
import org.newsml.toolkit.NewsMLException;
import org.newsml.toolkit.conformance.TestBase;

public class ACMEProviderTest extends TestBase
{
  public void run (BaseNode contextNode, boolean useExternal)
    throws NewsMLException
  {
    if (!"acmenews.com".equals(contextNode.toString()))
      warn("ProviderId is " + contextNode.toString() +
           " rather than acmenews.com");
  }
}

The application would register this test for all ProviderId nodes like this:

mgr.addTest("//ProviderId", new ACMEProviderTest());

Now, every time the tests are run, the conformance test manager will report a warning for any provider ID other than "acmenews.com".

A. Troubleshooting

  1. Why do I get a java.lang.NoClassDefFoundError when I try to run the toolkit?

    You probably haven't installed all of the required libraries, or you haven't set up your CLASSPATH correctly so that Java can find them. See the System Requirements and Setup chapter for more information.

  2. Why do I get an XML parsing exception when I put non-English characters in my NewsML document?

    You probably haven't set up the encoding correctly. The default eight-bit encoding for an XML document is UTF-8, not ISO-8859-1 (ISO Latin 1). The two are the same up to character position 127 (ASCII), but then UTF-8 uses multibyte escape sequences. If you want to include accented characters or non-English characters, either configure your editor to save with the UTF-8 encoding, or specify a different encoding in the XML declaration (this is not guaranteed to work with all parsers):

    <?xml version="1.0" encoding="ISO-8859-1"?>
    
  3. Why does it take so long to load my document?

    If you are loading your document over the network, it will be limited by network speed; perhaps the remote server is slow or your connection is dropping packets.

    The problem can also occur with documents on the local computer when you have enabled DTD validation and the document references a DTD at another side. For instance, if you have something like

    <!DOCTYPE NewsML SYSTEM "http://www.acmenews.com/newsml/NewsMLv1.0.dtd">
    

    your whole system is at the mercy of the acmenews.com server -- every time you load a document, the XML parser will go back to acmenews.com and request another copy of the DTD file. If acmenews.com goes offline or moves the DTD file, your system will grind to a halt; if acmenews.com has a security breach, the DTD file may be maliciously altered to force false information into your news stories.

    The best solution is to disable DTD validation or use only locally-hosted DTD files. This is good general advice for any XML system, not just for NewsML.

  4. Why won't the XPath expression "NewsComponent" find any NewsComponent nodes (etc.)?

    XPath anchors all searches in a context node, and reads the path relative to that node. The XPath expression "NewsComponent" will find only NewsComponents that are direct children of the context node; what you probably mean is "//NewsComponent".

    XSLT users get especially confused on this point, because an XSLT implementation is an engine that systematically walks through a document, trying every node as the context node in turn -- that's an XSLT thing, not an XPath thing. Read the XPath spec [XPATH] for more information.

B. Minimal Sample NewsML Document

This simple NewsML document describes the manual you are currently reading. It is also available separately (your browser may do strange things trying to display it).


<?xml version="1.0"?>
<!--
************************************************************************
Minimal useful NewsML Document.
************************************************************************
-->

<!-- note that a copy of the DTD file must be in the same directory -->
<!DOCTYPE NewsML SYSTEM "NewsMLv1.0.dtd">

<NewsML>

 <!-- Use a single, inline vocabulary for everything -->
 <TopicSet Duid="vocab" FormalName="vocab" Vocabulary="#vocab">
  <Topic>
   <TopicType FormalName="topic" Vocabulary="#vocab"/>
   <FormalName>topic</FormalName>
   <Description>A formally-identified topic.</Description>
  </Topic>
  <Topic>
   <TopicType FormalName="topic" Vocabulary="#vocab"/>
   <FormalName>vocab</FormalName>
   <Description>A vocabulary of topics and formal names.</Description>
  </Topic>
  <Topic>
   <TopicType FormalName="topic" Vocabulary="#vocab"/>
   <FormalName>released</FormalName>
   <Description>The status of a document released to the public.</Description>
  </Topic>
  <Topic>
   <TopicType FormalName="topic" Vocabulary="#vocab"/>
   <FormalName>manual</FormalName>
   <Description>Technical documentation.</Description>
  </Topic>
  <Topic>
   <TopicType FormalName="topic" Vocabulary="#vocab"/>
   <FormalName>text/html</FormalName>
   <Description>An HTML document.</Description>
  </Topic>
 </TopicSet>

 <NewsEnvelope>
  <DateAndTime>20020411T132700-0400</DateAndTime>
 </NewsEnvelope>

 <NewsItem>

  <Identification>
   <NewsIdentifier>
    <ProviderId>megginson.com</ProviderId>
    <DateId>20020411</DateId>
    <NewsItemId>newsml-toolkit-manual</NewsItemId>
    <RevisionId PreviousRevision="0" Update="N">1</RevisionId>
    <PublicIdentifier>urn:newsml:megginson.com:20020411:newsml-toolkit-manual:1</PublicIdentifier>
   </NewsIdentifier>
  </Identification>

  <NewsManagement>
   <NewsItemType FormalName="manual" Vocabulary="#vocab"/>
   <FirstCreated>20020411T132600-0400</FirstCreated>
   <ThisRevisionCreated>20020411T132600-0400</ThisRevisionCreated>
   <Status FormalName="released" Vocabulary="#vocab"/>
  </NewsManagement>

  <NewsComponent>
   <NewsLines>
    <HeadLine>NewsML Toolkit 2.0 Manual</HeadLine>
    <CopyrightLine>Copyright (c) 2002 by Reuters PLC</CopyrightLine>
    <RightsLine>Free redistribution permitted</RightsLine>
   </NewsLines>
   <ContentItem Href="http://newsml-toolkit.sourceforge.net/newsml-toolkit-manual.html">
    <MimeType FormalName="text/html" Vocabulary="#vocab"/>
   </ContentItem>
  </NewsComponent>

 </NewsItem>

</NewsML>

<!-- end of newsml-sample.xml -->

C. References

[ANT]
The Jakarta Project. Apache Ant. Version 1.4. URL: http://jakarta.apache.org/ant/
[DOM]
World Wide Web Consortium. Document Object Model (DOM) Level 2 Core Specification. Version 1.0, 13 November 2000. URL: http://www.w3.org/TR/DOM-Level-2-Core/
[GNUREGEXP]
The Gnu Project. Regular Expressions for Java. Version 1.1.3. URL: http://savannah.gnu.org/projects/gnu-regexp/
[JAXEN]
Bob McWhirter and James Strachan. Jaxen: Java XPath Engine. Version 1.0beta6. URL: http://jaxen.org/
[JUNIT]
Erich Gamma and Kent Beck. JUnit. Version 3.7. URL: http://junit.org/
[LGPL]
Free Software Foundation. Gnu Lesser General Public License (LGPL). Version 2.1, February 1999. URL: http://www.gnu.org/licenses/lgpl.html
[NEWSML]
International Press Telecommunications Council. NewsML Version 1.0 Functional Specification. 24 October 2000. URL: http://www.iptc.org/site/NewsML/specification/NewsMLv1.0.pdf
[SAX]
The XML-Dev Mailing List. SAX: The Simple API for XML. Version 2.0. URL: http://www.saxproject.org/
[SAXPATH]
Bob McWhiter and James Strachan. SAXPath: Simple API for XPath. Version 1.0beta5. URL: http://www.saxpath.org/
[TOOLKIT]
Reuters PLC. NewsML Toolkit. Version 2.0. URL: http://newsml-toolkit.sourceforge.net/
[XERCES]
The Apache XML Project. Xerces Java Parser. Version 1.4.0. URL: http://xml.apache.org/xerces-j/
[XPATH]
World Wide Web Consortium. XML Path Language (XPath). Version 1.0. 16 November 1999. URL: http://www.w3.org/TR/xpath
[XSLT]
World Wide Web Consortium. XSL Transformations (XSLT). Version 1.0. 16 November 1999. URL: http://www.w3.org/TR/xslt