Version 2.0 of the NewsML Toolkit is copyright (c) 2002 by Reuters PLC and is released under the terms of the Gnu Lesser General Public License [LGPL], which explicitly allows it to be used in non-free software.
This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.
This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
This manual is designed for computer programmers who want to incorporate version 2.0 of the NewsML Toolkit into their own Java programs. It does not provide a general introduction to XML, NewsML, or Java (though it is possible to use most of the toolkit without any specialized XML knowledge).
The NewsML Toolkit [TOOLKIT] is designed to read and write XML documents based on the NewsML 1.0 Functional Specification [NEWSML]. Most of the XML details are hidden from the user so that she can concentrate on the logical structure of a NewsML package.
To learn more about NewsML and other news industry specifications, you can visit the Internation Press Telecommunications Council (IPTC) Web site at
http://www.iptc.org/
If you want to build or rebuild the toolkit from source code, you will need to read through this entire chapter. If you just plan on using the library, you hate reading system requirements, and you just want to get up and running quickly, simply add the following JAR files from the NewsML Toolkit distribution to your Java class path:
newsml-toolkit.jar
lib/gnuregexp.jar
lib/jaxen-full.jar
lib/saxpath.jar
lib/xerces.jar
Different Java environments have different ways of setting the
class path. For Sun's reference Java Development
Environment, you can set the class path with the
CLASSPATH
environment variable, separating each entry
with a colon (Unix-based systems) or a semicolon (DOS-based systems).
If you're using a different environment, read the documentation that
came with it.
The rest of this chapter explains the system requirements in more detail, for those who want or need to know.
The NewsML Toolkit requires the following, additional Open-Source
runtime libraries. Copies of all of these are included in the
lib/
subdirectory at the top level of the
distribution:
An XML parser supporting SAX2 [SAX] and DOM level 2 [DOM] (tested with Xerces-J 1.4.0 [XERCES]). The DOM level 2 interfaces are also required, but they will usually be bundled with the parser.
The XML parser handles low-level reading and error-reporting for XML documents.
SAXPath [SAXPATH] (tested with version 1.0beta5).
The SAXPath library provides parsing support for XPath [XPATH] syntax; together with Jaxen, it makes the NewsML Toolkit XPath-aware.
Jaxen [JAXEN] (tested with version 1.0beta6)
Together with SAXPath, Jaxen provides XPath support for the NewsML Toolkit. NewsML requires XPath for resolving formal names and bases for choice, and XPath can also be a useful syntax for referring to parts of a document.
If you plan on using the conformance-testing suite in the
org.newsml.toolkit.conformance
package, you will also
require the following:
Gnu Regular Expression library [GNUREGEXP] (tested with 1.1.3).
This library adds support for regular-expression processing. The conformance suite uses regular expressions extensively in its tests to verify data formats.
The requirements in this section apply only if you plan to compile or recompile the NewsML Toolkit itself. You might want to do this because you plan to contribute to the toolkit's development, because you want to use a specialized Java compiler (such as one that compiles to machine code), or because you want to keep up to date with the latest CVS version. If you don't plan on doing anything like that (or perhaps don't even understand what some of those things are), feel free to skip this section.
To compile and test the NewsML Toolkit, you will need the following utilities:
Ant [ANT] (tested with version 1.4).
Ant is a project-building tool, similar to make
but
written for and in Java. Strictly, speaking, this tool is not
required -- you can import the Java files into any Java environment
you wish. However, the distribution comes with a prewritten
build.xml
file for ant, so rebuilding and running the
unit tests is as simple as typing "ant
" in the root
directory of the distribution.
JUnit [JUNIT] (tested with version 3.7).
JUnit is an extremely popular unit-testing package for
Java. The NewsML Toolkit currently contains nearly 500 individual
unit tests in the org.newsml.dom.unittests
package: these tests help to
ensure that changes to the toolkit do not break any existing
functionality.
These packages are not included in the main distribution: if you do not have them already, you will have to download and install them yourself.
First, you need to install the required JAR libraries as described
in the System Requirements and
Setup chapter. When everything's in place, try this quick
test from the src
subdirectory of the NewsML distribution
directory:
java NormalizeNewsML ExampleText.xml
(If your development environment has a different way of invoking
the Java runtime environment, use it instead.) If everything is
installed correctly, you will see a complete XML document scrolled
quickly down the screen. If you get any exceptions or error messages,
first make sure that you're in the src/
subdirectory of
the distribution, then go back and make sure that all of the required
libraries are actually on your class path and that there are no other
versions shadowing them.
Now that you know that the toolkit is installed and functioning properly, you can try writing your first program with it. This chapter walks step-by-step through a simple Java program to load, examine, modify, and save a NewsML document (the document used in the examples is in the appendix Minimal Sample NewsML Document).
First, we need to load the NewsML document into the toolkit; that requires two lines of code:
NewsMLFactory factory = new DOMNewsMLFactory(); NewsML newsml = factory.createNewsML(input_file);
The first line creates a NewsMLFactory
for building NewsML nodes, and the second
uses the factor to open a NewsML document and return the root NewsML
node.
It doesn't matter if input_file is a local filename or a
remote URL; the toolkit will work with either (local files will be
faster and more secure, of course). The NewsMLFactory.createNewsML(String)
method will throw a
regular Java IOException if there is any problem loading
the document.
Next, let's pull some information out of the document. NewsML 1.0 requires every NewsML document to contain a NewsEnvelope and every NewsEnvelope to contain a DateAndTime stating when the NewsML package was sent. The following code extracts that date string and prints it to standard output:
Text date = newsml.getNewsEnvelope().getDateAndTime(); System.out.println("Information sent at " + date.toString());
The method calls simply follow the path of the document: the NewsML root contains a NewsEnvelope, the NewsEnvelope contains the DateAndTime, and the DateAndTime contains text. (Normally, it is not quite this simple, because many nodes are optional and the getters may return null values, which you will have to check.)
Next, let's assume that the date in the sample document ("20020411T132700-0400", or 11 April 2002 at 1:27pm EDT) is incorrect, and we want to change it to 12 April:
date.setString("20020412T132700-0400"); System.out.println("Date changed to " + date.toString());
Now that the document has been corrected, we can write it back out as XML:
FileWriter output = new FileWriter(output_file); newsml.writeXML(output, true); output.close();
The BaseNode.writeXML(Writer, boolean)
method takes writes the
XML document to an output character stream; if you prefer a string,
you can use BaseNode.toXML(boolean)
.
That's it -- you've loaded, examine, modified, and saved a NewsML document using the toolkit. Of course, a full Java application always requires some extra code for importing classes, checking arguments, catching exceptions, and so on. Here's a complete Java application based on the code above:
import java.io.IOException; import java.io.FileWriter; import org.newsml.toolkit.NewsML; import org.newsml.toolkit.NewsMLFactory; import org.newsml.toolkit.Text; import org.newsml.toolkit.dom.DOMNewsMLFactory; public class NewsMLExample { public static void main (String args[]) { if (args.length != 2) { System.err.println("Usage: java NewsMLExample1 <input> <output>"); System.exit(2); } String input_file = args[0]; String output_file = args[1]; NewsMLFactory factory = new DOMNewsMLFactory(); try { NewsML newsml = factory.createNewsML(input_file); Text date = newsml.getNewsEnvelope().getDateAndTime(); System.out.println("Information sent at " + date.toString()); date.setString("20020412T132700-0400"); System.out.println("Date changed to " + date.toString()); FileWriter output = new FileWriter(output_file); newsml.writeXML(output, true); output.close(); } catch (IOException e) { System.err.println("Failed to open NewsML document " + args[0] + ": " + e.getMessage()); System.exit(-1); } } }
Enjoy. You can probably do a lot of useful work now just by browsing the JavaDoc, but the rest of this manual provides more information and examples if you need them.
The NewsML Toolkit is designed to allow programmers to read and write XML documents without any specialized XML knowledge.
The methods for reading XML appear in the org.newsml.toolkit.NewsMLFactory
interface:
public NewsML createNewsML (String url) throws IOException; public NewsML createNewsML (Reader input, String baseURL) throws IOException; public BaseNode createNode (String url) throws IOException; public BaseNode createNode (Reader input, String baseURL) throws IOException;
The createNewsML methods read a full NewsML document (with the root element NewsML), and the createNode methods read a NewsML document with any root element, such as Catalog or TopicSet. Each of these comes in two flavours: one that reads the document over the Web from a URL, and one that reads the document from any Java Reader.
When you read the document from a Java Reader, you may also supply a base URL for resolving references in the NewsML document; if you provide a null argument, the method will generate a base URL based on the current directory.
Here is a simple example that reads a NewsML document directly from an external URL:
import java.io.IOException; import org.newsml.toolkit.NewsMLFactory; import org.newsml.toolkit.dom.DOMNewsMLFactory; public NewsML read_newsml (String url) { NewsML newsml = null; NewsMLFactory factory = new DOMNewsMLFactory(); try { newsml = factory.createNewsML(url); } catch (IOException e) { System.err.println("Failed to read NewsML from " + url + ": " + e.getMessage()); } return newsml; }
(The org.newsml.toolkit.dom.DOMNewsMLFactory
class is an
implementation of the NewsMLFactory interface, built on top
of the Document Object Model [DOM]. In version
2.0 of the toolkit, it is the only implementation provided, but others
may appear in the future.)
Here is another example, that reads a NewsML document from a file and lets the factory build a default base URL:
NewsML newsml = null; NewsMLFactory factory = new DOMNewsMLFactory(); try { FileReader reader = new FileReader("mynewsml.xml"); newsml = factory.createNewsML(reader, null); } catch (IOException e) { System.err.println("Error reading mynewsml.xml: " + e.getMessage()); }
To read a NewsML document from a string, simply use the Java StringReader class instead of FileReader (both are in the java.io package).
Sometimes the root element of the NewsML document is not NewsML. For example, some providers publish standalone catalogs and topic sets, where the root element is Catalog or TopicSet; in other cases, you might want to create any arbitrary NewsML element from XML text provided by a user. In both cases, you need to use the createNode methods and cast to the correct type. Here is an example that creates a comment node from an XML string:
Comment comment = null; NewsMLFactory factory = new DOMNewsMLFactory(); String xmlData = "<Comment Duid=\"XXX\">Hello, world!</Comment>"; try { comment = (Comment)factory.createNode(new StringReader(xmlData), null); } catch (IOException e) { System.err.println("Error parsing XML string: " + e.getMessage()); }
Finally, here is a second example that reads a topic set from a (ficticious) remote URL:
TopicSet people = null; NewsMLFactory factory = new DOMNewsMLFactory(); try { people = (TopicSet)factory.createNode("http://www.acmenews.net/people.xml"); } catch (IOException e) { System.err.println("Failed to read topic set: " + e.getMessage()); }
The methods for writing XML appear in the org.newsml.toolkit.BaseNode
interface, which all NewsML
nodes implement:
public void writeXML (Writer output, boolean isDocument) throws IOException; public void writeXML (Writer output, String encoding, String internalSubset) throws IOException; public String toXML (boolean isDocument); public String toXML (String encoding, String internalSubset);
The writeXML methods write to any Java Writer, while the toXML methods (analogous to java.lang.Object.toString) writes the XML representation to a string and returns the string.
Each of the methods comes in two flavours. The first has an isDocument flag that determines whether the XML should be written as a fragment or as a standalone document with its own XML declaration and document-type declaration. The second provides greater control, by allowing the user to specify a character encoding and internal DTD subset to be included.
For example, to write a TopicSet out as a document fragment to the file "newsml-out.xml", you could use the following:
FileWriter output = new FileWriter("newsml-out.xml"); topicset.writeXML(output, false); output.close();
Since isDocument is false, this method will write out something like this:
<TopicSet FormalName="places"> ... </TopicSet>
This is perfectly good XML by itself, but contains no information
about the character encoding and no document type declaration. In
case you want those (or need them to be fully NewsML-compliant), you
can get a default XML declaration and document type declaration by
setting isDocument to true
:
topicset.writeXML(output, true);
If you originally loaded the document from XML, you will get the original public identifier, system identifier, and internal DTD subset. The result might look something like this:
<?xml version="1.0"?> <DOCTYPE TopicSet SYSTEM "NewsMLv1.0.dtd"> <TopicSet FormalName="places"> ... </TopicSet>
For even finer control, you can use a different version of the method where you can provide your own character encoding and internal DTD subset:
topicset.writeXML(output, "ISO-8859-1", "<!ENTITY acme \"ACME News Inc.\">");
The result might look like this:
<?xml version="1.0" encoding="ISO-8859-1"?> <DOCTYPE TopicSet SYSTEM "NewsMLv1.0.dtd" [ <!ENTITY acme "ACME News Inc."> ]> <TopicSet FormalName="places"> ... </TopicSet>
The toXML methods work exactly the same way, except that they do not take a Writer argument and they return the XML as a string.
The NewsML Toolkit stores NewsML nodes in a tree. Every node except the root (usually NewsML, but sometimes TopicSet, Catalog, or another type) has a parent node, and every node except the leaf nodes has children.
(Actually, things are not quite that simple. The design of NewsML requires that an API retain some lexical details of XML markup, and one of those details is the distinction between XML elements and attributes. More on that point later.)
The navigation methods in the toolkit make it possible to navigate from any node to any other node in the same NewsML document. This approach works well when you know the exact path from where you are to where you want to go; for an alternative approach, see the Searching chapter.
To get a node's parent, you use the BaseNode.getParent()
method, and then cast to the appropriate
type:
NewsItem newsitem = (NewsItem)identification.getParent();
When a node can have several different types of parent, you can use
the BaseNode.getXMLName()
method to figure out what to cast
to.
Since every node except the root has a non-null
parent, you can find the root easily from any node like this:
BaseNode root = node; while (root.getParent() != null) root = root.getParent();
The best way to get a child node is to use the appropriate specialized method in the node's interface.
For example, to get the second ByLine node inside a
NewsLines node, use the NewsLines.getByLine(int)
method:
OriginText byline = newslines.getByLine(1);
If you want to know how many ByLine children are
available, use the NewsLines.getByLineCount()
method:
int nByLines = newslines.getByLineCount(); for (int i = 0; i < nByLines; i++) { process_byline(newslines.getByLine(i)); }
Finally, to get all of the ByLine children in a single
array, use the NewsLines#getByLine()
method:
ByLine bylines[] = newslines.getByline();
All of the NewsML interfaces follow exactly this pattern for child
nodes that can appear more than once: there is a
get*Count() method that returns the number of children, a
get*(int) method that returns a specific child, and a
get*() method that returns all of the children in a single
array. For example, the org.newsml.toolkit.NewsEnvelope
interface has NewsEnvelope.getNewsServicecount()
and NewsEnvelope.getNewsProductCount()
, NewsEnvelope.getNewsService(int)
and NewsEnvelope.getNewsProduct(int)
, and NewsEnvelope.getNewsService()
and NewsEnvelope.getNewsProduct()
.
When a child type can appear only once, there is no need to get a
count or to collect all children of the same type into an array, so an
interface will have simply a get*() method to return the
child, such as NewsEnvelope.getDateAndTime()
:
IdText date = newsenvelope.getDateAndTime();
These accessors work well when you are writing code to follow a
specific path through a NewsML document. There will be times,
however, when you want to write more generic code that is not tied to
specific NewsML node types (such as interactive editors or search
engines). In these cases, you can take advantage of the generic
methods in the org.newsml.toolkit.BaseNode
interface.
To use the generic methods, you need to know first whether the node you want is represented in the XML markup as an element or an attribute.
In XML documents, attributes have several special properties:
They are not repeatable: a parent element may have only one attribute with any single name.
They are unordered: it is not allowed to matter what order attributes appear in in a document, and they have no ordering relative to elements.
They may not contain children: attributes always represent leaf nodes.
The navigation methods in the node-specific NewsML interfaces hide these differences, but the generic navigation methods cannot.
To get a child node represented by an attribute, you use the
BaseNode.getAttr(String)
method:
Text rank = node.getAttr("Rank");
Note that, like most leaf nodes, attribute-based nodes (currently)
all implement the org.newsml.toolkit.Text
interface. You'll need to use
Text.toInt()
, Text.toBoolean()
, or Text.toString()
method to get the actual attribute
value.
Text node = node.getAttr("Rank"); int rank; if (node != null) rank = node.toInt();
Because elements are repeatable, are ordered, and can have their own child nodes, the generic navigation methods for nodes based on elements are more complex than for those based on attributes.
The primitive methods for retrieving element-based child nodes are
BaseNode.getChildCount()
and BaseNode.getChild(int)
:
int nChildren = node.getChildCount(); for (int i = 0; i < nChildren; i++) { BaseNode child = node.getChild(i); System.out.println("Child " + i + " is a " + child.getXMLName()); }
The getChildCount method counts only element-based
nodes, not attribute-based nodes. There is also a BaseNode.getChild()
method for retrieving all of the
element-based child nodes in a single array:
BaseNode children[] = node.getChild();
Much of the time, however, you already know the XML element name of
the node you're looking for, so you're interested only in child nodes
with that name. In that case, you can use the derived convenience
methods BaseNode.getChildCount(String)
and BaseNode.getChild(String, int)
:
int nChildren = node.getChildCount("Comment"); for (int i = 0; i < nChildren; i++) process_comment((Comment)node.getChild("Comment", i));
Note that the various getChild methods return the type BaseNode, so you must cast down to the appropriate type (as in the previous example).
Finally, there is a BaseNode.getChild()
method for retrieving all of the
element-based child nodes with a specified name in a single array
(note that the array will be of type BaseNode[], not of the
derived interface type):
BaseNode comments[] = node.getChild("Comment");
While the higher-level structure of a NewsML document is important, eventually you need to get at the basic information like dates, names, places, and numbers. In the NewsML 1.0 XML format [], this information exists as character data and attribute values, and represents the leaf nodes of the NewsML tree. There are two different ways of accessing the information, depending on whether it occurs as part of the NewsML markup, or as part of some other markup language inside a DataContent element.
Nearly all nodes that contain leaf data implement the org.newsml.toolkit.Text
interface or one of its
subinterfaces, and the following methods provide the standard way to
get leaf values for NewsML:
Internally, leaf values are always managed as strings; the int and boolean methods are simply for convenience.
The following example gets the text of a comment:
String text = comment.toString();
The following example gets the revision number of a NewsIdentifier:
int rev = newsIdentifier.getRevisionId().toInt();
The getBoolean method returns true if the value is 'y' and false otherwise.
There is one type of leaf data that does not use the
Text interface. The DataContent element in
NewsML contains an inline payload in a NewsML document, which is not
(usually) in the NewsML format. The org.newsml.toolkit.DataContent
interface provides several
different ways of accessing the non-NewsML data:
as a plain text string with any XML markup removed, using the
DataContent.getText()
method;
as text string with XML markup included, using the DataContent.getXMLString()
method; or
as a preparsed tree of DOM [DOM] nodes,
using the DataContent.getDOMNodes()
method.
For example, consider the following data content (using a very simple, ficticious sports markup language):
<DataContent> <SportsScore> <sport>hockey</sport> <team> <name>Montreal</name> <score>3</score> </team> <team> <name>Boston</name> <score>2</score> </team> <SportsScore> </DataContent>
The getText method would return a Java string containing the above example with all XML tags stripped out (and any entity references expanded):
hockey Montreal 3 Boston 2
The getXMLString method would return all of the above example except for the opening and closing DataContent tags in a single Java String:
<SportsScore> <sport>hockey</sport> <team> <name>Montreal</name> <score>3</score> </team> <team> <name>Boston</name> <score>2</score> </team> <SportsScore>
Finally, the getDOMNodes method would return a list of DOM nodes representing the SportsScore element and any whitespace text around it.
Obviously, getText is not that useful if the data content contains XML markup; it is designed mainly for payloads like plain (non-XML) text, including base64-encoded binary objects like photos.
The NewsML Toolkit is not primarily designed as a search or query engine; however, the built-in XPath [XPATH] support does allow for certain types of contextual searching.
The org.newsml.toolkit.NewsMLSession
interface includes two
methods for executing XPath queries:
public BaseNode [] getNodesByXPath (BaseNode contextNode, String xpath) throws NewsMLException; public BaseNode getNodeByXPath (BaseNode contextNode, String xpath) throws NewsMLException;
The first method returns an array of all nodes that match the XPath expression; the second returns only the first match.
The current session is accessible from any node through the
org.newsml.toolkit.BaseNode
interface, so a typical query
might look like this:
BaseNode matches[] = node.getSession().getNodesByXPath(node, xpath);
This section contains some NewsML-specific examples of XPath queries.
Find every node using Swiss French:
//node()[@xml:lang="fr-CH"]
Find the closest ancestor NewsComponent:
ancestor-or-self::NewsComponent
Find every content item using the "image/jpeg" MIME type:
//ContentItem[MimeType/@FormalName="/image/jpeg"]
Find every news item that contains an update:
//NewsItem/Update
Note that you must cast each result from BaseNode to the appropriate type.
If your only experience with XPath is through XSLT [XSLT], you may find that the getNodesByXPath and getNodeByXPath methods do not behave exactly as you expect. For example, if you wanted to find every comment that appears inside a NewsComponent, you might be tempted to try
// WRONG!! BaseNode matches[] = session.getNodesByXPath(node, "NewsComponent/Comment");
However, this XPath query will return nothing at all unless context
node root happens itself to contain a NewsComponent child
with a Comment child. XSLT uses a processing model that usually
iterates through an entire XML document, trying an XPath expression
with each node as context node: that's the only reason that an
expression like "NewsComponent/Comment
" works in XSLT.
For general XPath use, there is only a single context node. To find
every Comment inside a NewsComponent descendant
of the node, you need to enter the expression like this:
// Correct for a specific subtree. BaseNode matches[] = session.getNodesByXPath(node, "//NewsComponent/Comment");
The leading "//
" tells the XPath engine to look for
any NewsComponent descendant rather than just the immediate
children. To search the entire document rather than just the
descendants of the context node, you need to use something like
// Correct for the entire document. BaseNode matches[] = session.getNodesByXPath(node, "/descendant::NewsComponent/Comment");
For more information, review the XPath specification listed in the References appendix.
Starting with version 2.0, NewsML Toolkit includes support for creating new nodes and modifying existing ones. Client applications can now use the toolkit to perform simple or complex modifications on a NewsML package before saving it back to XML, or even to create a new NewsML package entirely from scratch.
Creating nodes is very similar to reading from an XML document (see
4.1. Reading XML); in fact, reading an XML
document is just a special case of creating a new node. The methods
for creating new NewsML nodes also appear in the org.newsml.toolkit.NewsMLFactory
interface, and are
divided into two groups: methods for creating new, empty nodes, and
methods for copying existing nodes.
In the org.newsml.toolkit.NewsMLFactory
interface, there is a
no-argument create method for every XML element and attribute type in
NewsML 1.0 [NEWSML]. Here are some
examples:
public OriginText createByLine () throws IOException; public NewsML createNewsML () throws IOException; public Text createAssignedByAttr () throws IOException; public Text createContextAttr () throws IOException;
The factory methods for nodes represented in NewsML by XML
attributes always end in "Attr" and always return an instance of
org.newsml.toolkit.Text
; the factory methods for nodes
represented by XML elements return the appropriate node type.
Note in version 2.0, the NewsML Toolkit does not yet have DTD knowledge built in. When you create a new node, it will be empty, even if it is based on an XML element that has required attributes or child elements in the NewsML 1.0 DTD. For example, the NewsML 1.0 DTD requires that the XML NewsItem have at least Identification and NewsManagement child elements; however, if you invoke
NewsItem item = factory.createNewsItem(); System.out.print(item.toXML());
The output will be simply
<NewsItem></NewsItem>
It is the application's reponsibility to ensure that the NewsML document is valid and conformant.
The application has not saved a reference to the current
NewsMLFactory, it can look up the reference through the
current session, which is available through the BaseNode.getSession()
method:
NewsMLFactory factory = node.getSession().getFactory();
NewsMLFactory also has the generic methods NewsMLFactory.createNewNode(String)
and NewsMLFactory.createNewNodeAttr(String)
for creating any
type of node. Note that these require you to know whether the node is
represented by an element or attribute in the NewsML 1.0 XML markup,
and that they also require you to cast the result to the appropriate
type. Here is an example:
NewsEnvelope envelope = (NewsEnvelope)factory.createNewNode("NewsEnvelope");
The NewsMLFactory
interface also contains support for copying
existing NewsML nodes. For every factory method like this
public NewsEnvelope createNewsEnvelope() throws IOException;
There is also one like this
public NewsEnvelope createNewsEnvelope(NewsEnvelope node) throws IOException;
The second type of method will create a deep copy of the argument provided: any modifications to the copy will not affect the original. Note also that the copy will not have any parent until you add it to a NewsML document explicitly. Here is an easy way to clone a NewsComponent:
NewsComponent component2 = factory.createNewsComponent(component);
As with creating, there are also generic methods NewsMLFactory.createNewNode(BaseNode)
and NewsMLFactory.createNewNodeAttr(Text)
for copying any type
of NewsML node:
NewsEnvelope envelope_copy = (NewsEnvelope)factory.createNewNode(envelope);
Again, with the generic methods, you must know whether NewsML 1.0 represents the node in XML as an element or an attribute.
The major change between version 1.1 and version 2.0 of the NewsML Toolkit is the ability to modify a NewsML document programatically. There are two types of modification possible:
Starting with version 2.0, the NewsML Toolkit allows client applications to modify the leaf values of nodes. For example, the application can change a date, correct a spelling mistake, or reassign a formal name using the methods described in this section. When the application is creating a new NewsML document from scratch rather than modifying an existing one, these methods allow the application to populate the document with actual information.
Most of the time, you will be using three key methods in the
org.newsml.toolkit.Text
interface: Text.setBoolean(boolean)
, Text.setInt(int)
, and Text.setString(String)
; these correspond to the accessor
methods Text.toBoolean()
, Text.toInt()
, and Text.toString()
.
For example, to set the text of a BasisForChoice, you could use the following:
basisForChoice.setString("//Format/@FormalName");
As a slightly more complex example, to increment the RevisionId of a NewsIdentifier, you could use the following:
id.getRevisionId().setInt(id.getRevisionId().toInt()+1);
Internally, leaf values are always managed as strings; the int and boolean methods are simply for convenience.
There are also DataContent.setText(String)
, DataContent.setXMLString(String)
, and DataContent.setDOMNodes(NodeList)
methods corresponding to
the accessor methods described in the DataContent
section.
The corresponding setText method is convenient for setting raw, non-XML text like base64-encoded graphics or encryption keys, because it does not require you to escape any special XML characters like '<' or '&':
content.setText("AT&T");
The setXMLString method is a little more restrictive than the getXMLString method: the string returned by getXMLString might not be a well-formed XML document, since there may be more than one top-level element; setXMLString requires a single root element and a well-formed XML document, since it will actually parse the string as XML. Note that entity references might cause problems. Here is an example of setting a very simple piece of XML content:
content.setXMLString("<company>AT&T</company>");
Finally, the setDOMNodes method is useful if your XML is already parsed in memory, and you want to avoid the overhead of parsing it again. It is the application's responsibility to ensure that the document remains DTD valid, if validation is being performed.
The primitive method for inserting a new child into a node is
BaseNode.insertChild(int, BaseNode)
. This method inserts a child base node into the
current node at the absolute position specified, pushing any child
originally in that position (and all subsequent children) forward by
one position. All positions are zero-indexed.
For example, consider the following NewsML XML markup:
<Resource> <Urn>urn:xxx:yyy</Urn> <Url>http://www.acmenews.com/vocab/</Url> </Resource>
The Resource (an instance of org.newsml.toolkit.Resource
) has two child nodes,
Urn and Url (both instances of org.newsml.toolkit.IdText
). The following code adds a new
Url in position 1:
resource.insertChild(1, new_url);
Depending on the contents of new_url, the resulting XML rendition would look something like this:
<Resource> <Urn>urn:xxx:yyy</Urn> <Url>http://www.newsrus.com/metadata/</Url> <Url>http://www.acmenews.com/vocab/</Url> </Resource>
An application can use this method to insert a new child anywhere in the NewsML tree. Version 2.0 of the toolkit does not apply DTD constraints, so it is the responsibility of the application to ensure that the document remains NewsML-conformant after the new child has been inserted.
An index of -1
will always insert the child in the
last position, so
resource.insertChild(-1, new_url);
would result in something like
<Resource> <Urn>urn:xxx:yyy</Urn> <Url>http://www.acmenews.com/vocab/</Url> <Url>http://www.newsrus.com/metadata/</Url> </Resource>
This convention is useful for appending new children to a node incrementally.
It is not always convenient to provide an absolute position for
inserting a child, however; in many cases, an application will need to
add a child to the beginning or end of a list of similar children.
For that, the methods BaseNode.insertBefore(String, int, BaseNode)
and BaseNode.insertAfter(String, int, BaseNode)
are
helpful:
node.insertBefore("Comment", 0, comment);
This example will insert comment immediately before the first existing Comment node.
There are several other convenience methods for inserting. Here is the entire list:
public BaseNode insertChild (int index, BaseNode child); public BaseNode[] insertChild (int index, BaseNode children[]); public BaseNode insertFirst (BaseNode child); public BaseNode[] insertFirst (BaseNode children[]); public BaseNode insertLast (BaseNode child); public BaseNode[] insertLast (BaseNode children[]); public BaseNode insertBefore (String name, int index, BaseNode child); public BaseNode insertBefore (String name, BaseNode child); public BaseNode[] insertBefore (String name, int index, BaseNode children[]); public BaseNode[] insertBefore (String name, BaseNode children[]); public BaseNode insertBeforeDuid (String duid, BaseNode child); public BaseNode[] insertBeforeDuid (String duid, BaseNode children[]); public BaseNode insertAfter (String name, int index, BaseNode child); public BaseNode insertAfter (String name, BaseNode child); public BaseNode[] insertAfter (String name, int index, BaseNode children[]); public BaseNode[] insertAfter (String name, BaseNode children[]); public BaseNode insertAfterDuid (String duid, BaseNode child); public BaseNode[] insertAfterDuid (String duid, BaseNode children[]);
For more information, see the documentation for org.newsml.toolkit.BaseNode
.
Replacing children is like Inserting Children,
except that the child at the existing index is removed rather than
being pushed forward, and the replace methods always return the node
that has been removed rather than any nodes that have been added. The
fundamental method is BaseNode.replaceChild(int, BaseNode)
. Consider the
following document:
<Resource> <Urn>urn:xxx:yyy</Urn> <Url>http://www.acmenews.com/vocab/</Url> </Resource>
When the application executes the following command
resource.replaceChild(1, url);
The URL in position 1 will be removed, and the new URL will be added in its place:
<Resource> <Urn>urn:xxx:yyy</Urn> <Url>http://www.newsrus.com/metadata/</Url> </Resource>
As with inserting, there is an extensive collection of convenience
methods in the org.newsml.toolkit.BaseNode
class:
public BaseNode replaceChild (int index, BaseNode child); public BaseNode replaceChild (int index, BaseNode children[]); public BaseNode replaceChild (String xmlName, int index, BaseNode child); public BaseNode replaceChild (String xmlName, BaseNode child); public BaseNode replaceChild (String xmlName, int index, BaseNode children[]); public BaseNode replaceChild (String xmlName, BaseNode children[]);
The methods for removing children are similar to those for Replacing Children,
except that the application does not supply any replacement for the
node being removed. The basic method is BaseNode.removeChild(int)
.
For example, consider once again the following NewsML markup:
<Resource> <Urn>urn:xxx:yyy</Urn> <Url>http://www.acmenews.com/vocab/</Url> </Resource>
When the application executes the following code
resource.removeChild(1);
the result will be
<Resource> <Urn>urn:xxx:yyy</Urn> <Url>http://www.acmenews.com/vocab/</Url> </Resource>
As with inserting and deleting, the org.newsml.toolkit.BaseNode
interface contains a couple of
convenience methods:
public BaseNode removeChild (int index); public BaseNode removeChild (String xmlName, int index); public void removeSelf () throws NewsMLException;
Note again that version 2.0 of the NewsML Toolkit does not enforce DTD constraints, so it is the application's reponsibility to ensure that the document remains DTD-valid and NewsML-conformant when removing nodes.
Sometimes, especially on the authoring side, an application will
need to check a NewsML document for conformance errors. Since the
NewsML 1.0 specification includes a DTD, some of the simpler, more
obvious structural errors can be detected simply by enabling DTD
validation using the NewsMLFactory.setValidation(boolean)
method before reading
an XML document. With this flag set to true, and DTD
validation errors in the NewsML document will cause an exception at
load time.
DTDs (or other schemas) can detect only simple, obvious errors,
however. A large part of the conformance requirements of NewsML are
not covered by general-purpose schemas. As a result, the NewsML
toolkit includes an org.newsml.toolkit.conformance
package for running
additional, non-DTD-based tests for data-type constraints, referential
integrity, proper formal-name use, and so on.
The main entry point to the conformance testing is the org.newsml.toolkit.conformance.NewsMLTestManager
class.
This class allows the application to register a series of tests and
then to run the tests against a NewsML document. Here is a simple
example that uses the built-in default tests:
NewsMLTestManager mgr = new NewsMLTestManager(); mgr.addDefaultTests(); mgr.runTests(newsml, true);
By default, the test manager will print messages for warnings and
errors to System.err. The application can override the
default by supplying its own implementation of org.newsml.toolkit.conformance.ErrorVisitor
using the
org.newsml.toolkit.conformance/NewsMLTestManager.setErrorVisitor(ErrorVisitor)
method.
An application writer can choose not to use the default tests, and
can add new tests for local business rules by extending the org.newsml.toolkit.conformance.TestBase
class. The NewsMLTestManager.addTest(String, TestBase)
method takes
as its first argument an XPath expression matching the nodes to which
the tests should be applied.
For example, a test to warn if the provider id is not "acmenews.com" would look like this:
import org.newsml.toolkit.BaseNode; import org.newsml.toolkit.NewsMLException; import org.newsml.toolkit.conformance.TestBase; public class ACMEProviderTest extends TestBase { public void run (BaseNode contextNode, boolean useExternal) throws NewsMLException { if (!"acmenews.com".equals(contextNode.toString())) warn("ProviderId is " + contextNode.toString() + " rather than acmenews.com"); } }
The application would register this test for all ProviderId nodes like this:
mgr.addTest("//ProviderId", new ACMEProviderTest());
Now, every time the tests are run, the conformance test manager will report a warning for any provider ID other than "acmenews.com".
Why do I get a java.lang.NoClassDefFoundError
when I try to run the toolkit?
You probably haven't installed all of the required libraries, or you haven't set up your CLASSPATH correctly so that Java can find them. See the System Requirements and Setup chapter for more information.
Why do I get an XML parsing exception when I put non-English characters in my NewsML document?
You probably haven't set up the encoding correctly. The default eight-bit encoding for an XML document is UTF-8, not ISO-8859-1 (ISO Latin 1). The two are the same up to character position 127 (ASCII), but then UTF-8 uses multibyte escape sequences. If you want to include accented characters or non-English characters, either configure your editor to save with the UTF-8 encoding, or specify a different encoding in the XML declaration (this is not guaranteed to work with all parsers):
<?xml version="1.0" encoding="ISO-8859-1"?>
Why does it take so long to load my document?
If you are loading your document over the network, it will be limited by network speed; perhaps the remote server is slow or your connection is dropping packets.
The problem can also occur with documents on the local computer when you have enabled DTD validation and the document references a DTD at another side. For instance, if you have something like
<!DOCTYPE NewsML SYSTEM "http://www.acmenews.com/newsml/NewsMLv1.0.dtd">
your whole system is at the mercy of the acmenews.com
server -- every time you load a document, the XML parser will go back
to acmenews.com
and request another copy of the DTD file.
If acmenews.com
goes offline or moves the DTD file, your
system will grind to a halt; if acmenews.com
has a
security breach, the DTD file may be maliciously altered to force
false information into your news stories.
The best solution is to disable DTD validation or use only locally-hosted DTD files. This is good general advice for any XML system, not just for NewsML.
Why won't the XPath expression "NewsComponent
"
find any NewsComponent nodes (etc.)?
XPath anchors all searches in a context node, and reads
the path relative to that node. The XPath expression
"NewsComponent
" will find only NewsComponents that are
direct children of the context node; what you probably mean is
"//NewsComponent
".
XSLT users get especially confused on this point, because an XSLT implementation is an engine that systematically walks through a document, trying every node as the context node in turn -- that's an XSLT thing, not an XPath thing. Read the XPath spec [XPATH] for more information.
This simple NewsML document describes the manual you are currently reading. It is also available separately (your browser may do strange things trying to display it).
<?xml version="1.0"?> <!-- ************************************************************************ Minimal useful NewsML Document. ************************************************************************ --> <!-- note that a copy of the DTD file must be in the same directory --> <!DOCTYPE NewsML SYSTEM "NewsMLv1.0.dtd"> <NewsML> <!-- Use a single, inline vocabulary for everything --> <TopicSet Duid="vocab" FormalName="vocab" Vocabulary="#vocab"> <Topic> <TopicType FormalName="topic" Vocabulary="#vocab"/> <FormalName>topic</FormalName> <Description>A formally-identified topic.</Description> </Topic> <Topic> <TopicType FormalName="topic" Vocabulary="#vocab"/> <FormalName>vocab</FormalName> <Description>A vocabulary of topics and formal names.</Description> </Topic> <Topic> <TopicType FormalName="topic" Vocabulary="#vocab"/> <FormalName>released</FormalName> <Description>The status of a document released to the public.</Description> </Topic> <Topic> <TopicType FormalName="topic" Vocabulary="#vocab"/> <FormalName>manual</FormalName> <Description>Technical documentation.</Description> </Topic> <Topic> <TopicType FormalName="topic" Vocabulary="#vocab"/> <FormalName>text/html</FormalName> <Description>An HTML document.</Description> </Topic> </TopicSet> <NewsEnvelope> <DateAndTime>20020411T132700-0400</DateAndTime> </NewsEnvelope> <NewsItem> <Identification> <NewsIdentifier> <ProviderId>megginson.com</ProviderId> <DateId>20020411</DateId> <NewsItemId>newsml-toolkit-manual</NewsItemId> <RevisionId PreviousRevision="0" Update="N">1</RevisionId> <PublicIdentifier>urn:newsml:megginson.com:20020411:newsml-toolkit-manual:1</PublicIdentifier> </NewsIdentifier> </Identification> <NewsManagement> <NewsItemType FormalName="manual" Vocabulary="#vocab"/> <FirstCreated>20020411T132600-0400</FirstCreated> <ThisRevisionCreated>20020411T132600-0400</ThisRevisionCreated> <Status FormalName="released" Vocabulary="#vocab"/> </NewsManagement> <NewsComponent> <NewsLines> <HeadLine>NewsML Toolkit 2.0 Manual</HeadLine> <CopyrightLine>Copyright (c) 2002 by Reuters PLC</CopyrightLine> <RightsLine>Free redistribution permitted</RightsLine> </NewsLines> <ContentItem Href="http://newsml-toolkit.sourceforge.net/newsml-toolkit-manual.html"> <MimeType FormalName="text/html" Vocabulary="#vocab"/> </ContentItem> </NewsComponent> </NewsItem> </NewsML> <!-- end of newsml-sample.xml -->
http://jakarta.apache.org/ant/
http://www.w3.org/TR/DOM-Level-2-Core/
http://savannah.gnu.org/projects/gnu-regexp/
http://jaxen.org/
http://junit.org/
http://www.gnu.org/licenses/lgpl.html
http://www.iptc.org/site/NewsML/specification/NewsMLv1.0.pdf
http://www.saxproject.org/
http://www.saxpath.org/
http://newsml-toolkit.sourceforge.net/
http://xml.apache.org/xerces-j/
http://www.w3.org/TR/xpath
http://www.w3.org/TR/xslt