The XML Files: Introducing XPath 2.0

Article
10/22/2019

XML Files

Introducing XPath 2.0

Aaron Skonnard

Contents

XPath 1.0 Limitations
XPath 2.0
One Data Model to Rule Them All
Sequences
Set Operations
Sequence Processing
Sequences and Predicates
Sequence Coercions
Comparisons
Explicit Quantification
Other Language Enhancements
Improved String Support
One Person's Trash, Another's Treasure

Over two years ago in one of the first installments of this column, I wrote about XPath version 1.0 (see The XML Files: Addressing Infosets with XPath for a quick review). As stated in the original specification: "XPath is a language for addressing parts of an XML document." Using XPath expressions, developers can easily identify nodes in an XML document for further processing. This has made it possible to replace complex traversal algorithms with simple declarative expressions. For example, the following expression identifies all child elements of the LineItem element with a Sku attribute value of 123 that are descendants of the root Invoice element:

/Invoice//LineItem[@Sku='123']/*

Writing the same logic with a traditional XML API would be extremely tedious and difficult to get right. As a result, XPath is commonly supported as a layered service in various API implementations today (such as DOM, XPathNavigator, and so on):

// C# DOM Code XmlNodeList nodes = doc.SelectNodes("/Invoice//LineItem[@Sku='123']/*"); for (int i=0; i<nodes.Count; i++) { ... // process selection here }

XPath is also used heavily in XSLT 1.0 for addressing input documents, as illustrated here:

<xsl:apply-templates select="/Invoice//LineItem[@Sku='123']/*"/>

Overall, XPath is a huge win for XML developers and has been quickly embraced as an essential part of their programming toolkit. The rest of this column assumes familiarity with XPath 1.0 (see https://www.w3.org/TR/xpath).

XPath 1.0 Limitations

Although XPath 1.0 simplified many common programming tasks, developers were left wanting more. The XPath 1.0 specification is limited or confusing in several areas and is in need of a facelift. The W3C has been pushing to add more significant features to the language, mostly in support of the other evolving W3C XML specifications (like XML Schema, XML Query 1.0, and XSLT 2.0).

Since XPath 1.0 was released, XML Schema has become a W3C recommendation and quickly positioned as the "official" type system for several other works in progress, such as XQuery and others related to Web Services. Because XML Schema is becoming a more integral part of the XML landscape, the W3C wants to make strongly typed XPath a reality (wouldn't it be nice to be able to select all elements of type double?). Furthermore, the recent work on XQuery 1.0 and XSLT 2.0 uncovered a great deal of common ground—areas in which both languages could share the same data model and expression syntax. This least common denominator has become XPath 2.0.

XPath 2.0

Those chartered to work on XPath 2.0 (https://www.w3.org/TR/XPath20) optimistically set out to fix the problems in XPath 1.0 and fulfill the following list of requirements (summarized from the XPath 2.0 Requirements document at https://www.w3.org/TR/xpath20req):

should maintain backward compatibility
must improve ease of use
must improve string manipulation and matching
must support XML family of standards (XSLT 2.0 and XQuery 1.0)
must support XML Schema (simple and complex types)

The reason the requirement for backward compatibility is not absolute is that this is less important than the other requirements that may get in the way, although the W3C has done a good job of respecting this requirement overall.

Improving ease of use and simplifying common use cases (such as working with strings) is a clear goal, but the last two requirements drive the more significant changes. In the remainder of this column, I'll highlight the most salient aspects to help you prepare for what's coming.

If you want to experiment with some of the sample expressions as you're reading through the column, download Michael Kay's SAXON 7.2 XSLT processor, which includes a fairly complete reference implementation (see https://saxon.sourceforge.net/saxon7.2/), with the exception of full XML Schema support. None of the Microsoft XML processors (or APIs) support XPath 2.0 as of the time of this writing, but it's expected along with their XQuery 1.0 and XSLT 2.0 implementations. There are no specific release dates available, but it would make sense to align it with the release of the specifications (probably in the coming year).

One Data Model to Rule Them All

The most profound change in XPath 2.0 can be found deep inside its data model. The details of the XPath 2.0 data model are found in the XQuery 1.0 and XPath 2.0 Data Model specifications, which again is shared by XPath 2.0, XSLT 2.0, and XQuery 1.0. The XPath 2.0 data model is based on the Infoset with the necessary extensions to support XML Schema. The data model defines seven node types that make up the structure of an XML document: document (root), element, text, attribute, namespace, processing instruction, and comment nodes.

The node types are similar to those found in the XPath 1.0 data model, but in the case of elements and attributes they've been extended to include XML Schema type information after validation occurs. The resulting "typed" data model is referred to as the Post Schema-Validation Infoset (PSVI).

In addition to the basic node types, XPath 1.0 defines four atomic types to assist in processing text values found in elements or attributes: node-set, string, number, and Boolean. This simple type system makes it possible to treat text values as specific types when necessary. For example, you should consider the following versions of a price element:

<price>10</price> <price>10.0</price> <price>10.00</price>

Testing for a specific value in a text-only world is quite cumbersome since you have to check all possible lexical formats:

price = '10' || price = '10.0' || price = '10.00'

But if you are armed with a numeric type, you can use the following expression that will take care of coercing the text into a numeric value automatically:

number(price) = 10

This shields you from having to deal with the various lexical formats that might have been used in the document.

In XPath 2.0, the atomic type system is based on this same principle, but it has been extended to include all of the data types defined in the XML Schema Datatypes spec (see https://www.w3.org/TR/xmlschema-2/). Figure 1 illustrates the XML Schema data type hierarchy, where derivation occurs by restriction. There are 19 primitive data types and many more derived types. XPath 2.0 also defines two new subtypes of duration, yearMonthDuration and dayTimeDuration, which simplify working with duration values. All of these types can be used in XPath 2.0.

Figure 1 XML Schema Datatype Hierarchy

As before, the XPath 2.0 type system shields developers from most lexical details, allowing them to deal primarily in value-spaces, but there are many more value-spaces to choose from now. You can access a node's typed value through the data function and you can construct typed values (or cast values) with the various constructor functions defined in the XQuery 1.0 and XPath 2.0 Functions and Operators specification. For example, the following expression tests whether any birthdate element's value (of type xs:date where xs is bound to the schema namespace ) comes before January 1, 1972:

data(birthdate) < xs:date('1972-01-01')

In addition to being able to work with strongly typed values, the PSVI makes it possible to take any element or attribute and access its type information at run time, just as you can access basic name information today. Elements can be bound to either simple or complex types, while attributes are always bound to simple types. There are several new operators that facilitate type inspection and coercion: instance of, cast as, and treat as.

The instance of operator allows you to verify that the operand is of a specific type, as illustrated in the equivalent expressions that are shown here:

LineItem/node()[. instance of element of type xsd:double] LineItem/*[. instance of xsd:double]

This expression identifies all children of the LineItem that are elements of type xsd:double. The cast as operator makes it possible to explicitly convert a value from one datatype to another. This operator may not be used with user-defined types. The treat as operator, on the other hand, may be used with user-defined types to check whether the value of the supplied expression is an instance of the supplied data type. If it is, the treat expression returns the value of the expression, otherwise it generates an error. Treat is typically used to guarantee that the dynamic type of a value is a particular subtype of the value's static type. The XQuery 1.0 and XPath 2.0 Functions and Operators specification defines numerous coercion rules for casting between types. This is probably the most complicated and overwhelming area of the specification.

The other major change in the data model has to do with node-sets, or lack thereof. In XPath 1.0 certain expressions returned node-sets while others returned atomic values. In XPath 2.0 everything is a sequence.

Sequences

A sequence is just what its name suggests: an ordered series of zero or more items. An item may be an atomic value or one of seven node types. When a node is placed in a sequence, it maintains its identity and can be duplicated in the same sequence.

Sequences are much different from XPath 1.0 node-sets, which are unordered and contain only nodes without duplicates. Although node-sets are unordered, they are still processed by XSLT 1.0 in document order. To help maintain backward compatibility, XPath 2.0 path expressions always produce sequences in document order without duplicates (this guarantee is not made for other expression types). And in XSLT 2.0, sequences are simply processed according to the order of the sequence. As you'll see shortly, sequences give you much more control and flexibility in your processing.

Every value in XPath 2.0 is a sequence. Sequences are written within parentheses using a comma delimiter, as shown here:

("Nathan", 1.32e0, true(), xs:date('2001-05-24'))

This sequence consists of four values of type: xs:string, xs:double, xs:boolean, and xs:date, respectively. The following sequence consists of a list of Sku elements, followed by a list of Price elements, followed by a list of Description elements:

(//Sku, //Price, //Description)

The empty sequence is written as: (). Even atomic values are considered sequences with a length of 1 (referred to as singleton sequences). Singleton sequences can be written as ("Nathan") or "Nathan". There is no distinction between the two.

Another important characteristic of sequences is that they're flat, meaning sequences may not contain other sequences. For example, the following three sequences are identical:

(1, 2, 3, 4) ((1, 2), (3, 4)) (((1), (2, 3), (4)))

Since the entire language revolves around sequences, there are a variety of operators and functions available for working with them.

Set Operations

I've already introduced you to one sequence operator, the comma (","). The comma represents the concatenation operator, which simply concatenates two sequences together, maintaining order. There are several additional set-based operators for working with sequences: the pipe ("|"), intersect, and except.

The | operator returns the union of the two sequences, removing duplicates (as in XPath 1.0). For example, given that $node1, $node2, $node3, and $node4 all reference distinct nodes, the following example illustrates a union:

($node1, $node2, $node3) | ($node2, $node3, $node4) -> ($node1, $node2, $node3, $node4)

The intersect operator returns the intersection of the two sequences, removing duplicates:

($node1, $node2, $node3) intersect ($node2, $node3, $node4)-> ($node2, $node3)

And the except operator returns every node that occurs in the first sequence but not in the second sequence, removing duplicates:

($node1, $node2, $node3) except ($node2, $node3, $node4) -> ($node1)

In addition to these set operations, there are several other functions that facilitate working with sequences. For more information on operators, see https://www.w3.org/TR/xquery-operators.

Sequence Processing

XPath 2.0 provides the expected functions for basic list manipulation: insert, remove, item-at, index-of, and subsequence. The first two allow you to add and remove items from the sequence, as shown here:

insert((1, 3, 4), 2, 2) -> (1, 2, 3, 4) remove((1, 2, 3), 2) -> (1, 3)

The item-at function returns the item at the supplied index (equivalent to using a simple position predicate [index]), while the index-of function returns the positions of the items that match the supplied value:

index-of((10, 20, 30), 20) -> 2

Subsequence returns the contiguous subset identified by the supplied begin and end positions (as another sequence).

There are several ways to check the length of a sequence. You can use exists, empty, and count. Exists returns true if the sequence is not empty; it returns false otherwise. Empty returns true if the sequence is empty; it returns false otherwise. Count returns the number of items in the sequence, as it does in XPath 1.0.

empty(()) -> true exists((1, 2, 3)) -> true count((1, 2, 3) -> 3

There are also several methods for performing math on a sequence: sum, avg, min, and max:

sum(1, 2, 3) -> 6 avg(1, 2, 3) -> 2 min(1, 2, 3) -> 1 max(1, 2, 3) -> 3

My favorite sequence-related function is distinct-values. It returns a new sequence containing only items with unique values (it removes items with duplicate values, just like SELECT DISTINCT in SQL). It assumes that the sequence either contains all nodes or all values, otherwise an error occurs. The following expression identifies all of the unique Sku elements in the document:

distinct-values(//Sku)

This is more convenient than the XPath 1.0 equivalent:

//Sku[not(preceding::Sku = .)]

There is also a distinct-nodes function, which removes duplicate nodes based on identity, not value.

Sequences and Predicates

In XPath 1.0 it's possible to apply predicates to specific node-sets through parentheses. Consider the following expression:

//Price[1]

At first glance it appears that this expression returns the first Price element in document order, but in fact it returns every Price element that is the first child of its parent (this is due to operator precedence). You can fix this by using parentheses around the node-set you want to filter:

(//Price)[1]

This same model applies to sequences. You can use predicates to filter out unwanted items. Applying a predicate to a sequence returns a new sequence containing only the items that satisfy the predicate expression. For example, the following predicate tests the item's position (equivalent to item-at):

(10, 20, 30)[2] -> (20)

This example uses a slightly more sophisticated predicate to check the number of children along with the value of the id attribute:

$seq[count(*) > 1 and @id < 1000]

In general, everything you learned about predicates in XPath 1.0 also applies to using them on sequences in XPath 2.0.

Sequence Coercions

If you use a sequence in a context where another type is required, an implicit coercion is attempted. The rules for coercing sequences to strings, numbers, and Booleans are mostly the same as those found in XPath 1.0. For example, if you supply a sequence where a string is expected, the value of the first item will be coerced to a string and used. The rules for evaluating the string-value of a node are as follows: for text/attributes it's just the value, while for elements it's the concatenation of descendant text nodes. If you supply a sequence where a number is expected, it's first coerced to a string, which is then coerced to a double. The only rule that's a bit different is the one for coercing to Boolean. If the sequence is empty, it's coerced to false. A sequence containing a single Boolean value is treated as that value, otherwise non-empty sequence containing at least one node is treated as true.

Functions exist for explicitly coercing between the string, number, and Boolean types as in XPath 1.0 and there are several additional functions for coercing to the various XML Schema data types (xsd:date, xsd:unsignedInteger, and so on) as mentioned earlier in this column.

Comparisons

One of the more nastier areas of XPath 1.0 is how it evaluates node-set comparisons. For example, take a look at the following simple expression:

Catalog/Prices/Price = 9.95

This evaluates to true if at least one Price element exists (under Catalog/Prices) with a numeric value equal to 9.95. When sequences are used in Boolean expressions like this, the result is true if the comparison is true for any of its members, otherwise known as existential quantification. Interestingly, this applies to any comparison operator (<, <=, >, >=, and so on), even the not equal (!=) operator, as illustrated in this line:

Catalog/Prices/Price != 9.95

This evaluates to true if at least one Price element exists (under Catalog/Prices) with a numeric value that does not equal 9.95. Hence, it's quite possible, if not probable, that both comparisons will return true for the same node-set. Needless to say, this has been the cause for much confusion.

It is possible to test all items in a sequence, otherwise known as universal quantification, by using the Boolean not function. For example, to confirm that all Price elements have a value of 9.95, you use the logical not of the last expression:

not(Catalog/Prices/Price != 9.95)

To verify that not a single Price element has a numeric value equal to 9.95, you have to use the logical not of the first expression:

not(Catalog/Prices/Price = 9.95)

In XPath 2.0 they've tried to eliminate as much of this confusion as possible by distinguishing between different comparison expressions and making it possible to explicitly choose between existential and universal quantification. General comparison expressions, which use the standard =, !=, <, <=, >, and >= operators, keep the XPath 1.0 default behavior I've described (existential quantification or "for any" semantics) to preserve backward compatibility. They also added a few new functions, sequence-deep-equal and sequence-node-equal, for comparing two sequences based on either value or node identity, respectively.

XPath 2.0 defines a separate mechanism for simple value comparisons. Value comparisons use the following operators: eq, ne, lt, le, gt, and ge (which represent =, !=, <, <=, >, and >=), but for comparing single items as atomic values. XPath 2.0 provides some additional comparisons for dealing with single nodes. For instance, the is and isnot operators test node identity:

Price[1] is Price[1] -> true Price[1] isnot Sku[1] -> true

You can also test node ordering using the << and >> operators. The << operator returns true if the first operand node comes before the second operand node in document order. And >> does the reverse. Since value and node expressions expect singletons, the "for any" versus "for all" semantics simply don't come into play.

Explicit Quantification

XPath 2.0 also makes it possible to write explicit quantified expressions through a new expression type. You can express whether you want to use existential (some) or universal (every) quantification when applying the constraining expression (satisfies) to the nodes in the supplied sequence:

some $item in //LineItem satisfies (($item/Price * $item/Quantity) > 100) every $item in //LineItem satisfies (($item/Price * $item/Quantity) > 100)

The first expression returns true if there's at least one LineItem element whose extended price (Price * Quantity) is greater than 100. The second expression returns true only if every LineItem's extended price is greater than 100.

Since the constraining expression can be anything, this model provides much more flexibility than =, !=, <, and the others. Furthermore, there's no limit to the number of sequences involved in the expression, as illustrated by this example that inspects the Cartesian product of the supplied sequences:

some $x in (1, 2, 3), $y in (2, 3, 4) satisfies $x + $y = 4

The XPath 2.0 specification is definitely easier to use and more flexible than XPath 1.0.

Other Language Enhancements

You can now supply multiple node tests in a single step, an enhancement to the way location steps work in XPath 2.0:

LineItems/(Sku|Price)/text()

In XPath 1.0, you would take the union of the following two expressions to achieve the same result:

LineItems/Sku/text() | LineItems/Price/text()

XPath 2.0 also provides a "for" expression for iteration purposes. This type of expression is useful for generating new sequences from multiple sources (such as joins) and is similar to explicit quantification expressions. For example, the following expression generates a new sequence containing the extended prices of each LineItem element:

for $i in //LineItem return ($i/Price * $i/Quantity)

This becomes extremely useful because you can pass the new sequence into another function for further processing. For example, the following expression calculates the invoice total:

sum(for $i in //LineItem return ($i/Price * $i/Quantity))

This wasn't possible at all in XPath 1.0 and was extremely tedious in XSLT 1.0 (requiring recursion and temporary result tree fragments). For expressions can be used for even more sophisticated transformations. This example generates a list of salespeople and the products they've sold:

for $sp in distinct-values(//Salesperson) return ($sp, for $item in //LineItem[Salesperson = $sp] return $item/Description)

The resulting sequence will contain a sequence of Salesperson elements, each of which is followed by a sequence of LineItem elements that they sold.

XPath 2.0 also provides a built-in conditional expression (if). The syntax consists of the traditional if/then/else keywords. The following expression calculates a discounted price based on the magnitude of the original price:

if ($Price > 100) then ($Price * .90) else ($Price * .95)

These new language constructs can be used anywhere an XPath expression is expected.

Improved String Support

In addition to these enhancements, XPath 2.0 also beefed up its string processing support with several new functions: upper-case, lower-case, string-pad, matches, replace, and tokenize. Before upper-case and lower-case, the only way to uppercase or lowercase a string was through the tedious and inefficient translate function. And before the string-pad function, you had to resort to recursion for such things:

upper-case('Michael') -> 'MICHAEL' string-pad('-', 7) -> '-------'

The last three functions bring regular expressions to the table. The matches function allows you to test whether the supplied input string matches the supplied regular expression. The following sample checks the SSNumber element against the standard format for U.S. Social Security numbers:

matches(SSNumber, '\d{3}-\d{2}-\d{4}')

The replace function makes it possible to replace substrings matched by a pattern, while the tokenize function facilitates breaking strings into substrings separated by the supplied pattern. Regular expression processing adds a whole new level of flexibility to the language.

One Person's Trash, Another's Treasure

XPath 2.0 is positioned center-stage in the family of XML specifications. It serves as the official addressing language to be used by other specifications, like XQuery 1.0, XSLT 2.0, and potentially many others. XPath 2.0 improves usability while increasing functionality and maintaining backward compatibility as much as possible. It also adds support for XML Schema, which promises added value to developers working in strongly typed worlds.

At this point, it wouldn't be proper to leave out the fact that XPath 2.0 has its fair share of opponents. There are many XML developers (especially those deep in the trenches of XSLT) that are confused by the increased complexity of XPath 2.0. Their biggest complaint has to do with requiring support for XML Schema instead of making it an additional, optional layer. This long-running XPath 2.0 debate has recently boiled over into some pretty heated political discussions.

Developers working heavily with databases or Web Services will undoubtedly find the benefit worth the price. But I do feel sympathy for the XSLT wonks who want the new-and-improved XPath without having strong-typing forced upon them. How this will ultimately shake out is hard to tell. If you have an opinion on the matter, or would like to share any other comments related to the language, send them to the public W3C XPath/XQuery mailing list at public-qt-comments@w3.org.

Send your questions and comments for Aaron to xmlfiles@microsoft.com.

Aaron Skonnardis an instructor/researcher at DevelopMentor, where he develops the XML and Web Service-related curriculum. Aaron coauthored Essential XML Quick Reference (Addison-Wesley, 2001) and Essential XML (Addison-Wesley, 2000).

Additional resources