
Data Modeling Using xml Data Type
This section discusses data modeling topics for native XML storage. These include indexing XML data, property promotion, and typed xml data type.
Same or Different Table
An xml data type column can be created in a table that contains other relational columns, or in a separate table with a foreign key relationship to a main table.
Create an xml data type column in the same table when one of the following conditions is true:
-
Your application performs data retrieval on the XML column and does not require an XML index on the XML column.
-
You want to build an XML index on the xml data type column and the primary key of the main table is the same as its clustering key. For more information, see Indexing an xml Data Type Column.
Create the xml data type column in a separate table if the following conditions are true:
-
You want to build an XML index on the xml data type column, but the primary key of the main table is different from its clustering key, or the main table does not have a primary key, or the main table is a heap (no clustering key). This may be true if the main table already exists.
-
You do not want table scans to slow down because of the presence of the XML column in the table. This uses space whether it is stored in-row or out-of-row.
Granularity of XML Data
The granularity of the XML data stored in an XML column is very important for locking and, to a lesser degree, it is also important for updates. SQL Server uses the same locking mechanism for both XML and non-XML data. Therefore, row-level locking causes all XML instances in the row to be locked. When the granularity is large, locking large XML instances for updates causes throughput to decline in a multiuser scenario. On the other hand, severe decomposition loses object encapsulation and increases reassembly cost.
A balance between data modeling requirements and locking and update characteristics is important for good design. However, in SQL Server 2005, the size of actual stored XML instances is not as critical.
For example, updates to an XML instance are performed by using new support for partial binary large object (BLOB) and partial index updates in which the existing stored XML instance is compared to its updated version. Partial binary large object (BLOB) update performs a differential comparison between the two XML instances and updates only the differences. Partial index updates modify only those rows that must be changed in the XML index.
Untyped, Typed, and Constrained xml Data Type
The SQL Server 2005 the xml data type implements the ISO SQL-2003 standard xml data type. Therefore, it can store well-formed XML version 1.0 documents and also so-called XML content fragments with text nodes and an arbitrary number of top-level elements in an untyped XML column. The system checks that the data is well formed, does not require the column to be bound to XML schemas, and rejects data that is not well formed in the extended sense. This is true also of untyped XML variables and parameters.
If you have XML schemas describing your XML data, you can associate the schemas with the XML column to yield typed XML. The XML schemas are used to validate the data, perform more precise type checks during compilation of query and data modification statements than untyped XML, and optimize storage and query processing.
Use untyped xml data type in the following situations:
-
You do not have a schema for your XML data.
-
You have schemas, but you do not want the server to validate the data. This is sometimes the case when an application performs client-side validation before storing the data at the server, or temporarily stores XML data that is invalid according to the schema, or uses schema components unsupported at the server (for example, key/keyref).
Use typed xml data type in the following situations:
-
You have schemas for your XML data and you want the server to validate your XML data according to the XML schemas.
-
You want to take advantage of storage and query optimizations based on type information.
-
You want to take better advantage of type information during compilation of your queries.
Typed XML columns, parameters, and variables can store XML documents or content. However, you have to specify with a flag whether you are storing a document or content at the time of declaration. Additionally, you have to provide the collection of XML schemas. Specify DOCUMENT if each XML instance has exactly one top-level element. Otherwise, use CONTENT. The query compiler uses the DOCUMENT flag in type checks during query compilation to infer singleton top-level elements.
Besides typing an XML column, you can use relational (column or row) constraints on typed or untyped xml data type columns. Use constraints in the following situations:
-
Your business rules cannot be expressed in XML schemas. For example, the delivery address of a flower shop must be within 50 miles of its business location. This can be written as a constraint on the XML column. The constraint may involve xml data type methods.
-
Your constraint involves other XML or non-XML columns in the table. An example is the enforcement of the ID of a Customer (/Customer/@CustId) found in an XML instance to match the value in a relational CustomerID column.
Document Type Definition (DTD)
The xml data type columns, variables, and parameters can be typed by using XML schema, but not by using DTD. However, inline DTD can be used for both untyped and typed XML to supply default values and to replace entity references with their expanded form.
You can convert DTDs to XML schema documents by using third-party tools, and load the XML schemas into the database.
Indexing an xml Data Type Column
XML indexes can be created on xml data type columns. It indexes all tags, values and paths over the XML instances in the column and benefits query performance. Your application may benefit from an XML index in the following situations:
-
Queries on XML columns are common in your workload. XML index maintenance cost during data modification must be considered.
-
Your XML values are relatively large and the retrieved parts are relatively small. Building the index avoids parsing the whole data at run time and benefits index lookups for efficient query processing.
The first index on an XML column is the primary XML index. In using it, three types of secondary XML indexes can be created on the XML column to speed up common classes of queries, as described in the following section.
Primary XML Index
This indexes all tags, values, and paths within the XML instances in an XML column. The base table, the table in which the XML column occurs, must have a clustered index on the primary key of the table. The primary key is used to correlate index rows with the rows in the base table. Full XML instances are retrieved from the XML columns, for example, SELECT *. Queries use the primary XML index, and return scalar values or XML subtrees by using the index itself.
Example: Creating a Primary XML Index
Table T (pk INT PRIMARY KEY, xCol XML) with an untyped XML column is used in most of the examples. These can be extended to typed XML in a straightforward way. For more information about how to use typed XML, see xml Data Type.) For simplicity, queries are described for XML data instances as shown in the following:
<book genre="security" publicationdate="2002" ISBN="0-7356-1588-2">
<title>Writing Secure Code</title>
<author>
<first-name>Michael</first-name>
<last-name>Howard</last-name>
</author>
<author>
<first-name>David</first-name>
<last-name>LeBlanc</last-name>
</author>
<price>39.99</price>
</book>
The following statement creates an XML index, called idx_xCol, on the XML column xCol of table T:
CREATE PRIMARY XML INDEX idx_xCol on T (xCol)
Secondary XML Indexes
After the primary XML index has been created, you may want to create secondary XML indexes to speed up different classes of queries within your workload. Three types of secondary XML indexes, PATH, PROPERTY and VALUE, benefit path-based queries, custom property management scenarios, and value-based queries, respectively. The PATH index builds a B+-tree on the (path, value) pair of each XML node in document order over all XML instances in the column. The PROPERTY index creates a B+-tree clustered on the (PK, path, value) pair within each XML instance, where PK is the primary key of the base table. Finally, the VALUE index creates a B+-tree on the (value, path) pair of each node in document order across all XML instances in the XML column.
Following are some guidelines for creating one or more of these indexes:
-
If your workload uses path expressions significantly on XML columns, the PATH secondary XML index is likely to speed up your workload. The most common case is the use of the exist() method on XML columns in the WHERE clause of Transact-SQL.
-
If your workload retrieves multiple values from individual XML instances by using path expressions, clustering paths within each XML instance in the PROPERTY index may be helpful. This scenario typically occurs in a property bag scenario when properties of an object are fetched and its primary key value is known.
-
If your workload involves querying for values within XML instances without knowing the element or attribute names that contain those values, you may want to create the VALUE index. This typically occurs with descendant axes lookups, such as //author[last-name="Howard"], where <author> elements can occur at any level of the hierarchy. It also occurs in wildcard queries, such as /book [@* = "novel"], where the query looks for <book> elements that have some attribute having the value "novel".
Example: Path-based Lookup
For illustration, assume that the following query is common in your workload:
SELECT pk, xCol
FROM T
WHERE xCol.exist ('/book/@genre[.="novel"]') = 1
The path expression /book/@genre and the value "novel" corresponds to the key fields of the PATH index. As a result, secondary XML index of type PATH is helpful for this workload:
CREATE XML INDEX idx_xCol_Path on T (xCol)
USING XML INDEX idx_xCol FOR PATH
Example: Fetching the Properties of an Object
Consider the following query that retrieves the properties genre, title, and ISBN of a book from each row in table T:
SELECT xCol.value ('(/book/@genre)[1]', 'varchar(50)'),
xCol.value ('(/book/title/text())[1]', 'varchar(50)'),
xCol.value ('(/book/@ISBN)[1]', 'varchar(50)')
FROM T
The property index is useful in this case and is created as follows:
CREATE XML INDEX idx_xCol_Property on T (xCol)
USING XML INDEX idx_xCol FOR PROPERTY
Example: Value-based Query
In the following query, the descendant-or-self axis (//) specifies a partial path so that the lookup based on the value of ISBN benefits from the use of the VALUE index:
SELECT xCol
FROM T
WHERE xCol.exist ('//book/@ISBN[. = "0-7356-1588-2"]') = 1
The VALUE index is created as follows:
CREATE XML INDEX idx_xCol_Value on T (xCol)
USING XML INDEX idx_xCol FOR VALUE
Full-Text Index on an XML Column
You can create a full-text index on XML columns that indexes the content of the XML values, but ignores the XML markup. Attribute values are not full-text indexed, because they are considered part of the markup, and element tags are used as token boundaries. When possible, you can combine full-text search with XML index in the following way:
-
First, filter the XML values of interest by using SQL full-text search.
-
Next, query those XML values that use XML index on the XML column.
Example: Combining Full-text Search with XML Querying
After the full-text index has been created on the XML column, the following query checks that an XML value contains the word "custom" in the title of a book:
SELECT *
FROM T
WHERE CONTAINS(xCol,'custom')
AND xCol.exist('/book/title/text()[contains(.,"custom")]') =1
The contains() method uses the full-text index to subset the XML values that contain the word "custom" anywhere in the document. The exist() clause ensures that the word "custom" occurs in the title of a book.
A full-text search that uses contains() and XQuery contains() has different semantics. The latter is a substring match and the former is a token match that uses stemming. Therefore, if the search is for the string that has "run" in the title, the matches will include "run", "runs", and "running", because both the full-text contains() and the Xquery contains() are satisfied. However, the query does not match the word "customizable" in the title in that the full-text contains() fails, but the Xquery contains () is satisfied. Generally, for pure substring match, the full-text contains() clause should be removed.
Additionally, full-text search uses word stemming, but XQuery contains() is a literal match. This difference is illustrated in the next example.
Example: Full-text Search on XML Values Using Stemming
The XQuery contains() check that was performed in the previous example generally cannot be eliminated. Consider this query:
SELECT *
FROM T
WHERE CONTAINS(xCol,'run')
The word "ran" in the document matches the search condition because of stemming. Additionally, the search context is not checked by using XQuery.
When XML is decomposed into relational columns by using AXSD that are full-text indexed, XPath queries that occur over the XML view do not perform full-text search on the underlying tables.
Property Promotion
If queries are made principally on a small number of element and attribute values, you may want to promote those quantities into relational columns. This is helpful when queries are issued on a small part of the XML data while the whole XML instance is retrieved. Creating an XML index on the XML column is not required. Instead, the promoted column can be indexed. Queries must be written to use the promoted column. That is, the query optimizer does not target again the queries on the XML column to the promoted column.
The promoted column can be a computed column in the same table or it can be a separate, user-maintained column in a table. This is sufficient when singleton values are promoted from each XML instance. However, for multi-valued properties, you have to create a separate table for the property, as described in the following section.
Computed Column Based on the xml Data Type
A computed column can be created by using a user-defined function that invokes xml data type methods. The type of the computed column can be any SQL type, including XML. This is illustrated in the following example.
Example: Computed Column Based on the xml Data Type Method
Create the user-defined function for a book ISBN number:
CREATE FUNCTION udf_get_book_ISBN (@xData xml)
RETURNS varchar(20)
BEGIN
DECLARE @ISBN varchar(20)
SELECT @ISBN = @xData.value('/book[1]/@ISBN', 'varchar(20)')
RETURN @ISBN
END
Add a computed column to the table for the ISBN:
ALTER TABLE T
ADD ISBN AS dbo.udf_get_book_ISBN(xCol)
The computed column can be indexed in the usual way.
Example: Queries on a Computed Column Based on xml Data Type Methods
To obtain the <book> whose ISBN is 0-7356-1588-2:
SELECT xCol
FROM T
WHERE xCol.exist('/book/@ISBN[. = "0-7356-1588-2"]') = 1
The query on the XML column can be rewritten to use the computed column as follows:
SELECT xCol
FROM T
WHERE ISBN = '0-7356-1588-2'
You can create a user-defined function to return the xml data type and a computed column by using the user-defined function. However, you cannot create an XML index on the computed, XML column.
Creating Property Tables
You may want to promote some of the multivalued properties from your XML data into one or more tables, create indexes on those tables, and target again your queries to use them. A typical scenario is one in which a small number of properties covers most of your query workload. You can do the following:
-
Create one or more tables to hold the multivalued properties. You may find it convenient to store one property per table and duplicate the primary key of the base table in the property tables for back joining with the base table.
-
If you want to maintain the relative order of the properties, you have to introduce a separate column for the relative order.
-
Create triggers on the XML column to maintain the property tables. Within the triggers, do one of the following:
-
Use xml data type methods, such as nodes() and value(), to insert and delete rows of the property tables.
-
Create streaming table-valued functions in the common language runtime (CLR) to insert and delete rows of the property tables.
-
Write queries for SQL access to the property tables and for XML access to the XML column in the base table, with joins between the tables by using their primary key.
Example: Create a Property Table
For illustration, assume that you want to promote the first name of the authors. Books have one or more authors, so that first name is a multivalued property. Each first name is stored in a separate row of a property table. The primary key of the base table is duplicated in the property table for back join.
create table tblPropAuthor (propPK int, propAuthor varchar(max))
Example: Create a User-defined Function to Generate a Rowset from an XML Instance
The following table-valued function, udf_XML2Table, accepts a primary key value and an XML instance. It retrieves the first name of all authors of the <book> elements and returns a rowset of primary key, first name pairs.
create function udf_XML2Table (@pk int, @xCol xml)
returns @ret_Table table (propPK int, propAuthor varchar(max))
with schemabinding
as
begin
insert into @ret_Table
select @pk, nref.value('.', 'varchar(max)')
from @xCol.nodes('/book/author/first-name') R(nref)
return
end
Example: Create Triggers to Populate a Property Table
The insert trigger inserts rows into the property table:
create trigger trg_docs_INS on T for insert
as
declare @wantedXML xml
declare @FK int
select @wantedXML = xCol from inserted
select @FK = PK from inserted
insert into tblPropAuthor
select * from dbo.udf_XML2Table(@FK, @wantedXML)
The delete trigger deletes the rows from the property table based on the primary key value of the deleted rows:
create trigger trg_docs_DEL on T for delete
as
declare @FK int
select @FK = PK from deleted
delete tblPropAuthor where propPK = @FK
The update trigger deletes the existing rows in the property table corresponding to the updated XML instance and inserts new rows into the property table:
create trigger trg_docs_UPD
on T
for update
as
if update(xCol) or update(pk)
begin
declare @FK int
declare @wantedXML xml
select @FK = PK from deleted
delete tblPropAuthor where propPK = @FK
select @wantedXML = xCol from inserted
select @FK = pk from inserted
insert into tblPropAuthor
select * from dbo.udf_XML2Table(@FK, @wantedXML)
end
Example: Find XML Instances Whose Authors Have the First Name of "David"
The query can be formed on the XML column. Alternatively, it can search the property table for first name "David" and perform a back join with the base table to return the XML instance. For example:
SELECT xCol
FROM T JOIN tblPropAuthor ON T.pk = tblPropAuthor.propPK
WHERE tblPropAuthor.propAuthor = 'David'
Example: Solution Using the CLR Streaming Table-valued Function
This solution is made up of the following steps:
-
Define a CLR class, SqlReaderBase, that implements ISqlReader and generates a streaming, table-valued output by applying a path expression on an XML instance.
-
Create an assembly and a Transact-SQL user-defined function to start the CLR class.
-
Define the insert, update, and delete triggers by using the user-defined function to maintain a property tables.
To do this, you first create the streaming CLR function. The xml data type is exposed as a managed class SqlXml in ADO.NET and supports the CreateReader() method that returns an XmlReader.
Note: |
|---|
|
The example code in this section uses XPathDocument and XPathNavigator. These force you to load all the XML documents into memory. If you are using similar code in your application to process several large XML documents, this code is not scalable. Instead, keep memory allocations small and use streaming interfaces whenever possible. For more information about performance, see Architecture of CLR Integration.
|
public class c_streaming_xml_tvf {
public static ISqlReader streaming_xml_tvf
(SqlXml xmlDoc, string pathExpression) {
return (new TestSqlReaderBase (xmlDoc, pathExpression));
}
}
// Class that implements ISqlReader
public class TestSqlReaderBase : ISqlReader {
XPathNodeIterator m_iterator;
public SqlChars FirstName;
// Metadata for current resultset
private SqlMetaData[] m_rgSqlMetaData;
public TestSqlReaderBase (SqlXml xmlDoc, string pathExpression) {
// Variables for XPath navigation
XPathDocument xDoc;
XPathNavigator xNav;
XPathExpression xPath;
// Set sql metadata
m_rgSqlMetaData = new SqlMetaData[1];
m_rgSqlMetaData[0] = new SqlMetaData ("FirstName",
SqlDbType.NVarChar,50);
//Set up the Navigator
if (!xmlDoc.IsNull)
xDoc = new XPathDocument (xmlDoc.CreateReader());
else
xDoc = new XPathDocument ();
xNav = xDoc.CreateNavigator();
xPath = xNav.Compile (pathExpression);
m_iterator = xNav.Select(xPath);
}
public bool Read() {
bool moreRows = true;
if (moreRows = m_iterator.MoveNext())
FirstName = new SqlChars (m_iterator.Current.Value);
return moreRows;
}
}
Next, create an assembly and a Transact-SQL user-defined function, SQL_streaming_xml_tvf (not shown), that corresponds to the CLR function, streaming_xml_tvf. The user-defined function is used to define the table-valued function, CLR_udf_XML2Table, for rowset generation:
create function CLR_udf_XML2Table (@pk int, @xCol xml)
returns @ret_Table table (FK int, FirstName varchar(max))
with schemabinding
as
begin
insert into @ret_Table
select @pk, FirstName
FROM SQL_streaming_xml_tvf (@xCol, '/book/author/first-name')
return
end
Finally, define triggers as shown in the example, "Create triggers to populate a property table", but replace udf_XML2Table with the CLR_udf_XML2Table function. The insert trigger is shown in the following example:
create trigger CLR_trg_docs_INS on T for insert
as
declare @wantedXML xml
declare @FK int
select @wantedXML = xCol from inserted
select @FK = PK from inserted
insert into tblPropAuthor
select *
from dbo.CLR_udf_XML2Table(@FK, @wantedXML)
The delete trigger is identical to the non-CLR version. However, the update trigger just replaces the function udf_XML2Table() with CLR_udf_XML2Table().
XML Schema Collections
An XML schema collection is a metadata entity that is scoped by a relational schema. It contains one or more XML schemas that may be related, such as through <xs:import>), or that may be unrelated. Individual XML schemas within an XML schema collection are identified by using their target namespace.
An XML schema collection is created by using CREATE XML SCHEMA COLLECTION (Transact-SQL) syntax and providing one or more XML schemas. More XML schema components can be added to an existing XML schema, and more schemas can be added to an XML schema collection by using ALTER XML SCHEMA COLLECTION (Transact-SQL) syntax. XML schema collections can be secured like any SQL object by using the security model in SQL Server 2005.
Multi-Typed Column
An XML schema collection C types an XML column, xCol, according to multiple XML schemas. Additionally, the DOCUMENT and CONTENT flags specify whether XML trees or fragments, respectively, can be stored in column xCol.
For DOCUMENT, each XML instance specifies the target namespace of its top-level element in the instance, and which is typed and validated according to it. For CONTENT, on the other hand, each top-level element can specify any one of the target namespaces in C. The XML instance is validated and typed according to all the target namespaces occurring in an instance.
Schema Evolution
XML schema collection is used to type XML columns, variables, and parameters. It provides a mechanism for XML schema evolution. For illustration, assume that you add an XML schema with target namespace BOOK-V1 to an XML schema collection C. An XML column, xCol typed by using C, can store XML data that conforms to the BOOK-V1 schema.
Next assume that an application wants to extend the XML schema with new schema components, such as complex type definitions and top-level element declarations. These new schema components can be added to the BOOK-V1 schema and do not require revalidation of the existing XML data in column xCol.
Assume that the application later wants to provide a new version of the XML schema and it selects the target namespace BOOK-V2. This XML schema can be added to C. The XML column can store instances of both BOOK-V1 and BOOK-V2, and execute queries and data modification on XML instances that conform to these namespaces.