How to: Create Web Services That Parse the Contents of a Web Page
Web services created using ASP.NET provide an HTML parsing solution that enables developers to parse content from a remote HTML page and programmatically expose the resulting data. For a detailed explanation, see HTML Parsing by ASP.NET XML Web Services.
To specify an operation and input parameters
-
Create a Web Services Description Language (WSDL) document, which is typically saved with the file name extension .wsdl. The document's content must consist of valid XML according to the WSDL schema. For a prototype, you can use a WSDL document dynamically generated for a Web service running on ASP.NET. Make a request with a ?wsdl argument appended to the Web service URL.
-
Specify the elements that define the operation each Web service method that parses HTML text. This step and the next one require a knowledge of the WSDL format.
-
If the parsing method takes input parameters, specify the elements that represent those parameters and associate them with the operation.
To specify the data returned from a parsed HTML page
-
Add a namespace-qualified <text> XML element within the <output> element that appears via the XPath /definitions/binding/operation/output. The <operation> element represents the Web service method that retrieves parsed HTML.
Note |
|---|
| The operation name inside a binding must be globally unique or Wsdl.exe can be run with the namespace specified to prevent naming collisions caused by other WSDL files imported in the same application. |
-
Add <match> XML elements in the service description within the <text> XML element for each piece of data you want to return from the parsed HTML page.
-
Apply attributes to the <match> element. The valid attributes are presented in a table under the topic HTML Parsing by ASP.NET XML Web Services.
To generate client proxy code for the Web service
-
Run the Wsdl.exe tool from the .NET Framework SDK. Pass the WSDL file you created as an input.
Example
The following code example is a simple Web page sample containing <TITLE> and <H1> tags.
<HTML>
<HEAD>
<TITLE>Sample Title</TITLE>
</HEAD>
<BODY>
<H1>Some Heading Text</H1>
</BODY>
</HTML>
The following code example is a service description that parses the contents of the HTML page, extracting the contents of the text within the <TITLE> and <H1> tags. In the code example, a TestHeaders method is defined for the GetTitleHttpGet binding. The TestHeaders method defines two pieces of data that can be returned from the parsed HTML page in <match> XML elements: Title and H1, which parse the contents of the <TITLE> and <H1> tags, respectively.
<?xml version="1.0"?>
<definitions xmlns:s="http://www.w3.org/2001/XMLSchema"
xmlns:http="http://schemas.xmlsoap.org/wsdl/http/"
xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/"
xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/"
xmlns:s0="http://tempuri.org/"
targetNamespace="http://tempuri.org/"
xmlns="http://schemas.xmlsoap.org/wsdl/">
<types>
<s:schema targetNamespace="http://tempuri.org/"
attributeFormDefault="qualified"
elementFormDefault="qualified">
<s:element name="TestHeaders">
<s:complexType derivedBy="restriction"/>
</s:element>
<s:element name="TestHeadersResult">
<s:complexType derivedBy="restriction">
<s:all>
<s:element name="result" type="s:string" nullable="true"/>
</s:all>
</s:complexType>
</s:element>
<s:element name="string" type="s:string" nullable="true"/>
</s:schema>
</types>
<message name="TestHeadersHttpGetIn"/>
<message name="TestHeadersHttpGetOut">
<part name="Body" element="s0:string"/>
</message>
<portType name="GetTitleHttpGet">
<operation name="TestHeaders">
<input message="s0:TestHeadersHttpGetIn"/>
<output message="s0:TestHeadersHttpGetOut"/>
</operation>
</portType>
<binding name="GetTitleHttpGet" type="s0:GetTitleHttpGet">
<http:binding verb="GET"/>
<operation name="TestHeaders">
<http:operation location="MatchServer.html"/>
<input>
<http:urlEncoded/>
</input>
<output>
<text xmlns="http://microsoft.com/wsdl/mime/textMatching/">
<match name='Title' pattern='TITLE>(.*?)<'/>
<match name='H1' pattern='H1>(.*?)<'/>
</text>
</output>
</operation>
</binding>
<service name="GetTitle">
<port name="GetTitleHttpGet" binding="s0:GetTitleHttpGet">
<http:address location="http://localhost" />
</port>
</service>
</definitions>
The following code example is a portion of the proxy class generated by Wsdl.exe for the previous service description.
Note