Thursday, November 17, 2005

XSLT development with ruby - picking text nodes

Setting up for XSLT development

If you plan to use XSLT in your ruby program, read on:

Let's assume the command line is a more agreeable code viewer than your graphical web browser and write a minimal XLST processor. Make sure you have an up to date libxml2, libxslt, and ruby-xslt, then write the following into min.rb:

#!/usr/bin/ruby
require 'xml/xslt'
xslt = XML::XSLT.new()
xslt.xsl = IO.read("test.xsl")
xslt.xml = IO.read("test.xml")
out = xslt.serve()
print out

Setting Emacs up for three windows (one for test.xsl, one for test.xml, and one for output) and typing M-! min.rb (or C-x-ESC-ESC) made for a fine xslt development environment for me. Emacs has a nice xsl mode, too. If you look around you'll see that this program is equally trivial in any language.

The best introductory article I've found on ruby-xslt is Alex Netkachev's, and the best mailing list I have found for XLST help is xsl-list@lists.mulberrytech.com

I'm currently reading Inside XSLT by Steven Holzner, and I'm glad I found it. XSLT is one of those markup languages that takes some real time to master (it's turing complete!). A major goal of the W3 working group is to make XSLT2 easier to learn and use. Ah well, a bit late for me.

Why XSLT?

In most cases I can think of, writing an XSLT transform (stylesheet) is probably easier on the programmer than interfacing to a tree-parser, reading, transforming manually, and writing out the result. However, you have to weigh that against the fact that it takes a couple days to learn XSLT and implement your first practical, non-trivial transforms.

Learning XSLT does expose you to lots of XML standards (namespaces XPath XBase, etc.), so if you are behind on learning them, you might just consider an XSLT project a practical means to learning XML standards. I'm glad I did.

And the number one reason to learn XSLT is that I now know a good bit of it ;). For the usual multitude of reasons, the more people who know a language, the more useful code written in it can be for all of us.

While studying I wrote a well-documented example of using XSLT to pick text out of XML, including some of the tricky parts of the easy parts. In particular, this little example explores how to deal with text in nested tags.

The XML file:


<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- test.xml - Source XML for self-explanatory XSLT exercise -->
<TestRoot>
This output is a result of applying the transformation test.xsl to
test.xml. Default XSLT processing rules will handle anything we don't
specifically handle ourselves in test.xsl (i.e. this paragraph).

<SomeGroup>
<Radio crud="blah">First, we would like all <foo>"Radio" tags
contained within tag set "SomeGroup"</foo> to be processed
identically, copying the text within the tags to the output.
</Radio>

<Radio crud="hooha">The text of any inner <foo>tags will be
copied as well. As you might expect, text in the .XML file
that is not specifically handled is</foo> copied to the output by
default.</Radio> However, "SomeGroup" *is* specifically handled.

</SomeGroup>

<Radio crud="watoozie"><foo>Second, when we encounter any "Radio"
tags outside of "SomeGroup" tags - we will print only the content of
the "foo" tags within those "Radio" tags .</foo>So don't print
this.</Radio>

<YetAnotherTag blah="and don't print this," yadayada="Third, we will
print this attribute. ">...but not this text.</YetAnotherTag>

<YetAnotherTag who="no printola.">Fourth, <foo>don't print this
inner foo tag,</foo> print this text followed by the tag
name:</YetAnotherTag>

<YetAnotherTag>Auugg! <foo>And fifth, print only this foo tag
text.</foo> Muhuhuhahaha!</YetAnotherTag>
<ForgottenTag>Unfortunately, we will forget to handle
this...<YetAnotherTag> But fortunately there is a catchall handler
for YetAnotherTags in our xsl file, so the inside bit won't print.
</YetAnotherTag>.</ForgottenTag>

Bye! <!-- Some literal text to test -->

</TestRoot>


The XSL file:


<?xml version="1.0" encoding="utf-8"?>
<!-- test.xsl - Transform for self-explanatory XSLT exercise -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<!-- Handle the root tag -->
<xsl:template match="/"> <!-- Plain text in an XSLT sheet gets sent to the output -->
Hello! I'm covering all the basic XSLT I needed to get most
simple things done. Note that newlines after tags in the .xsl
file count. Whitespace processing is left for another
exercise.
<xsl:apply-templates/> <!-- Apply all of the templates below to continue processing this section -->
</xsl:template>

<!-- Handle "SomeGroup" tags -->
<xsl:template match="SomeGroup">
<!-- Handle Radio tags that lie within SomeGroup tags -->
<xsl:for-each select="Radio">
<xsl:value-of select="."></xsl:value-of> <!-- loop through the Radio tags and print everything they contain\ -->
</xsl:for-each>
</xsl:template>

<!-- Handle the YetAnotherTags -->
<xsl:template match="YetAnotherTag[1]"> <!--match only the first YetAnotherTag occurence -->
<!-- print the yadayada attribute -->
<xsl:value-of select="@yadayada"/>
</xsl:template>
<xsl:template match="YetAnotherTag[2]"> <!-- print only the text elements in the parent -->
<!-- value-of only gets the first match, and we want them all -->
<xsl:for-each select="text()">
<xsl:value-of select="."/>
</xsl:for-each>
<xsl:value-of select="name()"/>
</xsl:template>
<xsl:template match="YetAnotherTag"> <!-- print only the text elements in the child -->
<xsl:value-of select="*"></xsl:value-of>
</xsl:template>

<!-- Handle any "Radio" tags that are unhadled thus far -->
<xsl:template match="Radio">
<xsl:value-of select="foo"></xsl:value-of>
</xsl:template>

You can also pick text nodes that start with specific tex with the "starts-with" parameter to the text() function.

<!-- That's it. If there is anything I didn't handle, I might not like the results -->
</xsl:stylesheet>


The Output:



Hello! I'm covering all the basic XSLT I needed to get most
simple things done. Note that newlines after tags in the .xsl
file count. Whitespace processing is left for another
exercise.

This output is a result of applying the transformation test.xsl to
test.xml. Default XSLT processing rules will handle anything we don't
specifically handle ourselves in test.xsl (i.e. this paragraph).

First, we would like all "Radio" tags
contained within tag set "SomeGroup" to be processed
identically, copying the text within the tags to the output.
The text of any inner tags will be
copied as well. As you might expect, text in the .XML file
that is not specifically handled is copied to the output by
default.

Second, when we encounter any "Radio"
tags outside of "SomeGroup" tags - we will print only the content of
the "foo" tags within those "Radio" tags .

Third, we will print this attribute.

Fourth, print this text followed by the tag
name:YetAnotherTag

And fifth, print only this foo tag
text.

Unfortunately, we forgot to handle
this....

Bye!

No comments: