Using XSLT to Create Wiki Markup

In my second post detailing my forays into MSBuild, I mentioned using XSLT to transform the XML documentation produced by the C# compiler into wiki markup. The primary motivation behind this was so I could integrate the API reference with the rest of the documentation for my framework, providing a more streamlined interface to the information. I recently started fleshing out the build process for SlimDX, a task that included crafting a similar wiki markup task.

Writing the transformation isn’t too difficult, but there are a few gotchas along the way; for those of you interested in trying it yourself, I’ve compiled a collection of tips based on my experiences. A basic understanding of XSLT, of course, is required — but it’s not a particularly difficult language to pick up, and Google provides a bountiful selection of learning resources. You’ll also need a good XSLT processing tool. I prefer command-line tools that don’t require installation beyond unzipping an archive — command-line operation is a must for anything that’s part of your build process, and the distribution of self-contained binary archives makes it much easier to install the tool into your source repository and upgrade it as required. At the moment, I’m happily using the Saxon processor (the “Saxon B” version). In addition to meeting the above requirements, it supports XSLT 2.0, which offers some useful new functionality over XSLT 1.0. Finally, you’ll want to understand the nature of the XML documentation you’ll be using as a source transform. Check out the C# XML Documentation Comments section on MSDN.

Know Your Wiki

It is critical that you understand how the wiki you’re using stores its data, or the path by which you can batch import content. DokuWiki, for example, uses the filesystem directly — pages are plain text files organized into directories that provide namespaces. MediaWiki, on the other hand, provides an import option that accepts a single XML file. That file uses elements to define pages and metadata, and contains the raw wiki markup in text nodes.

Most wikis will offer some kind of caching mechanism to speed the redraw of pages that haven’t been edited. If you have to import your documentation using a method external to the wiki (for example, overwriting the flat files), you may have to make sure the wiki isn’t caching those pages or your changes may not stick.

Whitespace Wrangling

Since XSLT is a transformation mechanism, not a formatting mechanism, it has whitespace handling rules that you may find irksome. When transforming from one XML representation to another, whitespace is usually insignifigant, because the XML provides structure to the data. But if your target format is plain text (highly likely for wiki markup), whitespace is probably extremely signifigant. There are a number of ways to handle whitespace in XSLT; after some experimentation I found I preferred to use <xsl:value-of> elements, along with <xsl:strip-space elements="*"/> to produce all my output text. It offers you very fine control over the content of your output, although you will need to manually insert newlines using the appropriate XML entity.

<member> Isn’t Nested

The bulk of the information you’ll want to access is contained in the <member> element. <member> defines all kinds of members — namespaces, types, methods, fields, and so on. While the element has children (<param> and <returns> for example), members that are hierarchically organized within the language are not hierarchically organized within the XML. Put another way, the <member> element describing your class’s constructor is a sibling of the element describing your class, not a child:

1
2
3
4
5
6
7
8
9
10
11
<member name="T:SlimDX.XInput.Controller" decl="false" source="controller.h" line="30">
  <remarks>
    Represents a controller, including methods for querying button and thumbstick state.
  </remarks>
</member>
<member name="M:SlimDX.XInput.Controller.#ctor(SlimDX.XInput.UserIndex)" decl="true" source="controller.h" line="39">
  <summary>
    Initializes a new instance of Controller.
  </summary>
  <param name="userIndex">Index of the user's controller.</param>
</member>

The practical implication of this is that you’ll probably end up having to do some textual processing of attribute values in order to extract the useful, meaty information. You’ll learn to love the XPath functions.

.csproj or .vcproj?

When building a .csproj, you get a single .xml file for the entire result assembly. When building a .vcproj, you get one .xdc file per translation unit as well as a final monolithic .xml. Unless you have an import system that takes a single file defining multiple pages, you will probably want to produce (at least) a single wiki markup file per class or type.

In the case of the C# compiler, your XSLT transform should leverage <xsl:result-document>. Note, however, that <xsl:result-document> is XSLT 2.0, so if you don’t have a 2.0-compliant processor, you’ll need to find a workaround or extension supported by your tool.

You’d think the one-file-per-translation-unit approach you get when building a C++/CLI .csproj would be easier to deal with. Alas, it’s not. The compiler injects the documentation for types defined in that translation unit as well as types referenced by that translation unit. Fortunately, the compiler will weave all these fragments together for you in the end, so you end up with the same kind of monolithic file you get with C# — for better or for worse.

Use Multiple Passes

Because of the loose, almost nonexistant, nature of the raw XML documentation, it can be to your benefit to perform the transformation in two stages. The first stage can produce (either as a distinct XML file or an in-memory node set) a more structured tree. I found it much easier to restructure the XML using a “pulling” paradigm, where the child data was directly selected during the processing of the parent element. On the second pass, a slightly more natural “pushing” paradigm can be employed, where you allow the natural recursive nature of the XSLT engine to drive your processing.

For example, here’s a fragment of the transform for handling type members:

1
2
3
4
5
6
7
8
9
10
<xsl:when test="contains(@name,'T:')">
  <xsl:variable name="MemberName" select="substring-after(@name,':')"/>
  <documented-type longname="{$MemberName}" shortname="{tokenize($MemberName,'\.')[last()]}">
    <!-- Select the member elements belonging to this type to nest them within the
         tree for this type. -->
    <xsl:apply-templates select="./remarks"/>
    <xsl:apply-templates select="//member[contains(@name,concat('M:',$MemberName,'.'))]"/>
    <xsl:apply-templates select="//member[contains(@name,concat('P:',$MemberName,'.'))]"/>
  </documented-type>
</xsl:when>

Using two stages like this allows you to isolate the restructuring logic from the presentation logic. This makes the resulting transformation easier to read and maintain, since the two aren’t intimately intertwined. It also allows you to extend your overall processing to support more than one target format by reusing your restructuring transform and simply writing additional formatting transforms.

Dealing With Parameters

Handling parameters — specifically, associating parameter types and names — is another tricky issue. If you want to make this association (you don’t need to), you need to tokenize the name attribute to get the list of types in the method function.

1
2
3
4
5
6
7
8
9
10
<xsl:variable name="ParameterTypes"
              select="tokenize(translate(substring-before(substring-after(@name,'('),')'),'@',''),',')"/>
<xsl:for-each select="./param">
  <xsl:variable name="ParameterIndex" select="position()"/>
  <parameter longtype="{$ParameterTypes[$ParameterIndex]}"
             shorttype="{tokenize($ParameterTypes[$ParameterIndex],'\.')[last()]}"
             name="{@name}">
    <xsl:apply-templates select="."/>
  </parameter>
</xsl:for-each>

Once tokenized, you can iteratively pull each of the child <param> elements, and store the type and name as attributes of your more-structured XML output (in the example above, I’m using a <parameter> element).

Seeing it in Action

I committed a preliminary version of the processing into the SlimDX repository a little while ago. If your interested in the real nuts and bolts, you can check out the code. The transforms are in the docs/xsl directory. There is one to restructure the XML (api-structure.xsl) and one to format it as wiki markup suitable for import into our wiki software (api-format.xsl). Since I still have a few bugs to work out, the actual documentation isn’t live on the wiki — at least not directly. You can probably browse some of the testing pages if you poke through the change history, though.