XSLTC Compiler Design

Compiler Overview
Building the Abstract Syntax Tree
Type-check and Cast Expressions
JVM byte-code generation


	Compiler overview

The main component of the XSLTC compiler is the class

org.apache.xalan.xsltc.compiler.XSLTC

This class uses three parsers to consume the input stylesheet(s):

javax.xml.parsers.SAXParser

is used to parse the stylesheet document and pass its contents to the compiler as basic SAX2 events.

com.sun.xslt.compiler.XPathParser

is a parser used to parse XPath expressions and patterns. This parser is generated using JavaCUP and JavaLEX from Princeton University.

com.sun.xslt.compiler.Parser

is a wrapper for the other two parsers. This parser is responsible for using the other two parsers to build the compiler's abstract syntax tree (which is described in more detail in the next section of this document).


	Building an Abstract Syntax Tree

An abstract syntax tree (AST) is a data-structure commonly used by compilers to separate the parse-phase from the later phases of the compilation. The AST has one node for each parsed token from the stylesheet and can easily be parsed at the stages of type-checking and bytecode generation.

Mapping stylesheet elements to AST nodes
Building the AST from AST nodes
Mapping XPath expressions and patterns to additional AST nodes

The SAX parser passes the contents of the stylesheet to XSLTC's main parser. The SAX events represent a decomposition of the XML document that contains the stylesheet. The main parser needs to create one AST node from each node that it receives from the SAX parser. It also needs to use the XPath parser to decompose attributes that contain XPath expressions and patterns. Remember that XSLT is in effect two languages: XML and XPath, and one parser is needed for each of these languages. The SAX parser breaks down the stylesheet document, the XPath parser breaks down XPath expressions and patterns, and the main parser maps the decomposed elements into nodes in the abstract syntax tree.


	Mapping stylesheets elements to AST nodes

Every element that is defined in the XSLT 1.0 spec is represented by a a class in the org.apache.xalan.xsltc.compiler package. The main parser class contains a Hashtable that that maps XSL elements into Java classes that make up the nodes in the AST. These Java classes all reside in the org.apache.xalan.xsltc.compiler package and extend either the TopLevelElement or the Instruction classes. (Both these classes extend the SyntaxTreeNode class.)

The mapping from XSL element names to Java classes/AST nodes is set up in the initClasses() method of the main parser:

    private void initStdClasses() {
	try {
	    initStdClass("template",    "Template");
	    initStdClass("param",       "Param");
	    initStdClass("with-param",  "WithParam");
	    initStdClass("variable",    "Variable");
	    initStdClass("output",      "Output");
	    :
	    :
	    :
	}
    }

    private void initClass(String elementName, String className)
	throws ClassNotFoundException {
	_classes.put(elementName,
		     Class.forName(COMPILER_PACKAGE + '.' + className));
    }


	Building the AST from AST nodes

The parser builds an AST from the various syntax tree nodes. Each node contains a reference to its parent node, a vector containing references to all child nodes and a structure containing all attribute nodes:


	protected SyntaxTreeNode _parent; // Parent node private Vector _contents; // Child nodes protected Attributes _attributes; // Attributes of this element

These variables should be accessed using these methods:


	protected final SyntaxTreeNode getParent(); protected final Vector getContents(); protected String getAttribute(String qname); protected Attributes getAttributes();

At this time the AST only contains nodes that represent the XSL elements from the stylesheet. A SAX parse is generic and can only handle XML files and will not break up and identify XPath patterns/expressions (these are stored as attributes to the various nodes in the tree). Each XSL instruction gets its own node in the AST, and the XPath patterns/expressions are stored as attributes of these nodes. A stylesheet looking like this:


	<xsl:stylesheet .......> <xsl:template match="chapter"> <xsl:text>Chapter</xsl:text> <xsl:value-of select="."> </xsl:template> </xsl>stylesheet>

will be stored in the AST as indicated in the following picture:

Figure 1: The AST in its first stage

All objects that make up the nodes in the initial AST have a parseContents() method. This method is responsible for:

parsing the values of those attributes that contain XPath expressions or patterns, breaking each expression/pattern into AST nodes and inserting them into the tree.
reading/checking all other required attributes
propagate the parseContents() call down the tree


	Mapping XPath expressions and patterns to additional AST nodes

The nodes that represent the XPath expressions and patterns extend either the Expression or Pattern class respectively. These nodes are not appended to the _contents vectory of each node, but rather stored as individual references in each AST element node. One example is the ForEach class that represents the <xsl:for-each> element. This class has a variable that contains a reference to the AST sub-tree that represents its select attribute:


	private Expression _select;

There is no standard way of storing these XPath expressions and each AST node that contains one or more XPath expression/pattern must handle that itself. This handling basically involves passing the attribute's value to the XPath parser and receiving back an AST sub-tree.

With all XPath expressions/patterns expanded, the AST will look somewhat like this:

Fiugre 2: The AST in its second stage


	Type-check and Cast Expressions

In many cases we will need to typecast the top node in the expression sub-tree to suit the expected result-type of the expression, or to typecast child nodes to suit the allowed types for the various operators in the expression. This is done by calling 'typeCheck()' on the root-node in the XSL tree. Each SyntaxTreeNode node is responsible for inserting type-cast nodes between itself and its child nodes or XPath nodes. These type-cast nodes will convert the output-types of the child/XPath nodes to the expected input-type of the parent node. Let look at our AST again and the node that represents the <xsl:value-of> element. This element expects to receive a string from its select XPath expression, but the Step expression will return either a node-set or a single node. An extra node is inserted into the AST to perform the necessary type conversions:

Figure 3: XPath expression type cast

The typeCheck() method of each SyntaxTreeNode object will call typeCheck() on each of its XPath expressions. This method will return the native type returned by the expression. The AST node will insert an additional type-conversion node if the return-type does not match the expected data-type. Each possible return type is represented by a class in the org.apache.xalan.xsltc.compiler.util package. These classes all contain methods that will generate bytecodes needed to perform the actual type conversions (at runtime). The type-cast nodes in the AST mainly consist of calls to these methods.


	JVM byte-code generation

Compiling the stylesheet
Compiling top-level elements
Compiling template code
Compiling instructions, functions expressions and patterns

Evey node in the AST extends the SyntaxTreeNode base class and implements the translate() method. This method is responsible for outputting the actual bytecodes that make up the functionality required for each element, function, expression or pattern.


	Compiling the stylesheet

Some nodes in the AST require more complex code than others. The best example is the <xsl:stylesheet> element. The code that represents this element has to tie together the code that is generated by all the other elements and generate the actual class definition for the main translet class. The Stylesheet class generates the translet's constructor and methods that handle all top-level elements.


	Compiling top-level elements

The bytecode that handles top-level elements must be generated before any other code. The 'translate()' method in these classes are mainly called from these methods in the Stylesheet class:


	private String compileBuildKeys(ClassGenerator); private String compileTopLevel(ClassGenerator, Enumeration); private void compileConstructor(ClassGenerator, Output);

These methods handle most top-level elements, such as global variables and parameters, <xsl:output> and <xsl:decimal-format> instructions.


	Compiling template code

All XPath patterns in <xsl:apply-template> instructions are converted into numeric values (known as the pattern's kernel 'type'). All templates with identical pattern kernel types are grouped together and inserted into a table known as a test sequence. (The table of test sequences is found in the Mode class in the compiler package. There will be one such table for each mode that is used in the stylesheet). This table is used to build a big switch() statement in the translet's applyTemplates() method. This method is initially called with the root node of the input document.

The applyTemplates() method determines the node's type and passes this type to the switch() statement to look up the matching template. The test sequence code (the TestSeq class) is responsible for inserting bytecodes to find one matching template in cases where more than one template matches the current node type.

There may be several templates that share the same pattern kernel type. Here are a few examples of templates with patterns that all have the same kernel type:


	<xsl:template match="A/C"> <xsl:template match="A/B/C"> <xsl:template match="A \| C">

All these templates will be grouped under the type for <C> and will all get the same kernel type (the type for "C"). The last template will be grouped both under "C" and "A", since it matches either element. If the type identifier for "C" in this case is 8, all these templates will be put under case 8: in applyTemplates()'s big switch() statement. The TestSeq class will insert some code under the case 8: statement (similar to if's and then's) in order to determine which of the three templates to trigger.


	Compiling instructions, functions, expressions and patterns

The template code is generated by calling translate() on each Template object in the abstract syntax tree. This call will be propagated down the abstract syntax tree and every element will output the bytecodes necessary to complete its task.

The Java Virtual Machine is stack-based, which goes hand-in-hand with the tree structure of a stylesheet and the AST. A node in the AST will call translate() on its child nodes and any XPath nodes before it generates its own bytecodes. In that way the correct sequence of JVM instructions is generated. Each one of the child nodes is responsible of creating code that leaves the node's output value (if any) on the stack. The typical procedure for the parent node is to create JVM code that consumes these values off the stack and then leave its own output on the stack (for its parent).

The tree-structure of the stylesheet is in this way closely tied with the stack-based JVM. The design does not offer any obvious way of extending the compiler to output code for other non-stack-based VMs or processors.