Since the opening of this blog, there were only a few posts explaining how we actually do the transformation and the issues we face day after day... So today I will explain one of the tricky things we had to implement - I hope that it will convince you that we are also doing technical stuffs ! ;-)

I already mentioned that OpenDocument and OpenXml had a very different way of handling formatting properties: while OpenDocument uses "automatic styles" (that means: for every single formatting property, a style is defined and applied to the content; but that style is not intended to be shown to the user), OpenXml uses something we could call "direct formatting" (properties are directly associated to paragraph or characters). There is a similarity in both formats, thus: they both have the same distinction between paragraph and character properties (paragraph properties are like indentation, text alignment, etc., while characters properties cover text fonts, size, color, etc.).

When starting to work on the transformation, we found out that in OpenXml you can use "hidden styles" - styles that don't appear in the user interface. So we decided to simply transform every automatic style into a hidden style. Actually, at first glance that seemed to worked quite good. But we later faced several issues:

1. In Word 2007 user interface, when a "hidden style" is used, the relation to its parent style is lost. For instance, if we define an automatic style called "P1", and that style is based on the "Standard" style, we expect the user interface to display "Standard" as the style used. But Word 2007 only displays "Clear style", what is not very user friendly for us...

2. We noticed that Word couldn't open very big files with a lot of automatic styles inside. This was not related to the size of the file, but to the number of styles defined. This should be OK for normal use (it happened after several tens of thousands of styles!) but in our case it was problematic.

3. We encountered a problem with toggle properties (which are properties that behave differently when they are applied to a style than when they are used as a direct formatting). Let me give a simple illustration with the bold property: when used in a character style, it toggles the previous state of the character (if it was bold, it becomes normal, and reciprocally); but when it is used as a direct formatting property, it enforces the state of the text (if it is switched to on, then the text is bold, whatever its current status was previously). As you may have guessed, OpenDocument does not make this distinction: bold means always bold, wherever it is defined - in a normal style or in an automatic style. So to handle that, we had to override every toggle property defined in a style to add the same defition as a direct formatting property. In XSLT, that had rapidly become a nightmare.

So in definitive, we took the decision to write a post processor dedicated to transform automatic styles into direct formatting properties. In the .NET framework, this can be done at a very low level (even lower than SAX, for those who know SAX): you intercept every event encountered during the XML parsing: document start, element start, attribute start, string content, attribute end, element end, document end (to make short). So your program has the responsability, for instance, of associating the value of an attribute to its name (they are transmitted in successive method calls). Actually, that can be handled quite simply through the use of a stack: when an element is starting, we put the corresponding node on the top of the stack, and retrieve it when the element is closing. That makes the code quite difficult to read, though, and we don't regret our choice to use XSLT!

In our case, we decided to keep the XSL as it - that means we continue to replace automatic styles with hidden styles in our XSL transformation, and we replace those automatic styles during the post processing. By doing so, we keep a "clean" XSL that can still be used in another context. We first intercept every style declaration and store it into a hashtable, removing the automatic styles from the output "on the fly". Then, we intercept each paragraph property (pPr) to fill it with automatic style properties (when needed); and we do the same with each run property (rPr). in fact, it is not as simple as it seems here, because paragraph properties also often define run properties (that apply to each run of the paragraph), and we have to replicate those properties in each run of the paragraph. Moreover, there are some specific cases to deal with, for instance when we have a run with no rPr - we might need to create one to apply the run properties from the paragraph.

Well, alltogether, there are a little more than 700 lines of code, just to handle this little part of the transformation. So once again, we are very glad that we did not have to code all the transformation this way! I really wonder how Office or OpenOffice.org coders can live without XSLT... ;-)