ODF Add-in for Microsoft Word

ODF Converter Team Blog

Friday 27 October 2006

Reaching the limits

I would have loved to replace this title by something like "Pushing the limits", but that would not have been very honnest... Indeed we sometimes reach the limit of both document formats we are working with: OpenDocument and OpenXML. I don't intend to compare the pros and cons of each of them in details here (I think there are people on the "blogosphere" that do it much better than I would ;-), but just give two examples to illustrate that both formats are just not perfect. I mean, a perfect format should be totally independant from the way it is rendered by an application or another, and there should not be any loss during a transformation (for features covered by both formats, of course).

In OpenDocument, page styles can be implicitely declared. For instance, if you want to put a landscape-oriented page inside a portait-oriented document, you have to declare a page style with landscape orientation, and then insert a page break associated with this page style. Ok, that's fine. But there is a property for a page style that specifies the following style - just as for paragraphs. But the difference between pages and paragraphs is that a paragraph always ends with some special character or element (the user has to type the carriage return key), while a page usually ends "by itself" within the text flow - I mean, there is not necessary a "page break" instruction. So in many cases, unless you actually render the page to see how it is filled by its content, you don't know when a page end occures, and therefore you don't know when the page style is changing for the following one. Practically, we have some "bugs" in our conversion that are direclty linked to this issue, and that can simply not be fixed. For instance, it happens that headers or footers change in the document, but we have no element to know when they change - and of course, OpenXML needs explicit page style changes (otherwise it would have been far to easy). A user should normally always declare explicit page breaks when he wants to modify the page layout (document maintenance would be a lot easier, such as prefering styles to direct formatting), but unfortunately it is not a common practice... Users have always loved to insert new paragraphs to fill empty spaces! I personnaly think that the ODF specification would better forbid page style changes without an explicit declaration.

Another example illustrating the limit of OpenXML this time: cell splitting in tables. In OpenDocument, cell splitting can be handled two ways: either by splitting the table into more little cells and joining them when necessary, or by defining subtables (the way OpenOffice.org works). In OpenXML, the second alternative does not exist: when you want to split a cell, you have to modify the whole table to define new columns or lines, and then join all the new cells that are not concerned by the splitting. The following example should make it clearer:

Consider a 2x2 table :

| cell A1 | cell B1 | 
|- - - - -|- - - - -| 
| cell A2 | cell B2 | 

You want to split the first cell vertically:

| cell A1a |         | 
|- - - - - | cell B1 | 
| cell A1b |         | 
|- - - - - |- - - - -| 
| cell A2  | cell B2 | 

In OpenDocument, the simpler way would be to define a 2x2 table, and a 1x2 table inside the first cell (A1). The "subtable" property ensures that the cell borders will join. In OpenXML, to achieve the same result, you have to declare a 2x3 table with the two first cells of the second row joined. OK, that doesn't seem so terrific. But now, consider that you want to split the B1 cell into three cells horizontally, to have something like:

| cell A1a | cell B1a | 
|          |- - - - - | 
|- - - - - | cell B1b | 
| cell A1b |- - - - - | 
|          | cell B1c | 
|- - - - - |- - - - - | 
| cell A2  | cell B2  | 

In OpenXML, the number of rows depends on the height of the different cells: in the previous example, we must have 5 rows declared to describe the full table. But if the cells were organized like this:

| cell A1a | cell B1a | 
|- - - - - |- - - - - | 
|          | cell B1b | 
| cell A1b |- - - - - | 
|          | cell B1c | 
|- - - - - |- - - - - | 
| cell A2  | cell B2  | 

then you would only need 4 rows to describe the table! As the cell's height is often implicit (depending on the cell's content), it is sometimes impossible to reproduce the correct table layout... To avoid this issue, when converting a subtable from ODF to OOX, we simply define a new table inside the cell exactly the same way as ODF does. This solves the layout problem, but has some drawbacks: the table is not unified any more, it is composed of several imbricated tables. The simplest way to improve the OpenXML specification in this case would be to allow subtables.

Those two examples aim at illustrating the fact that none format can be seen as "better" than the other - each has its own characteristics, its own strengths and weaknesses. One of our goals when working on the converter is to find the "incompatibilities" of both formats - features that can not be converted from one to the other. We try to keep a list that will be made public at the end of the project, hopping that the organizations behind each format will have a look at it and maybe get some ideas to step forward in the direction of the other format. Sweet dreams...

Monday 23 October 2006

Final Release and the Train Model

It's been a while since I posted the last entry on this blog, but that doesn't mean that we've been quiet during this period: actually we're actively preparing the so called "0.2-final" release, which will be made public on october 30th (next monday). In the initial roadmap, this release was planned to cover the whole direct conversion (transforming ODF documents into DOCX). But due to difficulties we met during the implementation (I explained some of them on this blog), we could not achieve this result. We therefore decided to postpone several features (like drop caps, password protection or digital signature) to a later release (hopefully the 0.3-M1), so that we can make a new version with quite a lot of improvements compared to the previous one (0.2-M3) available.

That's a project management model we usually call the "train model" (and which is typical of Open Source project management): instead of being driven by the features and adjusting the release dates accordingly to the development speed, the idea is to keep fixed release dates and remove or add features depending on which are available at the time of the release. In that way, users can test the product on a regular basis, and we avoid the well-known "tunnel effect". The analogy with a train is the following: features are like coaches in a train, and release dates are like train stations: the latter are fixed, and we keep the possibility to remove coaches from the train when needed in order to always arrive on time to the next station.

So please don't misunderstand: the "final" label does not mean that the development on the direct conversion will stop after this release. The good news is that we put a prototype version of the reverse conversion (transforming DOCX files into ODF) into this release, so that you'll be able to start to test the whole process: open and save ODF documents in Word! This reverse conversion will be availabe on the three targetted platforms: Word 2007, Word 2003 and Word XP. When coaches are ready before time, why should we remove them from the train? ;-)

Tuesday 3 October 2006

Word doesn't like automatic styles...

Since the opening of this blog, there were only a few posts explaining how we actually do the transformation and the issues we face day after day... So today I will explain one of the tricky things we had to implement - I hope that it will convince you that we are also doing technical stuffs ! ;-)

I already mentioned that OpenDocument and OpenXml had a very different way of handling formatting properties: while OpenDocument uses "automatic styles" (that means: for every single formatting property, a style is defined and applied to the content; but that style is not intended to be shown to the user), OpenXml uses something we could call "direct formatting" (properties are directly associated to paragraph or characters). There is a similarity in both formats, thus: they both have the same distinction between paragraph and character properties (paragraph properties are like indentation, text alignment, etc., while characters properties cover text fonts, size, color, etc.).

When starting to work on the transformation, we found out that in OpenXml you can use "hidden styles" - styles that don't appear in the user interface. So we decided to simply transform every automatic style into a hidden style. Actually, at first glance that seemed to worked quite good. But we later faced several issues:

1. In Word 2007 user interface, when a "hidden style" is used, the relation to its parent style is lost. For instance, if we define an automatic style called "P1", and that style is based on the "Standard" style, we expect the user interface to display "Standard" as the style used. But Word 2007 only displays "Clear style", what is not very user friendly for us...

2. We noticed that Word couldn't open very big files with a lot of automatic styles inside. This was not related to the size of the file, but to the number of styles defined. This should be OK for normal use (it happened after several tens of thousands of styles!) but in our case it was problematic.

3. We encountered a problem with toggle properties (which are properties that behave differently when they are applied to a style than when they are used as a direct formatting). Let me give a simple illustration with the bold property: when used in a character style, it toggles the previous state of the character (if it was bold, it becomes normal, and reciprocally); but when it is used as a direct formatting property, it enforces the state of the text (if it is switched to on, then the text is bold, whatever its current status was previously). As you may have guessed, OpenDocument does not make this distinction: bold means always bold, wherever it is defined - in a normal style or in an automatic style. So to handle that, we had to override every toggle property defined in a style to add the same defition as a direct formatting property. In XSLT, that had rapidly become a nightmare.

So in definitive, we took the decision to write a post processor dedicated to transform automatic styles into direct formatting properties. In the .NET framework, this can be done at a very low level (even lower than SAX, for those who know SAX): you intercept every event encountered during the XML parsing: document start, element start, attribute start, string content, attribute end, element end, document end (to make short). So your program has the responsability, for instance, of associating the value of an attribute to its name (they are transmitted in successive method calls). Actually, that can be handled quite simply through the use of a stack: when an element is starting, we put the corresponding node on the top of the stack, and retrieve it when the element is closing. That makes the code quite difficult to read, though, and we don't regret our choice to use XSLT!

In our case, we decided to keep the XSL as it - that means we continue to replace automatic styles with hidden styles in our XSL transformation, and we replace those automatic styles during the post processing. By doing so, we keep a "clean" XSL that can still be used in another context. We first intercept every style declaration and store it into a hashtable, removing the automatic styles from the output "on the fly". Then, we intercept each paragraph property (pPr) to fill it with automatic style properties (when needed); and we do the same with each run property (rPr). in fact, it is not as simple as it seems here, because paragraph properties also often define run properties (that apply to each run of the paragraph), and we have to replicate those properties in each run of the paragraph. Moreover, there are some specific cases to deal with, for instance when we have a run with no rPr - we might need to create one to apply the run properties from the paragraph.

Well, alltogether, there are a little more than 700 lines of code, just to handle this little part of the transformation. So once again, we are very glad that we did not have to code all the transformation this way! I really wonder how Office or OpenOffice.org coders can live without XSLT... ;-)

Powered by DotClear

Project page on SourceForge

SourceForge.net Logo