ODF Add-in for Microsoft Word

ODF Converter Team Blog

Tuesday 1 July 2008

Announcing Milestone 1 Release for Version 2.0 of ODF Translators

Following the success of ODF Translator V1.0, right away, Milestone 1 Release for ODF Translator V2.0 is available for download in ODF Convertor SourceForge website and next and final milestone planned sometime during first week of August, 2008.

M1 - Release

Current Milestone 1 Release for ODF Translator V2.0 extends the functionality on translating Text documents, Spreadsheets and Presentations from OpenXML to ODF and vice versa on following features for the convertors.

Excel (Download Excel Translator -M1)

  • Chart Features such as Chart Area, Chart Wall, Legends
  • Implementation of Reported Defects (Cell borders, Commandline, Freeze Pane, Font change, ODS filters, Reverse Roundtrip, Crash Defect etc)

Word (Download Word Translator -M1)

  • Basic conversion of VML shapes
  • Implementation of Reported Defects

PowerPoint (Download PPT Translator -M1)

  • Slide Transition
  • Basic OLE
  • Implementation of Reported Defects (Text of notes, Animation actions, Image color properties, Default save as file name, Gradient direction, Roundtrip, Option dialog etc)


Upcoming M2 - Release

M2 Release of V2.0 will be available in ODF Convertor SourceForge website some time during first week of August, 2008. Herein, see heads-up on upcoming release features.

Excel

  • Improvement of Chart properties
  • Digital Code Signing
  • Implementation of Reported Defects (Filters, Format of cells, Conversion time, Print etc)

Word


  • Improvement of VML shapes implementation
  • Digital Code Signing
  • Implementation of Reported Defects (Drawing object, Options Dialog for EN, Fidelity message for Japanese, Roundtrip, Localization, Build etc)

PowerPoint


  • Improvement of OLE objects
  • Digital Code Signing
  • Implementation of Reported Defects (Add-In dialog box, Tables, Uninstall, Read only files of PPT, Locale, Point of connectors, Default Tab Stop etc)

Stay tuned and check out new updates of V2.0 in this blog space.

Wednesday 21 May 2008

Back Again for v2.0

After we have released v1.1 in March we got a lot of feedback from various sources including public administrations from all over Europe (thank you very much for it!). We used the rainy spring months here in Europe to collect, analyse and classify this feedback and, thus, to build the scope for OpenXML/ODF Translator v2.0. Development work already started last week with the well-known teams on board: Sonata, Cleverage, DIaLOGIKa and—last, but not least—Microsoft.



Just to mention a number of highlights we will provide in v2.0:


  • supporting ODF 1.1 features (ODF 1.1 compliance)
  • improving roundtrip conversion , in particular concerning the reverse direction from DOCX nach ODT
  • enhancing table and field translation
  • adding shape translation (VML)
  • providing OLE translation in PPT
  • supporting chart properties
  • signing the add-in code, thus, allowing to deploy the add-in in environments where the Office security level is set to high
  • usability enhancements
  • Mono compatibility
  • fixing a number of bugs (thanks again for the excellent feedback from public authorities)


The detailed roadmap describing the various milestones (the final version is planned for beginning of August) will be made available on the project web site soon. Stay tuned :-)

Friday 7 March 2008

Long Time No See

It’s really a long time since our last posting here. However, we have well used this time to enhance the OpenXML/ODF Translator: Release 1.1 is now available for translating text documents, spreadsheets and presentations from OpenXML to ODF and vice versa offering a large number of enhancements as compared to release 1.0. What are the improvements and new features in detail?

Translator for Text Documents

Release 1.1 has been implemented to remedy some of the bugs encountered in v1.0, but also to provide for a more complete feature translation, in particular for documents exchanged in the public administration of the EU (aka Eurolook and LegisWrite documents). Consequently, we focussed in v1.1 on performance improvements, page layout settings (including footer and header), OLE objects, headings,lists and table of contents (ToC).

Translator for Spreadsheets

Improved performance together with increased reliability were two of the key features in release 1.1. In addition, we focussed on enhanced roundtrip translation experience and fixed a large number of bugs reported for the first release. In addition, the setup procedure has been simplified: one (localized) setup program installs the add-in for all supported Office versions.

Translator for Presentations

Performance improvements by considerably reducing the loading time of the translation tables have also been addressed in release 1.1 of the presentation translator. Apart from that, the translated feature set has been extended, e.g. by supporting additional shapes, and many v1.0 features have been made up. The setup procedure has also been simplified in the same way as for the text document and spreadsheet translator add-in, i.e. one (localized) setup program installs the add-in for all supported Office versions.

Command Line Translator

One single executable is now available for all three document types; thus, deployment and configuration of the command line translator, e.g. on a translation server, has become easier and more straightforward. The command line translator is the ideal tool in domain specific scenarios where desktop deployments are restricted, e.g. to implement a centralized conversion service, to do batch processing, etc.

We are looking forward to receiving your feedback on release 1.1 of the OpenXML/ODF Translator.

Stay tuned for future updates here…

Monday 23 July 2007

Open XML Translators’ Latest Releases!!!

It’s Release time again!!! This time we have Spreadsheet (Excel) and Presentation (PowerPoint) translators’ M3 releases and Document (Word) 1.1 translator’s M1 release for all of you download and play with to provide your candid feedback.

Over the past month, our teams have been slogging hard to bring the following feature sets and bug fixes (applicable to Word 1.1 releases) in the latest release versions:

Translator for Spreadsheet / Excel M3 features: Charts (Covers 2D+3D with Line, Area, Column, Bar and Pie Charts), Hyperlinks, Conditional formatting (only direct conversion implemented), Pictures, Data styles (includes Time, Fractions, Scientific formats) and Annotations (for reverse conversion: shapes properties are not implemented yet).

Translator for Presentation / PowerPoint M3 features: Design Layouts, Text Box, Paragraph, Shapes, Format Background, Text, Paragraph, Formatting Indent & Spacing and List numbering & bullets.

Translator for Document / Word 1.1 M1 scope: Bug fixes for loss in paragraph formats, style formats, field content loss and Text flow problems in particular scenarios. Other than these fixes, for the first time, you’ll see Word translator in Japanese, besides French, German, Polish, Dutch, and, but of course, English!

As always, all the known issues for the releases are tracked in the form of bugs, release notes for Excel, PowerPoint, Word 1.1 translators and format conversion known issues.

Till I post you with updates on the next set of releases, keep the faith, and please share with us your earnest feedback…

Friday 15 June 2007

Did I just mention Word 1.1!

Well folks, you heard right! After Word 1.0 release, we have been listening and gathering your feedback and comments and have decided to roll out the sequel to Word 1.0. Word 1.1 is slated to be released in fall 2007, along with Excel and PowerPoint translators’ v1 releases, with intermediate CTP updates.

At a high level, Word 1.1 aims to address the following issues:

  • Enhancements to end user usability in terms of:
  1. Installation
  2. Predictable translation
  3. Localization - Additional Languages supported will include Chinese and Japanese with updates to Dutch
  4. Error messages & documentation
  • Fixing all critical defects
  • Performance Improvements

Some major bug fixes would be observed with regards to content loss, formatting, header footer lists and shapes. We are still finalizing the roadmap in terms of intermediate releases, etc for 1.1 and will keep you posted on that front with newer updates. Our Word 1.0 translator team, that includes Cleverage, Aztecsoft and DIaLOGIKa, will be supporting this release as well. We look forward to your support and feedback in ensuring we enhance the translator to meet customer needs.

Thursday 7 June 2007

Introducing Office Excel and PowerPoint Translators

Well, what after Word translator 1.0!!! We have been hearing about Excel and PowerPoint translators every now and then, but now we present M2 releases of both these translators for all to download. The 1.0 / final versions of both these translators are expected to be released in fall 2007. The roadmaps for the translators provide further details in terms of intermediate milestone releases.

The Excel and PowerPoint translators’ lifecycle began in Feb 2007 with the M1 (Milestone 1) release of Excel happening in end of March. The latest version – M2 (Milestone 2) released on May 21st, features release drops for Excel and PowerPoint add-ins with the command line option. Support for ODF 1.0 -> Open XML and vice versa (often termed as forward / direct and reverse transforms in a colloquial way internally in our team) are implemented and released together to enable end users to exercise a round trip scenario for a feature completely.

M1 milestone for Excel premiered Basic and Advanced Table Model, Basic Text and Paragraph Formatting, Document Metadata and Document structure. Milestone M2 for Excel, constructed on this foundation, features Annotations (available only in direct mode), Basic and Advanced Table Model, Headers & Footers, Styles and Default styles, Text & Paragraph formatting, Page Styles and Layout and some Data styles. M2 for PowerPoint converter covers Page Setup, Custom Slide Show, Footer, Design Layouts (Blank and Title only), Text Font Formatting and Rectangle & Textbox shapes. All the known issues for the releases are tracked in the form of bugs, release notes for Excel and PowerPoint translators and format conversion known issues.



This is yet another stepping stone in Microsoft’s interoperability initiative so that support for Spreadsheets and Presentation (ODF 1.0) formats is also made available along with the already supported document formats. The official announcement can be viewed here.

Now, how about team introductions! Along with Cleverage working on developing the Excel translator, and Aztecsoft and DIaLOGIKa testing both the translators, we have Sonata who has joined the translators’ bandwagon by working on PowerPoint translator development.

We are anticipating your active participation this time as well…Will keep you posted with further updates and releases…

Friday 2 February 2007

Release 1.0 now available!

As you might have noticed, the long awaited Release 1.0 is now available for download!

It has been 8 months since the project started in early June and we've been hard working all that time to ensure the add-In meets a high quality level. It has been tested over Word XP, Word 2003 and Word 2007 in five different languages (English, Dutch, French, German and Polish) thanks to DIaLOGIKa (http://www.dialogika.de/) and Aztecsoft (http://www.aztecsoft.com/).

Since the official announcement from Microsoft (see the press news), over 9000 of copies were downloaded on SourceForge.org (at the time of posting this entry)! This is a great pleasure for us to see that the add-In has reached such an enormous success.

Already, we have received quite a lot of press coverage. A hit in Google for ODF-Converter returns 871 000 results! We are proud to bring life to a project that has stirred a big crowd, sometimes positively sometimes not, as the whole open standard issue seem to have become a political controversy...

Nonetheless, we sincerely hope this Add-in will benefit the entire community in exchanging electronic documentation. Yes, exchanging documents won't be chore any more. Yes you will be able to read and write ODF documents even if you don't have openOffice suite (or other ODF implementations). And conversely, as Novell (http://www.novell.com/) has announced that the Translator will be natively implemented in its next version of OpenOffice.

But we won't stop here.... We'are also planning to work on Excel and Powerpoint and this should be availble by the end of the year (Novembre 2007), so stay tuned and keep on providing feedback as this will help us in future development!

Thanks all for participating in this exciting project! I'm looking forward to the next releases... ;)

Thursday 7 December 2006

Open XML and extensibility

MS Workshop on Open XML

During the last few days, I was invited to participate in a workshop about Open XML targetted at developers. It took place in Microsoft Technology Center in Paris and was organized by Guillaume Renaud (from Microsoft France), Doug Mahugh (Office 2007 Technical "Evangelist", coming from Microsoft Corp.) and Wouter van Vugt (from InfoSupport). They kindly proposed me to present the ODF Converter project, what I was happy to do. Being at the technology center on wednesday afternoon, I was also invited to repeat the presentation (in quite a shorter time!) for the "developers wednesdays", an event organized weekly by MSDN France team (a nice report was posted - in french! - by Julien Chable on his blog). I must say that it was a real pleasure to meet all those people I had read many times on the internet. We've had some very interesting discussions about several aspects of the project: differences between ODF and Open XML, interoperability, Open Source...

The subtable issue, second

Having recently faced a critical issue related to Open XML when working on the converter, I did not miss the occasion to ask such worldwide famous experts for advice. But let me first explain the problem. It is closely linked to the subtable issue I reported a few weeks ago on this blog. In OpenDocument, when a table inside a cell has this "subtable" attribute set to "true", it means that the rendering engine has to join the borders of the tables, so that it looks like a unique, splitted table. But subtable doesn't have any equivalent in OpenXML. As I explained in my previous post, we ended up by simply embedding tables inside cells and removing the outer borders to obtain an acceptable result. It is more like a "better than nothing" workaround than a satisfying solution. But anyway, we did not have any better alternative.

When converting our table (containing a subtable) back to ODF, we would like to find our subtable attribute back, so that we don't lose anything during the whole conversion process. To achieve such a result, we need to find a way to add some custom property to the converted table. This property must be:

  • transparent for the user (Word rendering engine must ignore it)
  • preserved by Word when saving the document
  • recognized by our converter during the reverse conversion.
How to extend Open XML with custom properties

Among all the extensibility features provided by Open XML, I truly thought that we would find some candidates to solve our problem. Let's examine them one after the other.

Custom XML Markup & Smart Tags

Those two features would have been good options to store a custom property for our subtable. Unfortunately, they are not transparent for the user (SmartTags appear as underlined, whereas custom XML markups are shown as little boxes. It is possible to ask Word not to show them, but it depends on the user's configuration).

Custom XML Parts & Content Controls

Open XML came up with the notion of content controls and custom XML parts. But content controls are also always visible in some way to the user, even if they are empty. That does not prevent us to define custom XML parts in our document, in order to store some additional informations, e.g. "subtable" properties for tables. But unfortunately, there is no way to identify an element outside its definition in a way that would not change upon user actions (such as saving).

Processing instructions

We also thought of using processing instructions, but it appeared that Word did not preserve them when saving a document.

Participate in the contest!

I discussed this issue with Doug and Wouter, but we could not find any satisfying solution - I mean, that would be totally transparent to the user. I find it quite disappointing, because this sounds like a basic extensibility feature to me. That does not mean that it is impossible, but only that we did not find the right way to do it yet. One of the workshop attendees suggested to improve the ECMA specification by adding an attribute to custom properties that would prevent Word from rendering them. Well, that's not a bad idea. If some of the Open XML schema designers could hear it...

But in the mean time, any suggestion will be much appreciated!

UPDATE: Doug and Wouter posted nice reports on the Open XML Workshop in Paris on their blogs: here and here

Thursday 30 November 2006

Launching of 0.3-M1 release

Last week we released version 0.3-M1 of the converter. What do those numbers mean?

  • 0.3 means that we are now working mainly on the reverse conversion (from DOCX to ODT); the direct conversion will still continue to be improved, but it will be far less visible than during the previous months (we fixed a lot of bugs since the last release, though - the number of open bugs on SourceForge dropped from more than 100 to less than 50 at the time of the release)
  • M1 stands for "Milestone 1" and corresponds to a set of features that were implemented according to the roadmap of the project.

For simple documents, the reverse conversion works quite fine, allowing users to manipulate OpenDocument text files directly in Word. Our main concern is now to make the process of opening an ODT file and saving it back to ODT as accurate as possible. That means that if we have to implement workarounds to convert features that are not directly available in one format or the other, those workarounds will have to be preserved during the reverse conversion. To ensure that this process works fine, we iterate it several times on one file, and see the final result as something we could call the "fix point" of the converter (refering to a famous mathematical theorem - but I'm not sure of the english name).

Once we have an acceptable result for direct / reverse conversions, we will enhance our transformations so that they can also work correctly on legacy doc files produced by previous versions of Word (there are tons of features that are marked as deprecated in the OpenXML specification).

And to finish, a good news: Google team finally fixed the problem of ODT export in Google Docs. You can now import those documents in Word without problem!

Friday 3 November 2006

What's the problem with Google Docs?

You may have learnt that we've just released a new version of the ODF Translator yesterday, called 0.2-Final (even if it's not really a final version, as I've already explained on this blog), providing a lot of new features for the direct conversion (from ODT to DOCX) and a prototype of the reverse conversion (from DOCX to ODT). While the reverse conversion is only at its begining, the direct conversion now looks quite good and provides most of the features that we expect from a converter. A complete list of features is available on SourceForge. The main functionalities that are still to develop are:

  • digital signature
  • encryption
  • section protection
  • drop caps

along with several other small features. The roadmap for both direct and reverse conversion is also available for download on SourceForge.

This release has been intensively tested by our test teams from Dialogika and AztecSoft. Test scenarios are described here and here. But this time, we also wanted to have an overlook of how the converter was behaving with real-life files. To this purpose, Dialogika gathered more than 450 ODT files on the internet and here are the results:

  • 411 documents were converted, validated and opened successfully in Word
  • 10 documents were valid, but could not be opened in Word
  • 28 documents were invalid but could be opened in Word
  • 7 documents were invalid and could not be opened in Word

Of course, this does not show anything about the way the documents were actually rendered in Word! But anyway, the results were not as good as we might have expected, so we made some quick investigations to identify the issues that lead to those results. After a one-day exploration, most of them could be fixed, and we therefore decided to publish a "Hot Fix" for the 0.2-Final next week. It won't add any new feature, but will fix those file crashes.

During this "real life" tests, we noticed that all the files created with the online application "Google Docs" were not converted successfully. This was strange enough for us to look in detail at what was wrong. And we found out that Google Docs was simply not able to export to ODF. Actually, the file menu says "Save as OpenOffice" and not "Save as OpenDocument". The output file is an SXW file (the legacy format from previous versions of Star Office and OpenOffice.org)... with an ODT extension! I don't know if by doing this way the guys from Google wanted to make people think that they had implemented the ODF format, but that was a nice try! ;-) I guess that they are working hard to achieve the compatibility, but in the mean time our converter won't be able to open documents made with Google Docs - no need to complain, we have commited to handle OASIS OpenDocument format, not all the formats of the earth!

Friday 27 October 2006

Reaching the limits

I would have loved to replace this title by something like "Pushing the limits", but that would not have been very honnest... Indeed we sometimes reach the limit of both document formats we are working with: OpenDocument and OpenXML. I don't intend to compare the pros and cons of each of them in details here (I think there are people on the "blogosphere" that do it much better than I would ;-), but just give two examples to illustrate that both formats are just not perfect. I mean, a perfect format should be totally independant from the way it is rendered by an application or another, and there should not be any loss during a transformation (for features covered by both formats, of course).

In OpenDocument, page styles can be implicitely declared. For instance, if you want to put a landscape-oriented page inside a portait-oriented document, you have to declare a page style with landscape orientation, and then insert a page break associated with this page style. Ok, that's fine. But there is a property for a page style that specifies the following style - just as for paragraphs. But the difference between pages and paragraphs is that a paragraph always ends with some special character or element (the user has to type the carriage return key), while a page usually ends "by itself" within the text flow - I mean, there is not necessary a "page break" instruction. So in many cases, unless you actually render the page to see how it is filled by its content, you don't know when a page end occures, and therefore you don't know when the page style is changing for the following one. Practically, we have some "bugs" in our conversion that are direclty linked to this issue, and that can simply not be fixed. For instance, it happens that headers or footers change in the document, but we have no element to know when they change - and of course, OpenXML needs explicit page style changes (otherwise it would have been far to easy). A user should normally always declare explicit page breaks when he wants to modify the page layout (document maintenance would be a lot easier, such as prefering styles to direct formatting), but unfortunately it is not a common practice... Users have always loved to insert new paragraphs to fill empty spaces! I personnaly think that the ODF specification would better forbid page style changes without an explicit declaration.

Another example illustrating the limit of OpenXML this time: cell splitting in tables. In OpenDocument, cell splitting can be handled two ways: either by splitting the table into more little cells and joining them when necessary, or by defining subtables (the way OpenOffice.org works). In OpenXML, the second alternative does not exist: when you want to split a cell, you have to modify the whole table to define new columns or lines, and then join all the new cells that are not concerned by the splitting. The following example should make it clearer:

Consider a 2x2 table :

| cell A1 | cell B1 | 
|- - - - -|- - - - -| 
| cell A2 | cell B2 | 

You want to split the first cell vertically:

| cell A1a |         | 
|- - - - - | cell B1 | 
| cell A1b |         | 
|- - - - - |- - - - -| 
| cell A2  | cell B2 | 

In OpenDocument, the simpler way would be to define a 2x2 table, and a 1x2 table inside the first cell (A1). The "subtable" property ensures that the cell borders will join. In OpenXML, to achieve the same result, you have to declare a 2x3 table with the two first cells of the second row joined. OK, that doesn't seem so terrific. But now, consider that you want to split the B1 cell into three cells horizontally, to have something like:

| cell A1a | cell B1a | 
|          |- - - - - | 
|- - - - - | cell B1b | 
| cell A1b |- - - - - | 
|          | cell B1c | 
|- - - - - |- - - - - | 
| cell A2  | cell B2  | 

In OpenXML, the number of rows depends on the height of the different cells: in the previous example, we must have 5 rows declared to describe the full table. But if the cells were organized like this:

| cell A1a | cell B1a | 
|- - - - - |- - - - - | 
|          | cell B1b | 
| cell A1b |- - - - - | 
|          | cell B1c | 
|- - - - - |- - - - - | 
| cell A2  | cell B2  | 

then you would only need 4 rows to describe the table! As the cell's height is often implicit (depending on the cell's content), it is sometimes impossible to reproduce the correct table layout... To avoid this issue, when converting a subtable from ODF to OOX, we simply define a new table inside the cell exactly the same way as ODF does. This solves the layout problem, but has some drawbacks: the table is not unified any more, it is composed of several imbricated tables. The simplest way to improve the OpenXML specification in this case would be to allow subtables.

Those two examples aim at illustrating the fact that none format can be seen as "better" than the other - each has its own characteristics, its own strengths and weaknesses. One of our goals when working on the converter is to find the "incompatibilities" of both formats - features that can not be converted from one to the other. We try to keep a list that will be made public at the end of the project, hopping that the organizations behind each format will have a look at it and maybe get some ideas to step forward in the direction of the other format. Sweet dreams...

Monday 23 October 2006

Final Release and the Train Model

It's been a while since I posted the last entry on this blog, but that doesn't mean that we've been quiet during this period: actually we're actively preparing the so called "0.2-final" release, which will be made public on october 30th (next monday). In the initial roadmap, this release was planned to cover the whole direct conversion (transforming ODF documents into DOCX). But due to difficulties we met during the implementation (I explained some of them on this blog), we could not achieve this result. We therefore decided to postpone several features (like drop caps, password protection or digital signature) to a later release (hopefully the 0.3-M1), so that we can make a new version with quite a lot of improvements compared to the previous one (0.2-M3) available.

That's a project management model we usually call the "train model" (and which is typical of Open Source project management): instead of being driven by the features and adjusting the release dates accordingly to the development speed, the idea is to keep fixed release dates and remove or add features depending on which are available at the time of the release. In that way, users can test the product on a regular basis, and we avoid the well-known "tunnel effect". The analogy with a train is the following: features are like coaches in a train, and release dates are like train stations: the latter are fixed, and we keep the possibility to remove coaches from the train when needed in order to always arrive on time to the next station.

So please don't misunderstand: the "final" label does not mean that the development on the direct conversion will stop after this release. The good news is that we put a prototype version of the reverse conversion (transforming DOCX files into ODF) into this release, so that you'll be able to start to test the whole process: open and save ODF documents in Word! This reverse conversion will be availabe on the three targetted platforms: Word 2007, Word 2003 and Word XP. When coaches are ready before time, why should we remove them from the train? ;-)

Tuesday 3 October 2006

Word doesn't like automatic styles...

Since the opening of this blog, there were only a few posts explaining how we actually do the transformation and the issues we face day after day... So today I will explain one of the tricky things we had to implement - I hope that it will convince you that we are also doing technical stuffs ! ;-)

I already mentioned that OpenDocument and OpenXml had a very different way of handling formatting properties: while OpenDocument uses "automatic styles" (that means: for every single formatting property, a style is defined and applied to the content; but that style is not intended to be shown to the user), OpenXml uses something we could call "direct formatting" (properties are directly associated to paragraph or characters). There is a similarity in both formats, thus: they both have the same distinction between paragraph and character properties (paragraph properties are like indentation, text alignment, etc., while characters properties cover text fonts, size, color, etc.).

When starting to work on the transformation, we found out that in OpenXml you can use "hidden styles" - styles that don't appear in the user interface. So we decided to simply transform every automatic style into a hidden style. Actually, at first glance that seemed to worked quite good. But we later faced several issues:

1. In Word 2007 user interface, when a "hidden style" is used, the relation to its parent style is lost. For instance, if we define an automatic style called "P1", and that style is based on the "Standard" style, we expect the user interface to display "Standard" as the style used. But Word 2007 only displays "Clear style", what is not very user friendly for us...

2. We noticed that Word couldn't open very big files with a lot of automatic styles inside. This was not related to the size of the file, but to the number of styles defined. This should be OK for normal use (it happened after several tens of thousands of styles!) but in our case it was problematic.

3. We encountered a problem with toggle properties (which are properties that behave differently when they are applied to a style than when they are used as a direct formatting). Let me give a simple illustration with the bold property: when used in a character style, it toggles the previous state of the character (if it was bold, it becomes normal, and reciprocally); but when it is used as a direct formatting property, it enforces the state of the text (if it is switched to on, then the text is bold, whatever its current status was previously). As you may have guessed, OpenDocument does not make this distinction: bold means always bold, wherever it is defined - in a normal style or in an automatic style. So to handle that, we had to override every toggle property defined in a style to add the same defition as a direct formatting property. In XSLT, that had rapidly become a nightmare.

So in definitive, we took the decision to write a post processor dedicated to transform automatic styles into direct formatting properties. In the .NET framework, this can be done at a very low level (even lower than SAX, for those who know SAX): you intercept every event encountered during the XML parsing: document start, element start, attribute start, string content, attribute end, element end, document end (to make short). So your program has the responsability, for instance, of associating the value of an attribute to its name (they are transmitted in successive method calls). Actually, that can be handled quite simply through the use of a stack: when an element is starting, we put the corresponding node on the top of the stack, and retrieve it when the element is closing. That makes the code quite difficult to read, though, and we don't regret our choice to use XSLT!

In our case, we decided to keep the XSL as it - that means we continue to replace automatic styles with hidden styles in our XSL transformation, and we replace those automatic styles during the post processing. By doing so, we keep a "clean" XSL that can still be used in another context. We first intercept every style declaration and store it into a hashtable, removing the automatic styles from the output "on the fly". Then, we intercept each paragraph property (pPr) to fill it with automatic style properties (when needed); and we do the same with each run property (rPr). in fact, it is not as simple as it seems here, because paragraph properties also often define run properties (that apply to each run of the paragraph), and we have to replicate those properties in each run of the paragraph. Moreover, there are some specific cases to deal with, for instance when we have a run with no rPr - we might need to create one to apply the run properties from the paragraph.

Well, alltogether, there are a little more than 700 lines of code, just to handle this little part of the transformation. So once again, we are very glad that we did not have to code all the transformation this way! I really wonder how Office or OpenOffice.org coders can live without XSLT... ;-)

Tuesday 26 September 2006

About pre and post-processings

If you remember one of my first posts, I mentioned the need to do pre and post-processings to be able to convert certain features. One of the first need was indeed to build a ZIP file after the conversion. For this purpose, we created a processor called "ZipArchiveWriter" that takes the XML flow produced by the XSL transformation and creates the desired ZIP entries. We are now facing other situations where pre or post-processings are needed.

Special characters treatment

In OpenDocument, special unicode characters are used for unbreakable spaces, soft hyphens... whereas OpenXml uses XML tags. We could handle those conversions within our XSL files, but that would be very time-consuming (each single character must be checked). To avoid that, we implemented a filter that just takes the output of the transformation and replaces the special characters with the appropriate XML elements.

Technically, each post-processor is a XmlWriter, and we just chain those XmlWriters one after the other (the output of the first one being the input of the second one and so on, the last one being the ZipArchiveWriter). That allows us to dedicate each post-processor to a specific task, while remaining very low impacting on the global performance of the converter. The only drawback of such a method is that we can only work on the fly on the XML flow. But until now, it was not a problem. And it shouldn't be for the next post-processor we'll have to implement: the one that will convert to the automatic styles that we find in OpenDocument into run properties.

Password protected documents

Another situation we are facing is password-protected documents. As you know, OpenDocument files are in fact ZIP files containing XML data. When you choose to protect an ODF document with a password, the XML files embedded in the ZIP archive are simply encrypted with a dedicated algorithm (Blowfish) before they are stored into the archive. So for our XSLT engine to be able to process those XML files, we need to decrypt them first. That will be done through a pre-processor.

Actually we are already using a pre-processor to extract the files from the archive - but it is somehow hidden by the "resolver" mechanism from the .NET framework: when instanciating the XSLT processor, we specify a custom resolver (called "ZipResolver") that must be used to find the needed resources. This custom resolver simply retrieves the streams from the ZIP archive. To handle password-protected files, we will insert a decoding mechanism inside the ZipResolver.

You certainly noticed that I'm speaking in the future. That's because we don't have any implementation of the Blowfish algorithm in C# yet... We need an Open Source implementation compatible with the BSD license, and there does not seem to exist any. This is exactly the same issue we had for ZIP compression / decompression. For obvious reasons, we don't want to code a new implementation from scratch (it would be very time-consuming, with the risk of adding bugs or even security holes to our code). So the most reasonable solution may be to find a C-library and to add a C# wrapper around it - exactly the same way that we did with zlib. If anybody has such an experience and wants to help us to do this job, feel free to contact us!

Tuesday 19 September 2006

On Functionality Testing...

As Wolfgang had introduced the topic on functional and EU specific ‘testing the ODF translator tool’ in his blog last week I would like to talk a bit more about the functional, setup and system testing areas that Aztecsoft is involved with.

High level picture: We started off with testing the translator tool prototype back in June 2006 with the development of a test plan to organize our testing efforts and ongoing creation of test scenarios to cover the functional aspects of testing. With the increase in feature set of the translator tool we started off with creation and execution of performance scenarios as well. With the different flavors of the translator being made available for working on Word 2003 and Word XP, we also created a compatibility matrix and have been involved with testing the different translator flavors and investigating installation dependencies when required. The test scenarios for performance and setup and test plan are available for viewing in the documentation area of the project site. We have also worked out a process with the development team on procuring two builds a week for focused testing and filing bugs early on during the development phase. Identification of a set of common end-user scenarios or Build Verification Tests (BVT) helps us in ‘accepting’ a build for further rigorous testing. The Aztecsoft test team also contributed in putting up the end-user feature list for the translator tool.

On processes and tools: On testing functionality of the translator tool we work on creating test cases for the functional area for instance coming up with test scenarios for fonts and formatting, paragraphs, tables, etc and then in-depth test cases for the same. For ensuring good test coverage we are employing orthogonal array technique for creating pair-wise test cases. This means using the ‘pairs’ tool we pass inputs so as to get a combination of test cases generated. Say, for e.g. if we pass font styles like Arial, Verdana and font faces like Bold, Italic, etc with font sizes like 8, 10, etc we would be getting test cases like “Test Arial font with Bold and size 8”, “Test Verdana font with Bold and size 10” and so on and so forth. For the tests identified by this process data generation is another challenging activity to ensure exercising appropriate code paths. The tests are executed to test both the UI and the command line tool. Setup test cases are to ensure the product installs fine on the platforms identified with the appropriate Word and translator flavor combinations and to ensure good user experience. Our performance tests currently test the document size limit and performance of the translator on low memory conditions besides exercising some negative scenarios. Apart from testing the builds for feature completeness and conducting timely performance tests and filing bugs on the same, the Aztecsoft test team has designed and is developing an automation tool to automate most of the scenarios to develop a regression test automation suite for testing the ‘growing’ ODF translator tool. This is based on image / visual comparison techniques for ensuring document fidelity post conversion. The framework is being coded in C# and XML files are being employed for passing inputs to route to the required module execution and due to this the automation is modularized. Our plan is also to make this automation framework flexible so as to test other similar conversions after plugging in necessary code. Hence one thing to watch out for in the upcoming releases would be the test automation results!

Powered by DotClear

Project page on SourceForge

SourceForge.net Logo