ODF Add-in for Microsoft Word

ODF Converter Team Blog

Tuesday 26 September 2006

About pre and post-processings

If you remember one of my first posts, I mentioned the need to do pre and post-processings to be able to convert certain features. One of the first need was indeed to build a ZIP file after the conversion. For this purpose, we created a processor called "ZipArchiveWriter" that takes the XML flow produced by the XSL transformation and creates the desired ZIP entries. We are now facing other situations where pre or post-processings are needed.

Special characters treatment

In OpenDocument, special unicode characters are used for unbreakable spaces, soft hyphens... whereas OpenXml uses XML tags. We could handle those conversions within our XSL files, but that would be very time-consuming (each single character must be checked). To avoid that, we implemented a filter that just takes the output of the transformation and replaces the special characters with the appropriate XML elements.

Technically, each post-processor is a XmlWriter, and we just chain those XmlWriters one after the other (the output of the first one being the input of the second one and so on, the last one being the ZipArchiveWriter). That allows us to dedicate each post-processor to a specific task, while remaining very low impacting on the global performance of the converter. The only drawback of such a method is that we can only work on the fly on the XML flow. But until now, it was not a problem. And it shouldn't be for the next post-processor we'll have to implement: the one that will convert to the automatic styles that we find in OpenDocument into run properties.

Password protected documents

Another situation we are facing is password-protected documents. As you know, OpenDocument files are in fact ZIP files containing XML data. When you choose to protect an ODF document with a password, the XML files embedded in the ZIP archive are simply encrypted with a dedicated algorithm (Blowfish) before they are stored into the archive. So for our XSLT engine to be able to process those XML files, we need to decrypt them first. That will be done through a pre-processor.

Actually we are already using a pre-processor to extract the files from the archive - but it is somehow hidden by the "resolver" mechanism from the .NET framework: when instanciating the XSLT processor, we specify a custom resolver (called "ZipResolver") that must be used to find the needed resources. This custom resolver simply retrieves the streams from the ZIP archive. To handle password-protected files, we will insert a decoding mechanism inside the ZipResolver.

You certainly noticed that I'm speaking in the future. That's because we don't have any implementation of the Blowfish algorithm in C# yet... We need an Open Source implementation compatible with the BSD license, and there does not seem to exist any. This is exactly the same issue we had for ZIP compression / decompression. For obvious reasons, we don't want to code a new implementation from scratch (it would be very time-consuming, with the risk of adding bugs or even security holes to our code). So the most reasonable solution may be to find a C-library and to add a C# wrapper around it - exactly the same way that we did with zlib. If anybody has such an experience and wants to help us to do this job, feel free to contact us!

Tuesday 19 September 2006

On Functionality Testing...

As Wolfgang had introduced the topic on functional and EU specific ‘testing the ODF translator tool’ in his blog last week I would like to talk a bit more about the functional, setup and system testing areas that Aztecsoft is involved with.

High level picture: We started off with testing the translator tool prototype back in June 2006 with the development of a test plan to organize our testing efforts and ongoing creation of test scenarios to cover the functional aspects of testing. With the increase in feature set of the translator tool we started off with creation and execution of performance scenarios as well. With the different flavors of the translator being made available for working on Word 2003 and Word XP, we also created a compatibility matrix and have been involved with testing the different translator flavors and investigating installation dependencies when required. The test scenarios for performance and setup and test plan are available for viewing in the documentation area of the project site. We have also worked out a process with the development team on procuring two builds a week for focused testing and filing bugs early on during the development phase. Identification of a set of common end-user scenarios or Build Verification Tests (BVT) helps us in ‘accepting’ a build for further rigorous testing. The Aztecsoft test team also contributed in putting up the end-user feature list for the translator tool.

On processes and tools: On testing functionality of the translator tool we work on creating test cases for the functional area for instance coming up with test scenarios for fonts and formatting, paragraphs, tables, etc and then in-depth test cases for the same. For ensuring good test coverage we are employing orthogonal array technique for creating pair-wise test cases. This means using the ‘pairs’ tool we pass inputs so as to get a combination of test cases generated. Say, for e.g. if we pass font styles like Arial, Verdana and font faces like Bold, Italic, etc with font sizes like 8, 10, etc we would be getting test cases like “Test Arial font with Bold and size 8”, “Test Verdana font with Bold and size 10” and so on and so forth. For the tests identified by this process data generation is another challenging activity to ensure exercising appropriate code paths. The tests are executed to test both the UI and the command line tool. Setup test cases are to ensure the product installs fine on the platforms identified with the appropriate Word and translator flavor combinations and to ensure good user experience. Our performance tests currently test the document size limit and performance of the translator on low memory conditions besides exercising some negative scenarios. Apart from testing the builds for feature completeness and conducting timely performance tests and filing bugs on the same, the Aztecsoft test team has designed and is developing an automation tool to automate most of the scenarios to develop a regression test automation suite for testing the ‘growing’ ODF translator tool. This is based on image / visual comparison techniques for ensuring document fidelity post conversion. The framework is being coded in C# and XML files are being employed for passing inputs to route to the required module execution and due to this the automation is modularized. Our plan is also to make this automation framework flexible so as to test other similar conversions after plugging in necessary code. Hence one thing to watch out for in the upcoming releases would be the test automation results!

Monday 18 September 2006

M2 is out... announcing M3!

We just published a new release on SourceForge, called 0.2-M2. In this release, we focused on improving functionalities that already existed in the previous milestone (such as formatting properties, lists, tables...) and adding new ones: indexes, comments, foot and end notes, frames, sections and page layout (including footers and headers). We are aware of some bugs - even files that make the converter crash - but globally the output starts to be acceptable for a daily usage.

Unfortunately, our roadmap was not synchronized with Word 2007's: a new realease, called Beta 2 TR, has just been published and is not compatible with our plug-in... The main problem is that if you want to test the ODF Converter, you have to install Word 2007 Beta 2 (not TR), which is simply not available any more! However, we could not postpone our own release, because it was already 10 days after the date we planned initially. So we simply decided to program a new milestone, 0.2-M3, before the final 0.2 release. M3 will only focus on making the converter work with Beta 2 TR (along with some bug fixes) and should be available as soon as end of next week. It should not be too complicated, as the changes between Beta 2 and Beta 2 TR are not that important. But we will profit from this occasion to clean up our code a bit. Yves and Karolina, our two technical leaders on the project (Yves in France, and Karolina in Poland) will work hard on the code during the next couple of days to ensure it is consistent and of high quality. As it is quite difficult to continue to work on a piece of code during a global refactoring, we decided to start to work with the remaining developers on the reverse transformation (from DOCX to ODT) during this period. So maybe 0.2-final will include some reverse transformation either (in the initial raodmap, the reverse transformation was only planned for the 0.3 milestone, starting end of october).

By the way, we set up a new regression test framework that will help us to track regressions. Due to the relative complexity of the code, and the number of persons working on it simultaneously, we have regressions quite often - files that Word won't open, formatting bugs, etc. Our new framework will allow us to ensure, before every commit, that a set of predefined files will continue to open in Word. For the ones who wonder why we did not set up a real unit testing environment, I would answer that with XSL transformations is it almost impossible. Unit testing involves very small pieces of code being tested separately. But as we are are continuously improving our transformations, it would have meant thousands of XML sample files - and sometimes we simply cannot isolate a functionality, so that our test files would have to evolve in the mean time: that is not really compatible with unit testing. I think it would not have been totally impossible - but far too complicated and time consuming for this project. So until now, we only tested the converter against some representative test files - but that was not always enough. From now we'll have this systemized through our regression test framework, consisting of a large set of files and an double validation process (against schemas and by trying to open the resulting files in Word) - all automated. We expect this new framework to speed up the development process, allowing to detect regressions very quickly.

Wednesday 13 September 2006

Double-Check Testing

Jean asked me to also write an article for his blog; thanks for this offer, Jean.

Having learned in one of Jean’s articles a lot about the trench warfares behind the scenes let’s come back to the normal working live.

Implementing such a converter is only one part of the medal. The other is testing all the code drops the developers produce. As already mentioned by Jean, this takes place twice a week. Consequently, testing is a continuous process which has to check new feature conversions integrated in a release but also whether the new release did not re-introduce “old” errors (this is done via the regression tests).

There are two companies involved in testing: AztecSoft in India and we, DIaLOGIKa, in Germany.

Why two companies?

There are some good arguments to split the testing. AztecSoft focuses more on feature testing, i.e. they send a lot of documents containing only one specific formatting or layout feature through the converter and check the result, e.g. documents containing frames or documents containing footnotes, etc. Such a feature testing is necessary, even mandatory, and AztecSoft really does a great job here.

But what do we do?

We also test, however, we test real documents, i.e. day-to-day documents created in the EU institutions or found in the Internet. Our testing is oriented towards actual usage scenarios such as: A European company sends a technical specification in ODT format to the European Commission for being commented. The Commission internally uses Word only, i.e. the document must be converted to Word, commented by the Commission, re-converted to ODT and returned to that company.

Another example: An EU agency makes standardised CVs and language passports available for the EU citizens on their web site. Since a considerable number of citizens might use OpenOffice instead of Word the agency publishes the documents in Word and ODT format. In order to save time, the documents are created in one format, converted to the other and then fine-tuned.

Our real documents contain a good mix of various and sometimes rare features which challenge the converter (and probably its developers as well). This mix makes the difference to the mere feature testing: a feature alone might be properly converted, however, the feature mix exhibits the problems.

And there is another argument: When the converter has finally passed this real document feature mix testing we can be quite sure that will also be usable in our real world.

And why have we been chosen to do this kind of testing?

We are in the EU document creation and editing business for more than a decade. The European Commission’s corporate style package for official and legislative documents has been developed by us. Similar systems from us run at the European Council and other EU and member-state institutions. Due to this in-depth knowledge accumulated by us in this multilingual European IT environment, we are in an ideal position to contribute to the OpenXML/ODT conversion project by identifying, emphasizing and testing the document features required for the day-to-day work with international documents in an increasingly interoperable world.

Tuesday 12 September 2006

Clever Age, licenses and other thoughts...

Let me introduce in a few words the company I'm currently working for: Clever Age (who said Microsoft ? ;-). It is a french IT consulting company that was created by Frederic Bon in 2001 with two other associates. Coming from big consulting companies, they felt like they had lost their independence (due to political decisions, financial pressure, editors partnerships or whatever) and wanted to regain it. But they soon realised that they had to look further into the products and "put their hands inside" (as we would say in french) if they wanted to give valuable advices to their clients. That's why an "integration" department was opened in 2002, focusing on PHP, Java and .NET technologies. Since then, Clever Age has grown up slowly (but safely), starting to become quite well-known (at least in France) in several technical areas, among them portal and CMS or XML and Web Services. In 2004, our first subsidiary was opened by Maciek Borowka in Gdansk (Poland), followed by three new ones in 2005 and 2006: Lyon (France), Bordeaux (France) and Katowice (Poland). Today, we are around 60 employees working for Clever Age (40 in France and 20 in Poland).

So as you can see, we are not a "big" company, and you may wonder why we were chosen by Microsoft to run this project. To understand that, let's go back in January 2005. At this time, people from Microsoft France were questionned by people working in French ministries about the compatibility of Microsoft Office with emerging standards (OpenDocument was not yet in a final stage if I remember well, but there were discussions about making it the default format for official exchanges). The answer from Microsoft guys at this time was quite simple: we published the specifications of the XML format used in Word 2003 (known as WordML), so feel free to write converters. And to prove it, they asked a small company (Clever Age) to build a Proof of Concept that would demonstrate the feasability of such a project. That led to the first converter (that allowed to open OpenOffice.org 1.0 SXW files in Word 2003), released on SourceForge in september 2005. At this time, we expected French ministries to be interested by the project and to put some resources on the table to make it go further. But if they actually were enthousiastic about the idea, no one wanted to pay for it.

Just to give an idea of the context (only one year back from now!), when Eric Le Marois (responsible for relations with public institutions in MS France, and thus much concerned by questions of interoperability) raised his hand at the Office conference (or something like this) that took place in Seatle in september 2005 to ask for any plan to make Office 12 be compatible with OpenDocument, everybody looked at him, wondering what he was speaking of (he told me this story, Eric please correct me if necessary). So at this time, this sounded like a french-only preoccupation (actually, I'm pretty sure there were also some deep thoughts about it in MS Corp., but it was kept very secret).

So now you might better understand why we were chosen to develop the plug-in: when at Microsoft they started to think about doing something with OpenDocument (under the pressure of an increasing number of public institutions, among them... french ministries, but not only), the guys from MS France told Corp. about this little prototype that had been developed one year earlier by Clever Age - and that was quite promising. After some discussions, we found an agreement that would allow Clever Age to develop a new converter, based this time on OpenDocument and OpenXML, under a BSD-like license.

Why the hell choose a BSD-like license? There are plenty of other licenses (GPL, LGPL, Mozilla, ...) that are far more accepted by the Open Source community, so if our goal in making this project Open Source was to gain contributors, that was not the best choice. In fact, again we must go back to the context. When Microsoft asked us to develop the converter, Office 12 (now known as Office 2007) was not even released in Beta 2. And they did not really know what they would finally do with our converter. I think it is useful to precise that we were mandated not by the Office development team, but by Microsoft Interoperablility Department (headed by Jean Paoli). So the integration of the plug-in into the final version of Office was not very clear. For this reason, Microsoft wanted to keep the possibility to take the code and... simply put it in their product, without any legal restriction. And for this purpose, the BSD license sounded like the best choice, because anyone (meaning: not only Microsoft) is allowed to do whatever he likes with the code.

Now that the integration as a downloadable plug-in has been decided (but it can still change, who knows?), we know that we could have chosen another license, such as GPL or LGPL. A lot of Open Source contributors don't like the BSD licence because it allows to take their work to build closed-source, commercial products - and that's not in the "Open Source spirit". But I think in this case there was no alternative: Microsoft wanted to be able to take the code, possibly modify it, and integrate it into Office 2007. They could have chosen a Mozilla-like license: it is similar to GPL (or LGPL?) but the fact that one company (or one organization) keeps the right to build a commercial, closed-source product based on the source code (including external contributions). For external contributors, do you really think it would have changed anything if Microsoft had published the project (uh, sorry: if Clever Age had published the project, with Microsoft agreement) under a GPL licence with this kind of restriction allowing them to take all external contributions? I really don't think so. And that's why I still think the BSD license was a good choice.

Sorry, I chattered a lot again this time, and there is almost no place left for the "other thoughts" I announced in the title... I just wanted to say that we are working very very hard this days to fix bugs in order to publish the M2 release at the end of the week! Sometimes we run into such difficulties with our XSL transformations (I am thinking especially of subtable - arghhh - and automatic styles - arghhh again - that will possibly require post-processings to be handled in an acceptable way) that we wonder if our technical choices were so good... I will try to find time to explain those problematics for those who want to really know what kind of technical issues this project is facing (no, there are not only political issues ;-).

Thursday 7 September 2006

Are we traitors or mercenaries?

Yesterday I had lunch with two emissaries of both OpenOffice.org (the well known Open Source concurrent from Microsoft Office) and OASIS (the organization that works on OpenDocument). After the project launching, I had contacted both organizations: the first one to ask for contributors (our plug-in is designed to allow its use by other applications, so we thought that OpenOffice.org could be interested in joining the team to make its product open and save docx files) and the second to have its agreement to use OASIS logo in our plug-in (agreement that we never had, so we decided to use another picture).

The two emissaries wanted to tell us how we were seen by OpenOffice.org and OASIS, and also to ask us some questions regarding our position. So we learnt that the OpenOffice.org team considered us as "traitors" (who are those traitors that work for the Big Satan Microsoft?) and that people from OASIS "did not like us" (even if they don't really care about our little company). In fact that was not a definitive judgement: they wanted to know exactly in which camp we were - because in this war, you cannot remain independant, you have to choose your side. So by default, as we work for Microsoft, we are seen as ennemies; but if we show good willing and for instance join the OpenDocument consortium, that could be a sign that we may on the contrary be friends.

There were subtle differences in OpenOffice.org and OASIS positions: OpenOffice.org is a concurrent from Microsoft on the office applications market. So if we work for Microsoft, we are either a Microsoft's unsignificant subsidiary or if not at least "mercenaries" - we work for those who pay the best, without any moral judgement. OASIS speaks a different language: they support OpenDocument - not OpenOffice.org - and should be happy for any initiative that tends to spread OpenDocument usage and interoperability. But in fact, as Microsoft is developing a concurrent format, their initiative must necessary be an attack against OpenDocument - so they are also against it (I must add that OASIS does not have any "official" position regarding the plug-in - and how could it be different? Microsoft is a member of OASIS! So when I speak from "OASIS", you should hear "most people from OASIS"). We could wonder how the add-in could be an attack against ODF. The answer is simple: by not implementing ODF as a default format and only sponsoring a third-party Open Source project, they discourage users to switch to ODF.

So in definitive, that's quite simple: there is a war (a real war) between Microsoft and the rest of the world. In this war, there is no middle position: only friends and ennemies. If you're not a friend (that means: if you're not fighting Microsoft), then you are an ennemy.

What bothers me in all this speech is this extreme manicheism - on one side, the good guys; on the other, the bad guys. Clever Age is an independant IT consulting company. We don't have any commercial partnership with anyone. We don't belong to any organization that would influence us in the decisions we take for our clients. We are technology-agnostic - we work on .NET, Java, PHP projects indifferently. People at Clever Age like Open Source (that's not a secret). We use Linux and OpenOffice.org a lot (that's not a secret either). When Microsoft consultants come to our office, they don't feel comfortable, because they know they will be attacked from everywhere. When we have to advise our clients on a technological choice, we look at their interests (short, middle and long-term benefits), not ours. So are we just mercenaries? If we were so, we wouldn't have suggested to develop the project in an Open Source way; we wouldn't have put it on SourceForge (CodePlex would have been fine enough); we wouldn't have designed it so that it can be reused by other applications. Are we manipulated? I can't be sure that we are not - only the future will show what was behind all of this - if there is something behind, what I personnaly don't think.

I understand very well that OpenOffice.org is a concurrent of MS Office. So their primary interest would be the project's failure, so that people enforced to work with OpenDocument won't have another choice but using OpenOffice.org or any other ODF compatible product (hey, that reminds me of something with another well-known document format...). But that's a very short view: could anyone imagine that Microsoft would let such a thing happen without reacting? For sure, OpenOffice.org has no special interest in Word being fully compatible with OpenDocument - but let's see this fact as unavoidable. On the contrary, OpenOffice.org will have an interest in being fully compatible with OpenXml (unless you think OpenXml has no future at all - but that would be a daring forecast). That's why I posted an annoucement on OOo dev mailing list to advertise for the project. On a technical point of view, there is an interest to work together. Science is often seen as a way to put bridges between people in war. Let's forget this aerial war for a while and work together on doing the best converter between both formats.

I have more difficulties to understand OASIS position in the debate: they see Microsoft's initiative as an attack against OpenDocument due to the fact that Microsoft does not fully support the format. OK, the integration in Word could be improved (and can still be in the future - you can for instance have a look at Patrick Schmid's investigations). But in this case, the only logical attitude would be to do the maximum for the conversion to work the best as possible! The add-in is an OpenSource project: anyone can contribute and improve it. I personnaly don't feel like Microsoft is trying to make the project fail - but if this happened, nothing would prevent other parties to take over the project and make it live.

What I see is that Microsoft is taking a new turn on the interoperability field - yes it goes slowly, very slowly, but hey! that's Microsoft! We're not speaking of a little agile company. Yes, they could have done better regarding ODF compatibility, but they could have done less as well. So we have here an opportunity to do a good job, to allow MS Office users to work with ODF documents: let's give this initiative a chance. I'm not saying that OpenOffice.org or OASIS hostility are a threat for the success of this project - but for sure things wouldn't go worse if they supported it!

Monday 4 September 2006

C#, XSLT: Why did we choose them?

Let's talk a little about the technical choices we made for this project. There were several aspects in this project that could lead to several technologies. The first need was to write a plug-in for Word 2007 that could open and save ODF text documents. Beside that, we wanted to provide some command line tools too, very useful for development and testing. We also kept in mind that to build an Open Source community, we had to use Open technologies as often as possible. Finally, we also had to take into account our own interests (internal competences, cost vs. performance, development speed, etc.).

We did not have many choices for the Word integration part: we could either have used an (old) C API provided by Microsoft (the one used to write filters) or write COM or .NET shared add-ins. The C API would have allowed us to do a closer integration with Word: it gives access to the file formats used in the "Open" and "Save As" menus. But we did not feel like writing an entire converter in C - it would have be really complicated, and in Clever Age we have more competences in the new technologies, such as Java, .NET or PHP. Moreover, for obvious political reasons, we wanted to base our converter on Microsoft's new Office Open XML format - it would be a example of the new possibilities offered by the XML technology. The C API was based on the good old RTF format, so definitively we threw it away. Having some C#.NET competences internally, we decided to build our plug-in on this technology.

But that was for the integration part only - that is, add new entry menus and launch the conversions. For the conversion itself, we still had several choices available: we could have writen the whole converter in C# (using a SAX-like approach, based on events handling), or benefit from the XML technology and use either XQuery or XSLT. The first approach would have certainly offered the most flexibility and performance. But it would have required a lot more development effort (it is less structured than the two others, and we would have had to code a lot of things that are automatically done in XQuery or XSLT). From Microsoft's point of view, XQuery may have had the preference, as they announced that they would progressively give up XSLT support in the future (they don't plan to release an XSLT 2.0 engine). But the main problem with XQuery (appart from the fact that it is less dedicated to transform documents than XSLT is) is that there is still no parser available in the .NET framework... Once again, we had XSLT skills (we wrote a converter for OpenOffice.org 1.0 in the past) and we thought it was the best compromise between performance (Microsoft .NET 2.0 XSLT engine is very performant) and development speed. Moreover (and not the least argument from our point of view), it could allow other applications to reuse the converter in other contexts - we thought of OpenOffice.org for instance, that already had converters based on XSLT - and we believed therefore that we would have more chances to build a community.

Before starting the project, we discussed all those possibilities with Microsoft architects working on Word and interoperability, and they approved our recommandations. There were still some technical points to decide, among them the library to use for ZIP compression/decompression (surprisingly Microsoft doesn't provide any, and the licence we chose - BSD - prevented us from using SharpZipLib, which is released under GPL) or the way to handle multi-files generation in XSLT (should we run several XSLT engines or a unique engine and split the single XML output into several files during post-processing?). For the first issue, we finally developed a wraping around the unmanaged Zlib library; for the second one, we chose to produce a single XML flow that is automatically split into different files - you can find more information in the technical documentation available for download on SourceForge.

As both ODF and OOX formats are ZIP-based, we had no possibility but writing some pre- and post-processings (it is technically impossible to read or generate ZIP files using only XSLT), the question being: what to do in XSLT and what to delegate to the post-processor? Still today we happen to have to make such decisions, mainly to avoid performance issues and sometimes because of XSLT's technical restrictions (but that will be the subject of another discussion - stay tuned! ;-).

Powered by DotClear

Project page on SourceForge

SourceForge.net Logo