XML Performance, and now for something completely different…

The following is a blog post I did on Jul 20, 2009 on GullFOSS, the OpenOffice.org Engineering blog at Oracle

While Armin Le Grand did a great job at improving the load/save Performance for presentation documents by tweaking the application core itself, I took a step back and thought about performance improvement by using different technologies. The first step was to look at other techniques to deal with xml documents that would have one or more advantages over the current implementation. Since according to Helmuth von Moltke “No battle plan survives contact with the enemy” I had to test my assumptions and so my theoretical work on this resulted in a prototype import filter implementation for impress. This is a short summary of what techniques I looked at, why and how I used them in the prototype and the interesting results I got from it.
Mission Statement
The mission of this prototype was to gather data how the utilization of new technology could enhance the performance of the native OpenDocument Format (ODF) filters for OpenOffice.org (OOo). The focus of this prototype was to first look at the import of impress documents and achieve the following three goals

Goal 1 : improve overall filter performance

Goal 2 : make use of multiple cores/cpu through threading (where threading is not blocked constantly by calls to the OOo core which is currently not supporting multiple threads without blocking)

Goal 3 : support the implementation of load on demand (for example to display the first slides to the user and load the rest of the document in the background)
Current state
Currently ODF is imported with a SAX based filter that is accessed over the UNO API. Therefore the current SAX parser pushes notifications about the xml elements to the current ODF filter implementation in the order they appear in the xml stream. The filter itself has no control on choosing which elements to parse first or to postpone the parsing of the current element for a later time. It makes it also nearly impossible to measure the time spend for the current xml parsing because this tight coupling between the SAX parser and the ODF filter.

XML streams from ODF documents are usually encoded as UTF-8. The UNO API uses strings in UTF-16 encoding. Therefore, the SAX parser converts all strings from the xml stream to UTF-16 which is used by the UNO string implementation. To identify xml element and attribute names, expensive string compares are conducted.

Between the ODF filter and the current OOo application core is another UNO API layer. The filter has to convert the xml events to something that can be send over this layer to the OOo application core. In most cases the implementation of the API layer must also convert the given data to the format used in the applications core.

the detailed flow of data is currently as follows

SAX parser reads xml data of utf-8 streams inside zip storages
SAX parser UNO implementation handles namespaces and converts element names, attribute names, attribute values and text content to utf-16 strings and feeds the ODF filter implementation
ODF filter implementation transforms utf-16 xml data to UNO representations for the OOo UNO API
The applications UNO API implementation transforms the UNO data to a core data representation

Assumptions

While it is unknown how much the xml parsing accounts to the overall time to load a document, most developers agree that the conversions of all strings from utf-8 to utf-16 should be a significant and useless overhead.
String comparison should be reduced to a minimum.
Making the applications core multi threaded would require a major rewrite so its not a short term solution. To make use of multiple cores/cpu the filter itself has to delegate work that does not need access to the applications core.
Implementing load on demand would require some kind of random access to the xml elements in the content and styles streams.

Alternative technology
XML Pull Parser (XPP)

XPP is a streaming pull XML parser. Unlike sax where the sax parser calls the filter, the filter itself makes calls to the XPP parser to parse the next xml element.

+ In contrast to sax, the filter can interrupt sax parsing after any element and continue parsing later.
+ Pull instead of Push leads to a cleaner filter implementation (cleaner code is usually easier to service and improve).

- Performance is equal to a sax parser
- No random access
Document Object Model (DOM)

A DOM is a parsed memory representation of a xml stream. This technology is used by modern browsers and the odftoolkit.org project.

+ Filter has random access to all xml elements
+ Random access leads to a filter implementation (cleaner code is usualy easier to service and improve).

- A DOM has to store a complete copy of the xml stream + management in memory during the filter process.
Fast sax parser

I developed the fast sax parser during the initial implementation of the Office 12 XML filters. It is basically a sax parser but uses integer tokens to represent known namespaces, element names and attribute names. Tools like the gnu gperf can be used to create perfect hash code to transform the xml names to integer tokens. Scripts can parse the dtd or relax ng of an xml format to automatically extract all the xml names that needs to be convertable to integers. With utf-8 xml streams, the tokens can be created without the need to transform the strings to another encoding first. Namespaces can be combined with xml names so each element or attribute name with a namespace can be identified with just one integer compare.

+ Reduces string handling (encoding, comparing, storing)
+ Leads to cleaner filter implementation (f.e. switch statements can be used to identify child elements instead of if .. else .. blocks which use the string compare functions)
+ Usage of perfect hash algorithms which are automatically created during compile time

- No random access
The Prototype

Since random access looks like the key to have a filter that supports painless load on demand, I decided to go with a DOM solutions. To minimize the memory footprint of the DOM tree, I used the fast sax parser to build the DOM tree and use the integer tokens for xml names instead of the strings from the stream. Now parsing the xml stream itself does not need any interaction with the application core so this could be done in a separate thread. Converting the xml representation of attribute values to an UNO representation is also something that could be done in almost all cases without the application core, so this should be done in the same thread.

The problem here is that a classic DOM is a generic and typeless representation of the xml data. The solution here is to use a technique we introduced in the odftoolkit.org project. Initially a xslt transformation was used to create dom node implementations for each element of the ODF format with type safe access to its attributes using
only the relax ng. Upkeeping xslt templates proved costly for such complex operations. So this was replaced with a code generator that I implemented in java which uses simple templates and configuration files to transform a relax ng schema to code files.

This code generator also allowed to create DOM tree element implementations for other languages than java which is used in the odftoolkit.org project. Therefore I used the generator to create c++ source files for all ODF elements and I adapted the configuration files to create types for the attributes that are equal to the UNO types that the filter needs to pass to the application core. (The key difference between the ODFDOM from odftoolkit.org and the DOM tree for the prototype is that the former uses ODF based types for the attributes and the later uses UNO types).

So in conclusion, a DOM tree builder is started in a worker thread and parses the xml stream by using the fast sax parser (The prototype actually starts two worker threads, one to build the tree for the styles.xml stream and one for the content.xml stream). It uses the sax events to create a tree where each element is the instance of a class that was specifically generated for that element from the relax ng schema. All attribute values are parsed and stored into an UNO Any with preferable the same type as used in the UNO API of the OOo application. So for example if the attribute is a length value then something like “12cm” is converted into an UNO Any containing an integer value of 1200 (12cm converted to 1/100th mm).

This worker threads would never block or wait for the OOo thread. But this would not make sense if at the same time the office thread idles and waits for the tree builder to finish. So the tree builder notifies the filter as soon as an imported element has been fully parsed. For example if the office:styles element is completely parsed the filter thread can start and import the styles. If the filter gets notified that a slide has been completely parsed, it can check if the needed styles and master pages are already imported and then import this slide. If the filter is also executed in a separate thread, the office thread can paint the already imported slides to enhance the responsiveness of the application to the user which results in a ‘subjective’ performance gain

To implement the filter I borrowed code from the existing ODF filter and transformed it to use the DOM tree instead of sax events. This mostly resulted in much less and cleaner code, as expected. For a reliable comparison with the existing filter I had to implement a minimum set of functionality so that for selected real world documents the prototype imports all the functionality available in that document.

I ended up implementing

Import of graphic and presentation styles (with the exception of rarely used attributes like 3d or form attributes)
Import of numbering styles, page layouts and presentation layouts
Import of draw styles like gradients, bitmaps, hatches and markers
Import of master pages including their notes
Import of automatic styles for shapes, paragraphs and spans
Import of draw pages including their notes
Import of common shapes (placeholder, text, rectangle, graphic and ole)

Results

First I used an average real world document with 47 slides and lots of graphics and some chart ole. It showed that the prototype filter was around 2% faster than the original filter. This was less than expected.

Next I created an artificial document. A presentation with 188 slides and only formated text, no graphics, no ole. This pure xml document would be uncommon for a presentation but a good approximation what the gain could be for a writer or calc document where the xml to graphic/ole ratio is much higher. This lead to a performance gain of 10% which is not bad since the prototype is not yet profiled and optimized itself.

I tested this on a dual core 3Ghz system. Experiments with the original filter showed that we are cpu bound and since we use only one core, the processor usage is only 50%. So I expected better results with the threaded prototype by making use of the 3Ghz from the second core. A quick look at a cpu monitor showed that this didn’t happen, processor usage was still capped at 50%.

This made me suspect that the actual parsing, tree building and xml to UNO conversion did only account for an insignificant amount of the time the filter needs to import this document. After removing the threading I measured the time it took to parse the xml streams and to actually import the document. It turns out that building the DOM tree for the styles.xml and content.xml accounted for less than 2% of the overal time. So when opening a document that takes 10 seconds to load, the second core is only used 0,2 seconds. Remember that this does not only include the actual xml parsing but also creating the DOM tree in memory and converting most of the attribute values to UNO types.

The interesting finding here is that the overhead from the xml parsing is much less than the typical ‘developer gut feeling’ about xml. So any performance work in the filter or the application core would be much more efficient then trying to speed up the xml parsing.

Since the usage of DOM is often criticized for its memory consumption, I also took a look at this. While I replaced most strings with 32 bit integer tokens I figured that the memory consumption of a DOM tree should not be greater than the xml stream it was parsed from. My expectation was that for most cases it should be even smaller.

The first measuring showed that it was actually 10 times more than its xml stream. After further investigation I found the source of the problem which is the overhead for each allocated instance from the memory manager.After converting attributes from individual instances to a single vector this dropped to 2x the size of the xml stream. For an impress document this is not a problem as the xml stream size for average documents is seldom more then 1 MB. For a writer document this may also be no problem as for example the huge OpenDocument specification has less than 10MB of xml streams. For calc this may be a problem as calc document with many cells can have xml streams of 100MB and more. For current office workstations this may not be a problem at all, but if OOo runs on a server for multiple users or if OOo would be ported to small devices then this could become an issue.
Conclusion

While the prototype showed that this method of implementing an ODF import filter does result in a performance gain and the option to support load on demand, for an impress application the performance gain of only around 2% for real world documents is out weighted by the actual cost to implement this filter. I currently estimate an effort to at least 6 man month for implementation of only the impress import filter (this is without time for testing which is also crucial to find regressions). For a calc and writer filter these figures may differ.

Since writer and calc are more xml centric then impress documents, a prototype for these applications may show that the overall gain still does out weight the costs.