strict versus lenient parsing of binary documents

Thu Jun 16 08:42:21 BST 2011

Now that there is some filter work going on, there are people waking up to the 
idea of adding more features. Notably some would like to see that the 
msoscheme parser is more lenient with invalid data.

The technical background is this: in powerpoint files, the data is split in 
records. Each record starts with a header that has a type number and a size. 
Records can be nested. Even without knowing the details of all the records, 
one can still parse them. One simply cannot assign a meaning to them.

The parser in msoscheme fails when it does not recognize some data. There are 
two cases where some would like the parser to be more lenient:
 - if the order of data members in a record is wrong
 - if some record has data members that are invalid
 - if a record has unknown data members

The historic reason why the parser is very strict is simple: we want to follow 
the documentation published by Microsoft and be clear on any exceptions in it. 
So far there are quite a few exceptions noted in the mso.xml file from which 
the parser is generated. Yet, there are still ppt files out there that cannot 
be parsed. The reasons vary from bugs in the creating software to lacking 
documentation.

If the parser would be more lenient, it would probably be able to parse these 
structures.

I think making the parser more lenient is a good idea, considering these 
limitations:
 - the mso.xml will not change: being lenient does not change the original 
definition
 - the leniency is optional for the separate aspects of leniency (see above)
 - there are callbacks to report where a file violates the specification 
(position in the file, size of record, type of record)

Patches to the parser generator that meet these requirements are very welcome.

There is an additional cost though. If the parser is more lenient, this has 
consequences for the assumptions made in code that uses the results of the 
parser. For the three types of leniency here are the consequences:
 - if the order of data members in a record is wrong, just parse them
      This has no consequences if each member in a record has a unique type. 
If there are two members with the same type they might be swapped. Typically, 
swapping will not have large consequences.
 - if some record has data members that are invalid, just parse them
    This is dangerous and invasive. If you need to check if each member is 
valid after parsing, the size of the code interpreting the parser results will 
blow up with 'if' statements.
 - if a record has unknown data members, just ignore them
    This is fine, unless you want to be able to save them back or if you are 
worried about losing information.

Cheers,
Jos

-- 
Jos van den Oever, software architect
+49 391 25 19 15 53
074 3491911
http://kogmbh.com/legal/