strict versus lenient parsing of binary documents
Jos van den Oever
jos.van.den.oever at kogmbh.com
Thu Jun 16 08:42:21 BST 2011
Now that there is some filter work going on, there are people waking up to the
idea of adding more features. Notably some would like to see that the
msoscheme parser is more lenient with invalid data.
The technical background is this: in powerpoint files, the data is split in
records. Each record starts with a header that has a type number and a size.
Records can be nested. Even without knowing the details of all the records,
one can still parse them. One simply cannot assign a meaning to them.
The parser in msoscheme fails when it does not recognize some data. There are
two cases where some would like the parser to be more lenient:
- if the order of data members in a record is wrong
- if some record has data members that are invalid
- if a record has unknown data members
The historic reason why the parser is very strict is simple: we want to follow
the documentation published by Microsoft and be clear on any exceptions in it.
So far there are quite a few exceptions noted in the mso.xml file from which
the parser is generated. Yet, there are still ppt files out there that cannot
be parsed. The reasons vary from bugs in the creating software to lacking
documentation.
If the parser would be more lenient, it would probably be able to parse these
structures.
I think making the parser more lenient is a good idea, considering these
limitations:
- the mso.xml will not change: being lenient does not change the original
definition
- the leniency is optional for the separate aspects of leniency (see above)
- there are callbacks to report where a file violates the specification
(position in the file, size of record, type of record)
Patches to the parser generator that meet these requirements are very welcome.
There is an additional cost though. If the parser is more lenient, this has
consequences for the assumptions made in code that uses the results of the
parser. For the three types of leniency here are the consequences:
- if the order of data members in a record is wrong, just parse them
This has no consequences if each member in a record has a unique type.
If there are two members with the same type they might be swapped. Typically,
swapping will not have large consequences.
- if some record has data members that are invalid, just parse them
This is dangerous and invasive. If you need to check if each member is
valid after parsing, the size of the code interpreting the parser results will
blow up with 'if' statements.
- if a record has unknown data members, just ignore them
This is fine, unless you want to be able to save them back or if you are
worried about losing information.
Cheers,
Jos
--
Jos van den Oever, software architect
+49 391 25 19 15 53
074 3491911
http://kogmbh.com/legal/
More information about the calligra-devel
mailing list