Supporting the MAFF web archive format, based on ZIP

Sun Mar 21 22:31:15 GMT 2010

On Sonntag 21 März 2010, Paolo Amadini wrote:
> Let me know if you think you need other specific test cases. I'm
>  still working on the trunk build in the meantime.

My comments on the MAFF specs:

Pro:
- Placing the webpage content into a directory of its own inside the 
ZIP archive is a good idea. Or rather it is bad that WAR archives do 
not use directories. If you accidently extract an WAR archive with 
'tar' you will clutter your current directory.

- Structured metadata in "index.rdf". WAR only stores one metadata: 
The original URL inside an HTML comment.

- Using ZIP means fast access to individual files. Reading WAR
i.e. TGZ means the whole WAR archive has to be extracted to some 
temporary place. It is probably slower and uses more memory.

- Easier exchange with Windows as Windows natively supports reading 
ZIP archives.

Con:
IMO the scope of the webarchiver should be rather narrow
1) it should only archive webpages. MAFF seems to allow PNG and
   SVG as well.
2) it should only archive one webpage at a time. I am referering to
   the "extended" conformance level that allows multiple webpages
   to be archived.

The primary reason is that it is easy to understand and sticks to the 
Unix principle: do one job and do it well. Right now the webarchiver
"freezes" the currently visible webpage and does nothing else. It is 
an easy and obvious concept.

Besides wrapping one PNG file in a ZIP archive does not make much 
sense to me. 

What this means is I am going to work on supporting a subset of the 
"basic" conformance level: archived HTML files with metadata.

I am not saying MAFF is bad here. It is just that everything in MAFF 
that goes beyond the KISS principle is going to be a hard sell for me.

Matthias