Supporting the MAFF web archive format, based on ZIP
Matthias Grimrath
maps4711 at gmx.de
Sun Mar 21 22:31:15 GMT 2010
On Sonntag 21 März 2010, Paolo Amadini wrote:
> Let me know if you think you need other specific test cases. I'm
> still working on the trunk build in the meantime.
My comments on the MAFF specs:
Pro:
- Placing the webpage content into a directory of its own inside the
ZIP archive is a good idea. Or rather it is bad that WAR archives do
not use directories. If you accidently extract an WAR archive with
'tar' you will clutter your current directory.
- Structured metadata in "index.rdf". WAR only stores one metadata:
The original URL inside an HTML comment.
- Using ZIP means fast access to individual files. Reading WAR
i.e. TGZ means the whole WAR archive has to be extracted to some
temporary place. It is probably slower and uses more memory.
- Easier exchange with Windows as Windows natively supports reading
ZIP archives.
Con:
IMO the scope of the webarchiver should be rather narrow
1) it should only archive webpages. MAFF seems to allow PNG and
SVG as well.
2) it should only archive one webpage at a time. I am referering to
the "extended" conformance level that allows multiple webpages
to be archived.
The primary reason is that it is easy to understand and sticks to the
Unix principle: do one job and do it well. Right now the webarchiver
"freezes" the currently visible webpage and does nothing else. It is
an easy and obvious concept.
Besides wrapping one PNG file in a ZIP archive does not make much
sense to me.
What this means is I am going to work on supporting a subset of the
"basic" conformance level: archived HTML files with metadata.
I am not saying MAFF is bad here. It is just that everything in MAFF
that goes beyond the KISS principle is going to be a hard sell for me.
Matthias
More information about the kfm-devel
mailing list