[Okular-devel] md5 hash for annotation file name

Thu Sep 11 00:16:05 CEST 2008

A Dijous 11 Setembre 2008, Markus Grabner va escriure:
> Am Mittwoch, 10. September 2008 schrieb Albert Astals Cid:
> > A Dimecres 10 Setembre 2008, Markus Grabner va escriure:
> > > 	Hi!
> > >
> > >     It has been discussed (http://bugs.kde.org/show_bug.cgi?id=151614)
> > > to use a hash function to determine the name of the annotation file
> > > created by okular. The attached patch implements this behaviour (thanks
> > > Ivo for pointing me to QCryptographicHash - I looked for such a thing
> > > but somehow missed it).
> > >
> > > It works nicely in several ways:
> > > *) Annotations keep associated with the file after renaming it.
> > > *) It also works for non-local URLs (http://...) since we don't need to
> > > care for mapping the URL to some valid file name.
> > > *) Annotations keep associated with the file after downloading it from
> > > the web and opening a local copy (possibly under a different name).
> >
> > It works not nicely in several ways:
> >  *) Md5 sucks, use Sha1
>
> I don't see any serious security threat by using a weak hash function at
> this point. All an attacker could do would be to create a modified file for
> which the same annotations would be displayed as for the file the
> annotations were initially created for.
> I like Ivo's proposal to use QCryptographicHash, which supports MD4, MD5,
> and Sha1, so these are natural candidates.

It's not an attacker, it's you having two files that collide and gets you 
annotations from one to another.

>
> >  *) Reading the whole file sucks, i don't want the 100MB of my pdf file
> > to be piped though a hash, it't probably take *some* time
>
> Just tried it on my ancient AMD64 2GHz machine and found the following
> computing times for a 500MB file:

Calling a AMD64 2Ghz ancient makes me think what an EeePC is, prehistory?

> MD4: 1.3 seconds
> MD5: 2 seconds
> SHA1: 4 seconds
> Loading the file from a local hard disk takes considerably longer
How much is that? 

> , so I'm 
> not very much concerned about the hash computation time. 
> However, the 
> "readAll()" definitely has to be replaced by reading smaller chunks and
> processing them sequentially, that was just for the "proof of concept".

So can you see if splitting the read gives us an improvement, 4 seconds on an 
AMD64 2GHz seems "a lot" to me.

>
> > so reading up to 1MB as much would be much better imho.
>
> If an annotation refers to a typo on the last page of a huge document, and
> this gets fixed, the same annotation would still be displayed for the
> corrected file if the correction appears after the portion of the file for
> which the hash value is computed (at least for uncompressed formats such as
> PostScript). BTW, the current implementation in okular has the same problem
> since changing a single character in a PostScript file usually doesn't
> change its size.

You have a point here

Albert

>
> 	Kind regards,
> 		Markus