[Digikam-devel] file hash creation: asking for short test

Thu Dec 9 11:04:19 GMT 2010

Hi,

we are using an MD5 hash over parts of a file to uniquely identify images and 
display thumbnails. This has worked quite well, but recently I have seen two 
or three cases where the hash fails (same hash for completely different 
images).
There is another problem with the current hash, it relies on a binary blob of 
the metadata produced by Exiv2, but this format is not guaranteed to be stable 
(possibly, the hash changes with a new Exiv2 version).

The recommendation by Andreas Huggel was to simply use the first 100kB of a 
file, which will typically include the file header, the metadata, and reach 
actual image data.
A variant would be to include the last 100kB as well.

Attached is a small application which scans a given collection directory, 
creates the hash, and will output if the hash is successful in differentiating 
all files.

I have run this on my collection, but I would ask you to repeat testing with 
your collections to find out if it works for you as well:

qmake testhash.pro
make
./testhash /toplevel/directory/to/your/collection

Here it takes 15s per 1000 files.
At the end, it will tell you if any files failed, or if it succeeded. If it 
fails, it would be interesting to find out if the files are actually very 
similar, and if they have the same file size. (a hard failure would be two 
dissimilar files with the same file size)

Thanks
Marcel

-------------- next part --------------
A non-text attachment was scrubbed...
Name: main.cpp
Type: text/x-c++src
Size: 4939 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/digikam-devel/attachments/20101209/11775214/attachment.cpp>
-------------- next part --------------
SOURCES += main.cpp
CONFIG += qt debug