[Digikam-devel] file hash creation: asking for short test
Marcel Wiesweg
marcel.wiesweg at gmx.de
Thu Dec 9 11:04:19 GMT 2010
Hi,
we are using an MD5 hash over parts of a file to uniquely identify images and
display thumbnails. This has worked quite well, but recently I have seen two
or three cases where the hash fails (same hash for completely different
images).
There is another problem with the current hash, it relies on a binary blob of
the metadata produced by Exiv2, but this format is not guaranteed to be stable
(possibly, the hash changes with a new Exiv2 version).
The recommendation by Andreas Huggel was to simply use the first 100kB of a
file, which will typically include the file header, the metadata, and reach
actual image data.
A variant would be to include the last 100kB as well.
Attached is a small application which scans a given collection directory,
creates the hash, and will output if the hash is successful in differentiating
all files.
I have run this on my collection, but I would ask you to repeat testing with
your collections to find out if it works for you as well:
qmake testhash.pro
make
./testhash /toplevel/directory/to/your/collection
Here it takes 15s per 1000 files.
At the end, it will tell you if any files failed, or if it succeeded. If it
fails, it would be interesting to find out if the files are actually very
similar, and if they have the same file size. (a hard failure would be two
dissimilar files with the same file size)
Thanks
Marcel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: main.cpp
Type: text/x-c++src
Size: 4939 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/digikam-devel/attachments/20101209/11775214/attachment.cpp>
-------------- next part --------------
SOURCES += main.cpp
CONFIG += qt debug
More information about the Digikam-devel
mailing list