Hi Marcel,<div><br></div><div>here are my results:</div><div><br></div><div><div>Directory scanning and hash generation took 35.9236 ms/file </div><div>Success: All 4557 files have a different hash. </div><div><br></div><div>


Also, I might understand it wrong, but wouldn't reading the beginning be better than reading the end of file in regards to IO operations? (as with reading the end of the file you must move the file "cursor" to somewhere near the end, with the beginning you just open and read)</div>


<div><br></div><div>Marty</div><br><div class="gmail_quote">On Thu, Dec 9, 2010 at 12:04, Marcel Wiesweg <span dir="ltr"><<a href="mailto:marcel.wiesweg@gmx.de">marcel.wiesweg@gmx.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


Hi,<br>

<br>

we are using an MD5 hash over parts of a file to uniquely identify images and<br>

display thumbnails. This has worked quite well, but recently I have seen two<br>

or three cases where the hash fails (same hash for completely different<br>

images).<br>

There is another problem with the current hash, it relies on a binary blob of<br>

the metadata produced by Exiv2, but this format is not guaranteed to be stable<br>

(possibly, the hash changes with a new Exiv2 version).<br>

<br>

The recommendation by Andreas Huggel was to simply use the first 100kB of a<br>

file, which will typically include the file header, the metadata, and reach<br>

actual image data.<br>

A variant would be to include the last 100kB as well.<br>

<br>

Attached is a small application which scans a given collection directory,<br>

creates the hash, and will output if the hash is successful in differentiating<br>

all files.<br>

<br>

I have run this on my collection, but I would ask you to repeat testing with<br>

your collections to find out if it works for you as well:<br>

<br>

qmake <a href="http://testhash.pro" target="_blank">testhash.pro</a><br>

make<br>

./testhash /toplevel/directory/to/your/collection<br>

<br>

Here it takes 15s per 1000 files.<br>

At the end, it will tell you if any files failed, or if it succeeded. If it<br>

fails, it would be interesting to find out if the files are actually very<br>

similar, and if they have the same file size. (a hard failure would be two<br>

dissimilar files with the same file size)<br>

<br>

Thanks<br>

<font color="#888888">Marcel<br>

<br>

</font><br>_______________________________________________<br>

Digikam-devel mailing list<br>

<a href="mailto:Digikam-devel@kde.org">Digikam-devel@kde.org</a><br>

<a href="https://mail.kde.org/mailman/listinfo/digikam-devel" target="_blank">https://mail.kde.org/mailman/listinfo/digikam-devel</a><br>

<br></blockquote></div><br></div>