[Digikam-users] Images.uniquehash calculation

Mon Jun 17 19:49:26 BST 2013

On Monday 17 June 2013 19:24:45 Marcel Wiesweg wrote:
> 
> >     Disclaimer: probably this is not the right list to ask this. if so,
> > just let me know. also, I'm not subscribed, so please CC me in the
> > answers.
> > 
> >     I'm trying to write a script that is able to take an image already
> > in digikam's database and resize it, apply the same tags as the
> > original, and possibly remove the original. so far the idea is that 
this
> > script will be independent of digikam, touching it's database when
> > needed. so I checked the database structure and it looks ok, except for
> > the md5sum. I tried to reimplement DImgLoader::uniqueHashV2() in
> > libs/dimg/loaders/dimgloader.cpp:329, and even reimplementing it in
> > python with the same libraries (qt4's md5) and copying the algo line by
> > line, I get different values in the database and with the script. am I
> > missing something? for omparisson, I attach the script I did.
> 
> That's the fun of a hash...Well, I dont know.
> For debugging, I would record the binary data you feed into the hash in 
Python 
> and C++ to a file, compare that one. If it differs, you'll be able to 
locate 
> the problem. If not, there's a difference in the hash implementation, but 
I 
> doubt that.
> 
> Marcel
> _______________________________________________
> Digikam-users mailing list
> Digikam-users at kde.org
> https://mail.kde.org/mailman/listinfo/digikam-users

According to the code, the same hashing routine is used (not only the same 
algorithm, but actually the same implementation).

There is one difference between the two routines though:
- in the Digikam C++ routine, the datablocks are only used if there are 
actually data read
- in the python routine, this check is omitted, and the data block is added 
to the data to be hashed /unconditionally/.

For the second data block (the last 100 kB), as there is a seek just 
before, that could make a difference if the file is <100kB: 
- in C++, the file's probably in an error state, so no data will be read, so 
the second data block will not be fed to the hash routine.
- in Python, the data block /is/ fed, but will probably contain rubbish if 
the file is <100kB...

Also, if the python script changes anything in the metadata (e.g. by 
recording the correct image size...), the first 100kB will differ.

Remco