Speed of Similarity/Duplicates

Fri Aug 29 16:00:43 BST 2025

Dear digiKam fans and users,

This email is about potentially replacing the Haar matrix search with 
pgvector.

I'm trying to "Find duplicates" on a collection with about 300,000 
images. I was able to scan the collection and "Update fingerprints" with 
sqlite, but it crashed during "find duplicates." Then, I moved from 
sqlite to MySQL and I'm waiting (right now) to see if "Find Duplicates" 
will complete. While I was waiting, I looked into the database and found 
the ImageHaarMatrix table. Upon seeing it, I put together this 
demonstration <https://github.com/kenberland/digikam-pgvector> of using 
vector search instead of comparing the Haar matrix for each image. Here 
is the benchmark's summary:

--- Benchmark Summary ---
Runs: 5

--- Individual Run Times ---
Run 1: MySQL: 8.8436s, PostgreSQL: 0.0765s
Run 2: MySQL: 8.9818s, PostgreSQL: 0.0666s
Run 3: MySQL: 8.9786s, PostgreSQL: 0.0713s
Run 4: MySQL: 8.7938s, PostgreSQL: 0.0658s
Run 5: MySQL: 9.1870s, PostgreSQL: 0.0636s

--- Average Times ---
MySQL (simulated search): 8.9570 seconds
PostgreSQL (pgvector search): 0.0688 seconds

Improvement Factor: 130.25x

If finding duplicates crashes again, I'll probably create a script to 
remove them using the pgvector information.

-KB
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/digikam-devel/attachments/20250829/f4e25770/attachment.htm>