Speed of Similarity/Duplicates
Ken Berland
ken at hero.net
Fri Aug 29 00:12:46 BST 2025
Dear digiKam fans and users,
I'm trying to "Find duplicates" on a collection with about 300,000
images. I was able to scan the collection and "Update fingerprints" with
sqlite, but it crashed during "find duplicates." Then, I moved from
sqlite to MySQL and I'm waiting (right now) to see if "Find Duplicates"
will complete. While I was waiting, I looked into the database and found
the ImageHaarMatrix table. Upon seeing it, I put together this
demonstration <https://github.com/kenberland/digikam-pgvector> of using
vector search instead of comparing the Haar matrix for each image. Here
is the benchmark's summary:
--- Benchmark Summary ---
Runs: 5
--- Individual Run Times ---
Run 1: MySQL: 8.8436s, PostgreSQL: 0.0765s
Run 2: MySQL: 8.9818s, PostgreSQL: 0.0666s
Run 3: MySQL: 8.9786s, PostgreSQL: 0.0713s
Run 4: MySQL: 8.7938s, PostgreSQL: 0.0658s
Run 5: MySQL: 9.1870s, PostgreSQL: 0.0636s
--- Average Times ---
MySQL (simulated search): 8.9570 seconds
PostgreSQL (pgvector search): 0.0688 seconds
Improvement Factor: 130.25x
If finding duplicates crashes again, I'll probably create a script to
remove them using the pgvector information.
-KB
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/digikam-devel/attachments/20250828/74a9d522/attachment.htm>
More information about the Digikam-devel
mailing list