Speed of Similarity/Duplicates

Mon Sep 1 07:01:53 BST 2025

Hi,

How did you test the Postgresql performance ? digiKam do not support
this kind of database yet...

Using pgvector will not be portable i think. How sqlite will support it ?

Can you compare sqlite vs Postgresql ?

Best regards

Gilles Caulier

Le ven. 29 août 2025 à 17:01, Ken Berland <ken at hero.net> a écrit :
>
> Dear digiKam fans and users,
>
> This email is about potentially replacing the Haar matrix search with pgvector.
>
> I'm trying to "Find duplicates" on a collection with about 300,000 images. I was able to scan the collection and "Update fingerprints" with sqlite, but it crashed during "find duplicates." Then, I moved from sqlite to MySQL and I'm waiting (right now) to see if "Find Duplicates" will complete. While I was waiting, I looked into the database and found the ImageHaarMatrix table. Upon seeing it, I put together this demonstration of using vector search instead of comparing the Haar matrix for each image. Here is the benchmark's summary:
>
> --- Benchmark Summary ---
> Runs: 5
>
> --- Individual Run Times ---
> Run 1: MySQL: 8.8436s, PostgreSQL: 0.0765s
> Run 2: MySQL: 8.9818s, PostgreSQL: 0.0666s
> Run 3: MySQL: 8.9786s, PostgreSQL: 0.0713s
> Run 4: MySQL: 8.7938s, PostgreSQL: 0.0658s
> Run 5: MySQL: 9.1870s, PostgreSQL: 0.0636s
>
> --- Average Times ---
> MySQL (simulated search): 8.9570 seconds
> PostgreSQL (pgvector search): 0.0688 seconds
>
> Improvement Factor: 130.25x
>
> If finding duplicates crashes again, I'll probably create a script to remove them using the pgvector information.
>
> -KB