<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Dear digiKam fans and users,</p>
<p>I'm trying to "Find duplicates" on a collection with about
300,000 images. I was able to scan the collection and "Update
fingerprints" with sqlite, but it crashed during "find
duplicates." Then, I moved from sqlite to MySQL and I'm waiting
(right now) to see if "Find Duplicates" will complete. While I was
waiting, I looked into the database and found the ImageHaarMatrix
table. Upon seeing it, I put together <a
href="https://github.com/kenberland/digikam-pgvector">this
demonstration</a> of using vector search instead of comparing
the Haar matrix for each image. Here is the benchmark's summary:<br>
<br>
<font face="monospace">--- Benchmark Summary ---<br>
Runs: 5<br>
<br>
--- Individual Run Times ---<br>
Run 1: MySQL: 8.8436s, PostgreSQL: 0.0765s<br>
Run 2: MySQL: 8.9818s, PostgreSQL: 0.0666s<br>
Run 3: MySQL: 8.9786s, PostgreSQL: 0.0713s<br>
Run 4: MySQL: 8.7938s, PostgreSQL: 0.0658s<br>
Run 5: MySQL: 9.1870s, PostgreSQL: 0.0636s<br>
<br>
--- Average Times ---<br>
MySQL (simulated search): 8.9570 seconds<br>
PostgreSQL (pgvector search): 0.0688 seconds<br>
<br>
Improvement Factor: 130.25x</font></p>
<p>If finding duplicates crashes again, I'll probably create a
script to remove them using the pgvector information.</p>
<p>-KB<br>
</p>
</body>
</html>