[KPhotoAlbum] Search performance (at least) linear in number of search terms?

Sun Jul 16 04:34:31 BST 2017

On Sun,  9 Jul 2017 20:00:21 -0400 (EDT), Robert Krawitz wrote:
> I'm reorganizing my database as follows:
>
> I have traditionally applied a keyword to each set of photos.  When I
> select the photos I'm using, I apply a second keyword named "$keyword
> selection" (for appropriate value of $keyword).
>
> This led to a lot of clutter, so I've created a new keyword named "All
> Selections", which I apply to the selected image.  So I can select the
> image I want by selecting the and of the keyword and All Selections.
>
> I went about this via the search dialog, by selecting all of the
> keywords named "$keyword selection" (something like 100 keywords,
> IIRC), and then applying the new keyword to that.  This procedure was
> very time consuming; it took maybe a minute on my core i7-920XM.  This
> seems too long for 222000 images and 100 keywords (each image has no
> more than two keywords, usually).  I'm going through one kcachegrind
> trace (which took me about 30 minutes to collect, using only 40
> keywords) and haven't been able thus far to work out quite what's
> going on in there.  For reference, it made about 80 million calls each
> to DB::ImageInfo::hasCategoryInfo(QString const&, QString const&) and DB::ImageInfo::hasCategoryInfo(QString const&, QSet<QString> const&).
>
>
> It appears that most of the work is done by intersecting the set of
> tags applied to the image and the set of tags being searched for, but
> the way this intersection is done results in a lot of string hashes
> being calculated (as opposed to the hashes being cached).
>
> Furthermore, since each keyword (or other category member) has a
> unique integer assigned, the matching could be done by integer
> comparisons.  That wouldn't work if someone wanted to do a search on a
> wildcard/regexp, but it appears we don't allow that anyway.
>
> So I'm still trying to puzzle my way through this.

Well, I did find one piece of really low hanging fruit: the overlap
function (which, mind you, is used in only one place, so Set.cpp and
Set.h might just as well be removed altogether) was computing the
intersection of the two sets of strings rather than just determining
whether there is an intersection.

For my test case, it cut the time from about 18 seconds to maybe 13.
While with only 222,000 images and searching for about 40 tags it
should be able to do a lot better indeed than that, this is at least a
start in the right direction.

If you want, I'll do a patch that removes Set.h and Set.cpp altogether.

diff --git a/DB/ImageInfo.cpp b/DB/ImageInfo.cpp
index d5b8cb25..c5897fe7 100644
--- a/DB/ImageInfo.cpp
+++ b/DB/ImageInfo.cpp
@@ -132,7 +132,7 @@ bool ImageInfo::hasCategoryInfo( const QString& key, const QString& value ) cons
 
 bool DB::ImageInfo::hasCategoryInfo( const QString& key, const StringSet& values ) const
 {
-    return Utilities::overlap( m_categoryInfomation[key], values );
+    return values.intersects( m_categoryInfomation[key] );
 }
 
-- 
Robert Krawitz                                     <rlk at alum.mit.edu>

***  MIT Engineers   A Proud Tradition   http://mitathletics.com  ***
Member of the League for Programming Freedom  --  http://ProgFree.org
Project lead for Gutenprint   --    http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton