D13747: Fuzzy filename search for Baloo

Stefan BrĂ¼ns noreply at phabricator.kde.org
Sat Jul 14 16:53:08 BST 2018


bruns added a comment.


  I think this design has several problems:
  
  1. adding or renaming documents becomes very write heavy
    - consider adding e.g. foobar.png - this will add 7 bigrams. Each bigram has an associated list of matching documents. The document list is sorted. This means we have 7 read-modify-write operations.
  
  2. Lots of additional data
    - to be fast, we have to keep this data in memory. We also have to fetch it from disk at startup.
  
  3. It ties the searching algorithm to the data structure
  
  4. The searching may become inefficient when the data set becomes large
    - The lookup of each bigram is fast
    - You have to lookup several bigrams when the search term becomes longer
    - You have to evaluate all result sets and combine them in some way

INLINE COMMENTS

> fuzzysearchtest.cpp:63
> +}
> +
> +void FuzzySearchTest::testFeatures()

The test cases above are good in general, as these test a small aspect.
Improvement:
These are not different test cases, but iterations of the same test. This calls for data driven testing, see:
http://doc.qt.io/qt-5/qttestlib-tutorial2-example.html

> fuzzysearchtest.cpp:69
> +  QMap<FuzzyFeature, FuzzyDataList> correct;
> +
> +  auto make = [](quint8 wid, quint8 len) -> FuzzyDataList {

I think this test should be written in a different way:

- check if the map has the correct size
- check if each entry has the correct docId
- check the terms:

  for (const auto& feat : { "no", "ot", "te", "es"} ) {
      QCOMPARE(exported[feat].size(), 1); // one document
      QCOMPARE(exported[feat][0].wid, 0); // first word
      QCOMPARE(exported[feat][0].len, 5); // strlen("notes")
  }
  ...
  QCOMPARE(exported["md"][0].wid, 3);
  QCOMPARE(exported["md"][0].len, 2);

> fuzzysearchtest.cpp:119
> +  db.unite(FuzzySearch::features(2, file2));
> +  db.unite(FuzzySearch::features(3, file3));
> +  db.unite(FuzzySearch::features(4, file4));

You should have a separate test case here, or even 2 - you should test if the merging works as expected:

- for same feats from different documents ("notes")
- for same feat from one documents ("not_es_", "wedn_es_day", i.e. "es")

> fuzzysearchtest.cpp:129
> +  db.unite(FuzzySearch::features(12, file12));
> +
> +  FuzzyDataList list;

If you create a list of the files, you can use a loop for the insertion here.

  QList<QStringList> files = {
      {"notes", "april8", "2018", "md"},
      {"notes", "wednesday", "04092018", "md"},
  ...
      {"LMC2200"}
  }
  ...
  for (size_t i = 0; i < files.size(); i++) {
    db.unite(FuzzySearch::features(i, files[i]));
  }

> fuzzysearchtest.cpp:131
> +  FuzzyDataList list;
> +  FuzzyDataGetter getter = [&](const FuzzyFeature& feat) -> const FuzzyDataList& {
> +    list.m_datalist.clear();

This function itself calls for a unit test - for a given feature, return the matching documents
Also, either capture the list, or return it (then, by value, not by reference), but don't do both. I strongly prefer return by value here.

> fuzzysearchtest.cpp:143
> +  results = fuzzy.search(QString("wensday"), getter);
> +  QCOMPARE(results, QList<quint64>({ 2 }));
> +

If you create a `QSet<QStringlist> foundFiles = files[result[0]];`, you can compare it with `QSet<QStringList>({{"notes", "wednesday", "04092018", "md"}});`
This way, it becomes obvious what you expect as output here.
To make it more obvious what the QStringList refers to, you can use `using FileNameTerms = QStringList;`

> CMakeLists.txt:18
>      postingdb.cpp
> +    fuzzydb.cpp
>      postingiterator.cpp

Please remove the db from the patch, this should be a separate patch, after the fuzzy matcher itself is paved out.

> database.cpp:25
>  #include "postingdb.h"
> +#include "fuzzydb.h"
>  #include "documentdb.h"

later ...

> database.cpp:108
>       */
> -    mdb_env_set_maxdbs(m_env, 12);
> +    mdb_env_set_maxdbs(m_env, 13);
>  

later ...

> fuzzysearch.cpp:80
> +      // Keep track of the score of each document and its length
> +      auto score = scores.find(doc);
> +      if (score == scores.end()) {

You are matching for document *and* term here, I don't think thats what you want.

REPOSITORY
  R293 Baloo

REVISION DETAIL
  https://phabricator.kde.org/D13747

To: michaeleden, vhanda, #baloo
Cc: bruns, kde-frameworks-devel, #baloo, ashaposhnikov, michaelh, astippich, spoorun, ngraham, abrahams
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-frameworks-devel/attachments/20180714/e162a6b4/attachment-0001.html>


More information about the Kde-frameworks-devel mailing list