<table><tr><td style="">bruns added a comment.
</td><a style="text-decoration: none; padding: 4px 8px; margin: 0 8px 8px; float: right; color: #464C5C; font-weight: bold; border-radius: 3px; background-color: #F7F7F9; background-image: linear-gradient(to bottom,#fff,#f1f0f1); display: inline-block; border: 1px solid rgba(71,87,120,.2);" href="https://phabricator.kde.org/D13747">View Revision</a></tr></table><br /><div><div><p>I think this design has several problems:</p>
<ol class="remarkup-list">
<li class="remarkup-list-item">adding or renaming documents becomes very write heavy<ul class="remarkup-list">
<li class="remarkup-list-item">consider adding e.g. foobar.png - this will add 7 bigrams. Each bigram has an associated list of matching documents. The document list is sorted. This means we have 7 read-modify-write operations.</li>
</ul></li>
</ol>
<ol class="remarkup-list" start="2">
<li class="remarkup-list-item">Lots of additional data<ul class="remarkup-list">
<li class="remarkup-list-item">to be fast, we have to keep this data in memory. We also have to fetch it from disk at startup.</li>
</ul></li>
</ol>
<ol class="remarkup-list" start="3">
<li class="remarkup-list-item">It ties the searching algorithm to the data structure</li>
</ol>
<ol class="remarkup-list" start="4">
<li class="remarkup-list-item">The searching may become inefficient when the data set becomes large<ul class="remarkup-list">
<li class="remarkup-list-item">The lookup of each bigram is fast</li>
<li class="remarkup-list-item">You have to lookup several bigrams when the search term becomes longer</li>
<li class="remarkup-list-item">You have to evaluate all result sets and combine them in some way</li>
</ul></li>
</ol></div></div><br /><div><strong>INLINE COMMENTS</strong><div><div style="margin: 6px 0 12px 0;"><div style="border: 1px solid #C7CCD9; border-radius: 3px;"><div style="padding: 0; background: #F7F7F7; border-color: #e3e4e8; border-style: solid; border-width: 0 0 1px 0; margin: 0;"><div style="color: #74777d; background: #eff2f4; padding: 6px 8px; overflow: hidden;"><a style="float: right; text-decoration: none;" href="https://phabricator.kde.org/D13747#inline-72660">View Inline</a><span style="color: #4b4d51; font-weight: bold;">fuzzysearchtest.cpp:63</span></div>
<div style="font: 11px/15px "Menlo", "Consolas", "Monaco", monospace; white-space: pre-wrap; clear: both; padding: 4px 0; margin: 0;"><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"><span class="p">}</span>
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);">
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"><span style="color: #aa4000">void</span> <span class="n">FuzzySearchTest</span><span style="color: #aa2211">::</span><span class="n">testFeatures</span><span class="p">()</span>
</div></div></div>
<div style="margin: 8px 0; padding: 0 12px;"><p style="padding: 0; margin: 8px;">The test cases above are good in general, as these test a small aspect.<br />
Improvement:<br />
These are not different test cases, but iterations of the same test. This calls for data driven testing, see:<br />
<a href="http://doc.qt.io/qt-5/qttestlib-tutorial2-example.html" class="remarkup-link" target="_blank" rel="noreferrer">http://doc.qt.io/qt-5/qttestlib-tutorial2-example.html</a></p></div></div><br /><div style="border: 1px solid #C7CCD9; border-radius: 3px;"><div style="padding: 0; background: #F7F7F7; border-color: #e3e4e8; border-style: solid; border-width: 0 0 1px 0; margin: 0;"><div style="color: #74777d; background: #eff2f4; padding: 6px 8px; overflow: hidden;"><a style="float: right; text-decoration: none;" href="https://phabricator.kde.org/D13747#inline-72667">View Inline</a><span style="color: #4b4d51; font-weight: bold;">fuzzysearchtest.cpp:69</span></div>
<div style="font: 11px/15px "Menlo", "Consolas", "Monaco", monospace; white-space: pre-wrap; clear: both; padding: 4px 0; margin: 0;"><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> <span class="n">QMap</span><span style="color: #aa2211"><</span><span class="n">FuzzyFeature</span><span class="p">,</span> <span class="n">FuzzyDataList</span><span style="color: #aa2211">></span> <span class="n">correct</span><span class="p">;</span>
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);">
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> <span style="color: #aa4000">auto</span> <span class="n">make</span> <span style="color: #aa2211">=</span> <span class="p">[](</span><span class="n">quint8</span> <span class="n">wid</span><span class="p">,</span> <span class="n">quint8</span> <span class="n">len</span><span class="p">)</span> <span style="color: #aa2211">-></span> <span class="n">FuzzyDataList</span> <span class="p">{</span>
</div></div></div>
<div style="margin: 8px 0; padding: 0 12px;"><p style="padding: 0; margin: 8px;">I think this test should be written in a different way:</p>
<ul class="remarkup-list">
<li class="remarkup-list-item">check if the map has the correct size</li>
<li class="remarkup-list-item">check if each entry has the correct docId</li>
<li class="remarkup-list-item">check the terms:</li>
</ul>
<div class="remarkup-code-block" style="margin: 12px 0;" data-code-lang="text" data-sigil="remarkup-code-block"><pre class="remarkup-code" style="font: 11px/15px "Menlo", "Consolas", "Monaco", monospace; padding: 12px; margin: 0; background: rgba(71, 87, 120, 0.08);">for (const auto& feat : { "no", "ot", "te", "es"} ) {
QCOMPARE(exported[feat].size(), 1); // one document
QCOMPARE(exported[feat][0].wid, 0); // first word
QCOMPARE(exported[feat][0].len, 5); // strlen("notes")
}
...
QCOMPARE(exported["md"][0].wid, 3);
QCOMPARE(exported["md"][0].len, 2);</pre></div></div></div><br /><div style="border: 1px solid #C7CCD9; border-radius: 3px;"><div style="padding: 0; background: #F7F7F7; border-color: #e3e4e8; border-style: solid; border-width: 0 0 1px 0; margin: 0;"><div style="color: #74777d; background: #eff2f4; padding: 6px 8px; overflow: hidden;"><a style="float: right; text-decoration: none;" href="https://phabricator.kde.org/D13747#inline-72670">View Inline</a><span style="color: #4b4d51; font-weight: bold;">fuzzysearchtest.cpp:119</span></div>
<div style="font: 11px/15px "Menlo", "Consolas", "Monaco", monospace; white-space: pre-wrap; clear: both; padding: 4px 0; margin: 0;"><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> <span class="n">db</span><span class="p">.</span><span class="n">unite</span><span class="p">(</span><span class="n">FuzzySearch</span><span style="color: #aa2211">::</span><span class="n">features</span><span class="p">(</span><span style="color: #601200">2</span><span class="p">,</span> <span class="n">file2</span><span class="p">));</span>
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> <span class="n">db</span><span class="p">.</span><span class="n">unite</span><span class="p">(</span><span class="n">FuzzySearch</span><span style="color: #aa2211">::</span><span class="n">features</span><span class="p">(</span><span style="color: #601200">3</span><span class="p">,</span> <span class="n">file3</span><span class="p">));</span>
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> <span class="n">db</span><span class="p">.</span><span class="n">unite</span><span class="p">(</span><span class="n">FuzzySearch</span><span style="color: #aa2211">::</span><span class="n">features</span><span class="p">(</span><span style="color: #601200">4</span><span class="p">,</span> <span class="n">file4</span><span class="p">));</span>
</div></div></div>
<div style="margin: 8px 0; padding: 0 12px;"><p style="padding: 0; margin: 8px;">You should have a separate test case here, or even 2 - you should test if the merging works as expected:</p>
<ul class="remarkup-list">
<li class="remarkup-list-item">for same feats from different documents ("notes")</li>
<li class="remarkup-list-item">for same feat from one documents ("not_es_", "wedn_es_day", i.e. "es")</li>
</ul></div></div><br /><div style="border: 1px solid #C7CCD9; border-radius: 3px;"><div style="padding: 0; background: #F7F7F7; border-color: #e3e4e8; border-style: solid; border-width: 0 0 1px 0; margin: 0;"><div style="color: #74777d; background: #eff2f4; padding: 6px 8px; overflow: hidden;"><a style="float: right; text-decoration: none;" href="https://phabricator.kde.org/D13747#inline-72671">View Inline</a><span style="color: #4b4d51; font-weight: bold;">fuzzysearchtest.cpp:129</span></div>
<div style="font: 11px/15px "Menlo", "Consolas", "Monaco", monospace; white-space: pre-wrap; clear: both; padding: 4px 0; margin: 0;"><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> <span class="n">db</span><span class="p">.</span><span class="n">unite</span><span class="p">(</span><span class="n">FuzzySearch</span><span style="color: #aa2211">::</span><span class="n">features</span><span class="p">(</span><span style="color: #601200">12</span><span class="p">,</span> <span class="n">file12</span><span class="p">));</span>
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);">
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> <span class="n">FuzzyDataList</span> <span class="n">list</span><span class="p">;</span>
</div></div></div>
<div style="margin: 8px 0; padding: 0 12px;"><p style="padding: 0; margin: 8px;">If you create a list of the files, you can use a loop for the insertion here.</p>
<div class="remarkup-code-block" style="margin: 12px 0;" data-code-lang="text" data-sigil="remarkup-code-block"><pre class="remarkup-code" style="font: 11px/15px "Menlo", "Consolas", "Monaco", monospace; padding: 12px; margin: 0; background: rgba(71, 87, 120, 0.08);">QList<QStringList> files = {
{"notes", "april8", "2018", "md"},
{"notes", "wednesday", "04092018", "md"},
...
{"LMC2200"}
}
...
for (size_t i = 0; i < files.size(); i++) {
db.unite(FuzzySearch::features(i, files[i]));
}</pre></div></div></div><br /><div style="border: 1px solid #C7CCD9; border-radius: 3px;"><div style="padding: 0; background: #F7F7F7; border-color: #e3e4e8; border-style: solid; border-width: 0 0 1px 0; margin: 0;"><div style="color: #74777d; background: #eff2f4; padding: 6px 8px; overflow: hidden;"><a style="float: right; text-decoration: none;" href="https://phabricator.kde.org/D13747#inline-72672">View Inline</a><span style="color: #4b4d51; font-weight: bold;">fuzzysearchtest.cpp:131</span></div>
<div style="font: 11px/15px "Menlo", "Consolas", "Monaco", monospace; white-space: pre-wrap; clear: both; padding: 4px 0; margin: 0;"><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> <span class="n">FuzzyDataList</span> <span class="n">list</span><span class="p">;</span>
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> <span class="n">FuzzyDataGetter</span> <span class="n">getter</span> <span style="color: #aa2211">=</span> <span class="p">[</span><span style="color: #aa2211">&</span><span class="p">](</span><span style="color: #aa4000">const</span> <span class="n">FuzzyFeature</span><span style="color: #aa2211">&</span> <span class="n">feat</span><span class="p">)</span> <span style="color: #aa2211">-></span> <span style="color: #aa4000">const</span> <span class="n">FuzzyDataList</span><span style="color: #aa2211">&</span> <span class="p">{</span>
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> <span class="n">list</span><span class="p">.</span><span class="n">m_datalist</span><span class="p">.</span><span class="n">clear</span><span class="p">();</span>
</div></div></div>
<div style="margin: 8px 0; padding: 0 12px;"><p style="padding: 0; margin: 8px;">This function itself calls for a unit test - for a given feature, return the matching documents<br />
Also, either capture the list, or return it (then, by value, not by reference), but don't do both. I strongly prefer return by value here.</p></div></div><br /><div style="border: 1px solid #C7CCD9; border-radius: 3px;"><div style="padding: 0; background: #F7F7F7; border-color: #e3e4e8; border-style: solid; border-width: 0 0 1px 0; margin: 0;"><div style="color: #74777d; background: #eff2f4; padding: 6px 8px; overflow: hidden;"><a style="float: right; text-decoration: none;" href="https://phabricator.kde.org/D13747#inline-72673">View Inline</a><span style="color: #4b4d51; font-weight: bold;">fuzzysearchtest.cpp:143</span></div>
<div style="font: 11px/15px "Menlo", "Consolas", "Monaco", monospace; white-space: pre-wrap; clear: both; padding: 4px 0; margin: 0;"><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> <span class="n">results</span> <span style="color: #aa2211">=</span> <span class="n">fuzzy</span><span class="p">.</span><span class="n">search</span><span class="p">(</span><span class="n">QString</span><span class="p">(</span><span style="color: #766510">"wensday"</span><span class="p">),</span> <span class="n">getter</span><span class="p">);</span>
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> <span class="n">QCOMPARE</span><span class="p">(</span><span class="n">results</span><span class="p">,</span> <span class="n">QList</span><span style="color: #aa2211"><</span><span class="n">quint64</span><span style="color: #aa2211">></span><span class="p">({</span> <span style="color: #601200">2</span> <span class="p">}));</span>
</div></div></div>
<div style="margin: 8px 0; padding: 0 12px;"><p style="padding: 0; margin: 8px;">If you create a <tt style="background: #ebebeb; font-size: 13px;">QSet<QStringlist> foundFiles = files[result[0]];</tt>, you can compare it with <tt style="background: #ebebeb; font-size: 13px;">QSet<QStringList>({{"notes", "wednesday", "04092018", "md"}});</tt><br />
This way, it becomes obvious what you expect as output here.<br />
To make it more obvious what the QStringList refers to, you can use <tt style="background: #ebebeb; font-size: 13px;">using FileNameTerms = QStringList;</tt></p></div></div><br /><div style="border: 1px solid #C7CCD9; border-radius: 3px;"><div style="padding: 0; background: #F7F7F7; border-color: #e3e4e8; border-style: solid; border-width: 0 0 1px 0; margin: 0;"><div style="color: #74777d; background: #eff2f4; padding: 6px 8px; overflow: hidden;"><a style="float: right; text-decoration: none;" href="https://phabricator.kde.org/D13747#inline-72674">View Inline</a><span style="color: #4b4d51; font-weight: bold;">CMakeLists.txt:18</span></div>
<div style="font: 11px/15px "Menlo", "Consolas", "Monaco", monospace; white-space: pre-wrap; clear: both; padding: 4px 0; margin: 0;"><div style="padding: 0 8px; margin: 0 4px; "> postingdb.cpp
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> fuzzydb.cpp
</div><div style="padding: 0 8px; margin: 0 4px; "> postingiterator.cpp
</div></div></div>
<div style="margin: 8px 0; padding: 0 12px;"><p style="padding: 0; margin: 8px;">Please remove the db from the patch, this should be a separate patch, after the fuzzy matcher itself is paved out.</p></div></div><br /><div style="border: 1px solid #C7CCD9; border-radius: 3px;"><div style="padding: 0; background: #F7F7F7; border-color: #e3e4e8; border-style: solid; border-width: 0 0 1px 0; margin: 0;"><div style="color: #74777d; background: #eff2f4; padding: 6px 8px; overflow: hidden;"><a style="float: right; text-decoration: none;" href="https://phabricator.kde.org/D13747#inline-72675">View Inline</a><span style="color: #4b4d51; font-weight: bold;">database.cpp:25</span></div>
<div style="font: 11px/15px "Menlo", "Consolas", "Monaco", monospace; white-space: pre-wrap; clear: both; padding: 4px 0; margin: 0;"><div style="padding: 0 8px; margin: 0 4px; "><span style="color: #304a96">#include</span> <span class="cpf">"postingdb.h"</span><span style="color: #304a96"></span>
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"><span style="color: #304a96">#include</span> <span class="cpf">"fuzzydb.h"</span><span style="color: #304a96"></span>
</div><div style="padding: 0 8px; margin: 0 4px; "><span style="color: #304a96">#include</span> <span class="cpf">"documentdb.h"</span><span style="color: #304a96"></span>
</div></div></div>
<div style="margin: 8px 0; padding: 0 12px;"><p style="padding: 0; margin: 8px;">later ...</p></div></div><br /><div style="border: 1px solid #C7CCD9; border-radius: 3px;"><div style="padding: 0; background: #F7F7F7; border-color: #e3e4e8; border-style: solid; border-width: 0 0 1px 0; margin: 0;"><div style="color: #74777d; background: #eff2f4; padding: 6px 8px; overflow: hidden;"><a style="float: right; text-decoration: none;" href="https://phabricator.kde.org/D13747#inline-72676">View Inline</a><span style="color: #4b4d51; font-weight: bold;">database.cpp:108</span></div>
<div style="font: 11px/15px "Menlo", "Consolas", "Monaco", monospace; white-space: pre-wrap; clear: both; padding: 4px 0; margin: 0;"><div style="padding: 0 8px; margin: 0 4px; "><span style="color: #74777d"> */</span>
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(251, 175, 175, .7);"> <span class="n">mdb_env_set_maxdbs</span><span class="p">(</span><span class="n">m_env</span><span class="p">,</span> <span style="color: #601200">1<span class="bright">2</span></span><span class="p">);</span>
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> <span class="n">mdb_env_set_maxdbs</span><span class="p">(</span><span class="n">m_env</span><span class="p">,</span> <span style="color: #601200">1<span class="bright">3</span></span><span class="p">);</span>
</div></div></div>
<div style="margin: 8px 0; padding: 0 12px;"><p style="padding: 0; margin: 8px;">later ...</p></div></div><br /><div style="border: 1px solid #C7CCD9; border-radius: 3px;"><div style="padding: 0; background: #F7F7F7; border-color: #e3e4e8; border-style: solid; border-width: 0 0 1px 0; margin: 0;"><div style="color: #74777d; background: #eff2f4; padding: 6px 8px; overflow: hidden;"><a style="float: right; text-decoration: none;" href="https://phabricator.kde.org/D13747#inline-72677">View Inline</a><span style="color: #4b4d51; font-weight: bold;">fuzzysearch.cpp:80</span></div>
<div style="font: 11px/15px "Menlo", "Consolas", "Monaco", monospace; white-space: pre-wrap; clear: both; padding: 4px 0; margin: 0;"><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> <span style="color: #74777d">// Keep track of the score of each document and its length</span>
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> <span style="color: #aa4000">auto</span> <span class="n">score</span> <span style="color: #aa2211">=</span> <span class="n">scores</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">doc</span><span class="p">);</span>
</div><div style="padding: 0 8px; margin: 0 4px; background: rgba(151, 234, 151, .6);"> <span style="color: #aa4000">if</span> <span class="p">(</span><span class="n">score</span> <span style="color: #aa2211">==</span> <span class="n">scores</span><span class="p">.</span><span class="n">end</span><span class="p">())</span> <span class="p">{</span>
</div></div></div>
<div style="margin: 8px 0; padding: 0 12px;"><p style="padding: 0; margin: 8px;">You are matching for document *and* term here, I don't think thats what you want.</p></div></div></div></div></div><br /><div><strong>REPOSITORY</strong><div><div>R293 Baloo</div></div></div><br /><div><strong>REVISION DETAIL</strong><div><a href="https://phabricator.kde.org/D13747">https://phabricator.kde.org/D13747</a></div></div><br /><div><strong>To: </strong>michaeleden, vhanda, Baloo<br /><strong>Cc: </strong>bruns, kde-frameworks-devel, Baloo, ashaposhnikov, michaelh, astippich, spoorun, ngraham, abrahams<br /></div>