fuzzy-matching in quickopen...

Sun Sep 25 22:57:32 BST 2022

Rethinking some of the approaches, I think we can still improve in some
ways.

For example, matches which are not at start but still occur at a
separation/camel Hump can be scored better.

Gap between two matched letters can be taken into account so that really
bad matches are scored lower for example the image-match in your initial
email. This would also improve results for in sequence matches vs matches
that are at separation but the distance is too long. Need to think through
this well though.

Camel/separation/acronym matches could be scored lower for example when a
letter in between is just a normal match and not a separator match e.g:

FooBarTaz with pattern "fot". F and T both get full separator scores even
though there is o in between which didn't get matched and thus its likely
this isn't what the user wanted

More boundaries can be considered, for example a `-` or `/`

But all this needs a lot of testing. For you it may seem like, let's just
adjust scores and be done, but understand that I have spent countless hours
even days tuning the algorithm for various things.

On Mon, Sep 26, 2022, 2:13 AM Waqar Ahmed <waqar.17a at gmail.com> wrote:

> You seem to think that an in sequence match should always be preferred.
> That's not how it works _by design_. And the examples you gave are not very
> good examples at all. If you are doing ese like searches for the given
> filenames, then don't expect good results, rather improve your searches
> instead so that the tool is able to help you better.
>
> - matches at the beginning are preferred because usually people tend to
> search stuff like that and because all other fuzzy filter implementations
> that I came across did the same thing. This is not changing, sorry.
> - in sequence matches are preferred once the pattern is >= 4.
>
> This is called fuzzy filter for a reason. It's not exact matching as you
> want it to be.
>
> I have created an MR which should prefer open files over non-open ones. If
> you can try that, it would be great.
>
> Thanks.
>
>
>
> On Mon, Sep 26, 2022, 1:49 AM Alexander Neundorf <neundorf at kde.org> wrote:
>
>> Hi,
>>
>> On Samstag, 24. September 2022 00:06:34 CEST Waqar Ahmed wrote:
>> > I am against adding the old way, but if it's optional, ok sure as long
>> as
>> > it is disabled by default.
>> >
>> > Your approach is completely incorrect though and the only reason I will
>> say
>> > ok to the patch is because Christoph already said ok. We can and should
>> > improve the algorithm instead rather than just bringing back the old
>> way on
>> > the first complaint.
>>
>> Here are 3 examples (in the kate source tree) where the calculated score
>> is
>> IMO not good:
>>
>> I want to switch to "KateSearchCommand.cpp", which is already open.
>> filter "ese":
>> KateSearchCommand.cpp gets a score of 113
>> MultilineStartEndOfLineMatch.txt gets a higher score of 116, even though
>> it
>> does not contain the string "ese", but only the "eS" and "E" with 4
>> characters
>> inbetween
>> I think a string which contains the filter exactly should get a higher
>> score
>> than a string which "just" contains the characters.
>>
>>
>> filter "tes":
>> KateSearchCommand.cpp score gets a score of 118 and comes in place 23,
>> i.e.
>> not visible without scrolling.
>> tests.qrc score gets a higher score of 159, probably because it starts
>> with
>> "tes", but it is not open yet. There are about 20 files which start with
>> "test", they are all not open.
>> I often leave out the start of the filename, because often this is the
>> same for
>> many files in a project (e.g. "kate" in kate, or "q" in Qt, or "algo" in
>> some
>> other project), so I start typing with something in the middle of the
>> filename.
>> So I'd suggest that the "is open" bonus should be bigger than the "starts
>> with" bonus.
>>
>> Different example: I want to switch to "kfts_fuzzy_match.h"
>> filter "fts":
>> kfts_fuzzy_match.h gets a score of 100
>> filetree_model_test.cpp gets a higher score of 120. Again, I'd suggest
>> that a
>> string which contains the filter string exactly should get a higher score
>> than
>> a string which "just" contains the characters.
>>
>> The following gives IMO better results:
>>
>> bonus for "already open" = 15
>>
>> if (matched) {
>>    int sequentialBonus = 25;
>>    int separatorBonus = 10; // bonus if match occurs after a separator
>>    int camelBonus = 10; // bonus if match is uppercase and prev is lower
>>    int firstLetterBonus = 10; // bonus if the first letter is matched
>>    int leadingLetterPenalty = 0; // penalty applied for every letter in
>> str
>> before the first match
>>    int maxLeadingLetterPenalty = 0; // maximum penalty for leading letters
>>    int unmatchedLetterPenalty = -1; // penalty for every letter that
>> doesn't
>> matter
>>    int nonBeginSequenceBonus = 20;
>>
>>
>> I'm not sure I understand this. Doesn't this mean that a long filename
>> gets a
>> big bonus ?
>>             // extra points if file exists in project root
>>             // This gives priority to the files at the root
>>             // of the project over others. This is important
>>             // because otherwise getting to root files may
>>             // not be that easy
>>             if (!matchPath) {
>>                 score += (sm->idxToFilePath(sourceRow) == name) *
>> name.size();
>>
>>
>> Alex
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kwrite-devel/attachments/20220926/d177f5cf/attachment-0001.htm>