D21865: [TermGenerator] Do Term truncation prior to UTF-8 conversion
Stefan BrĂ¼ns
noreply at phabricator.kde.org
Sun Jun 16 23:42:06 BST 2019
bruns created this revision.
bruns added reviewers: Baloo, ngraham, astippich, poboiko.
Herald added projects: Frameworks, Baloo.
Herald added a subscriber: kde-frameworks-devel.
bruns requested review of this revision.
REVISION SUMMARY
The (somewhat arbitrary) term truncation was applied to the UTF-8 encoded
data, somethimes truncating the term in the middle of a codepoint.
Truncate the QString instead. This also has the effect of leaving more
useful characters for languages where the majority of codepoints are
encoded as 2 or more bytes.
This requires some extra storage size in the DB when a term which would
have been truncated previously now goes in as is, but likely only a few
terms / languages are affected (for english words UTF-8 encodes most
codepoints in 1 byte).
There is a small caveat for the SearchStore. As queries were truncated
likewise, an untruncated query would no longer find untruncated terms from
new index runs. To allow matches nevertheless, truncated terms use
StartsWith instead of Equal matches.
TEST PLAN
ctest
REPOSITORY
R293 Baloo
BRANCH
phrasestorage_fixes
REVISION DETAIL
https://phabricator.kde.org/D21865
AFFECTED FILES
src/engine/termgenerator.cpp
src/lib/searchstore.cpp
To: bruns, #baloo, ngraham, astippich, poboiko
Cc: kde-frameworks-devel, LeGast00n, fbampaloukas, domson, ashaposhnikov, michaelh, astippich, spoorun, ngraham, bruns, abrahams
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-frameworks-devel/attachments/20190616/9ec60189/attachment.html>
More information about the Kde-frameworks-devel
mailing list