D21865: [TermGenerator] Do Term truncation prior to UTF-8 conversion

Stefan BrĂ¼ns noreply at phabricator.kde.org
Sun Jun 16 23:42:06 BST 2019


bruns created this revision.
bruns added reviewers: Baloo, ngraham, astippich, poboiko.
Herald added projects: Frameworks, Baloo.
Herald added a subscriber: kde-frameworks-devel.
bruns requested review of this revision.

REVISION SUMMARY
  The (somewhat arbitrary) term truncation was applied to the UTF-8 encoded
  data, somethimes truncating the term in the middle of a codepoint.
  
  Truncate the QString instead. This also has the effect of leaving more
  useful characters for languages where the majority of codepoints are
  encoded as 2 or more bytes.
  
  This requires some extra storage size in the DB when a term which would
  have been truncated previously now goes in as is, but likely only a few
  terms / languages are affected (for english words UTF-8 encodes most
  codepoints in 1 byte).
  
  There is a small caveat for the SearchStore. As queries were truncated
  likewise, an untruncated query would no longer find untruncated terms from
  new index runs. To allow matches nevertheless, truncated terms use
  StartsWith instead of Equal matches.

TEST PLAN
  ctest

REPOSITORY
  R293 Baloo

BRANCH
  phrasestorage_fixes

REVISION DETAIL
  https://phabricator.kde.org/D21865

AFFECTED FILES
  src/engine/termgenerator.cpp
  src/lib/searchstore.cpp

To: bruns, #baloo, ngraham, astippich, poboiko
Cc: kde-frameworks-devel, LeGast00n, fbampaloukas, domson, ashaposhnikov, michaelh, astippich, spoorun, ngraham, bruns, abrahams
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-frameworks-devel/attachments/20190616/9ec60189/attachment.html>


More information about the Kde-frameworks-devel mailing list