D11552: [WIP] Handle CJK characters
Michael Heidelbach
noreply at phabricator.kde.org
Thu Mar 22 21:38:21 UTC 2018
michaelh added a comment.
In D11552#231784 <https://phabricator.kde.org/D11552#231784>, @bruns wrote:
> In D11552#231330 <https://phabricator.kde.org/D11552#231330>, @hein wrote:
>
> > For the record though - a better way to do this is to use QTextBoundaryFinder which will operate e.g. on grapheme cluster boundaries. This still isn't super great for Chinese though. If you want to really-properly do it you'll end up depending on ICU and using its BreakIterator combined with dict-based support for Chinese, which isn't terribly fast however.
>
>
> There are a few implications here:
>
> - splitting to much generates to unspecific terms, especially in case of full text indexing (Think of splitting a western language at character level, most texts likely contain almost the full alphabet. Same likely applies to Katakana with its about ~100 graphemes)
> - term generation at query and index time have to agree about what a term is, otherwise a search will likely return nothing. Changing the splitting at a later time will require reindexing all affected files
> - better splitting will cost some more time at index generation, but likely makes searching faster (additional time for term generation will be neglegible, but the search terms are less complex - e.g. "abc" instead of "a" AND "b" AND "c").
Currently `termgenerator` uses `QTextBoundaryFinder bf(QTextBoundaryFinder::Word, text);`
REPOSITORY
R293 Baloo
REVISION DETAIL
https://phabricator.kde.org/D11552
To: michaelh, hein
Cc: bruns, lbeltrame, #frameworks, alexeymin, cfeck, ashaposhnikov, michaelh, astippich, spoorun, nicolasfella, ngraham
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-frameworks-devel/attachments/20180322/2a3e4fce/attachment.html>
More information about the Kde-frameworks-devel
mailing list