<table><tr><td style="">poboiko created this revision.<br />poboiko added a project: Frameworks.
</td><a style="text-decoration: none; padding: 4px 8px; margin: 0 8px 8px; float: right; color: #464C5C; font-weight: bold; border-radius: 3px; background-color: #F7F7F9; background-image: linear-gradient(to bottom,#fff,#f1f0f1); display: inline-block; border: 1px solid rgba(71,87,120,.2);" href="https://phabricator.kde.org/D4995" rel="noreferrer">View Revision</a></tr></table><br /><div><strong>REVISION SUMMARY</strong><div><p>I've noted that on some PDF files, "balooshow -x file.pdf" segfaulted. Backtrace showed that it crashed due to <a href="https://cgit.kde.org/baloo.git/tree/src/tools/balooshow/main.cpp#n201" class="remarkup-link" target="_blank" rel="noreferrer">having single "X" term (see line 201)</a>. Moreover, it actually had a bunch of terms containing uppercase symbols (which should never occur, all the search terms are lowercase and uppercase is reserved for metadata).<br />
Further investigation showed that pdf file (after extraction) contained exotic unicode symbols (ex.: "𝐻𝑒𝑑𝑔𝑒"). After casting toLower(), that string remained the same; and after normalization it became "Hedge", and with that uppercase symbols it went right to DB.</p></div></div><br /><div><strong>TEST PLAN</strong><div><p>I've tested it on affected file; "balooshow -x" no longer crashes and no longer contains uppercase terms.</p>
<p>Probably one can add additional check for "balooctl checkDb" command for that problematic case.<br />
I can prepare a separate patch, if necessary.</p></div></div><br /><div><strong>REPOSITORY</strong><div><div>R293 Baloo</div></div></div><br /><div><strong>REVISION DETAIL</strong><div><a href="https://phabricator.kde.org/D4995" rel="noreferrer">https://phabricator.kde.org/D4995</a></div></div><br /><div><strong>AFFECTED FILES</strong><div><div>src/engine/termgenerator.cpp</div></div></div><br /><div><strong>To: </strong>poboiko<br /><strong>Cc: </strong>Frameworks<br /></div>