[Okular-devel] [okular] [Bug 334068] Failed to find a string occurence in a PDF document, but other readers find it successfully

Fri May 2 22:13:04 UTC 2014

https://bugs.kde.org/show_bug.cgi?id=334068

--- Comment #4 from Jaan Vajakas <jaanvajakas at hot.ee> ---
The problem with this file is that the bounding boxes of "T" and "A" overlap
and Okular's layout detection algorithm only considers two glyphs to belong to
the same word if the second one's bounding box touches the first one's right
side exactly (rounded to integer pixels at a certain resolution), not if there
is overlap or a gap. I think I can write a small patch to solve it: accept
overlap (or maybe also gap) within a percentage of the width of the following
character.

In the long run, as layout detection is something that will never be 100%
perfect and in particular the XY Cut layout detection approach that Okular uses
has some fundamental limitations, I think the layout detection in Okular would
benefit from a major refactoring to 1) use existing text flow info in the file
if available (Tagged PDF, ePUB, OpenDocument etc.) and 2) for files where text
flow data is really missing, reuse algorithms from other similar projects to
save the research & development effort. For the current file, however, 1) would
not help since it is not a Tagged PDF, i. e. it is one of the kind that Albert
described in his comment.

-- 
You are receiving this mail because:
You are the assignee for the bug.