[Okular-devel] [okular] [Bug 334068] Failed to find a string occurence in a PDF document, but other readers find it successfully

Jaan Vajakas jaanvajakas at hot.ee
Thu Jun 19 18:34:20 UTC 2014


https://bugs.kde.org/show_bug.cgi?id=334068

--- Comment #7 from Jaan Vajakas <jaanvajakas at hot.ee> ---
When testing with some PDF documents on my hard drive, I found that improving
this bug would cause a regression for some PDFs (OCR'ed papers) from JSTOR
which have slightly wrong bounding rectangles; for those documents the current
rule "two glyphs belong to the same word iff their bounding box edges exactly
match" works best. (An example is http://www.jstor.org/stable/1970717 but
unfortunately they want money for downloading the PDF unless you belong to a
university that has a contract with them.) However, those JSTOR PDFs are Tagged
PDFs and their Tagged PDF actual text content (which can be obtained by copying
text from Acrobat Reader) is good. So, in order to avoid regressions, Tagged
PDF support (i. e., not doing layout detection for Tagged PDFs) should also be
added to Okular when fixing this bug.

However, I didn't find a method returning the Tagged PDF actual text in the Qt4
interface of poppler. The only promising one was Poppler::Page::textList(),
which is also currently used by Okular (but Okular does some layout detection
chemistry on top of it) but from testing with poppler (0.26.0 and 0.26.1), but
I found that textList() still doesn't return the Tagged PDF text but the
results of layout detection done by poppler.

-- 
You are receiving this mail because:
You are the assignee for the bug.


More information about the Okular-devel mailing list