[Nepomuk] Review Request: Extend popplerextractor with firstpage parsing

Jörg Ehrichs joerg.ehrichs at gmx.de
Sun Dec 23 13:21:18 UTC 2012



> On Dec. 23, 2012, 12:50 p.m., Vishesh Handa wrote:
> > services/fileindexer/indexer/popplerextractor.cpp, line 69
> > <http://git.reviewboard.kde.org/r/107870/diff/1/?file=100721#file100721line69>
> >
> >     This line is no longer required since the title has now been trimmed

Actually this line is debateable and not meant to check for whitespace only detection.

The reason for this line is to detect for titles without any whitespaces (so one word only).
While this might in many cases be a valid match. For research papers it certainly never is the case.
Instead whenever the metadata has just one word the content where some "random numbers/identifier/author abbreviation".


- Jörg


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/107870/#review23896
-----------------------------------------------------------


On Dec. 23, 2012, 1:15 p.m., Jörg Ehrichs wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://git.reviewboard.kde.org/r/107870/
> -----------------------------------------------------------
> 
> (Updated Dec. 23, 2012, 1:15 p.m.)
> 
> 
> Review request for Nepomuk and Vishesh Handa.
> 
> 
> Description
> -------
> 
> Extend popplerextractor with firstpage parsing
> 
> Often the pdf metadata is not available or wrong data is added
> to the title field (pdf exporter names instead of title).
>     
> This patch adds the possibility to parse the first page for a possible
> title. A possibel title is determined by the connected text with the
> biggest font that was more than one character.
> 
> 
> Diffs
> -----
> 
>   services/fileindexer/indexer/popplerextractor.h c7dfa50 
>   services/fileindexer/indexer/popplerextractor.cpp 7015195 
> 
> Diff: http://git.reviewboard.kde.org/r/107870/diff/
> 
> 
> Testing
> -------
> 
> tested various pdf files, title is added correctly if it was possible to find one
> 
> 
> Thanks,
> 
> Jörg Ehrichs
> 
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20121223/981a7ecc/attachment.html>


More information about the Nepomuk mailing list