[Nepomuk] Review Request 113217: Nepomuk File Extractor for binary MS Office files (doc, xls, ppt)

Vishesh Handa me at vhanda.in
Tue Oct 15 08:24:17 UTC 2013


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/113217/#review41765
-----------------------------------------------------------



services/fileindexer/indexer/officeextractor.cpp
<http://git.reviewboard.kde.org/r/113217/#comment30490>

    It occurred to me that you don't really need to use a CustomCriteria for matching. You could just return the standard list of mimetypes by constructing the mimetype list in the constructor. That way you'll also be avoiding the extra checks at runtime.
    
    Feel free to ship this patch, and if you think it should be done, change the criteria in another patch. 


- Vishesh Handa


On Oct. 12, 2013, 1:43 p.m., Denis Steckelmacher wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://git.reviewboard.kde.org/r/113217/
> -----------------------------------------------------------
> 
> (Updated Oct. 12, 2013, 1:43 p.m.)
> 
> 
> Review request for Nepomuk.
> 
> 
> Repository: nepomuk-core
> 
> 
> Description
> -------
> 
> This patch adds a File Extractor for doc, xls and ppt files (the binary MS Office formats). The current version of the extractor is very simple and only indexes the plain text content of the files (no title nor owner information is extracted). The extractor is a tiny wrapper around the "catdoc", "catppt" and "xls2csv" command-line utilities. These tools are packaged in the "catdoc" package of Debian and openSUSE.
> 
> These utilities are released under the GNU GPLv2. If I recall correctly, the LGPLv2.1 Nepomuk libraries can use these tools provided no library calls are made to them. The extractor uses QProcess to launch an instance of catdoc, catppt or xls2csv, giving it the name of the file to index, and gets the plain text from the standard output of this process. I hope this complies with the GPL.
> 
> The commands are located at run-time using KStandardDirs. This way, no new build dependency is added to Nepomuk, and it is up to the user or the distribution to add "catdoc" to the dependency list of Nepomuk. If a command is not found, the indexer is disabled for the specific MIME type handled by the command.
> 
> 
> Diffs
> -----
> 
>   services/fileindexer/indexer/officeextractor.cpp PRE-CREATION 
>   services/fileindexer/indexer/officeextractor.h PRE-CREATION 
>   services/fileindexer/indexer/nepomukofficeextractor.desktop PRE-CREATION 
> 
> Diff: http://git.reviewboard.kde.org/r/113217/diff/
> 
> 
> Testing
> -------
> 
> I have run the indexer on several DOC, XLS and PPT files I have on my computer. The indexer doesn't work on encrypted files (catdoc refuses to parse them). This is embarrassing because some interesting Excel files are password-protected only on select pages, or only the edition of certain cells is prohibited. The rest of the file can contain valuable data and should be indexed.
> 
> 
> Thanks,
> 
> Denis Steckelmacher
> 
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20131015/00c6d7f0/attachment.html>


More information about the Nepomuk mailing list