[Nepomuk] Review Request 113217: Nepomuk File Extractor for binary MS Office files (doc, xls, ppt)

Denis Steckelmacher steckdenis at yahoo.fr
Tue Oct 15 11:20:20 UTC 2013


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/113217/
-----------------------------------------------------------

(Updated Oct. 15, 2013, 11:20 a.m.)


Status
------

This change has been marked as submitted.


Review request for Nepomuk.


Repository: nepomuk-core


Description
-------

This patch adds a File Extractor for doc, xls and ppt files (the binary MS Office formats). The current version of the extractor is very simple and only indexes the plain text content of the files (no title nor owner information is extracted). The extractor is a tiny wrapper around the "catdoc", "catppt" and "xls2csv" command-line utilities. These tools are packaged in the "catdoc" package of Debian and openSUSE.

These utilities are released under the GNU GPLv2. If I recall correctly, the LGPLv2.1 Nepomuk libraries can use these tools provided no library calls are made to them. The extractor uses QProcess to launch an instance of catdoc, catppt or xls2csv, giving it the name of the file to index, and gets the plain text from the standard output of this process. I hope this complies with the GPL.

The commands are located at run-time using KStandardDirs. This way, no new build dependency is added to Nepomuk, and it is up to the user or the distribution to add "catdoc" to the dependency list of Nepomuk. If a command is not found, the indexer is disabled for the specific MIME type handled by the command.


Diffs
-----

  services/fileindexer/indexer/officeextractor.cpp PRE-CREATION 
  services/fileindexer/indexer/officeextractor.h PRE-CREATION 
  services/fileindexer/indexer/nepomukofficeextractor.desktop PRE-CREATION 

Diff: http://git.reviewboard.kde.org/r/113217/diff/


Testing
-------

I have run the indexer on several DOC, XLS and PPT files I have on my computer. The indexer doesn't work on encrypted files (catdoc refuses to parse them). This is embarrassing because some interesting Excel files are password-protected only on select pages, or only the edition of certain cells is prohibited. The rest of the file can contain valuable data and should be indexed.


Thanks,

Denis Steckelmacher

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20131015/53db49a3/attachment-0001.html>


More information about the Nepomuk mailing list