Extracting plain text and meta data
Vishesh Handa
me at vhanda.in
Wed Sep 12 15:55:10 BST 2012
Hey everyone
I'm currently working on improving KDE File Indexing infrastructure.
One of the areas where we are lacking is proper support for Open Document
Formats and Microsoft Document Formats. It occurred to me that maybe I
could use the calligra libraries to do so. I even looked at the code base (
a little bit ) and extracting the basic metadata is really simple
(KoDocumentInfo).
I also looked at the Calligra Converter code, which seems to be using a
print job to convert the formats. It can convert the file to a pdf, which I
can then easily parse, but that seems like a bit too much effort. Not to
mention that it's probably very slow.
So my question is - Is it possible to use Calligra to quickly extract the
plain text from the file?
Also, what kind of dependencies an I looking at? Just calligra-libs or
something else?
--
Vishesh Handa
PS: Please keep me cced. I'm not on the mailing list.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/calligra-devel/attachments/20120912/68fea877/attachment.htm>
More information about the calligra-devel
mailing list