[dot] The Road to KDE 4: Strigi and File Information Extraction
Dot Stories
stories at kdenews.org
Wed Apr 11 18:57:29 CEST 2007
URL: http://dot.kde.org/1176310483/
From: Troy Unrau <troy.unrau at gmail.com>
Dept: you've-got-to-dig-a-little-deeper
Date: Wednesday 11/Apr/2007, @09:54
The Road to KDE 4: Strigi and File Information Extraction
=========================================================
After a short delay due to a heavy dosage of Real Life(tm), I return
to bring you more on the technologies behind KDE 4. This week I am
featuring Strigi [http://strigi.sf.net/], an information extraction
subsystem that is being fully deployed for KDE 4.0. KDE has previously
had the ability to extract information about files of various types, and
has used them in a variety of functional contexts, such as the
Properties Dialog. Strigi promises many improvements over the existing
versions. Read on for more...
Strigi is a library that sits at a lower level than KDE. It is
written in C++, and is designed to present a series of generic calls
that a program can use to find more information about a given file or
files. It is in no way tied to KDE except that the development version
[http://websvn.kde.org/trunk/KDE/kdesupport/strigi] lives in KDE's SVN
repository. It also has search capabilities, which are not really the
focus of this article.
The Strigi libraries are used to get information from within files,
such as the dimensions of an image, or the length of an audio clip,
embedded thumbnails, number of lines in a log, source code licensing
info or just to search a text file for a given string. Strigi has other
advantages, as it can work inside compressed files, archives, and so
forth seamlessly. In fact, it ships a few useful utility programs,
called deepgrep and deepfind. These useful command line programs allow
you to search for information within binary file formats as easily as
using grep or find on plain text files. KDE is inheriting the same
libraries, so we also get this unique advantage of being able to pull
information out of files that are buried within binary formats, such as
.tgz files. There is a prototype kio_jstreams powered by Strigi that
treats archives like local folders, allowing you to visit
/home/user/tarball.tar.gz/icons/ for example... This works great when
you are using solely KDE integrated applications, but there are
currently problems when mixing with other programs. For example, if
you're browsing with Konq, and click on a file within a tarball, and you
want to open it in the Gimp, well passing that sort of directory would
obviously break the Gimp. So for the time being, this mode of operation
is an experimental io_slave only, and will continue to be until these
sorts of problems are solved. (The other problem is making a tgz or odp
file behave as both a file and a directory simultaneously.)
There are many useful ways that Strigi can return data, once a
query has been performed. For example, Jos notes: "The program
xmlindexer is useful for extracting data from files in a very efficient
manner. Because it outputs xml, it is easy to use from any program.
Other search projects such as Beagle and Tracker would greatly benefit
from using xmlindexer." The xmlindexer program is a binary, so programs
can easily call it externally without having to link to Qt or use C++.
That said, there are many ways to directly use the Strigi libraries...
The KDE libraries have had methods of extracting information (such
as meta data via KFileMetaInfo) from files before, but in many cases
they were either slow, or of limited functionality. With Strigi, we have
seen as much as a several-fold increase in speed for extracting data
from PNG files. I am not aware of any other speeds tests actually being
performed, but the general impression is that it is much faster at
retrieving file data than most of the previously existing methods.
So in KDE, there are not really any good screenshots to show Strigi
in action, as it's really just a library. That's not to say that its
effects will be invisible though, as things like the File Properties
dialogs are already taking advantage of the Strigi backend to pull the
data that was previously provided by KFileMetaInfo. Also, for things
like thumbnail and other metadata that is being displayed in the file
browsers, Strigi is planned to be used (or already in use in some cases)
and preliminary results show massive speed improvements. But so far,
this has had little effect on the actual KDE experience to the end user,
at least in a visual sense. However, as more KDE subsystems become aware
of Strigi, we should start to see more unique and useful uses for all
the features that Strigi supports.
For example: One of the biggest benefactors of the Strigi work is
NEPOMUK [http://nepomuk.semanticdesktop.org/]. According to Jos:
"Nepomuk is a big European research project on enhancing computer
applications to make them semantic and connected. Nepomuk-KDE
[http://nepomuk-kde.semanticdesktop.org/] is the work on a KDE
implementation of the standards and ideas that come out of that project.
I work together with the people of Nepomuk and especially Sebastian
Trueg of Nepomuk-KDE to make sure our work fits together. At the moment
Sebastian is writing [an] additional index implementation for Strigi
that is better able to work with semantic data." This project uses a lot
of metadata and other file contents (like the text of IRC logs, for
example) to provide a easy to use search system for the desktop. NEPOMUK
will undergo a name change before its final implementation is set.
So while Strigi does the actual digging through the data, other
applications such as the Dolphin/Konqueror, the File Properties Dialog
or NEPOMUK are the applications that will see the fruits of this work.
At the moment, however, work is mostly focused on porting the previously
existing KFilePlugins to use the new backend classes. For status reports
on this effort, check out the Porting KFilePlugins Progress
[http://wiki.kde.org/tiki-index.php?page=Porting+KFilePlugin+Progress]
page on the kde wiki.
To learn more about Strigi, visit the website
[http://strigi.sf.net/] or join #strigi on irc.kde.org.
More information about the dot-stories
mailing list