[Owncloud] Initial work on full text search with Lucene Search app, requesting feedback

Jörn Friedrich Dreyer jfd at butonic.de
Tue Apr 24 21:07:42 UTC 2012


Hi everyone,

I took the day off to polish the initial work on the Lucene Search app
that was born during the sweeet owncloud developer meeting.

Before issueing a merge request I would like some more feedback on the
integration with the web frontend. Especially async ajax calls, as I
seem to be doing something wrong: my browser is frozen while receiving
an OC_Eventsource stream.

Let me give you a rundown of the current state and the hacks still in use:

After checking out the app from [1] it will automagically reindex your
files (even encrypted files) upon a page reload. There is the first
hack: currently I synchronize an indexer state table with the
oc_fscache table on every web page reload (@klaas webdav accedd
bypasses the indexing for speed, but requires marking changed files as
dirty). Upon a page reload I meant to use an ajax call to run the
indexing in the background while the connection is open. This somehow
locks my browser, so I'm doing it wrong.

Nevertheless, it is happyly building an index which will be used to
present new search results. File deletion is also handled correctly
and cleans up the lucene index now.

Improvements:
* We now have full text search in plain text files!
* We now have full text search in HTML files by using the classes
provided by Zend Lucene Search (BSD license)!
* We have limited full text search in PDF files with code from [2]
which lacks a proper license [3] and features a github project [4]
with outdated sources ... meh.
* We could use the nice lucene query language [5] but I implemented it
is as similar to the current search as possible.

Problems I still need to figure out how to solve:
* The Zend classes for msoffice 2007 files uses ZipArchive which
bypasses the OC_Filesystem layer and thus breaks indexing of encrypted
files. @robin any idea?
* Still no support for Open/LibreOffice, ODF, older word, rtf ... do
we want to index sourcecode?
* My ajax background code still locks the browser ... a progressbar on
the status page woulde be nice. I tried to understand the ajax code
from the gallery and calendar apps and copied some of the code to come
up with somthing useable. At some point in time it stopped working and
I switched to jquery ajax calls instead of Eventsource ... and I admit
now I'm lost. Furthermore, I would like to start the background
indexing via ajax when a file has been uploaded.
* Can we somehow filter out or overwrite search results from the default search?

Tedious work:
* store more meta information from getID3 in the index. This would
obsolete the current database based full text search. But theb I would
also like to merge the current lucene search status table into the
oc_fscache table. It has only one flag column, anyway.

I tried to document the code and hope everything is well in place and
ready for inclusion in owncloud/master. maybe disabled by default ;)

so long

Jörn

[1] https://gitorious.org/~butonic/owncloud/butonics-owncloud/trees/lucene_search/apps/search_lucene
[2] http://www.hashbangcode.com/blog/zend-lucene-and-pdf-documents-part-2-pdf-data-extraction-437.html
[3] the "Our Philosophy" on http://www.hashbangcode.com/about states
the following: 'All of the code placed onto this site has been tested
to the best of our ability and resources so it should work out of the
box. If you spot any problems then please let us know! You should be
aware the all the code here is "use at your own risk" and we can't
take any responsibility for loss of data or server downtime as a
result of the code on this site.'
[4] http://github.com/philipnorton42/PDFSearch

-- 
A. Because it breaks the logical sequence of discussion
Q. Why is top posting bad?



More information about the Owncloud mailing list