[Owncloud] Initial work on full text search with Lucene Search app, requesting feedback

Robin Appelman icewind at owncloud.com
Tue Apr 24 21:35:18 UTC 2012


The main problem with using eventsource is that it keeps a connection open 
with the server, most servers don't really like that and will block the next 
connection untill the eventsource connection is closed, the causes the 
interface to lock up when another ajax call is fired from somewhere

On Tuesday 24 April 2012 23:07:42 Jörn Friedrich Dreyer wrote:
> Hi everyone,
> 
> I took the day off to polish the initial work on the Lucene Search app
> that was born during the sweeet owncloud developer meeting.
> 
> Before issueing a merge request I would like some more feedback on the
> integration with the web frontend. Especially async ajax calls, as I
> seem to be doing something wrong: my browser is frozen while receiving
> an OC_Eventsource stream.
> 
> Let me give you a rundown of the current state and the hacks still in use:
> 
> After checking out the app from [1] it will automagically reindex your
> files (even encrypted files) upon a page reload. There is the first
> hack: currently I synchronize an indexer state table with the
> oc_fscache table on every web page reload (@klaas webdav accedd
> bypasses the indexing for speed, but requires marking changed files as
> dirty). Upon a page reload I meant to use an ajax call to run the
> indexing in the background while the connection is open. This somehow
> locks my browser, so I'm doing it wrong.
> 
> Nevertheless, it is happyly building an index which will be used to
> present new search results. File deletion is also handled correctly
> and cleans up the lucene index now.
> 
> Improvements:
> * We now have full text search in plain text files!
> * We now have full text search in HTML files by using the classes
> provided by Zend Lucene Search (BSD license)!
> * We have limited full text search in PDF files with code from [2]
> which lacks a proper license [3] and features a github project [4]
> with outdated sources ... meh.
> * We could use the nice lucene query language [5] but I implemented it
> is as similar to the current search as possible.
> 
> Problems I still need to figure out how to solve:
> * The Zend classes for msoffice 2007 files uses ZipArchive which
> bypasses the OC_Filesystem layer and thus breaks indexing of encrypted
> files. @robin any idea?
> * Still no support for Open/LibreOffice, ODF, older word, rtf ... do
> we want to index sourcecode?
> * My ajax background code still locks the browser ... a progressbar on
> the status page woulde be nice. I tried to understand the ajax code
> from the gallery and calendar apps and copied some of the code to come
> up with somthing useable. At some point in time it stopped working and
> I switched to jquery ajax calls instead of Eventsource ... and I admit
> now I'm lost. Furthermore, I would like to start the background
> indexing via ajax when a file has been uploaded.
> * Can we somehow filter out or overwrite search results from the default
> search?
> 
> Tedious work:
> * store more meta information from getID3 in the index. This would
> obsolete the current database based full text search. But theb I would
> also like to merge the current lucene search status table into the
> oc_fscache table. It has only one flag column, anyway.
> 
> I tried to document the code and hope everything is well in place and
> ready for inclusion in owncloud/master. maybe disabled by default ;)
> 
> so long
> 
> Jörn
> 
> [1]
> https://gitorious.org/~butonic/owncloud/butonics-owncloud/trees/lucene_sear
> ch/apps/search_lucene [2]
> http://www.hashbangcode.com/blog/zend-lucene-and-pdf-documents-part-2-pdf-d
> ata-extraction-437.html [3] the "Our Philosophy" on
> http://www.hashbangcode.com/about states the following: 'All of the code
> placed onto this site has been tested to the best of our ability and
> resources so it should work out of the box. If you spot any problems then
> please let us know! You should be aware the all the code here is "use at
> your own risk" and we can't take any responsibility for loss of data or
> server downtime as a
> result of the code on this site.'
> [4] http://github.com/philipnorton42/PDFSearch

 - Robin Appelman



More information about the Owncloud mailing list