[Owncloud] ownCloud 5.0.10 - lucene fails to index .txt files?

Jörn Friedrich Dreyer jfd at owncloud.com
Mon Aug 26 14:28:02 UTC 2013


On 23.08.2013 12:32, Stefan Vollmar wrote:
> Dear Jörn,
>
> On 22.08.2013, at 23:49, Jörn Friedrich Dreyer wrote:
>
>> The warnings about pdf and word are from getid3 lib and can be ignored if you are using search lucene. It comes with special indexers for these filetypes.
>>
>> The error about not beeing able to determine the file format for txt files also is from getid3 and might be caused by empty txt files.
> Can get we rid of the error messages?
setting debug level to error (3) should stop logging them.

>> Can you check if the reported txt file has 0 bytes? Can you search for a text in the pdf or word files and see if you get any results?
> The text file is not empty. We have manually scheduled a re-scan of all files and this might be the reason that now *some* search terms yield results with that txt-file, we also have hits inside the PDF file. So, in principle, search_lucene does seem to do something. Is there a way to monitor what lucene is doing exactly and whether it has already indexed a particular file at all?
search_lucene tracks the indexing status in the oc_lucene_status table.
there is no ui, yet sou will have to join the table to the oc_filecache
table to get meaningful information. setting log level to debug (0) and
tailing the owncloud.log file with a grep on "search_lucene" will give
you only search lucene related output.

> However, simple matching of file names (which should be much simpler and is really helpful if you have a nested directory structure with many files) is not nearly as good as it could be: it required the full "readme" before "readme.txt" is offered as a hit, likewise all characters of "tourismus" before "tourismus.jpg" turns up as a potential hit.
>
> Likewise "Serverraum" finds "Serverraum" in a PDF, however "server" or "raum" triggers nothing. I will not say that this is useless, but it does not compare favorably with either the Google or the Spotlight search engine - is this maybe something that is configurable?
Yes, something we did not yet decide how to handle. lucene search uses
the lucene query language instead of the simple 'LIKE "%<term>%"' which
is too expensive for most systems. You should be able to use '<term>*'
and also '*<term>*' in the search field when you want to find partial
matches. This also allows for more complex searches, but since the app
is currently marked as experimental this really is stuff me need
performance comparisons and usage reports for *hint* *hint*

You can get back the simple search term behaviour of the stock search by
uncommenting
https://github.com/owncloud/apps/blob/master/search_lucene/lib/lucene.php#L196
but please bear in mind that php has to load the whole index on every
request, which might take a while. We still need to investigate on how
to optimize this for large indexes.

so long

Jörn

-- 
Jörn Friedrich Dreyer (jfd at owncloud.com)
Senior Software Engineer
ownCloud GmbH

Your Data, Your Cloud, Your Way!

ownCloud GmbH, GF: Markus Rex, Holger Dyroff
Schloßäckerstrasse 26a, 90443 Nürnberg, HRB 28050 (AG Nürnberg)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/owncloud/attachments/20130826/6e04f42a/attachment.html>


More information about the Owncloud mailing list