<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 23.08.2013 12:32, Stefan Vollmar
wrote:<br>
</div>
<blockquote
cite="mid:00032CC5-E312-4239-9AA0-6B683D207C34@nf.mpg.de"
type="cite">
<pre wrap="">Dear Jörn,
On 22.08.2013, at 23:49, Jörn Friedrich Dreyer wrote:
</pre>
<blockquote type="cite">
<pre wrap="">The warnings about pdf and word are from getid3 lib and can be ignored if you are using search lucene. It comes with special indexers for these filetypes.
The error about not beeing able to determine the file format for txt files also is from getid3 and might be caused by empty txt files.
</pre>
</blockquote>
<pre wrap="">
Can get we rid of the error messages?</pre>
</blockquote>
setting debug level to error (3) should stop logging them.<br>
<br>
<blockquote
cite="mid:00032CC5-E312-4239-9AA0-6B683D207C34@nf.mpg.de"
type="cite">
<blockquote type="cite">
<pre wrap="">Can you check if the reported txt file has 0 bytes? Can you search for a text in the pdf or word files and see if you get any results?
</pre>
</blockquote>
<pre wrap="">
The text file is not empty. We have manually scheduled a re-scan of all files and this might be the reason that now *some* search terms yield results with that txt-file, we also have hits inside the PDF file. So, in principle, search_lucene does seem to do something. Is there a way to monitor what lucene is doing exactly and whether it has already indexed a particular file at all?</pre>
</blockquote>
search_lucene tracks the indexing status in the oc_lucene_status
table. there is no ui, yet sou will have to join the table to the
oc_filecache table to get meaningful information. setting log level
to debug (0) and tailing the owncloud.log file with a grep on
"search_lucene" will give you only search lucene related output.<br>
<br>
<blockquote
cite="mid:00032CC5-E312-4239-9AA0-6B683D207C34@nf.mpg.de"
type="cite">
<pre wrap="">However, simple matching of file names (which should be much simpler and is really helpful if you have a nested directory structure with many files) is not nearly as good as it could be: it required the full "readme" before "readme.txt" is offered as a hit, likewise all characters of "tourismus" before "tourismus.jpg" turns up as a potential hit.
Likewise "Serverraum" finds "Serverraum" in a PDF, however "server" or "raum" triggers nothing. I will not say that this is useless, but it does not compare favorably with either the Google or the Spotlight search engine - is this maybe something that is configurable?</pre>
</blockquote>
Yes, something we did not yet decide how to handle. lucene search
uses the lucene query language instead of the simple 'LIKE
"%<term>%"' which is too expensive for most systems. You
should be able to use '<term>*' and also '*<term>*' in
the search field when you want to find partial matches. This also
allows for more complex searches, but since the app is currently
marked as experimental this really is stuff me need performance
comparisons and usage reports for *hint* *hint*<br>
<br>
You can get back the simple search term behaviour of the stock
search by uncommenting
<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1">
<a
href="https://github.com/owncloud/apps/blob/master/search_lucene/lib/lucene.php#L196">https://github.com/owncloud/apps/blob/master/search_lucene/lib/lucene.php#L196</a>
but please bear in mind that php has to load the whole index on
every request, which might take a while. We still need to
investigate on how to optimize this for large indexes.<br>
<br>
so long<br>
<br>
Jörn<br>
<br>
<pre class="moz-signature" cols="72">--
Jörn Friedrich Dreyer (<a class="moz-txt-link-abbreviated" href="mailto:jfd@owncloud.com">jfd@owncloud.com</a>)
Senior Software Engineer
ownCloud GmbH
Your Data, Your Cloud, Your Way!
ownCloud GmbH, GF: Markus Rex, Holger Dyroff
Schloßäckerstrasse 26a, 90443 Nürnberg, HRB 28050 (AG Nürnberg)
</pre>
</body>
</html>