Baloo - Not Indexing everything by default

Thu Oct 16 11:39:27 UTC 2014

Am Donnerstag, 16. Oktober 2014, 13:20:57 schrieb Vishesh Handa:
> Hey guys

Hi Vishesh,

> While Baloo performs better than Nepomuk. It does have its share of
> problems - mostly large text files, and high IO usage. Additionally, users
> on linux often seem to have the craziest files. Currently, we do not index
> plain text files which do not have a `.txt` extension, because otherwise we
> land up indexing genome data and other strange files. (Actual bugs)

How about limiting size for problematic files? I.e. only smaller text files? 
Here Baloo runs quite well. But I´d like it to also index *.txt files.

Anything else that can be done to make is more efficient? In my experience its 
already a lot more efficient than Nepomuk. It indexed a lot of text files here, 
about a million or more. My mails that is :).

> I've been thinking about actually disabling the file indexing by default.
> However, that might be too radical. Instead, we could only index -
> 
> * $HOME - Not including any subfolders.
> * Desktop, Documents, Videos, Pictures and Music. All of these are xdg user
> directories.
> 
> Gnome Tracker actually does something quite similar.

Hmmm, I actually don´t use these, except for a images folder. I store my files 
in categories / directories I want. I usually don´t sort by file type, but by 
purpose – okay I have an images folder, but mostly for Digikam, but music and 
audio meditations I already have split into two main directories. Thus I for 
me above structure just doesn´t fit.

> Comments?

I´d rather like Baloo to be *intelligent* about errors, i.e.:

If an indexer fails on a file to skip it next time. Optionally at some time 
present a list of files it failed to index to the user, maybe via a non 
intrusive summary notification at the end of an indexing cycle. And report each 
failed file just once in it.

Extra points for offering to report a bug with the file. But is a bit difficult, 
cause it may well be a private file the user does not want to share.

Actually I´d also like to have advanced configuration options. On my Debian the 
settings are very simplistic I can just say where not to search, no extension 
list, no file size restrictions, no nothing. I think this could help users who 
have problems with extra large text files.

But… I think advanced error handling, i.e. not trying on a file that is known 
to fail, again and again and again, might be able to circumvent the need for 
further configuration options.

I´d like to scan it for text files and source files tough. Just probably with 
some delay… to avoid I/O load durging git checkout or compile runs. Right now 
I do not seem to be able to set anything. I´d also like to see what filetypes 
it actually indexes. I wonder whether it indexes opendocument files for 
example, or PDF files. It seems from my files it finds less than Nepomuk. Ok, but 
PDF it seems to find.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7