Baloo - Not Indexing everything by default

Martin Steigerwald Martin at lichtvoll.de
Thu Oct 16 12:28:45 UTC 2014


Am Donnerstag, 16. Oktober 2014, 14:20:06 schrieb Luca Beltrame:
> In data giovedì 16 ottobre 2014 14:15:15, Martin Gräßlin ha scritto:
> > genome data is really huge wouldn't it make sense to go rather for file
> > size or abort the indexing if it's obvious random gibberish?
> 
> As the person who mentioned this first (hey, I'm famous ;), I'm guessing
> that limiting on file size would work in principle.
> 
> For reference on the sizes, these kind of files range from tens of M to a
> few G. Perhaps a size cutoff would work without no longer indexing
> everything (which IMO is a nice feature and shouldn't be disabled).

Could limiting on filesize also be done like this:

Just index the first say 100 KiB or so of a file – instead of not indexing it at 
all? And in search results probably include a hint it has only been partially 
indexed? Or would that be worse than not indexing at all in that case?

For my file index I currently have:

martin at merkaba:~/.local/share/baloo> LANG=C du -sch file/* | sort -rh
1.2G    total
638M    file/position.DB
250M    file/postlist.DB
160M    file/termlist.DB
103M    file/fileMap.sqlite3
2.5M    file/fileMap.sqlite3-wal
19M     file/record.DB
4.0K    file/termlist.baseB
4.0K    file/termlist.baseA
4.0K    file/record.baseB
4.0K    file/record.baseA
4.0K    file/postlist.baseB
4.0K    file/postlist.baseA
4.0K    file/iamchert
32K     file/fileMap.sqlite3-shm
12K     file/position.baseB
12K     file/position.baseA
0       file/flintlock

Thats less than the last Nepomuk index:

martin at merkaba:~/.kde/share/apps/nepomuk/repository/main/data/virtuosobackend> 
LANG=C du -sch * | sort -rh
3.1G    total
3.1G    soprano-virtuoso.db
2.1M    soprano-virtuoso.log
8.0K    soprano-virtuoso-temp.db
20K     missed_flush.txt
0       soprano-virtuoso.trx
0       soprano-virtuoso.pxa
0       soprano-virtuoso.lock


And as its still performant, I wouldn´t care if it indexed some nice *.txt or 
source files :). Actually I think I would like to be able to fulltext search in 
these.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7


More information about the Plasma-devel mailing list