Baloo - Not Indexing everything by default
Martin Steigerwald
Martin at lichtvoll.de
Thu Oct 16 12:28:45 UTC 2014
Am Donnerstag, 16. Oktober 2014, 14:20:06 schrieb Luca Beltrame:
> In data giovedì 16 ottobre 2014 14:15:15, Martin Gräßlin ha scritto:
> > genome data is really huge wouldn't it make sense to go rather for file
> > size or abort the indexing if it's obvious random gibberish?
>
> As the person who mentioned this first (hey, I'm famous ;), I'm guessing
> that limiting on file size would work in principle.
>
> For reference on the sizes, these kind of files range from tens of M to a
> few G. Perhaps a size cutoff would work without no longer indexing
> everything (which IMO is a nice feature and shouldn't be disabled).
Could limiting on filesize also be done like this:
Just index the first say 100 KiB or so of a file – instead of not indexing it at
all? And in search results probably include a hint it has only been partially
indexed? Or would that be worse than not indexing at all in that case?
For my file index I currently have:
martin at merkaba:~/.local/share/baloo> LANG=C du -sch file/* | sort -rh
1.2G total
638M file/position.DB
250M file/postlist.DB
160M file/termlist.DB
103M file/fileMap.sqlite3
2.5M file/fileMap.sqlite3-wal
19M file/record.DB
4.0K file/termlist.baseB
4.0K file/termlist.baseA
4.0K file/record.baseB
4.0K file/record.baseA
4.0K file/postlist.baseB
4.0K file/postlist.baseA
4.0K file/iamchert
32K file/fileMap.sqlite3-shm
12K file/position.baseB
12K file/position.baseA
0 file/flintlock
Thats less than the last Nepomuk index:
martin at merkaba:~/.kde/share/apps/nepomuk/repository/main/data/virtuosobackend>
LANG=C du -sch * | sort -rh
3.1G total
3.1G soprano-virtuoso.db
2.1M soprano-virtuoso.log
8.0K soprano-virtuoso-temp.db
20K missed_flush.txt
0 soprano-virtuoso.trx
0 soprano-virtuoso.pxa
0 soprano-virtuoso.lock
And as its still performant, I wouldn´t care if it indexed some nice *.txt or
source files :). Actually I think I would like to be able to fulltext search in
these.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
More information about the Plasma-devel
mailing list