Scrap baloo?

Wed Sep 14 21:29:22 UTC 2016

Hi,

first, read that from my mail to the maintainer thread:

<snip>

Hi,

after looking a bit more at the code, I think there are ATM a lot of things that need fixing:

1) 32-bit system: I see no fix, > 1GB of index and baloo + all baloo using applications fail

  see bugs like https://bugs.kde.org/show_bug.cgi?id=356114 here we have the 5GB limit, which is now raised
  for 64-bit, but not for 32-bit

2) Larger filesystems: unfortunately one decided to ignore the upper 32-bit of the inodes

/**
 * Convert the QT_STATBUF into a 64 bit unique identifier for the file.
 * This identifier is combination of the device id and inode number.
 */
inline quint64 statBufToId(const QT_STATBUF& stBuf)
{
    // We're loosing 32 bits of info, so this could potentially break
    // on file systems with really large inode and device ids
    return devIdAndInodeToId(static_cast<quint32>(stBuf.st_dev),
                             static_cast<quint32>(stBuf.st_ino));
}

=> random breakage e.g. on my NFS drive here as the IDs clash and all invariants no longer hold.
(e.g. something can be a file but in addition a directory, ....)

3) No error handling of most lmdb faults (like already mentioned)

4) No error handling for any data corruption: e.g. many places will just endless loop or malloc, like
  DocumentUrlDB::get(quint64 docId) (we have bugs for that)

5) lmdb locking issues: crash one read-write process => all other things stall (or crash because of 3+4)

6) No resource management nor crash handling for the baloo_file_extractor which either OOMs you or corrupts the database on crash leading to 5)

CC'd Vishesh, perhaps I am wrong with that issues and misunderstand the code, unfortunately e.g. the database
structure is not that well documented, if I don't just not find the correct docs in the git.

</snip>

Now executive summary, after a day more looking at the code.

1) 32-bit systems: never will be usable, thanks to lmdb, at least not with non-trivial index sizes

2) network file system homes: never will be usable, thanks to lmdb (ask its author: http://lmdb.tech/doc/ "Do not use LMDB databases on remote filesystems, even between processes on the same host. This breaks flock() on some OSes, possibly memory map sync, and certainly sync between programs on different hosts."

3) close to no error handling in the code => see the crash reports, I cleaned up a bit, but they are piling
  https://bugs.kde.org/reports.cgi?product=frameworks-baloo&output=show_chart&datasets=CONFIRMED&datasets=ASSIGNED&datasets=REOPENED&datasets=UNCONFIRMED&datasets=RESOLVED&banner=1

4) fundamental problems like: wrong data structure for index (32-bit inodes in 21th century?) and close to zero docs what it does internally

Proposal:

Scrap baloo_file* and Co. and just reimplement the public API (modulo the settings for the then non-existing indexer daemon)
to use tracker.

Benefits:

1) Tracker is maintained: https://github.com/GNOME/tracker/graphs/contributors
2) We share the index with GNOME/* and save double indexing on "many" Linux systems which are not plain KDE Plasma Desktop based
3) We can delete 99% of the code (question is if we can remove the very buggy extractors from KFileMetaData, too, afterwards somewhen).

=> Opinions?

Greetings
Christoph

-- 
----------------------------- Dr.-Ing. Christoph Cullmann ---------
AbsInt Angewandte Informatik GmbH      Email: cullmann at AbsInt.com
Science Park 1                         Tel:   +49-681-38360-22
66123 Saarbrücken                      Fax:   +49-681-38360-20
GERMANY                                WWW:   http://www.AbsInt.com
--------------------------------------------------------------------
Geschäftsführung: Dr.-Ing. Christian Ferdinand
Eingetragen im Handelsregister des Amtsgerichts Saarbrücken, HRB 11234