KIO directory listing - CPU slows down SSD

Wed May 14 13:31:08 UTC 2014

On Wed, May 14, 2014 at 12:50 PM, Aaron J. Seigo <aseigo at kde.org> wrote:
> On Sunday, May 11, 2014 21:57:58 Mark Gaiser wrote:
>> Note: my recent incremental cleanups already made it about a second
>> faster then it used to be. It used to be around 5.500ms. However, the
>> current speed is still ~7x slower then using raw C++ and Qt. My goal
>> is to get it within 2x slower then raw C++/Qt.
>
> i did some benchmarking and profiling of this in the past and threading won't
> help that much. i did actually have a pretty simple threading model
> implemented which provided a fairly significantly speed up that was noticeable
> (e.g. 20%+) in the large (>50k files) folder case, but that was really just
> masking the problem: it offloaded the reaaaaaally slow part so that i/o could
> continue, which is classic latency vs throughput. the real solution is to cut
> latency here, by which i mean get rid of UDSEntry and stop using Qt's slow
> collection classes for things like kio_file. and because it was offoading a
> really slow part that came in large chunks (collections of dirents) it didn't
> parallelize all that nicely: on large dirs i/o would finish quickly and the
> thread pool would starve as each thread pushed its way through the UDSEntry
> dance as jobs piled up from the now unencumbered i/o path. the latency of
> threading killed throughput as a result.

Yeah, i gave up on the parallel approach. I only did one test with it
(as can be read in my previous message here) just to (stubbornly) try
it out and observe the slowdown and because it's fairly simple with
QtConcurrent. It's not worth it in this code path.
>
> with kio_file, the reaaaaaaally slow bit, and i mean REAAAAALY slow, as in 90%
> of wall clock time, is populating and serializing UDSEntrys. in particular,
> adding fields is very expensive. each field is added one at a time and the
> mechanism for storing those fields is horrendously slow. honestly, i don't even
> know why UDSEntry is used for the IOSlave side: it means creating huge numbers
> of small objects which each have their own collection objects holding a
> further inefficient structure (the Field struct in the FieldHash). each of those
> objects has a private data member that is a QSharedData ...

Yes, i observed the exact same thing.
- Creating UDSEntry objects is expensive
- Putting them in a QDataStream is even more expensive

Those two alone cause certainly 75% of all time spend while listing a folder.

>
> UDSEntry makes sense for the client side (though these days i'd rather just
> have a more efficient private data structure with a QAIM in front of that, but
> compatibility must be preserved) ... but for the ioslave side, especially ones
> like kio_file, it's completely needless overhead as a result of an
> overgeneralization in the code.
>
> for kio_file, the ioslave side basically does:
>
> * list entries via the protocol (e.g. readdir in the case of kio-file)
> * turn each entry into a UDSEntry
> * add the UDSEntry to a cold store
> * pump each UDSEntry out to the client once the cold store is full
>
> what would make so much more sense imho is having kio_file write directly to a
> stream ... which is exactly what those UDSEntry objects end up doing anyways!
> they are created one at a time, and then when there are N of them (or N
> seconds have passed) they serialize out to a QDataStream. for kio_file, the
> UDSEntry fields are NEVER changed and they are NEVER queried individually on
> the ioslave side. they are bulked up and then drained, and for no particular
> reason other than it lets one re-use UDSEntry and have a symmetry on both
> sides of the code base.
>
> it would be fairly trivial to have a class that has an API very much like
> UDSEntry but instead has a QByteArray internally that it just writes
> everything to. the contract with its users would be:
>
> * one entry at a time, serialized
> * no changing data once put there
> * no getting the data back out
>
> the class would then just batch up data into its QByteArray and then shunt it
> across the wire when needed. no middle men, no 100s of 1000s of objects and
> QLists and << operators. just an API sth like:
>
>         startEntry
>         addField
>         endEntry
>
> internally it could either batch up the fields and then stream them out when
> endEntry is called, or (and this is what i would do personally) on startEntry
> it would set up the header but with bad data (e.g. field count of 0) and then
> start appending the fields as they come in and increment a field count variable.
> when endEntry is called, update the header with the right field count. this
> would be much faster, though it would require a strictly serial approach to
> creating items. all 'safeties' such as "if you set the 'is a symlink' field
> twice, it gets updated" and would rely much more on the ioslave being well
> behaved. batching up the fields internally (so the equivalent of one UDSEntry's
> internals) and then writing them out would be enough to return these safeties,
> but for kio_file i don't think that would *ever* be useful.

I am very happy with your feedback here :)

The current UDSEntry (as it is on git now) cannot be used directly for
your suggestion. However, the UDSEntry i have here is already very
close to your suggestion. The data it stores looks like this:

    struct Field {
        inline Field() : m_long(0) { }
        QString m_str;
        long long m_long;
    };
    typedef QHash<uint, Field> FieldHash;
    FieldHash fields;
    QString m_name;         // UDS_NAME
    qint64 m_size;          // UDS_SIZE
    quint64 m_device_id;    // UDS_DEVICE_ID
    quint64 m_inode;        // UDS_INODE
    quint32 m_mode;         // UDS_ACCESS
    qint64 m_mtime;         // UDS_MODIFICATION_TIME
    qint64 m_atime;         // UDS_ACCESS_TIME
    quint32 m_user;         // UDS_USER_ID
    quint32 m_group;        // UDS_GROUP_ID

In other words, i already store the needed stat fields directly in
UDSEntry. The fieldhash is only being used when custom fields are
injected.

The serializing/unserializing now looks like this:

s = QDataStream

    s << a.d->m_name.toUtf8().constData();
    s << a.d->m_size;
    s << a.d->m_device_id;
    s << a.d->m_atime;
    s << a.d->m_mtime;
    s << a.d->m_mode;
    s << a.d->m_user;
    s << a.d->m_group;
    s << a.d->fields.size();

And if the fiels.size() count is > 0 then it starts injecting those
custom fields to the stream.

Thus far i've been trying to make UDSEntry more efficient and trying
to put stuff into QDataStream as quickly as possible. An approach i
didn't even think about is leaving UDSEntry out of it completely from
the kio_file slave.

All your idea requires is:
1. kio_file to do the batching internally (not via SlaveBase::listEntry)
2. kio_file to never call SlaveBase::listEntry but call
SlaveBase::listEntries instead with a QByteArray
3. A new SlaveBase::listEntries which takes a QByteArray and basically
calls send(MSG_LIST_ENTRIES, QByteArray) directly. Or make the send
function public in which case kio_file can directly send it's data to
the socket.. It's going to be an API change either way. Or a second
listEntries function or a public send function.

I'm implementing this approach now. I'm very curious how this will turn out :)

>
> as for determinate order of fields (to take advantage of the static list of
> QStrings to share QString data trick), that would be preserved as the order of
> addField calls in kio_file is fully deterministic: it is the exact same order
> every time.
>
> so while i disagree with some of dfaure's arguments, in particular that we
> should care about single CPU systems (in practice , they simply do not exist
> anymore for KDE software; even mobile is nearly all multicore now) and that
> the intermediate data structures have a lot of overhead (UDSEntry is already
> the definition of "intermediate data structure" and "overhead"; using readdir_r
> instead of readdir has exceptionally little impact next to that) ... he is
> essentially correct: the answer is not threading. it is decreasing the
> operations. such as by kicking UDSEntry to the curb on the ioslave side.
>
> UDSEntry could still use optimization on the client side of ioslave
> relationship, of course ...
>
> --
> Aaron J. Seigo
> _______________________________________________
> Kde-frameworks-devel mailing list
> Kde-frameworks-devel at kde.org
> https://mail.kde.org/mailman/listinfo/kde-frameworks-devel
>