KIO directory listing - CPU slows down SSD

Mark Gaiser markg85 at gmail.com
Wed May 14 19:42:56 UTC 2014


On Wed, May 14, 2014 at 3:31 PM, Mark Gaiser <markg85 at gmail.com> wrote:
> On Wed, May 14, 2014 at 12:50 PM, Aaron J. Seigo <aseigo at kde.org> wrote:
>> On Sunday, May 11, 2014 21:57:58 Mark Gaiser wrote:
>>> Note: my recent incremental cleanups already made it about a second
>>> faster then it used to be. It used to be around 5.500ms. However, the
>>> current speed is still ~7x slower then using raw C++ and Qt. My goal
>>> is to get it within 2x slower then raw C++/Qt.
>>
>> i did some benchmarking and profiling of this in the past and threading won't
>> help that much. i did actually have a pretty simple threading model
>> implemented which provided a fairly significantly speed up that was noticeable
>> (e.g. 20%+) in the large (>50k files) folder case, but that was really just
>> masking the problem: it offloaded the reaaaaaally slow part so that i/o could
>> continue, which is classic latency vs throughput. the real solution is to cut
>> latency here, by which i mean get rid of UDSEntry and stop using Qt's slow
>> collection classes for things like kio_file. and because it was offoading a
>> really slow part that came in large chunks (collections of dirents) it didn't
>> parallelize all that nicely: on large dirs i/o would finish quickly and the
>> thread pool would starve as each thread pushed its way through the UDSEntry
>> dance as jobs piled up from the now unencumbered i/o path. the latency of
>> threading killed throughput as a result.
>
> Yeah, i gave up on the parallel approach. I only did one test with it
> (as can be read in my previous message here) just to (stubbornly) try
> it out and observe the slowdown and because it's fairly simple with
> QtConcurrent. It's not worth it in this code path.
>>
>> with kio_file, the reaaaaaaally slow bit, and i mean REAAAAALY slow, as in 90%
>> of wall clock time, is populating and serializing UDSEntrys. in particular,
>> adding fields is very expensive. each field is added one at a time and the
>> mechanism for storing those fields is horrendously slow. honestly, i don't even
>> know why UDSEntry is used for the IOSlave side: it means creating huge numbers
>> of small objects which each have their own collection objects holding a
>> further inefficient structure (the Field struct in the FieldHash). each of those
>> objects has a private data member that is a QSharedData ...
>
> Yes, i observed the exact same thing.
> - Creating UDSEntry objects is expensive
> - Putting them in a QDataStream is even more expensive
>
> Those two alone cause certainly 75% of all time spend while listing a folder.
>
>>
>> UDSEntry makes sense for the client side (though these days i'd rather just
>> have a more efficient private data structure with a QAIM in front of that, but
>> compatibility must be preserved) ... but for the ioslave side, especially ones
>> like kio_file, it's completely needless overhead as a result of an
>> overgeneralization in the code.
>>
>> for kio_file, the ioslave side basically does:
>>
>> * list entries via the protocol (e.g. readdir in the case of kio-file)
>> * turn each entry into a UDSEntry
>> * add the UDSEntry to a cold store
>> * pump each UDSEntry out to the client once the cold store is full
>>
>> what would make so much more sense imho is having kio_file write directly to a
>> stream ... which is exactly what those UDSEntry objects end up doing anyways!
>> they are created one at a time, and then when there are N of them (or N
>> seconds have passed) they serialize out to a QDataStream. for kio_file, the
>> UDSEntry fields are NEVER changed and they are NEVER queried individually on
>> the ioslave side. they are bulked up and then drained, and for no particular
>> reason other than it lets one re-use UDSEntry and have a symmetry on both
>> sides of the code base.
>>
>> it would be fairly trivial to have a class that has an API very much like
>> UDSEntry but instead has a QByteArray internally that it just writes
>> everything to. the contract with its users would be:
>>
>> * one entry at a time, serialized
>> * no changing data once put there
>> * no getting the data back out
>>
>> the class would then just batch up data into its QByteArray and then shunt it
>> across the wire when needed. no middle men, no 100s of 1000s of objects and
>> QLists and << operators. just an API sth like:
>>
>>         startEntry
>>         addField
>>         endEntry
>>
>> internally it could either batch up the fields and then stream them out when
>> endEntry is called, or (and this is what i would do personally) on startEntry
>> it would set up the header but with bad data (e.g. field count of 0) and then
>> start appending the fields as they come in and increment a field count variable.
>> when endEntry is called, update the header with the right field count. this
>> would be much faster, though it would require a strictly serial approach to
>> creating items. all 'safeties' such as "if you set the 'is a symlink' field
>> twice, it gets updated" and would rely much more on the ioslave being well
>> behaved. batching up the fields internally (so the equivalent of one UDSEntry's
>> internals) and then writing them out would be enough to return these safeties,
>> but for kio_file i don't think that would *ever* be useful.
>
> I am very happy with your feedback here :)
>
> The current UDSEntry (as it is on git now) cannot be used directly for
> your suggestion. However, the UDSEntry i have here is already very
> close to your suggestion. The data it stores looks like this:
>
>     struct Field {
>         inline Field() : m_long(0) { }
>         QString m_str;
>         long long m_long;
>     };
>     typedef QHash<uint, Field> FieldHash;
>     FieldHash fields;
>     QString m_name;         // UDS_NAME
>     qint64 m_size;          // UDS_SIZE
>     quint64 m_device_id;    // UDS_DEVICE_ID
>     quint64 m_inode;        // UDS_INODE
>     quint32 m_mode;         // UDS_ACCESS
>     qint64 m_mtime;         // UDS_MODIFICATION_TIME
>     qint64 m_atime;         // UDS_ACCESS_TIME
>     quint32 m_user;         // UDS_USER_ID
>     quint32 m_group;        // UDS_GROUP_ID
>
> In other words, i already store the needed stat fields directly in
> UDSEntry. The fieldhash is only being used when custom fields are
> injected.
>
> The serializing/unserializing now looks like this:
>
> s = QDataStream
>
>     s << a.d->m_name.toUtf8().constData();
>     s << a.d->m_size;
>     s << a.d->m_device_id;
>     s << a.d->m_atime;
>     s << a.d->m_mtime;
>     s << a.d->m_mode;
>     s << a.d->m_user;
>     s << a.d->m_group;
>     s << a.d->fields.size();
>
> And if the fiels.size() count is > 0 then it starts injecting those
> custom fields to the stream.
>
> Thus far i've been trying to make UDSEntry more efficient and trying
> to put stuff into QDataStream as quickly as possible. An approach i
> didn't even think about is leaving UDSEntry out of it completely from
> the kio_file slave.
>
> All your idea requires is:
> 1. kio_file to do the batching internally (not via SlaveBase::listEntry)
> 2. kio_file to never call SlaveBase::listEntry but call
> SlaveBase::listEntries instead with a QByteArray
> 3. A new SlaveBase::listEntries which takes a QByteArray and basically
> calls send(MSG_LIST_ENTRIES, QByteArray) directly. Or make the send
> function public in which case kio_file can directly send it's data to
> the socket.. It's going to be an API change either way. Or a second
> listEntries function or a public send function.
>
> I'm implementing this approach now. I'm very curious how this will turn out :)
>
>>
>> as for determinate order of fields (to take advantage of the static list of
>> QStrings to share QString data trick), that would be preserved as the order of
>> addField calls in kio_file is fully deterministic: it is the exact same order
>> every time.
>>
>> so while i disagree with some of dfaure's arguments, in particular that we
>> should care about single CPU systems (in practice , they simply do not exist
>> anymore for KDE software; even mobile is nearly all multicore now) and that
>> the intermediate data structures have a lot of overhead (UDSEntry is already
>> the definition of "intermediate data structure" and "overhead"; using readdir_r
>> instead of readdir has exceptionally little impact next to that) ... he is
>> essentially correct: the answer is not threading. it is decreasing the
>> operations. such as by kicking UDSEntry to the curb on the ioslave side.
>>
>> UDSEntry could still use optimization on the client side of ioslave
>> relationship, of course ...
>>
>> --
>> Aaron J. Seigo
>> _______________________________________________
>> Kde-frameworks-devel mailing list
>> Kde-frameworks-devel at kde.org
>> https://mail.kde.org/mailman/listinfo/kde-frameworks-devel
>>

Small update.

I've implemented the "aaron approach" and you where very right!

Listing 500.000 files:
- Before: ~3.500 ms
- After: ~1.500 ms

These are the number as they arrive in the application that requests
the data. So yes, this is with socket and multi process overhead
included.

Note: the "before" is with quite a few optimizations in UDSEntry and
some other code paths that are used while listing a directory. If you
would compare this against KIO as it it's now in git then the before
number would be ~4.500 ms.

If you look at it from a week ago (some patches are already in KIO)
then it would be around ~5.500 which would be KDE 4.xx.

Raw speed is ~700ms and i want to be - at most - twice as slow while
keeping all of KIO's features which means that i'm just 0.1 ms away
from my minimal goal :D


More information about the Kde-frameworks-devel mailing list