IndexedString vs. implicit sharing

Thu Oct 18 10:57:55 UTC 2012

Hi,

On Thu, Oct 18, 2012 at 10:57 AM, Milian Wolff <mail at milianw.de> wrote:
> On Thursday 18 October 2012 09:12:45 Andreas Pakulat wrote:
>> On Wed, Oct 17, 2012 at 9:46 PM, Milian Wolff <mail at milianw.de> wrote:
>> > On Wednesday 17 October 2012 20:52:22 Andreas Pakulat wrote:
>> >> > But couldn't we use
>> >> > QString's in there directly?
>> >>
>> >> That would mean keeping all the bits/bytes for a QString in memory,
>> >> which was the whole reason for dropping their use.
>> >
>> > Right. I forgot that kind of. Probably makes the implementation of what I
>> > want even more complicated or even impossible. Let me give an example
>> > here, replace>
>> > QString with KUrl to get the same problem:
>> >       QString qString;
>> >       qString.fill('.', 1000);
>> >       KDevelop::IndexedString indexedString(qString);
>> >       const int repeat = 10000;
>> >       QVector<QString> strings;
>> >       strings.resize(repeat);
>> >       for(int i = 0; i < repeat; ++i) {
>> >
>> >         strings[i] = indexedString.str();
>> >         QCOMPARE(qString, strings[i]);
>> >
>> >       }
>> >
>> > What do you expect that does memory wise? Well, it will hog up more and
>> > more memory here. That's the Use-Case I'd like to optimize if possible.
>> So the implicit sharing of QString fails here? Do you happen to know
>> why, since I don't understand how the implicit sharing works all that
>> well I'm just wondering wether there's a way to make it work in this
>> situation...
>
> Implicit sharing only works if you copy one QString (or any other implicitly
> shared data) to another. I.e. this shares memory (replace QShared with any
> implicitly shared data type):

Heh, somehow I always thought it would magically work even if two
QStrings are create out of nowhere :)

> *But* if I look at IndexedStringData I wonder whether that could be leveraged
> somehow... Right now it looks as if it stores a refCount and the length of the
> string. The actual string data is stored just after that data, see e.g.
> IndexedStringRepositoryItemRequest::itemSize(). Now if there would be a way to
> construct a QString out of this data and reuse that, but destroy it as soon as
> the refcount drops... that would be optimal. But as far as I can see, we can't
> just add a QString data member to IndexedStringData, as we cannot/must not
> serialize that (it's internals are hidden from us after all)... But well, this
> is more or less just guessing on my side. Maybe I misunderstand the
> ItemRepository magic. But considering there is code like this in the
> ItemRequest:

If you have a place that does the refCount'ing in a central place,
that same thing can be used to store a static map of
IndexedString/QString, i.e. something like

// Use QHash if you have a proper efficient hash function for IndexedString
static QHash<IndexedString,QString>& indexedStringCache() {
    static QHash<IndexedString,QString>* cachedData = new
QMap<IndexedString,QString>();
    return *cachedData;
}

Then wherever the refCounting happens do something like this, when the
refCount is 0:
indexedStringCache().remove(*this);

and in the .str() function do
QHash<IndexedString,QString>::iterator it = indexedStringCache().find( *this );
if( it == indexedStringCache().end()  ) {
  it = indexedStringCache().insert(*this, createQString() );
}
return *it;

As long as you can be sure that the destroy function is always
properly triggered when the indexedstring is not needed anymore this
should work. And of course this only works if all the relevant code is
in the same .cpp file. I think the runtime overhead of the contains
and the additional lookup as well as the cleanup is acceptable, given
that constructing the QString is rather time-consuming and the
hash-lookup should be very efficient.

>> Also does this happen in the codebase without being able to change the
>> relevant code? I mean adding the same indexedstring's QString content
>> to a list multiple times looks like somethings wrong there...
>
> Of course its not as extreme as in the example shown above. But what did
> happen (I fixed it differently, by replacing KUrl with IndexedString) was that
> we essentially duplicated the KUrl's of every project file. That was already
> quite some data (>200mb for my tests). I wouldn't find it too suprising to see
> that we do that at other places as well. I'll continue to investigate. For
> now, the "simple" solution is to use IndexedString in place of QString/KUrl
> wherever possible.

But is this memory kept over a longer time? If so the question is why
both the indexedstring as well as the kurl's are being kept around all
the time and thus changing code to rather use IndexedString and only
fetch the url when needed is "good".

The models are a problem anyway in this scenario and the only way to
solve that would be by having a custom delegate which can "render"
IndexedString's (and making it possible to transport them through
QVariant). On the other hand with todays itemviews only the visible
items should have a QString copy of the IndexedString kept in memory
for rendering (or at least I hope).

Of course creating and deleting all these objects may still increase
memory consumption, because fragmentation might occur. So another
thing to investigate is wether kdevelop could benefit from a custom
memory allocator that gets memory in bigger chunks...

>> Seeing the c-string and bytearray conversion immediately triggers my
>> encoding-alarm, how does IndexedString decide which encoding to use
>> and are all callers of these two function aware of the encoding being
>> used?
>
> It's not encoding aware at all - it asumes UTF-8 everywhere. The callers must
> take care to use the proper function, and the byte array is just there as an
> optimization at one place (forgot again where). Since we usually only operate
> on UTF-8 data this is all fine. Non-UTF-8 data will break our code in many
> places, as there are multiple code paths in the DUChain that implicitly assume
> UTF-8.

Thats actually fine as long as all callers of the C-String/ByteArray
API are aware they're getting UTF-8 (aka this should be crystal clear
from the api docs), they can do whatever conversion they need to do
themselves.

Andreas