IndexedString vs. implicit sharing

Thu Oct 18 08:57:50 UTC 2012

On Thursday 18 October 2012 09:12:45 Andreas Pakulat wrote:
> Hi,
> 
> On Wed, Oct 17, 2012 at 9:46 PM, Milian Wolff <mail at milianw.de> wrote:
> > On Wednesday 17 October 2012 20:52:22 Andreas Pakulat wrote:
> >> > But couldn't we use
> >> > QString's in there directly?
> >> 
> >> That would mean keeping all the bits/bytes for a QString in memory,
> >> which was the whole reason for dropping their use.
> > 
> > Right. I forgot that kind of. Probably makes the implementation of what I
> > want even more complicated or even impossible. Let me give an example
> > here, replace> 
> > QString with KUrl to get the same problem:
> >       QString qString;
> >       qString.fill('.', 1000);
> >       KDevelop::IndexedString indexedString(qString);
> >       const int repeat = 10000;
> >       QVector<QString> strings;
> >       strings.resize(repeat);
> >       for(int i = 0; i < repeat; ++i) {
> >       
> >         strings[i] = indexedString.str();
> >         QCOMPARE(qString, strings[i]);
> >       
> >       }
> > 
> > What do you expect that does memory wise? Well, it will hog up more and
> > more memory here. That's the Use-Case I'd like to optimize if possible.
> So the implicit sharing of QString fails here? Do you happen to know
> why, since I don't understand how the implicit sharing works all that
> well I'm just wondering wether there's a way to make it work in this
> situation...

Implicit sharing only works if you copy one QString (or any other implicitly 
shared data) to another. I.e. this shares memory (replace QShared with any 
implicitly shared data type):

QShared orig = ...;
QShared shared1 = orig;
QShared shared2(orig);
QShared shared3(shared1);
...

but that does not, assuming rawData is not also a QShared:
QShared orig(rawData);
QShared copy1(rawData);

And that is exatly what is being done here. The conversion code in 
IndexedString uses QString::fromUtf8(const char*, int) to construct a QString. 
To leverage implicit sharing here, one needs to keep e.g. a 
QHash<IndexedString, QString> around and do something like this:

QString sharedValue(const IndexedString& index, const QString& value) {
  static QHash<IndexedString, QString> cache;
  auto it = cache.constFind(index);
  if (it != cache.constEnd()) {
    return it.value();
  }
  cache.insert(index, value);
  return value;
}

But, as you noted in your last email, this brings us a whole new problem:

When should we clean up the cache? There is afaiks no way to get notified that 
the ref counter of the implicitly shared QString data dropped to 1 (i.e. only 
referenced in the QHash) which would be the correct time to remove it there as 
well... If we don't do that, the cache will increase over time and we probably 
end up with a much higher memory consumption.

*But* if I look at IndexedStringData I wonder whether that could be leveraged 
somehow... Right now it looks as if it stores a refCount and the length of the 
string. The actual string data is stored just after that data, see e.g. 
IndexedStringRepositoryItemRequest::itemSize(). Now if there would be a way to 
construct a QString out of this data and reuse that, but destroy it as soon as 
the refcount drops... that would be optimal. But as far as I can see, we can't 
just add a QString data member to IndexedStringData, as we cannot/must not 
serialize that (it's internals are hidden from us after all)... But well, this 
is more or less just guessing on my side. Maybe I misunderstand the 
ItemRepository magic. But considering there is code like this in the 
ItemRequest:

  static void destroy(IndexedStringData* item, 
KDevelop::AbstractItemRepository&) {
    Q_UNUSED(item);
    //Nothing to do here (The object is not intelligent)
  }

I assume we could do something intelligent here, no?

> Also does this happen in the codebase without being able to change the
> relevant code? I mean adding the same indexedstring's QString content
> to a list multiple times looks like somethings wrong there...

Of course its not as extreme as in the example shown above. But what did 
happen (I fixed it differently, by replacing KUrl with IndexedString) was that 
we essentially duplicated the KUrl's of every project file. That was already 
quite some data (>200mb for my tests). I wouldn't find it too suprising to see 
that we do that at other places as well. I'll continue to investigate. For 
now, the "simple" solution is to use IndexedString in place of QString/KUrl 
wherever possible.

But that would never solve the case of having e.g. two QAbstractItemModels 
that operate on the same list of IndexedStrings. As soon as you plug them into 
a view and return something like IndexedString::str() in your model's ::data() 
function, you'll have the string twice in memory, not once...

> >> > Maybe that would require us to make IndexedString
> >> > only operate on QString's but I'd be all for that anyways to increase
> >> > type
> >> > safety. But well, food for though...
> >> 
> >> What kind of type-safety do you gain by replacing IndexedString with
> >> QString?
> > 
> > IndexedString has .str() .c_str() .byteArray() and finally .toUrl() -
> > which
> > one is the correct one?
> > 
> > DUContext::url() -> should always be a KUrl
> > Declaration::identifier() -> should alays be a QString
> > 
> > That's what I meant with "type-safety" - probably the wrong word.
> 
> Ok, then rename IndexString to IndexObject, make the constructor
> private and drop those 4 functions from it. Create 4 subclasses:
> IndexedUrl, IndexedQString, IndexedByteArray and IndexedCString which
> each have one of those 4 functions in their public API. Then change
> all the code to use the suitable Index* class instead of
> IndexedString.

Yes that is exactly what I had in mind.

> Seeing the c-string and bytearray conversion immediately triggers my
> encoding-alarm, how does IndexedString decide which encoding to use
> and are all callers of these two function aware of the encoding being
> used?

It's not encoding aware at all - it asumes UTF-8 everywhere. The callers must 
take care to use the proper function, and the byte array is just there as an 
optimization at one place (forgot again where). Since we usually only operate 
on UTF-8 data this is all fine. Non-UTF-8 data will break our code in many 
places, as there are multiple code paths in the DUChain that implicitly assume 
UTF-8.
-- 
Milian Wolff
mail at milianw.de
http://milianw.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.kde.org/pipermail/kdevelop-devel/attachments/20121018/e9dccdd1/attachment.sig>