[Kde-pim] Character encoding handling in KMail/KMime

Wed Dec 7 20:50:33 GMT 2011

On Tuesday 29 November 2011, Andras Mantia wrote:
> Ingo Klöcker wrote:
> > On Monday 28 November 2011, Andras Mantia wrote:
> >> Hi,
> >> 
> >>  on the weekend I fixed a problem introduced with the HTML reply
> >> 
> >> branch, where the characters in the reply were wrongly encoded.
> >> While trying to get around the code and see where do the strings
> >> get broken I realized that it is just insane what we do there
> >> with the character encodings. The messages are encoded, decoded
> >> often (and now I refer both to charset encoding and string
> >> encoding for transfer), while the message does not leave KMail
> >> (or its libraries, like composer, templateparser) code. This also
> >> includes of course a lot of QString->QByteArray and
> >> QByteArray->QString covnersion. In my opinion there should be
> >> exactly two conversion: when the message is read into KMime and
> >> when it is written out from KMime (either to disc, send through
> >> the network or someway else). For all the rest it should be just
> >> unicode in a QString.
> >> 
> >>  Of course this is an intrusive change and KDE 5 (or whatever
> >>  comes
> >> 
> >> next material). What do you think of it? Does it even makes sense
> >> to go in this direction? Any better ideas?
> > 
> > This is difficult to answer because it depends on what the
> > representation of the message is used for. Of course, it makes
> > sense to reduce the conversions as much as possible, but I don't
> > think that using a Unicode representation of the whole message
> > makes sense. Do you mean the message text?
> 
> I was mainly refering to the message body, but user editable headers,
> like from/to/subject have also the same issue.
> 
> > Actually, I think the message as a whole should never be serialized
> > as Unicode string because it is never needed in this form.
> > (Correct me if I'm wrong.) The message should either be in KMime
> > or it should be serialized into a QByteArray (for storage on disk
> > and sending). Apart from this only the body and the header content
> > of individual message parts should be converted to Unicode for
> > display, composing and other tasks where Unicode is really needed.
> 
> I think inside the app those parts should be unicode. Here is what
> happens right now when you click on a message to view and reply to
> it: - the message viewer gets a KMime::Message (that has a
> QByteArray string and the encoding in the header). The body is
> converted to an unicode string before it is displayed, based on the
> encoding in the message or the override encoding specificed in the
> settings.
> - when replying, the KMime::Message is passed together with a
> selection to the messagefactory to create the reply message. The
> message is in the original encoding, the selection is in unicode
> (comes from the viewer). - the message is passed to the template
> parser. This creates an object tree parser, that like in case of the
> viewer, converts the message to unicode. Then creates the reply
> message content as a KMime::Content, where the reply (unicode)
> string is converted to the charset selected to be the default for
> the composer and put into the KMime::Content.
> - if the "force original charset is used", the message is converted
> back to unicode, the original charset applied, saved back into the
> message - the message is passed to the composer window: this creates
> an OTP, that again converts the message to unicode for displaying
> - I cannot find right now where this happens, but at one point the
> text from the editor (which is a QString AFAIK) is put back to the
> KMime::Message in an encoded form and finally sent through the
> network.

Yeah. That's mostly as it was in KMail1.

> This might make sense (individual parts always get a KMime::Message
> that is a real representation of an email, so it is not always
> unicode), but I find it to be:
> - suboptimal (too many conversions)
> - fragile. The encoding can go wrong in any place and it is hard to
> find out where this happened. The last bug was that the template
> parser saved back the data as unicode with a non-unicode encoding
> header. Then furthermore when this was converted to unicode, a
> double conversion was performed.

Well, the advantage is that there is a well-defined interface between 
KMail's different components. This interface is KMime.

> My raw idea is that KMime stores itself every string as unicode and
> applies the encoding only through some special methods, like:
> - setBodyFromBytearray() - applies the encoding in the header and
> stores inside as unicode
> - decodedMessage() (or named something like that) - returns a
> QByteArray in the right encoding, specified in the header. This
> would return the whole assembled message.
> - we could have similar convenience methods to get the
> body/headers/message in the original encoding. Similar like now we
> has asUnicodeString() and as7BitString().
> 
> So the difference is how KMime stores internally the message.
> 
> I hope now it is more clear what would be my idea. Do you still think
> this is bad?

Yes. It will require KMime to treat text/* parts differently from non-
text parts. I think this will make KMime unnecessarily complex. 
Moreover, this would require the content to be decoded each time the 
KMime-structure of a message is created. This could easily become a more 
serious performance problem than the current one.

OTOH, maybe it can be solved without making KMime (the API) much more 
complex by making all entities available in an encoded variant and a 
decoded variant both implementing the same interface. Or maybe both 
variants are better hidden transparently behind the same interface using 
composition. KMime could transparently switch between both internal 
variants and keep both or only the most recently used variant in memory 
depending on the general memory consumption. This is similar to your 
idea, but avoid any unnecessary conversions because all conversions 
would be done lazily on demand (as it's done now) but with additional 
internal caching of the conversion results. The advantages of such a 
solution are that the API stays as it is (so no using code needs to be 
changed) and that it can be implemented separately for each entity.

All of this is just brainstorming. I didn't have a closer look at KMime 
to see whether this is a feasible approach.

Regards,
Ingo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.kde.org/pipermail/kde-pim/attachments/20111207/1460e95c/attachment.sig>
-------------- next part --------------
_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/