[Kde-pim] Character encoding handling in KMail/KMime

Tue Nov 29 06:59:47 GMT 2011

Ingo Klöcker wrote:

> On Monday 28 November 2011, Andras Mantia wrote:
>> Hi,
>> 
>>  on the weekend I fixed a problem introduced with the HTML reply
>> branch, where the characters in the reply were wrongly encoded.
>> While trying to get around the code and see where do the strings get
>> broken I realized that it is just insane what we do there with the
>> character encodings. The messages are encoded, decoded often (and
>> now I refer both to charset encoding and string encoding for
>> transfer), while the message does not leave KMail (or its libraries,
>> like composer, templateparser) code. This also includes of course a
>> lot of QString->QByteArray and QByteArray->QString covnersion. In my
>> opinion there should be exactly two conversion: when the message is
>> read into KMime and when it is written out from KMime (either to
>> disc, send through the network or someway else). For all the rest it
>> should be just unicode in a QString.
>>  Of course this is an intrusive change and KDE 5 (or whatever comes
>> next material). What do you think of it? Does it even makes sense to
>> go in this direction? Any better ideas?
> 
> This is difficult to answer because it depends on what the
> representation of the message is used for. Of course, it makes sense to
> reduce the conversions as much as possible, but I don't think that using
> a Unicode representation of the whole message makes sense. Do you mean
> the message text?

I was mainly refering to the message body, but user editable headers, like 
from/to/subject have also the same issue.

> Actually, I think the message as a whole should never be serialized as
> Unicode string because it is never needed in this form. (Correct me if
> I'm wrong.) The message should either be in KMime or it should be
> serialized into a QByteArray (for storage on disk and sending). Apart
> from this only the body and the header content of individual message
> parts should be converted to Unicode for display, composing and other
> tasks where Unicode is really needed.

I think inside the app those parts should be unicode. Here is what happens 
right now when you click on a message to view and reply to it:
- the message viewer gets a KMime::Message (that has a QByteArray string and 
the encoding in the header). The body is converted to an unicode string 
before it is displayed, based on the encoding in the message or the override 
encoding specificed in the settings.
- when replying, the KMime::Message is passed together with a selection to 
the messagefactory to create the reply message. The message is in the 
original encoding, the selection is in unicode (comes from the viewer).
- the message is passed to the template parser. This creates an object tree 
parser, that like in case of the viewer, converts the message to unicode. 
Then creates the reply message content as a KMime::Content, where the reply 
(unicode) string is converted to the charset selected to be the default for 
the composer and put into the KMime::Content.
- if the "force original charset is used", the message is converted back to 
unicode, the original charset applied, saved back into the message
- the message is passed to the composer window: this creates an OTP, that 
again converts the message to unicode for displaying
- I cannot find right now where this happens, but at one point the text from 
the editor (which is a QString AFAIK) is put back to the KMime::Message in 
an encoded form and finally sent through the network.

This might make sense (individual parts always get a KMime::Message that is 
a real representation of an email, so it is not always unicode), but I find 
it to be:
- suboptimal (too many conversions)
- fragile. The encoding can go wrong in any place and it is hard to find out 
where this happened. The last bug was that the template parser saved back 
the data as unicode with a non-unicode encoding header. Then furthermore 
when this was converted to unicode, a double conversion was performed.

My raw idea is that KMime stores itself every string as unicode and applies 
the encoding only through some special methods, like:
- setBodyFromBytearray() - applies the encoding in the header and stores 
inside as unicode
- decodedMessage() (or named something like that) - returns a QByteArray in 
the right encoding, specified in the header. This would return the whole 
assembled message.
- we could have similar convenience methods to get the body/headers/message 
in the original encoding. Similar like now we has asUnicodeString() and 
as7BitString().

So the difference is how KMime stores internally the message. 

I hope now it is more clear what would be my idea. Do you still think this  
is bad?

Andras
_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/