[Kde-pim] Character encoding handling in KMail/KMime
Andras Mantia
amantia at kde.org
Tue Nov 29 06:59:47 GMT 2011
Ingo Klöcker wrote:
> On Monday 28 November 2011, Andras Mantia wrote:
>> Hi,
>>
>> on the weekend I fixed a problem introduced with the HTML reply
>> branch, where the characters in the reply were wrongly encoded.
>> While trying to get around the code and see where do the strings get
>> broken I realized that it is just insane what we do there with the
>> character encodings. The messages are encoded, decoded often (and
>> now I refer both to charset encoding and string encoding for
>> transfer), while the message does not leave KMail (or its libraries,
>> like composer, templateparser) code. This also includes of course a
>> lot of QString->QByteArray and QByteArray->QString covnersion. In my
>> opinion there should be exactly two conversion: when the message is
>> read into KMime and when it is written out from KMime (either to
>> disc, send through the network or someway else). For all the rest it
>> should be just unicode in a QString.
>> Of course this is an intrusive change and KDE 5 (or whatever comes
>> next material). What do you think of it? Does it even makes sense to
>> go in this direction? Any better ideas?
>
> This is difficult to answer because it depends on what the
> representation of the message is used for. Of course, it makes sense to
> reduce the conversions as much as possible, but I don't think that using
> a Unicode representation of the whole message makes sense. Do you mean
> the message text?
I was mainly refering to the message body, but user editable headers, like
from/to/subject have also the same issue.
> Actually, I think the message as a whole should never be serialized as
> Unicode string because it is never needed in this form. (Correct me if
> I'm wrong.) The message should either be in KMime or it should be
> serialized into a QByteArray (for storage on disk and sending). Apart
> from this only the body and the header content of individual message
> parts should be converted to Unicode for display, composing and other
> tasks where Unicode is really needed.
I think inside the app those parts should be unicode. Here is what happens
right now when you click on a message to view and reply to it:
- the message viewer gets a KMime::Message (that has a QByteArray string and
the encoding in the header). The body is converted to an unicode string
before it is displayed, based on the encoding in the message or the override
encoding specificed in the settings.
- when replying, the KMime::Message is passed together with a selection to
the messagefactory to create the reply message. The message is in the
original encoding, the selection is in unicode (comes from the viewer).
- the message is passed to the template parser. This creates an object tree
parser, that like in case of the viewer, converts the message to unicode.
Then creates the reply message content as a KMime::Content, where the reply
(unicode) string is converted to the charset selected to be the default for
the composer and put into the KMime::Content.
- if the "force original charset is used", the message is converted back to
unicode, the original charset applied, saved back into the message
- the message is passed to the composer window: this creates an OTP, that
again converts the message to unicode for displaying
- I cannot find right now where this happens, but at one point the text from
the editor (which is a QString AFAIK) is put back to the KMime::Message in
an encoded form and finally sent through the network.
This might make sense (individual parts always get a KMime::Message that is
a real representation of an email, so it is not always unicode), but I find
it to be:
- suboptimal (too many conversions)
- fragile. The encoding can go wrong in any place and it is hard to find out
where this happened. The last bug was that the template parser saved back
the data as unicode with a non-unicode encoding header. Then furthermore
when this was converted to unicode, a double conversion was performed.
My raw idea is that KMime stores itself every string as unicode and applies
the encoding only through some special methods, like:
- setBodyFromBytearray() - applies the encoding in the header and stores
inside as unicode
- decodedMessage() (or named something like that) - returns a QByteArray in
the right encoding, specified in the header. This would return the whole
assembled message.
- we could have similar convenience methods to get the body/headers/message
in the original encoding. Similar like now we has asUnicodeString() and
as7BitString().
So the difference is how KMime stores internally the message.
I hope now it is more clear what would be my idea. Do you still think this
is bad?
Andras
_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/
More information about the kde-pim
mailing list