wide char patch

Yoshiki Yazawa yaz at cc.rim.or.jp
Fri Nov 24 17:17:51 CET 2006


Thank you for replying, Scott.

I have understood that the ID3 standards only allow latin1 or unicodes
as the encodings for meta data. Unfortunately, as an application
developer, I can't ignore large amount of media files tagged with
non-unicode encodings. I have implemented automatic character encoding
detector into audacious, and it was approved among many users in
various countries where non-latin characters are mainly used.

To make the detector work correctly, "raw" meta data output is
essential. With UTF-8 specified, the original raw character would be
encoded into one byte or two byte of UTF-8 sequence, depending on the
value. It incurs an uninvited inverse conversion to obtain raw
string. Besides, it is quite difficult to distinct a meaningful UTF-8
string from unicode string and a pseudo UTF-8 from non-unicode string,
because these are all valid in the sense of UTF-8 encoding.

The reasons why I proposed taglib to return UTF-8 string in the
situation I've explained are:

1. When specified to return in latin1 for an unicode string, it is
safer to return meaningful UTF-8 byte sequence than to return
corrupted byte sequence of every other byte. Moreover, nothing will
change if the original string is non-unicode. So the users who use
latin1 output for latin1 metadata will never mind this change.

2. If we decide to define a new interface for raw output, I'm quite
sure that the new API will return strings in the same manner to the
patched latin1 interface, since returning ucs2be will involve NUL
termination problem. Therefore, I think the way of the patch does is
necessary and sufficient for raw output.

3. At least, I think it is valuable that taglib will never emit broken
byte sequence at any time. Proposed patch does this well. Even though
mine is not acceptable, taglib should check the internal string and
never return broken sequence.

I am open minded for better solutions. Please let me know your
opinions.

---------------------
Yoshiki Yazawa



From: Scott Wheeler <wheeler at kde.org>
Subject: Re: wide char patch
Date: Fri, 24 Nov 2006 13:28:38 +0100

> Yoshiki Yazawa wrote:
> > Dear authors of taglib,
> >
> > I am Yoshiki Yazawa, a developer of audacious media player.
> >
> > Taglib is the primary tag library of our software. Thank you for the
> > great library.
> >
> > I have a proposal to change the behavior when taglib is asked to
> > return meta data in latin1. Current implementation simply picks up
> > lower byte of internal ucs2be character and composes result string.
> > However, this behavior easily ruins wide characters into unrecoverable
> > garbage.
> >
> > I think it is very reasonable and safe if taglib checks wide character
> > and returns an utf-8 string instead of a chain of lower bytes when the
> > internal string has wide characters even though it was asked to return
> > latin1.
> >   
> Hi Yoshiki --
> 
> Why not just request UTF-8 data?  I don't think it's acceptable to 
> introduce indeterminism like this directly at the TagLib::String level.  
> Maybe if I understood the problem that you're trying to solve we could 
> solve it at a higher level...
> 
> -Scott
> 
> 
> 
> 


More information about the taglib-devel mailing list