Encoding problems

Wed May 18 00:27:30 CEST 2005

On Tuesday 03 May 2005 16:43, Mikhail Yakshin wrote:
> Hello and greetings to all Taglib developers!
>
> I've tried to contact wheeler at kde.org directly, but, alas, I've received
> no answer so far.

Sorry, I miss a lot of things in my mailbox (especially if they're long) 
because I get so much mail.  And here I tend to respond in batch.  ;-)

> 1. First major problem is with String class. It's more or less nice when
> it collaborates with Qt/QString, but some methods really make me cry.
> The most evil one is to8Bit() method and operator << that try to convert
> string into Latin1 std::string().

to8Bit() has a bool option to convert to UTF-8 and the operator<< is only used 
for debugging, where I don't think it really matters.

The string class isn't meant to be a general purpose string class; it's one 
that just does the stuff that TagLib needs.

> First, they fail to do as stated in documentation, really. For example,
> output of UTF8 strings containing cyrillic symbols looks just like 8-bit
> stripped string, making it totally unreadable. Sure, it's impossible to
> represent these symbols in Latin1, but there are "?" signs for
> replacement and, well,

Huh?  If the string goes in properly (i.e. marked as UTF-8 with UTF-8 data; 
marked as UTF-16, with UTF-16 data) it will come out as valid UTF-8 when used 
with to8Bit(true) or toCString(true).

I maintain an application that defaults to UTF-8 tagging and I regularly use 
extended characters.

> Second thing, they shouldn't really restrict everything to Latin1.
> Operator << and to8Bit() seem to be used as a helper to output
> everything to console

They're not made to handle output to the console except for internal 
debugging.  If you're writing a locale aware application that's something 
your application needs to handle.

> If we'll fix this problem, at least Ogg Vorbis (always proper UTF-8)
> comments would be displayed and saved properly with tagreader/tagwriter.

Those are just demo applications to show how the framework works; they're not 
built by default or shipped by anyone.

> 2. Second major problem is ID3v2 tags encoding.

[...]

> The problem is that de-facto, 99% of Russian MP3s with ID3v2 tags also
> contain windows-1251-encoded data, written as Latin1. Unicode support is
> generally very weak within majority of players (including popular
> Windows-based players, such as some versions of WinAmp or Windows Media
> Player and most hardware players that support localized interface and
> character sets), and most taggers use local 8-bit encoding instead of
> Unicode.
>
> So, there should be also a recoding handler for ID3v2 tags that will
> only touch reading/writing of Latin1-encoded tags, allowing somehow
> flexible to decide if we want to write tags in windows-1251 (if user
> wants it) or Unicode (by default, I guess?)

I'm still torn on this one.  This is just so spectacularly broken that it 
hurts me to implement.

I might consider something like just doing it for reading or something.  I 
really don't like the idea of writing non-ISO-8859-1 data into a place and 
specifically marking it as ISO-8859-1 -- especially when both UTF-8 and 
UTF-16 are supported...

> 3. Lots of application use Taglib, but, so far, only amaroK included
> selection for string handler encoding. This shouldn't be
> application-related issues, it could be some sort of library-wide
> setting (for example, a config file in ~/.taglib) that would

Locale handling is complex.  TagLib only implements the UTF-8, UTF-16 and 
ISO-8859-1 because (a) those are the only formats that are actually supposed 
to go into the tag types and (b) locale handling is beyond the scope of 
TagLib.  TagLib isn't an application toolkit, it's a tag reading and writing 
toolkit.  Qt, for example, contains about 2 MB of source for their text 
codecs.  That's about 4 times the size of TagLib.  Including those also would 
significantly increase the binary size and as such would require more memory 
consumption for every application using the library -- most of which are 
already linking to tookits that *do* have locale-aware string abstractions.

-Scott

-- 
For a successful technology, reality must take precedence over public 
relations, for nature cannot be fooled. 
--Richard Feynman