Encoding problems
Scott Wheeler
wheeler at kde.org
Wed May 18 00:27:30 CEST 2005
On Tuesday 03 May 2005 16:43, Mikhail Yakshin wrote:
> Hello and greetings to all Taglib developers!
>
> I've tried to contact wheeler at kde.org directly, but, alas, I've received
> no answer so far.
Sorry, I miss a lot of things in my mailbox (especially if they're long)
because I get so much mail. And here I tend to respond in batch. ;-)
> 1. First major problem is with String class. It's more or less nice when
> it collaborates with Qt/QString, but some methods really make me cry.
> The most evil one is to8Bit() method and operator << that try to convert
> string into Latin1 std::string().
to8Bit() has a bool option to convert to UTF-8 and the operator<< is only used
for debugging, where I don't think it really matters.
The string class isn't meant to be a general purpose string class; it's one
that just does the stuff that TagLib needs.
> First, they fail to do as stated in documentation, really. For example,
> output of UTF8 strings containing cyrillic symbols looks just like 8-bit
> stripped string, making it totally unreadable. Sure, it's impossible to
> represent these symbols in Latin1, but there are "?" signs for
> replacement and, well,
Huh? If the string goes in properly (i.e. marked as UTF-8 with UTF-8 data;
marked as UTF-16, with UTF-16 data) it will come out as valid UTF-8 when used
with to8Bit(true) or toCString(true).
I maintain an application that defaults to UTF-8 tagging and I regularly use
extended characters.
> Second thing, they shouldn't really restrict everything to Latin1.
> Operator << and to8Bit() seem to be used as a helper to output
> everything to console
They're not made to handle output to the console except for internal
debugging. If you're writing a locale aware application that's something
your application needs to handle.
> If we'll fix this problem, at least Ogg Vorbis (always proper UTF-8)
> comments would be displayed and saved properly with tagreader/tagwriter.
Those are just demo applications to show how the framework works; they're not
built by default or shipped by anyone.
> 2. Second major problem is ID3v2 tags encoding.
[...]
> The problem is that de-facto, 99% of Russian MP3s with ID3v2 tags also
> contain windows-1251-encoded data, written as Latin1. Unicode support is
> generally very weak within majority of players (including popular
> Windows-based players, such as some versions of WinAmp or Windows Media
> Player and most hardware players that support localized interface and
> character sets), and most taggers use local 8-bit encoding instead of
> Unicode.
>
> So, there should be also a recoding handler for ID3v2 tags that will
> only touch reading/writing of Latin1-encoded tags, allowing somehow
> flexible to decide if we want to write tags in windows-1251 (if user
> wants it) or Unicode (by default, I guess?)
I'm still torn on this one. This is just so spectacularly broken that it
hurts me to implement.
I might consider something like just doing it for reading or something. I
really don't like the idea of writing non-ISO-8859-1 data into a place and
specifically marking it as ISO-8859-1 -- especially when both UTF-8 and
UTF-16 are supported...
> 3. Lots of application use Taglib, but, so far, only amaroK included
> selection for string handler encoding. This shouldn't be
> application-related issues, it could be some sort of library-wide
> setting (for example, a config file in ~/.taglib) that would
Locale handling is complex. TagLib only implements the UTF-8, UTF-16 and
ISO-8859-1 because (a) those are the only formats that are actually supposed
to go into the tag types and (b) locale handling is beyond the scope of
TagLib. TagLib isn't an application toolkit, it's a tag reading and writing
toolkit. Qt, for example, contains about 2 MB of source for their text
codecs. That's about 4 times the size of TagLib. Including those also would
significantly increase the binary size and as such would require more memory
consumption for every application using the library -- most of which are
already linking to tookits that *do* have locale-aware string abstractions.
-Scott
--
For a successful technology, reality must take precedence over public
relations, for nature cannot be fooled.
--Richard Feynman
More information about the taglib-devel
mailing list