Concept of "hashing" stings in Amarok (sih1)

Mon Jan 11 18:35:01 CET 2010

> In fact, we've always over the years gotten emails asking for *more*
> specificity, which is why we recently switched the database to be
> case-sensitive. People *want* to know when they have both AC-DC and
> AC/DC so that they can fix their tags" (Jeff)

In deed,  you just wrote the reason why I started thinking about sih1 -
http://wiki.github.com/LukasLt/Collector - and sih1 is one of main homework
to do. I'm thinking about it as a plugin, as there will always be users with
different opinions, how similar things is really similar.

Creating a world wide standard, well, would be great, but unless someone
really big would start supporting this, its close to impossible.

> /Stripping punctuation works fine usually, but when I google e.g. "C++
> memory allocation" that's what I want, not "C memory allocation"./
>
> Also if you remove duplicated characters it's a confusion for the user
> for example Madonna is written as *Madonna* (with two N's) not *Madona*.

Very important topic too. But in this case a few things should be taken into
account:
* Music is not technical literature. Yes such method would su*k if it would
be used on BAR codes etc.
* sih1 should not be visible to user in normal conditions, just the same, as
Amarok doesn't show uniqueid
* I couldn't find any artists that could have have same sih1, as most
artists uses uniques names, and its very unlikely that someone would title
himself as Maddona or Madonna2, so even if false collisions are possible
they would be rare.
* Amarok already uses LIKE %keyword% syntax, so if I'm looking for e.g. Don
McLean and type in don, Madonna is also found ;)

On 01/11/2010 02:44 PM, Jakob Kummerow wrote:
>> In fact, we've always over the years gotten emails asking for *more*
>> specificity, which is why we recently switched the database to be
>> case-sensitive. People *want* to know when they have both AC-DC and
>> AC/DC so that they can fix their tags.

In this point have some points rising.
I think it is reasonable to want to have artists and track's names spelled
as they are titled by the original author. Of course, there is problem, how
to know what is real original spelling. Precedent could be used in most
cases: 1 User edited tag manually, 2 The correct spelling was fetched from
Internet. 3 First used (e.g. first variant added to DB (in case of case
conflicts), 4 Directly form file id3

Talking about "case-sensitive" DB scheme (a bit off topic), i didn't cached
the point why Amarok is using uft_bin (case sensitive) instead of
utf_general_ci (case in-sesitive). They both allows to store AC/DC and ac/dc
as separate records, so users, wanting case sensitive environment still can
have both. But when using utf_bin tracks are read from DB, collation is
being converted to utf_general_ci on the fly, and it doesn't add
performance. If utf_general_ci would be used, such conversion would be
needed only when writing. And Amarok does much more reads than writes ;)

2010/1/11 Jeff Mitchell <mitchell at kde.org>

> On 01/10/2010 02:40 PM, Milot Shala wrote:
> > If this is about to be implemented in my opinion this feature should be
> > a user's choice.
> >
> > I agree with John in:
> > /Stripping punctuation works fine usually, but when I google e.g. "C++
> > memory allocation" that's what I want, not "C memory allocation"./
> >
> > Also if you remove duplicated characters it's a confusion for the user
> > for example Madonna is written as *Madonna* (with two N's) not *Madona*.
>
>
>
> --Jeff
>
>
> _______________________________________________
> Amarok-devel mailing list
> Amarok-devel at kde.org
> https://mail.kde.org/mailman/listinfo/amarok-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.kde.org/pipermail/amarok-devel/attachments/20100111/09ba9fab/attachment.htm