Concept of "hashing" stings in Amarok (sih1)

Milot Shala milot.shala at gmail.com
Sun Jan 10 20:40:02 CET 2010


If this is about to be implemented in my opinion this feature should be a
user's choice.

I agree with John in:
*Stripping punctuation works fine usually, but when I google e.g. "C++
memory allocation" that's what I want, not "C memory allocation".*

Also if you remove duplicated characters it's a confusion for the user for
example Madonna is written as *Madonna* (with two N's) not *Madona*.

On Sun, Jan 10, 2010 at 8:23 PM, John Atkinson <john at fauxnetic.co.uk> wrote:

> Hi all,
>
> I thought I'd quickly chime in with my thoughts on this suggestion. In my
> opinion the biggest problem with a system like this is the inability to
> (easily) turn it off. This is probably easiest to do the de-facto way by
> treating anything inside quote marks as literal.
>
> Sometimes users might really want to filter for the string, and only that
> string. I can't think of a band name equivalent off the top of my head but
> there was nothing more infuriating when, back in the day, search engines
> used to do this. Stripping punctuation works fine usually, but when I google
> e.g. "C++ memory allocation" that's what I want, not "C memory allocation".
>
> - John
>
>
> 2010/1/10 Lukas <1lukas1 at gmail.com>
>
>> Acronym *sih1* of "Simplified string Hash version 1" is going to be used
>> in order to avoid confusion with other hashing methods.
>>
>> As Amarok is getting data from various sources, like id3, file name,
>> last.fm, discogs, musicbrainz, user inputed etc. there always be a high
>> chance of getting same sting in various forms, e.g. AC/DC, Ac/Dc, ac/dc
>> ,AcDc, AC-DC, AC DC, acdc, the list can be too long.
>> This is bad practice, because Amarok is understanding each one as separate
>> entry. This causes: longer lists (same artist is repeated more than once),
>> difficult to browse (in case of duplication user has to enter filter twice,
>> difficult to manage collection (witch one is real?).
>> Also, when doing search, there also is possibility to mistype name or
>> title, especial in cases where various punctuation is used (should i type
>> It's or Its etc.). Punctuation problem also applies to scripts, desiged to
>> automate taging, because byte2byte comparison not always gives positive
>> results when it should to.
>>
>>
>> The idea of sih1 is to solve most of these problems. As long as i was
>> trying this with various sets of data it worked ~90% times.
>>
>> sih1 is one way hashing method, designed to to be used internally and help
>> match similar strings: sih1(AC/DC) == sih1(acdc) == sih1(AC-DC) == acdc
>>
>> In pseudo-code it looks like:
>>
>>
>> 1) lowercase: JayZ -> jayz
>> 2.1) Convert non-latin letters to latin equivalents: š -> s, ų,ū -> u
>> 2.2) Convert similar letters: w -> v, n -> m, y -> i, j ->i, d ->t
>> 3) strip punctuation (non "a-z0-9 ") AC/DC -> ACDC, It's -> Its, Jay-Z ->
>> JayZ
>> 4.1) Remove useless words using predefined dictionary:
>> feat, ft, featuring - as it doesn't change mean of string to the
>> application
>> (www.*) - various tags added by ripping application
>> ^[0|1|2][0-9] - in case if tracks no. is embedded into title
>> 4.2) Remove duplicated chars:
>> madonna -> madona, robbie -> robie
>>
>>
>> 2.2 is useful when user, who is searching, does not know how to correctly
>> spell
>> 4.1 I'd suggest to have default dictionary, almost static. But if user
>> wants to customize it, to regenerate sih1's doesn't take long time.
>> Regenerating my 6 475 tracks collection hashes index used 13124 queries and
>> took 9862 ms, where half of queries is accesive and created by framework i
>> use.
>>
>>
>> Bottom line
>>
>> I'm currently prototyping Collectors plugin for Amarok
>> http://wiki.github.com/LukasLt/Collector and sih1 is going to be used
>> widely. I think Amarok in general could make a use of it.
>>
>> Collectors plugin in short
>> Aim is to create batch processing like plugin to correct mistypes in
>> titles, artists, album names, fetch meta-data from online resources, help to
>> identify duplicates and previously deleted tracks in a user friendly way.
>>
>> _______________________________________________
>> Amarok-devel mailing list
>> Amarok-devel at kde.org
>> https://mail.kde.org/mailman/listinfo/amarok-devel
>>
>>
>
> _______________________________________________
> Amarok-devel mailing list
> Amarok-devel at kde.org
> https://mail.kde.org/mailman/listinfo/amarok-devel
>
>


-- 
Milot Shala
gtalk: milot.shala at gmail.com
blog: http://www.codespartan.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.kde.org/pipermail/amarok-devel/attachments/20100110/54fdb997/attachment.htm 


More information about the Amarok-devel mailing list