Concept of "hashing" stings in Amarok (sih1)

Sun Jan 10 20:23:01 CET 2010

Hi all,

I thought I'd quickly chime in with my thoughts on this suggestion. In my
opinion the biggest problem with a system like this is the inability to
(easily) turn it off. This is probably easiest to do the de-facto way by
treating anything inside quote marks as literal.

Sometimes users might really want to filter for the string, and only that
string. I can't think of a band name equivalent off the top of my head but
there was nothing more infuriating when, back in the day, search engines
used to do this. Stripping punctuation works fine usually, but when I google
e.g. "C++ memory allocation" that's what I want, not "C memory allocation".

- John

2010/1/10 Lukas <1lukas1 at gmail.com>

> Acronym *sih1* of "Simplified string Hash version 1" is going to be used in
> order to avoid confusion with other hashing methods.
>
> As Amarok is getting data from various sources, like id3, file name,
> last.fm, discogs, musicbrainz, user inputed etc. there always be a high
> chance of getting same sting in various forms, e.g. AC/DC, Ac/Dc, ac/dc
> ,AcDc, AC-DC, AC DC, acdc, the list can be too long.
> This is bad practice, because Amarok is understanding each one as separate
> entry. This causes: longer lists (same artist is repeated more than once),
> difficult to browse (in case of duplication user has to enter filter twice,
> difficult to manage collection (witch one is real?).
> Also, when doing search, there also is possibility to mistype name or
> title, especial in cases where various punctuation is used (should i type
> It's or Its etc.). Punctuation problem also applies to scripts, desiged to
> automate taging, because byte2byte comparison not always gives positive
> results when it should to.
>
>
> The idea of sih1 is to solve most of these problems. As long as i was
> trying this with various sets of data it worked ~90% times.
>
> sih1 is one way hashing method, designed to to be used internally and help
> match similar strings: sih1(AC/DC) == sih1(acdc) == sih1(AC-DC) == acdc
>
> In pseudo-code it looks like:
>
>
> 1) lowercase: JayZ -> jayz
> 2.1) Convert non-latin letters to latin equivalents: š -> s, ų,ū -> u
> 2.2) Convert similar letters: w -> v, n -> m, y -> i, j ->i, d ->t
> 3) strip punctuation (non "a-z0-9 ") AC/DC -> ACDC, It's -> Its, Jay-Z ->
> JayZ
> 4.1) Remove useless words using predefined dictionary:
> feat, ft, featuring - as it doesn't change mean of string to the
> application
> (www.*) - various tags added by ripping application
> ^[0|1|2][0-9] - in case if tracks no. is embedded into title
> 4.2) Remove duplicated chars:
> madonna -> madona, robbie -> robie
>
>
> 2.2 is useful when user, who is searching, does not know how to correctly
> spell
> 4.1 I'd suggest to have default dictionary, almost static. But if user
> wants to customize it, to regenerate sih1's doesn't take long time.
> Regenerating my 6 475 tracks collection hashes index used 13124 queries and
> took 9862 ms, where half of queries is accesive and created by framework i
> use.
>
>
> Bottom line
>
> I'm currently prototyping Collectors plugin for Amarok
> http://wiki.github.com/LukasLt/Collector and sih1 is going to be used
> widely. I think Amarok in general could make a use of it.
>
> Collectors plugin in short
> Aim is to create batch processing like plugin to correct mistypes in
> titles, artists, album names, fetch meta-data from online resources, help to
> identify duplicates and previously deleted tracks in a user friendly way.
>
> _______________________________________________
> Amarok-devel mailing list
> Amarok-devel at kde.org
> https://mail.kde.org/mailman/listinfo/amarok-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.kde.org/pipermail/amarok-devel/attachments/20100110/d37585e7/attachment.htm