Concept of "hashing" stings in Amarok (sih1)

Lukas 1lukas1 at gmail.com
Sun Jan 10 19:27:16 CET 2010


Acronym *sih1* of "Simplified string Hash version 1" is going to be used in
order to avoid confusion with other hashing methods.

As Amarok is getting data from various sources, like id3, file name, last.fm,
discogs, musicbrainz, user inputed etc. there always be a high chance of
getting same sting in various forms, e.g. AC/DC, Ac/Dc, ac/dc ,AcDc, AC-DC,
AC DC, acdc, the list can be too long.
This is bad practice, because Amarok is understanding each one as separate
entry. This causes: longer lists (same artist is repeated more than once),
difficult to browse (in case of duplication user has to enter filter twice,
difficult to manage collection (witch one is real?).
Also, when doing search, there also is possibility to mistype name or title,
especial in cases where various punctuation is used (should i type It's or
Its etc.). Punctuation problem also applies to scripts, desiged to automate
taging, because byte2byte comparison not always gives positive results when
it should to.


The idea of sih1 is to solve most of these problems. As long as i was trying
this with various sets of data it worked ~90% times.

sih1 is one way hashing method, designed to to be used internally and help
match similar strings: sih1(AC/DC) == sih1(acdc) == sih1(AC-DC) == acdc

In pseudo-code it looks like:


1) lowercase: JayZ -> jayz
2.1) Convert non-latin letters to latin equivalents: š -> s, ų,ū -> u
2.2) Convert similar letters: w -> v, n -> m, y -> i, j ->i, d ->t
3) strip punctuation (non "a-z0-9 ") AC/DC -> ACDC, It's -> Its, Jay-Z ->
JayZ
4.1) Remove useless words using predefined dictionary:
feat, ft, featuring - as it doesn't change mean of string to the application
(www.*) - various tags added by ripping application
^[0|1|2][0-9] - in case if tracks no. is embedded into title
4.2) Remove duplicated chars:
madonna -> madona, robbie -> robie


2.2 is useful when user, who is searching, does not know how to correctly
spell
4.1 I'd suggest to have default dictionary, almost static. But if user wants
to customize it, to regenerate sih1's doesn't take long time. Regenerating
my 6 475 tracks collection hashes index used 13124 queries and took 9862 ms,
where half of queries is accesive and created by framework i use.


Bottom line

I'm currently prototyping Collectors plugin for Amarok
http://wiki.github.com/LukasLt/Collector and sih1 is going to be used
widely. I think Amarok in general could make a use of it.

Collectors plugin in short
Aim is to create batch processing like plugin to correct mistypes in titles,
artists, album names, fetch meta-data from online resources, help to
identify duplicates and previously deleted tracks in a user friendly way.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.kde.org/pipermail/amarok-devel/attachments/20100110/cebb1dde/attachment-0001.htm 


More information about the Amarok-devel mailing list