Acronym *sih1* of "Simplified string Hash version 1" is going to be used in order to avoid confusion with other hashing methods.<br><br>As Amarok is getting data from various sources, like id3, file name, <a href="http://last.fm" target="_blank">last.fm</a>, discogs, musicbrainz, user inputed etc. there always be a high chance of getting same sting in various forms, e.g. AC/DC, Ac/Dc, ac/dc ,AcDc, AC-DC, AC DC, acdc, the list can be too long. <br>
This is bad practice, because Amarok is understanding each one as separate entry. This causes: longer lists (same artist is repeated more than once), difficult to browse (in case of duplication user has to enter filter twice, difficult to manage collection (witch one is real?).<br>
Also, when doing search, there also is possibility to mistype name or title, especial in cases where various punctuation is used (should i type It's or Its etc.). Punctuation problem also applies to scripts, desiged to automate taging, because byte2byte comparison not always gives positive results when it should to.<br>
<br><br>The idea of sih1 is to solve most of these problems. As long as i was trying this with various sets of data it worked ~90% times.<br><br>sih1 is one way hashing method, designed to to be used internally and help match similar strings: sih1(AC/DC) == sih1(acdc) == sih1(AC-DC) == acdc<br>
<br>In pseudo-code it looks like:<br><br><br>1) lowercase: JayZ -> jayz<br>2.1) Convert non-latin letters to latin equivalents: š -> s, ų,ū -> u <br>2.2) Convert similar letters: w -> v, n -> m, y -> i, j ->i, d ->t<br>
3) strip punctuation (non "a-z0-9 ") AC/DC -> ACDC, It's -> Its, Jay-Z -> JayZ<br>
4.1) Remove useless words using predefined dictionary: <br>feat, ft, featuring - as it doesn't change mean of string to the application<br>(www.*) - various tags added by ripping application<br>^[0|1|2][0-9] - in case if tracks no. is embedded into title <br>
4.2) Remove duplicated chars: <br>madonna -> madona, robbie -> robie<br><br><br>2.2 is useful when user, who is searching, does not know how to correctly spell <br>4.1 I'd suggest to have default dictionary, almost static. But if user wants to customize it, to regenerate sih1's doesn't take long time. Regenerating my 6 475 tracks collection hashes index used 13124 queries and took 9862 ms, where half of queries is accesive and created by framework i use. <br>
<br><br>Bottom line<br><br>I'm currently prototyping Collectors plugin for Amarok <a href="http://wiki.github.com/LukasLt/Collector">http://wiki.github.com/LukasLt/Collector</a> and sih1 is going to be used widely. I think Amarok in general could make a use of it.<br>
<br> Collectors plugin in short<br>Aim is to create batch processing like plugin to correct mistypes in titles, artists, album names, fetch meta-data from online resources, help to identify duplicates and previously deleted tracks in a user friendly way.<br>