If this is about to be implemented in my opinion this feature should be a user's choice. <div><br></div><div>I agree with John in: </div><div><i>Stripping punctuation works fine usually, but when I google e.g. "C++ memory allocation" that's what I want, not "C memory allocation".</i><br>
<div><div><div><br></div><div>Also if you remove duplicated characters it's a confusion for the user for example Madonna is written as <b>Madonna</b> (with two N's) not <b>Madona</b>.</div><div><br><div class="gmail_quote">
On Sun, Jan 10, 2010 at 8:23 PM, John Atkinson <span dir="ltr"><<a href="mailto:john@fauxnetic.co.uk">john@fauxnetic.co.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Hi all,<br><br>I thought I'd quickly chime in with my thoughts on this suggestion. In my opinion the biggest problem with a system like this is the inability to (easily) turn it off. This is probably easiest to do the de-facto way by treating anything inside quote marks as literal.<br>
<br>Sometimes users might really want to filter for the string, and only that string. I can't think of a band name equivalent off the top of my head but there was nothing more infuriating when, back in the day, search engines used to do this. Stripping punctuation works fine usually, but when I google e.g. "C++ memory allocation" that's what I want, not "C memory allocation".<br>
<br>- John<br><br><br><div class="gmail_quote">2010/1/10 Lukas <span dir="ltr"><<a href="mailto:1lukas1@gmail.com" target="_blank">1lukas1@gmail.com</a>></span><br><blockquote class="gmail_quote" style="border-left:1px solid rgb(204, 204, 204);margin:0pt 0pt 0pt 0.8ex;padding-left:1ex">
<div><div></div><div class="h5">
Acronym *sih1* of "Simplified string Hash version 1" is going to be used in order to avoid confusion with other hashing methods.<br><br>As Amarok is getting data from various sources, like id3, file name, <a href="http://last.fm" target="_blank">last.fm</a>, discogs, musicbrainz, user inputed etc. there always be a high chance of getting same sting in various forms, e.g. AC/DC, Ac/Dc, ac/dc ,AcDc, AC-DC, AC DC, acdc, the list can be too long. <br>
This is bad practice, because Amarok is understanding each one as separate entry. This causes: longer lists (same artist is repeated more than once), difficult to browse (in case of duplication user has to enter filter twice, difficult to manage collection (witch one is real?).<br>
Also, when doing search, there also is possibility to mistype name or title, especial in cases where various punctuation is used (should i type It's or Its etc.). Punctuation problem also applies to scripts, desiged to automate taging, because byte2byte comparison not always gives positive results when it should to.<br>
<br><br>The idea of sih1 is to solve most of these problems. As long as i was trying this with various sets of data it worked ~90% times.<br><br>sih1 is one way hashing method, designed to to be used internally and help match similar strings: sih1(AC/DC) == sih1(acdc) == sih1(AC-DC) == acdc<br>
<br>In pseudo-code it looks like:<br><br><br>1) lowercase: JayZ -> jayz<br>2.1) Convert non-latin letters to latin equivalents: š -> s, ų,ū -> u <br>2.2) Convert similar letters: w -> v, n -> m, y -> i, j ->i, d ->t<br>
3) strip punctuation (non "a-z0-9 ") AC/DC -> ACDC, It's -> Its, Jay-Z -> JayZ<br>
4.1) Remove useless words using predefined dictionary: <br>feat, ft, featuring - as it doesn't change mean of string to the application<br>(www.*) - various tags added by ripping application<br>^[0|1|2][0-9] - in case if tracks no. is embedded into title <br>
4.2) Remove duplicated chars: <br>madonna -> madona, robbie -> robie<br><br><br>2.2 is useful when user, who is searching, does not know how to correctly spell <br>4.1 I'd suggest to have default dictionary, almost static. But if user wants to customize it, to regenerate sih1's doesn't take long time. Regenerating my 6 475 tracks collection hashes index used 13124 queries and took 9862 ms, where half of queries is accesive and created by framework i use. <br>
<br><br>Bottom line<br><br>I'm currently prototyping Collectors plugin for Amarok <a href="http://wiki.github.com/LukasLt/Collector" target="_blank">http://wiki.github.com/LukasLt/Collector</a> and sih1 is going to be used widely. I think Amarok in general could make a use of it.<br>
<br> Collectors plugin in short<br>Aim is to create batch processing like plugin to correct mistypes in titles, artists, album names, fetch meta-data from online resources, help to identify duplicates and previously deleted tracks in a user friendly way.<br>
<br></div></div>_______________________________________________<br>
Amarok-devel mailing list<br>
<a href="mailto:Amarok-devel@kde.org" target="_blank">Amarok-devel@kde.org</a><br>
<a href="https://mail.kde.org/mailman/listinfo/amarok-devel" target="_blank">https://mail.kde.org/mailman/listinfo/amarok-devel</a><br>
<br></blockquote></div><br>
<br>_______________________________________________<br>
Amarok-devel mailing list<br>
<a href="mailto:Amarok-devel@kde.org">Amarok-devel@kde.org</a><br>
<a href="https://mail.kde.org/mailman/listinfo/amarok-devel" target="_blank">https://mail.kde.org/mailman/listinfo/amarok-devel</a><br>
<br></blockquote></div><br><br clear="all"><br>-- <br>Milot Shala<br>gtalk: <a href="mailto:milot.shala@gmail.com">milot.shala@gmail.com</a><br>blog: <a href="http://www.codespartan.org/">http://www.codespartan.org/</a><br>
</div></div></div></div>