[KDE-Sonnet] Should we detect dialects?

Kevin Scannell kscanne at gmail.com
Wed Jan 17 21:47:49 CET 2007


Ar Céadaoin 17 Eanáir 2007 13:40, scríobh Jacob R Rideout:
> Hello everyone, this is the first post to new kde-sonnet list!
>
> I just wrote a blog post explain why Sonnet doesn't detect dialects
> and how it handles them.
>
> http://blog.jacobrideout.net/2007/01/queen-and-country.html
>
> So, am I right? Or is there a need to change the behavior? I've made
> series of trade-offs based off a set of assumptions of end-user use
> cases. Are these assumptions and use cases correct? Should we keep the
> current behavior, but trigger additional heuristics in certain cases?

Hi Jacob,

   Great question.  I thought about this a lot when working on my
web crawler.    

   As you note in your blog, there is some fuzziness here regarding
language vs. dialect vs. orthography (and no, I'm not about to play
the pedant!).  

   I'll say this much though - distinguishing dialects or orthographies
is sometimes easier than distinguishing languages.  I've done
some work on statistical methods for distinguishing the
orthographies of Cornish (Britain), and also the dialects of Ladin
(Northern Italy), and more-or-less naive approaches work fine in 
these cases.  On the other hand, as I recall, distingushing Indonesian
vs. Malay (distinct languages, at least from the point of view of ISO-639-1)
is harder.   Xhosa vs. Zulu was problematic too.

The full similarity table (languages only) is here:
http://borel.slu.edu/crubadan/table.html

So it would be a shame to preclude the possibility
of distinguishing dialects in easy cases like these,
by having Sonnet only treat things at
the granularity of ISO-639-1 (or -2 or -3).  

It would be nice if Sonnet were flexible enough to
return language tags according to RFC 4646
(http://www.rfc-editor.org/rfc/rfc4646.txt)
in cases where it's possible to do so, and in cases
where it isn't easy/reliable/important (e.g. en_US vs. en_GB),
just give the best answer possible (e.g. "en").

-Kevin



More information about the kde-sonnet mailing list