why the % cruft?

Vadim Plessky lucy-ples at mtu-net.ru
Tue Jul 9 11:45:08 BST 2002


On Tuesday 09 July 2002 11:15 am, Waldo Bastian wrote:
|  On Tuesday 09 July 2002 12:01 am, Lars Knoll wrote:
|  > > URLs are spec'ed as a sequence of octets (8-bit values) "Unicode URLs"
|  > > basically don't exist. Despite that we try to handle them anyway and
|  > > appearantly that doesn't always work. (E.g. we need to convert unicode
|  > > to an 8 bit sequence before we can tranfer it to the website but the
|  > > encoding to use for that is unspecified, so we can only guess.)
|  >
|  > As Dirk already pointed out, IE sends URLS in utf8 by default. I'm
|  > pretty sure we could do the same without breaking a lot of web pages
|  > (they'd be broken with IE aswell). Maybe there's an HTTP header field we
|  > can set to indicate this?
|
|  My impression was that many non-latin1 (e.g. russian, japanese, korean,
| etc.) websites use the "local locale" as encoding and not utf8. Maybe Vadim
| can comment on that from the Russian point of view.

ok, in my Linux setup (I use English locale, as I don't need Cyrillic in 
console, etc.) - I tried to do the same task (search with Google) for 
Cyrillic word, now Using Mozilla (1.0-rc2).

First Mozilla redirected me from Google.com to google.com.ru, and I tried same 
word 'пример' from that page
a)
http://www.google.com.ru/search?q=%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80&ie=UTF-8&oe=UTF-8&hl=ru&btnG=%D0%9F%D0%BE%D0%B8%D1%81%D0%BA+%D0%B2+Google
So, despite being on web page dispalyed in Cyrillic, Google transformed search 
word (URL) to UTF8.

Than I forced Mozilla to http://www.google.com/en (Google in English button)
b) Result:
http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80&btnG=Google+Search

So, difference was:
a) hl=ru
b) hl=en
but in both cases:
ie=UTF-8&oe=UTF-8
(don't know though what's the difference between those two fileds)

Than I took serach engine with local roots ;-)
Yandex (which claims to be the biggest serach engine here:
http://www.yandex.ru
word: пример
Result:
http://www.yandex.ru/yandsearch?text=%EF%F0%E8%EC%E5%F0

So, Yandex *doesn't encode* URL and doesn't use UTF.
As far as I can see, they use windows-1251 (cp1251) by default, and encode URL 
using cp1251 encoding, without extra fields.

Another big search engine, Rambler:
http://www.rambler.ru
word: пример
http://search.rambler.ru/cgi-bin/rambler_search?words=%D0%D2%C9%CD%C5%D2&where=1

And those guys return page in 'koi8-r' (not in 'cp1251')!
Encoding values are also different...

So, it seems that *returning URL string in same encoding as original page", is 
valid (and existing) approach.
But Google is probably has the best solution, using UTF8 by default.
We can't change other search engines, though...
 
|
|  Cheers,
|  Waldo

-- 

Vadim Plessky
http://kde2.newmail.ru  (English)
33 Window Decorations and 6 Widget Styles for KDE
http://kde2.newmail.ru/kde_themes.html
KDE mini-Themes
http://kde2.newmail.ru/themes/





More information about the kfm-devel mailing list