[Owncloud] RFC: Unicode normalization

Evert Pot evert at rooftopsolutions.nl
Fri Nov 9 14:18:08 UTC 2012


Hi!

Very good that this is happening. It's a complicated subject.. I find that it's extremely important for any web application in general, to have a strong awareness of the characterset of _any_ incoming data, and internally _always_ deal with a consistent character-set.

One thing I would want to add to this, and may already be on your radar..

If you run owncloud on a windows server, you pretty much can not store files with unicode in their filenames.
Even though it may internally use NFC for storage, the PHP filesystem API's only deal with CP-1252 (or whatever the default is for your locale). The only way to get around this, is to use COM objects to interact with a filesystem.

On mac servers this will be NFD, and Linux takes any byte and makes no assumptions.. 

This means that if you indeed want to support windows (as a server), you should really encode all filenames to something ascii-compatible before storage.

I fel that the clearest way to deal with this for WebDAV servers that are supposed to run on any platform, is to convert any path coming from a client to specific Unicode normalization form for internal use, and before storage encode this string with urlencode.

This is the only way you can predictably, and consistently store filenames cross-platform. 

Note that SabreDAV will attempt to detect any incoming Latin1 encoded path, and already transcode it to Unicode.

Hope this helps at all, and good luck with this.. 

Evert

On Nov 9, 2012, at 2:53 PM, Daniel Molkentin <danimo at owncloud.com> wrote:

> Hi,
> 
> this week, we made some promising advances on the syncing client to reduce the amount of problems leading to incomplete syncs and conflicts. However one of these fixes requires some input: Unicode normalization and different Operating systems going differently about it:
> 
> What is Unicode normalization?
> 
> In unicode, some special characters can be stored in two ways: Decomposed and Composed. Making them one or the other is called "Unicode Normalization. There are 4 forms of normalization: NFC (Normalization Form C, i.e. Composed Normalization) and NFD (Normalization From D, i.e. Decomposed Normalization) (and one compatibility mapping for each, read http://en.wikipedia.org/wiki/Unicode_equivalence if you are interested in the details)
> 
> Example: In NFC, so the 'é' in "Amélie" can will be stored as 'é', in NFD, it's stored as two characters 'e'+'  ◌́' (where the latter means "accent on top of the previous character").
> 
> Mac OS, by default, stores all its files as NFD, whereas Linux and Windows use NFC. the W3C also mandates that special characters URLs should be in NFC prior to percent-encoding them (check the IRI RFC for details):
> 
> What is Unicode normalization not?
> 
> - URL percent encoding
> - Variable-width encoding (UTF-7, UTF-8, UTF-16, UTF-32)
> 
> Why is that a problem?
> 
> - Files that should be the same are not (Create the same file with an 'é' on Linux (or Windows) and on Mac. Upload both to the server: You will see two identical files on the server (and on the clients after sync). And in fact, they are both there. And both are valid -> Certainly unexpected.
> 
> - Bizarre problems when syncing directories with umlauts to a Mac (could also be shadowing another bug, we are investigating this atm)
> 
> So now I have a fix for the ownCloud Client that normalizes all files towards the URL "interface" (which mandates NFC) when sending any request to the server. Other webdav clients for Mac seem to do the same. Still the server needs to do the same on its side: Normalize whatever hits it from the client side into what the server OS needs (usually NFC, unless it's a Mac server) and vice versa (NFC towards the client). Ideally this should still go into 4.5.2.
> 
> PHP has Normalizer::normalize (http://php.net/manual/en/normalizer.normalize.php), suggested in https://github.com/owncloud/mirall/issues/45, which mandates the intl extension (a new dependency, although fairly standard). I have not yet figured out if PHP iconv (already a hard dependency) is capable of doing normalization, and could use some help there.
> 
> We also need to make sure to release a patched 5.4.2 along with a patched 1.1.2 in this scenario to make sure there are no issues with existing client installations. Also, the server might want to try and look for an NFD-encoded (or NFC-encoded on a Mac Server) version of the file if it does not exist in its native encoding, and rename it in that case.
> 
> Also, what do we do if both versions exist on the server (should be a rare case though)?
> 
> Cheers,
>   Daniel
> --
> www.owncloud.com - Your Data, Your Cloud, Your Way!
> 
> ownCloud GmbH, GF: Markus Rex, Holger Dyroff
> Schloßäckerstrasse 26a, 90443 Nürnberg, HRB 28050 (AG Nürnberg)
> _______________________________________________
> Owncloud mailing list
> Owncloud at kde.org
> https://mail.kde.org/mailman/listinfo/owncloud




More information about the Owncloud mailing list