[Owncloud] RFC: Unicode normalization

Daniel Molkentin danimo at owncloud.com
Fri Nov 9 13:53:00 UTC 2012


Hi,

this week, we made some promising advances on the syncing client to reduce the amount of problems leading to incomplete syncs and conflicts. However one of these fixes requires some input: Unicode normalization and different Operating systems going differently about it:

What is Unicode normalization?

In unicode, some special characters can be stored in two ways: Decomposed and Composed. Making them one or the other is called "Unicode Normalization. There are 4 forms of normalization: NFC (Normalization Form C, i.e. Composed Normalization) and NFD (Normalization From D, i.e. Decomposed Normalization) (and one compatibility mapping for each, read http://en.wikipedia.org/wiki/Unicode_equivalence if you are interested in the details)

Example: In NFC, so the 'é' in "Amélie" can will be stored as 'é', in NFD, it's stored as two characters 'e'+'  ◌́' (where the latter means "accent on top of the previous character").

Mac OS, by default, stores all its files as NFD, whereas Linux and Windows use NFC. the W3C also mandates that special characters URLs should be in NFC prior to percent-encoding them (check the IRI RFC for details):

What is Unicode normalization not?

- URL percent encoding
- Variable-width encoding (UTF-7, UTF-8, UTF-16, UTF-32)

Why is that a problem?

- Files that should be the same are not (Create the same file with an 'é' on Linux (or Windows) and on Mac. Upload both to the server: You will see two identical files on the server (and on the clients after sync). And in fact, they are both there. And both are valid -> Certainly unexpected.

- Bizarre problems when syncing directories with umlauts to a Mac (could also be shadowing another bug, we are investigating this atm)

So now I have a fix for the ownCloud Client that normalizes all files towards the URL "interface" (which mandates NFC) when sending any request to the server. Other webdav clients for Mac seem to do the same. Still the server needs to do the same on its side: Normalize whatever hits it from the client side into what the server OS needs (usually NFC, unless it's a Mac server) and vice versa (NFC towards the client). Ideally this should still go into 4.5.2.

PHP has Normalizer::normalize (http://php.net/manual/en/normalizer.normalize.php), suggested in https://github.com/owncloud/mirall/issues/45, which mandates the intl extension (a new dependency, although fairly standard). I have not yet figured out if PHP iconv (already a hard dependency) is capable of doing normalization, and could use some help there.

We also need to make sure to release a patched 5.4.2 along with a patched 1.1.2 in this scenario to make sure there are no issues with existing client installations. Also, the server might want to try and look for an NFD-encoded (or NFC-encoded on a Mac Server) version of the file if it does not exist in its native encoding, and rename it in that case.

Also, what do we do if both versions exist on the server (should be a rare case though)?

Cheers,
  Daniel
--
www.owncloud.com - Your Data, Your Cloud, Your Way!

ownCloud GmbH, GF: Markus Rex, Holger Dyroff
Schloßäckerstrasse 26a, 90443 Nürnberg, HRB 28050 (AG Nürnberg)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/owncloud/attachments/20121109/589c921e/attachment.html>


More information about the Owncloud mailing list