[Owncloud] Sync Client needs server help

Klaas Freitag freitag at owncloud.com
Fri May 18 13:38:10 UTC 2012


On 17.05.2012 22:26, Brad McEvoy wrote:
Hi Brad,

thanks for your interesting feedback! I think your post did not make it 
to the mailinglist, but I'll forward it with this answer.

>
> I'm not a developer on OwnCloud, but i did a dotcom startup a while back
> trying to be a file sync service like dropbox, but was a bit late to the
> party
>
> I'm now converting that to an open source project (see
> https://github.com/Spliffy/spliffy - similar goals, but much less
> advanced then owncloud, java based). I posted to this list a few months
> back suggesting that we share experience and work towards a standards
> based and interoperable toolset. I think standards and interopability
> would generally strengthen the open source offerings as opposed to the
> closed source services currently proliferating.
Yes, standards are good. And I tried to stay as tight to WebDAV as 
possible yet to keep the door open for interoperability.
>
> Regarding your question below I'd like to share my experience. I first
> implemented path based sync, as you have done. I have since come to
> believe this is far from optimal. And others from mature and established
> sync product companies share that view.
>
> What git does, and i think this is a good model for any sync tool, is
> calculate hashes (ie checksums) for files and for directories. Where the
> hash for a directory is the checksum of a formatted list of its members
> names and hashes. This means that the root folder has a hash which
> uniquely identifies the current state of everything inside it. The
> client can calculate the same hash for its contents. So, to check if
> files are in sync you simply compare the hash of the root directories on
> client and server. If they are different you walk down the directory
> tree, ignoring directories that have the same hash on client and server,
> and locating changed items based on their relative checksums. This is
> very fast, very efficient, and very very robust. Its easy to integrate
> into a webdav server as its just an extra propery in PROPFIND or header
> in a HEAD response. It requires server support so that any change to any
> resource results in updated hashes right up to the syncronisation root.

I understand the concept and indeed its good. It's very near to what I 
want to implement, with the only difference that instead of the hash 
sums, I'd like to use the mtimes, as csync does. Why do we think thats a 
benefit: Well, based on the mtimes its decideable which version is 
newer. Moreover, the mtime is already a natural meta data in each file 
system, so we do not have to add something new. That given, csync runs 
without server support by now.

What is missing is the propagation of the mtimes from individual files 
and directories to their parent directory. If we do that with the 
ownCloud server support, I think we will have the same benefits that you 
described above. As we have the data in a database server side we will 
be able to retrieve the data fast.

>
> Note that there is a related RFC - http://tools.ietf.org/html/rfc6578 -
> however I'm not confident that the approach outlined there is quite right.
Do you know if its implemented in a WebDAV server already?

> Of course finding what files are new or updated is one thing,
> communicating those changes efficiently is another. Spliffy uses a
> similar approach to Bup (https://github.com/apenwarr/bup) to split files
> into blobs which are stable with respect to file changes. Only changed
> blobs are transmitted.
>
> The hashsplitting algorithm is **very** simple, and if you're not doing
> something like this yet i suggest you take a peek -
> https://github.com/HashSplit4J/hashsplit-lib
Thats cool and is a problem we also still have on our list to tackle.
I stumbled over this already and wonder if there is a C or C++ lib for that.

> Sorry for the long post, and I hope this is of some assistance.
Great, I really appreciate your input.

Best,

Klaas

>
> On 17/05/2012 9:12 p.m., Klaas Freitag wrote:
>> Hi,
>>
>> one of the biggest shortcomings of the sync client currently is that
>> it does a full scan of its the ownCloud directories via webdav to
>> query the last modified times. That causes load and other trouble. It
>> would be great to find out if something has changed server side more
>> cheaply.
>>
>> We have the file system cache which also has the mod times in the
>> database. My idea is now, instead of querying every single file, I
>> just issue a HEAD request on the top sync directory and get the latest
>> modtime of all files in that dir back. If that is younger than the one
>> I know, I have to do a sync.
>>
>> I know that it could be even more cool, ie. delivering the list of
>> files back etc. but lets do small steps. Doing just one HEAD instead
>> of querying the whole tree already will be great.
>>
>> The implementation seems easy: Just get all database id's of the
>> fscache table entries below the top directory of the sync dir and do
>> kind of
>> SELECT MAX(mtime) FROM fscache WHERE id in ( list-of-all-ids-in dir );
>> That should be fast enough.
>>
>> My question now is: How do we do that? Should we have another app
>> called /files/sync? Or do we want to enhance the WebDAV server to be
>> able to do the described logic if a HEAD request on a dir comes in?
>>
>> I think the latter is more "within the concept" of doing the sync via
>> WebDAV, OTOH a sync app could be useful anyway for other sync related
>> server support.
>>
>> What do you think?
>>
>> Thanks,
>>
>> Klaas
>> _______________________________________________
>> Owncloud mailing list
>> Owncloud at kde.org
>> https://mail.kde.org/mailman/listinfo/owncloud




More information about the Owncloud mailing list