[Owncloud] Who to give input to for desktop and on-/off-line synchronisation of files?

Wed Feb 22 17:17:43 UTC 2012

On Wednesday 22 February 2012 15:27:46 Bjorn Madsen wrote:
> *Thanks Klaas,*

Hi,

> As computational complexity is my favourite subject I start thinking how I
> would perform a full replication the day when our data-repositories pass
> 2-4 TB which is just around the corner....
> 
> On my private ubuntu 10.04 I ran $ ls -aR | wc -l  and found 126037 files,
> equal to 1.45 TB
> I copied the full filetree from my two machines (just the paths and the
> md5sum)- let's call them A & B into SQLite (appx. 9.3Mb each). In addition
> I set up inotify to write any changes to the SQLite database on both
> machines. Nothing more.
> 
> Now my experiment was to run the whole filetree in SQLite, perform three
> join operations:
> 
>    1. What is on machine A but not on machine B, as a LEFT excluding join,
>    i.e. SELECT <select_list> FROM Table_A A LEFT JOIN Table_B B ON A.Key =
>    B.Key WHERE B.Key IS NULL
>    2. What is on machine B but not on machine A, as a RIGHT excluing join,
>    i.e. SELECT <select_list> FROM Table_A A RIGHT JOIN Table_B B ON A.Key =
>    B.Key WHERE A.Key IS NULL
>    3. What is on A and B, as an inner join, i.e. SELECT <select_list> FROM
>    Table_A A INNER JOIN Table_B B ON A.Key = B.Key
> 
> With this operation I produce the lists #1 and #2 which I intend to feed to
> rsync to send with ssh across the both machines (pull not push), and doing
> so would be a piece of cake. I use rsync's option delete after filetransfer
> as the time where a file is unavailable is unnoticeable.
> 
> However the first operation of this kind takes some time (17m4.096s on my
> Intel Atom) and as our databases grow bigger, exponentially longer. In
> addition the memory footprint also doesn't get prettier.

Did you expect somthing else?

If I run csync on my home directory with cold caches, it needs 140.91 seconds 
walking 836549 files. And about 200MB to store in information about the files 
in memory.

Comparing side A with side B takes less than a second.

[stderr] 20120222 16:37:00.709 DEBUG    csync.api- Reconciliation for local 
replica took 0.52 seconds visiting 836549 files.
[stderr] 20120222 16:37:01.203 DEBUG    csync.api- Reconciliation for remote 
replica took 0.49 seconds visiting 836549 files.

So 17min vs 0.52 sec ;)

> 
> Now as all new files are written to the sqlite database, I can timestamp
> the operation and only use incremental operations (at least until I have
> performed 10^14 file operations, where it would be appropriate to recreate
> the database.
> 
> This means I have a few powerful operations available:

17min doesn't sound powerful ... :)

> #A I can segment the join operations to match the available memory
> footprint using SELECT and an appropriate interval which would reflect the
> allocated memory footprint.
> 
> #B More interestingly, I can use change propagation in the database
> operation to avoid redundant operations, by only selecting files updated
> since last operation or last "nirvana" when the synchronisation daemon
> reinitiates its check. I my case the usage of change propagation brings my
> memory footprint down to a few kb's (which I could run even on my old
> phone).
> 
> #C I can measure the entropy (frequency of update, based on list #3) per
> time-interval as an indicator for the age of my files, which permits that
> high-entropy files (typical browser and cached stuff) rarely gets
> synchronised, and that ultra-low entropy stuff could be moved to slower
> drives. In practice this would permit all my drives quickly to have the
> latest files I’m working on and then synchronise everything else later (if
> there is space).

I don't think sqlite has been written for file synchronization. I don't know 
what you really want to achieve but 2-way file synchronization isn't done by 
sqlite and rsync is only a one-way file synchronizer.

	-- andreas

-- 
Andreas Schneider                   GPG-ID: F33E3FC6
www.cryptomilk.org                asn at cryptomilk.org