[Owncloud] Who to give input to for desktop and on-/off-line synchronisation of files?
asn at cryptomilk.org
Wed Feb 22 17:17:43 UTC 2012
On Wednesday 22 February 2012 15:27:46 Bjorn Madsen wrote:
> *Thanks Klaas,*
> As computational complexity is my favourite subject I start thinking how I
> would perform a full replication the day when our data-repositories pass
> 2-4 TB which is just around the corner....
> On my private ubuntu 10.04 I ran $ ls -aR | wc -l and found 126037 files,
> equal to 1.45 TB
> I copied the full filetree from my two machines (just the paths and the
> md5sum)- let's call them A & B into SQLite (appx. 9.3Mb each). In addition
> I set up inotify to write any changes to the SQLite database on both
> machines. Nothing more.
> Now my experiment was to run the whole filetree in SQLite, perform three
> join operations:
> 1. What is on machine A but not on machine B, as a LEFT excluding join,
> i.e. SELECT <select_list> FROM Table_A A LEFT JOIN Table_B B ON A.Key =
> B.Key WHERE B.Key IS NULL
> 2. What is on machine B but not on machine A, as a RIGHT excluing join,
> i.e. SELECT <select_list> FROM Table_A A RIGHT JOIN Table_B B ON A.Key =
> B.Key WHERE A.Key IS NULL
> 3. What is on A and B, as an inner join, i.e. SELECT <select_list> FROM
> Table_A A INNER JOIN Table_B B ON A.Key = B.Key
> With this operation I produce the lists #1 and #2 which I intend to feed to
> rsync to send with ssh across the both machines (pull not push), and doing
> so would be a piece of cake. I use rsync's option delete after filetransfer
> as the time where a file is unavailable is unnoticeable.
> However the first operation of this kind takes some time (17m4.096s on my
> Intel Atom) and as our databases grow bigger, exponentially longer. In
> addition the memory footprint also doesn't get prettier.
Did you expect somthing else?
If I run csync on my home directory with cold caches, it needs 140.91 seconds
walking 836549 files. And about 200MB to store in information about the files
Comparing side A with side B takes less than a second.
[stderr] 20120222 16:37:00.709 DEBUG csync.api- Reconciliation for local
replica took 0.52 seconds visiting 836549 files.
[stderr] 20120222 16:37:01.203 DEBUG csync.api- Reconciliation for remote
replica took 0.49 seconds visiting 836549 files.
So 17min vs 0.52 sec ;)
> Now as all new files are written to the sqlite database, I can timestamp
> the operation and only use incremental operations (at least until I have
> performed 10^14 file operations, where it would be appropriate to recreate
> the database.
> This means I have a few powerful operations available:
17min doesn't sound powerful ... :)
> #A I can segment the join operations to match the available memory
> footprint using SELECT and an appropriate interval which would reflect the
> allocated memory footprint.
> #B More interestingly, I can use change propagation in the database
> operation to avoid redundant operations, by only selecting files updated
> since last operation or last "nirvana" when the synchronisation daemon
> reinitiates its check. I my case the usage of change propagation brings my
> memory footprint down to a few kb's (which I could run even on my old
> #C I can measure the entropy (frequency of update, based on list #3) per
> time-interval as an indicator for the age of my files, which permits that
> high-entropy files (typical browser and cached stuff) rarely gets
> synchronised, and that ultra-low entropy stuff could be moved to slower
> drives. In practice this would permit all my drives quickly to have the
> latest files I’m working on and then synchronise everything else later (if
> there is space).
I don't think sqlite has been written for file synchronization. I don't know
what you really want to achieve but 2-way file synchronization isn't done by
sqlite and rsync is only a one-way file synchronizer.
Andreas Schneider GPG-ID: F33E3FC6
www.cryptomilk.org asn at cryptomilk.org
More information about the Owncloud