[Owncloud] Who to give input to for desktop and on-/off-line synchronisation of files?

Wed Feb 22 15:27:46 UTC 2012

*Thanks Klaas,*
As computational complexity is my favourite subject I start thinking how I
would perform a full replication the day when our data-repositories pass
2-4 TB which is just around the corner....

On my private ubuntu 10.04 I ran $ ls -aR | wc -l  and found 126037 files,
equal to 1.45 TB
I copied the full filetree from my two machines (just the paths and the
md5sum)- let's call them A & B into SQLite (appx. 9.3Mb each). In addition
I set up inotify to write any changes to the SQLite database on both
machines. Nothing more.

Now my experiment was to run the whole filetree in SQLite, perform three
join operations:

   1. What is on machine A but not on machine B, as a LEFT excluding join,
   i.e. SELECT <select_list> FROM Table_A A LEFT JOIN Table_B B ON A.Key =
   B.Key WHERE B.Key IS NULL
   2. What is on machine B but not on machine A, as a RIGHT excluing join,
   i.e. SELECT <select_list> FROM Table_A A RIGHT JOIN Table_B B ON A.Key =
   B.Key WHERE A.Key IS NULL
   3. What is on A and B, as an inner join, i.e. SELECT <select_list> FROM
   Table_A A INNER JOIN Table_B B ON A.Key = B.Key

With this operation I produce the lists #1 and #2 which I intend to feed to
rsync to send with ssh across the both machines (pull not push), and doing
so would be a piece of cake. I use rsync's option delete after filetransfer
as the time where a file is unavailable is unnoticeable.

However the first operation of this kind takes some time (17m4.096s on my
Intel Atom) and as our databases grow bigger, exponentially longer. In
addition the memory footprint also doesn't get prettier.

Now as all new files are written to the sqlite database, I can timestamp
the operation and only use incremental operations (at least until I have
performed 10^14 file operations, where it would be appropriate to recreate
the database.

This means I have a few powerful operations available:
#A I can segment the join operations to match the available memory
footprint using SELECT and an appropriate interval which would reflect the
allocated memory footprint.

#B More interestingly, I can use change propagation in the database
operation to avoid redundant operations, by only selecting files updated
since last operation or last "nirvana" when the synchronisation daemon
reinitiates its check. I my case the usage of change propagation brings my
memory footprint down to a few kb's (which I could run even on my old
phone).

#C I can measure the entropy (frequency of update, based on list #3) per
time-interval as an indicator for the age of my files, which permits that
high-entropy files (typical browser and cached stuff) rarely gets
synchronised, and that ultra-low entropy stuff could be moved to slower
drives. In practice this would permit all my drives quickly to have the
latest files I’m working on and then synchronise everything else later (if
there is space).

I know that other sync-service providers stick to the users /home/$user
directory and lower non-dot- levels so the high speed/high entropy issue is
not of concern. The low speed stuff, however could be as a user who is
running low on disk-space whereby he could park his low-entropy (archival
files) on a remote raided-hard-drive.

This means that I can:
- Add a new machine at the computation time that follows run-time of #A,
- My files are update with the computation run-time #B,
- and I never run out of harddrive space because my low-entropy files are
stored on the cloud server (at whomever/owncloud/...) #C

In addition the hardware requirements to the server park at the host is
minimalistic, as all it has to maintain is a database with mac-address,
ssh-keys, filetrees and md5sums, and a blob for the cases where the users
chooses to push the file to your cloud (such as sharing files).
(The file-sharing and associated authentication can - if needed - run
elsewhere / on another set of servers).

So... being a computational complexity guy and not the hardcore slash-dot
programmer, I have to ask, does this make sense to you guys? Do you do
something similar? Do you actively avoid doing the diff on the servers
instead of offloading it to the local machine?

Appendix: Lazymans pseudo-code ...
First we read the file path tree recursively of the two machines (A and B)
into two sqlite tables as:
machine A.
path1/file1 + md5sum
path1/file2 + md5sum
path1/file3 + md5sum
path2/file1 + md5sum
path2/file2 + md5sum
path2/file3 + md5sum

machine B.
path1/file2 + md5sum
path1/file3 + md5sum
path2/file1 + md5sum
path2/file2 + md5sum
path2/file4 + md5sum

Next we use a join operation to map out the overlapping structure.

Next we check the inotify-log in sqlite for deletions and propagate
deletions to the wastebin - so users undo option is maintained (Inotify
operations reads/writes to the sqlite database).

On 22 February 2012 10:32, Klaas Freitag <freitag at owncloud.com> wrote:

> On 22.02.2012 10:30, Bjorn Madsen wrote:
> hi Bjorn,
>
>
>  Who should one give input to for desktop and on-/off-line synchronisation
>> of files?
>>
> We're happily take your input on this list.
>
> Thanks,
>
> Klaas
>
> ______________________________**_________________
> Owncloud mailing list
> Owncloud at kde.org
> https://mail.kde.org/mailman/**listinfo/owncloud<https://mail.kde.org/mailman/listinfo/owncloud>
>

-- 
Bjorn Madsen
*Researcher Complex Systems Research*
Ph.: (+44) 0 7792 030 720 Ph.2: (+44) 0 1767 220 828
bjorn.madsen at operationsresearchgroup.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/owncloud/attachments/20120222/1f3abcb8/attachment.html>