<div><b style="font-family:'Times New Roman';font-size:medium"><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">Thanks Klaas,</span></b></div><span id="internal-source-marker_0.7237568276468664" style="font-family:'Times New Roman';font-size:medium"><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">As computational complexity is my favourite subject I start thinking how I would perform a full replication the day when our data-repositories pass 2-4 TB which is just around the corner.... </span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">On my private ubuntu 10.04 I ran $ ls -aR | wc -l and found 126037 files, equal to 1.45 TB</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">I copied the full filetree from my two machines (just the paths and the md5sum)- let's call them A & B into SQLite (appx. 9.3Mb each). In addition I set up inotify to write any changes to the SQLite database on both machines. Nothing more.</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">Now my experiment was to run the whole filetree in SQLite, perform three join operations:</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><ol style="font-weight:bold"><li style="list-style-type:decimal;font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline">
<span style="vertical-align:baseline;white-space:pre-wrap">What is on machine A but not on machine B, as a LEFT excluding join, i.e. SELECT <select_list> FROM Table_A A LEFT JOIN Table_B B ON A.Key = B.Key WHERE B.Key IS NULL</span></li>
<li style="list-style-type:decimal;font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline"><span style="vertical-align:baseline;white-space:pre-wrap">What is on machine B but not on machine A, as a RIGHT excluing join, i.e. SELECT <select_list> FROM Table_A A RIGHT JOIN Table_B B ON A.Key = B.Key WHERE A.Key IS NULL</span></li>
<li style="list-style-type:decimal;font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline"><span style="vertical-align:baseline;white-space:pre-wrap">What is on A and B, as an inner join, i.e. SELECT <select_list> FROM Table_A A INNER JOIN Table_B B ON A.Key = B.Key</span></li>
</ol><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">With this operation I produce the lists #1 and #2 which I intend to feed to rsync to send with ssh across the both machines (pull not push), and doing so would be a piece of cake. I use rsync's option delete after filetransfer as the time where a file is unavailable is unnoticeable. </span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">However the first operation of this kind takes some time (17m4.096s on my Intel Atom) and as our databases grow bigger, exponentially longer. In addition the memory footprint also doesn't get prettier.</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">Now as all new files are written to the sqlite database, I can timestamp the operation and only use incremental operations (at least until I have performed 10^14 file operations, where it would be appropriate to recreate the database. </span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">This means I have a few powerful operations available:</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">#A I can segment the join operations to match the available memory footprint using SELECT and an appropriate interval which would reflect the allocated memory footprint.</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">#B More interestingly, I can use change propagation in the database operation to avoid redundant operations, by only selecting files updated since last operation or last "nirvana" when the synchronisation daemon reinitiates its check. I my case the usage of change propagation brings my memory footprint down to a few kb's (which I could run even on my old phone).</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">#C I can measure the entropy (frequency of update, based on list #3) per time-interval as an indicator for the age of my files, which permits that high-entropy files (typical browser and cached stuff) rarely gets synchronised, and that ultra-low entropy stuff could be moved to slower drives. In practice this would permit all my drives quickly to have the latest files I’m working on and then synchronise everything else later (if there is space).</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">I know that other sync-service providers stick to the users /home/$user directory and lower non-dot- levels so the high speed/high entropy issue is not of concern. The low speed stuff, however could be as a user who is running low on disk-space whereby he could park his low-entropy (archival files) on a remote raided-hard-drive.</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">This means that I can:</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">- Add a new machine at the computation time that follows run-time of #A,</span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">- My files are update with the computation run-time #B,</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">- and I never run out of harddrive space because my low-entropy files are stored on the cloud server (at whomever/owncloud/...) #C</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">In addition the hardware requirements to the server park at the host is minimalistic, as all it has to maintain is a database with mac-address, ssh-keys, filetrees and md5sums, and a blob for the cases where the users chooses to push the file to your cloud (such as sharing files).</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">(The file-sharing and associated authentication can - if needed - run elsewhere / on another set of servers).</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">So... being a computational complexity guy and not the hardcore slash-dot programmer, I have to ask, does this make sense to you guys? Do you do something similar? </span><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">Do you actively avoid doing the diff on the servers instead of offloading it to the local machine?</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">Appendix: Lazymans pseudo-code ...</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">First we read the file path tree recursively of the two machines (A and B) into two sqlite tables as:</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">machine A. </span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">path1/file1 + md5sum</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">path1/file2 + md5sum</span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">path1/file3 + md5sum</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">path2/file1 + md5sum</span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">path2/file2 + md5sum</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">path2/file3 + md5sum</span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">machine B. </span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">path1/file2 + md5sum</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">path1/file3 + md5sum</span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">path2/file1 + md5sum</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">path2/file2 + md5sum</span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">path2/file4 + md5sum</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">Next we use a join operation to map out the overlapping structure.</span><br>
<span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap"></span><br><span style="font-size:15px;font-family:Arial;font-weight:normal;vertical-align:baseline;white-space:pre-wrap">Next we check the inotify-log in sqlite for deletions and propagate deletions to the wastebin - so users undo option is maintained (Inotify operations reads/writes to the sqlite database).</span></span><br>
<br><div class="gmail_quote">On 22 February 2012 10:32, Klaas Freitag <span dir="ltr"><<a href="mailto:freitag@owncloud.com">freitag@owncloud.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On 22.02.2012 10:30, Bjorn Madsen wrote:<br>
hi Bjorn,<div class="im"><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Who should one give input to for desktop and on-/off-line synchronisation<br>
of files?<br>
</blockquote></div>
We're happily take your input on this list.<br>
<br>
Thanks,<br>
<br>
Klaas<br>
<br>
______________________________<u></u>_________________<br>
Owncloud mailing list<br>
<a href="mailto:Owncloud@kde.org" target="_blank">Owncloud@kde.org</a><br>
<a href="https://mail.kde.org/mailman/listinfo/owncloud" target="_blank">https://mail.kde.org/mailman/<u></u>listinfo/owncloud</a><br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div>Bjorn Madsen</div><div><i>Researcher Complex Systems Research</i></div><div>Ph.: (+44) 0 7792 030 720 Ph.2: (+44) 0 1767 220 828</div><div><a href="mailto:bjorn.madsen@operationsresearchgroup.com" target="_blank">bjorn.madsen@operationsresearchgroup.com</a></div>
<div><br></div><br>