[Owncloud] Who to give input to for desktop and on-/off-line synchronisation of files?

Bjorn Madsen bjorn.madsen at operationsresearchgroup.com
Wed Feb 22 18:36:33 UTC 2012


Hi Andreas,
First my apologies for writing in the jargon that governs my community of
practice.

There is a lot more to it. In your csync example you compare apples with
banana's (cached vs. disked files) and hence it does not represent the
problem, as I am not writing about matching/synchronising two plain
directories.

I am rather presenting a solution to the *sad fact* that cloud-services
with synchronisation do not acknowledge that not all devices have the same
amount of disk space and that some logical is required to manage this
constraint.

Here is an example:
- My mobile phone has a few Gb of space compared to
- my laptop that has 20x more, compared to
- my desktop which has 200x more, compared
- to my server that limits at 30 Tb.

In particular I refer to the incremental statistical nature of the
file-usage, where files in-frequently used may be moved to another location
in meaningful manner, whilst still being economical about the storage media
used for that purpose, and that this requires that there is some logic
which guarantees that I have the newest version of a file on any device
which has accessed the file before. Where as if I have not accessed a file
earlier, it is appropriate to store it remotely.

In plain english I could assign the desktop and the server to become
complete repositories for all files, and the laptop and mobile to become
partial repositories, which always have the newest files - and are
permitted to overwrite less frequently used files if they are about to run
out of the assigned disk space. Hereby I save space and bandwidth using
entropy from the SQLite database.
An example could be that I have never opened a given movie on my mobile
(1.2Gb) and hence the file should not be synchronised to it. However my
txt-notepad has been used 150k times so that is the first thing that gets
synchronised.

SQLite is not used for sync. It is used for the logic.

Is it clearer now?

/B

On 22 February 2012 17:17, Andreas Schneider <asn at cryptomilk.org> wrote:

> On Wednesday 22 February 2012 15:27:46 Bjorn Madsen wrote:
> > *Thanks Klaas,*
>
> Hi,
>
> > As computational complexity is my favourite subject I start thinking how
> I
> > would perform a full replication the day when our data-repositories pass
> > 2-4 TB which is just around the corner....
> >
> > On my private ubuntu 10.04 I ran $ ls -aR | wc -l  and found 126037
> files,
> > equal to 1.45 TB
> > I copied the full filetree from my two machines (just the paths and the
> > md5sum)- let's call them A & B into SQLite (appx. 9.3Mb each). In
> addition
> > I set up inotify to write any changes to the SQLite database on both
> > machines. Nothing more.
> >
> > Now my experiment was to run the whole filetree in SQLite, perform three
> > join operations:
> >
> >    1. What is on machine A but not on machine B, as a LEFT excluding
> join,
> >    i.e. SELECT <select_list> FROM Table_A A LEFT JOIN Table_B B ON A.Key
> =
> >    B.Key WHERE B.Key IS NULL
> >    2. What is on machine B but not on machine A, as a RIGHT excluing
> join,
> >    i.e. SELECT <select_list> FROM Table_A A RIGHT JOIN Table_B B ON
> A.Key =
> >    B.Key WHERE A.Key IS NULL
> >    3. What is on A and B, as an inner join, i.e. SELECT <select_list>
> FROM
> >    Table_A A INNER JOIN Table_B B ON A.Key = B.Key
> >
> > With this operation I produce the lists #1 and #2 which I intend to feed
> to
> > rsync to send with ssh across the both machines (pull not push), and
> doing
> > so would be a piece of cake. I use rsync's option delete after
> filetransfer
> > as the time where a file is unavailable is unnoticeable.
> >
> > However the first operation of this kind takes some time (17m4.096s on my
> > Intel Atom) and as our databases grow bigger, exponentially longer. In
> > addition the memory footprint also doesn't get prettier.
>
> Did you expect somthing else?
>
>
> If I run csync on my home directory with cold caches, it needs 140.91
> seconds
> walking 836549 files. And about 200MB to store in information about the
> files
> in memory.
>
> Comparing side A with side B takes less than a second.
>
> [stderr] 20120222 16:37:00.709 DEBUG    csync.api- Reconciliation for local
> replica took 0.52 seconds visiting 836549 files.
> [stderr] 20120222 16:37:01.203 DEBUG    csync.api- Reconciliation for
> remote
> replica took 0.49 seconds visiting 836549 files.
>
> So 17min vs 0.52 sec ;)
>
> >
> > Now as all new files are written to the sqlite database, I can timestamp
> > the operation and only use incremental operations (at least until I have
> > performed 10^14 file operations, where it would be appropriate to
> recreate
> > the database.
> >
> > This means I have a few powerful operations available:
>
> 17min doesn't sound powerful ... :)
>
> > #A I can segment the join operations to match the available memory
> > footprint using SELECT and an appropriate interval which would reflect
> the
> > allocated memory footprint.
> >
> > #B More interestingly, I can use change propagation in the database
> > operation to avoid redundant operations, by only selecting files updated
> > since last operation or last "nirvana" when the synchronisation daemon
> > reinitiates its check. I my case the usage of change propagation brings
> my
> > memory footprint down to a few kb's (which I could run even on my old
> > phone).
> >
> > #C I can measure the entropy (frequency of update, based on list #3) per
> > time-interval as an indicator for the age of my files, which permits that
> > high-entropy files (typical browser and cached stuff) rarely gets
> > synchronised, and that ultra-low entropy stuff could be moved to slower
> > drives. In practice this would permit all my drives quickly to have the
> > latest files I’m working on and then synchronise everything else later
> (if
> > there is space).
>
> I don't think sqlite has been written for file synchronization. I don't
> know
> what you really want to achieve but 2-way file synchronization isn't done
> by
> sqlite and rsync is only a one-way file synchronizer.
>
>
>        -- andreas
>
> --
> Andreas Schneider                   GPG-ID: F33E3FC6
> www.cryptomilk.org                asn at cryptomilk.org
>
> _______________________________________________
> Owncloud mailing list
> Owncloud at kde.org
> https://mail.kde.org/mailman/listinfo/owncloud
>



-- 
Bjorn Madsen
*Researcher Complex Systems Research*
Ph.: (+44) 0 7792 030 720 Ph.2: (+44) 0 1767 220 828
bjorn.madsen at operationsresearchgroup.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/owncloud/attachments/20120222/2e1df8c0/attachment.html>


More information about the Owncloud mailing list