[okular] [Bug 436738] docdata duplicated each time pdf is edited

Mon May 17 15:42:57 BST 2021

https://bugs.kde.org/show_bug.cgi?id=436738

--- Comment #15 from pbs3141 at googlemail.com ---
> My idea would be to store docdata (maybe including thumbnails) hashed by the file name/path/content, and encrypted with a hash of the file content, so they can only be read with read access to the document file (or a copy of it).

So, you're suggesting the encryption to address the privacy issue mentioned in
that thread. Would it not be simpler to make docdata only user-readable?

> I fear that what you're suggesting would create too much I/O. Each time i open a PDF i have never opened before i would have to read all the filenames in the docdata folder in case some of them has a matching sha.
>
> Doesn't sound like it would work fine at scale.

No, I only suggested testing the existence of docdata/$HASH and
docdata/$FULLPATH, which takes constant io. (I think David already said the
same thing in the next comment.)

The only potential source of large io in my suggestion was the amortised
deletion of stale files. The difficulty is in randomly selecting k files from a
directory containing N files, where say k ~ 5 and N ~ 5000. I think this can
still be done quickly, in O(k) io not O(N), by walking the linked list returned
by opendir, but only reading a random selection of k files. But I'll need to
benchmark / read up on disk formats to be sure.

> I don't like the idea of identifying the docdata exclusively by hash.

It's good enough for git! Surely it should be good enough here?

> Hashes by definition will have collisions, so will have filenames+filesize, but it's much easier to explain that two documents "share" their docdata because of that (and if the user actually has two files with the same filename and size and are not the same, she can rename one of the files) than the fact that if they share the hash of the first N bytes, which is something that no one "normal" can really understand and if even they understand they can't fix it.

I don't like hashing the whole file. Users may want to open some pretty large
PDFs. I've personally needed to view a PDF of a long slideshow with many large
pictures that was over 1GB. I shouldn't have to hash the whole lot just to view
a small part of it.

For PDF, the hash of filesize + a couple of 4kB chunks throughout the file
would surely be good enough. For some formats I can imagine users might want to
change small bits of the file in a way this can't detect, but PDF isn't one of
them.

> Ok, now I understand. pbs3141 and me suggested mostly the same, just that my suggestion does not use any filepaths, and so does not need to process docdata files to decide whether to delete them.

The thing that is lacking with an implementation that doesn't use filepaths is
that if you overwrite a PDF in-place, then you will lose the viewing data if it
is not currently open in Okular. (I encounter this problem frequently when
using LyX.)

-- 
You are receiving this mail because:
You are the assignee for the bug.