Telemetry Policy

Mon Aug 21 10:08:03 BST 2017

On Sunday, 20 August 2017 22:29:28 CEST Jaroslaw Staniek wrote:
> On 19 August 2017 at 11:39, Volker Krause <vkrause at kde.org> wrote:
> > On Friday, 18 August 2017 11:23:49 CEST Jaroslaw Staniek wrote:
> > > On 17 August 2017 at 16:19, Volker Krause <vkrause at kde.org> wrote:
> > > > On Wednesday, 16 August 2017 20:35:59 CEST Jaroslaw Staniek wrote:
> > > My assumption when started with telemetry was having adequate level of
> > > precision. Assuming no logs are fabricated as fake interesting questions
> > > are for example: how many users actually run supported software and how
> > > many run outdated one? Not how many executions per given period of time
> > > because it may be that old software is executed by a few users very
> > > frequently for some reason. e.g. because 3 years old sofware crashes on 
> > > old OS every minute and restart was needed :)
> > > 
> > > How to know that without unique (anonymous) identification?
> > > Using extra fields such as OS+Desktop type/version would be indeed a
> > > form of cheap UID.
> > > But I would say disclosing OS+Desktop type/version for that discloses
> > > more than the anonymous random UID represents.
> > > In bugzilla and mailing list we're asking for all this information too
> > > anyway and (at least I) do not like supporting anonymous users since I
> > > am not anonymous.
> > 
> > The implementation in KUserFeedback addresses this by fixed interval data
> > submission. If you then aggregate the received data by the same interval,
> > you can see e.g. how ratios of application versions develop over time.
> > 
> > This does have limits of course, you can't distinguish between the same
> > person using the application every sampling interval, or two people using 
> > it every other interval for example. With a sufficiently long sampling 
> > interval the result should nevertheless be sufficiently accurate I think.
> 
> Volker, thanks for sharing this. I don't see how this as an approximation.
> Do you probe in given time intervals and/or measure time spent with the
> application? How do you handle time zones (e.g. zero usage of version X
> that is used only in the USA for some reason)?
> 
> KEXI sends the feedback data on startup only. I have no idea if this is
> compatible with any other approach but this helps to ignore different usage
> patterns, e.g. these two basic and typical to KEXI and many apps:
> 
> - user starts the app and keeps it open for half of the day
> - user frequently starts the app multiple times (for any reason) and has
> multiple instances open
> 
> If I remember correctly we're not measuring how long the app is used, this
> can be perceived as quite private information, by the way. Interesting data
> but so far not collected.
> 
> Moreover based on my specific experience giving up the IDs softens the data
> any more complex than app version: Alice can use module M of the app
> primarily and Bob can use module N mostly. Without IDs we have a set of
> mixed probes that include usage of both modules in no particular order
> (maybe per locale or timezone or other factor but this is not worth
> guessing IMHO). We don't even know if there are module-based preferences
> among the users.

Let's looks at a concrete example:

{
    "applicationVersion": {  "value": "2.8.50"  },
    "compiler": { "type": "GCC",  "version": "7.1" },
    "opengl": {
        "glslVersion": "1.30",
        "renderer": "Haswell Mobile ",
        "type": "GL",
        "vendor": "Intel",
        "vendorVersion": "Mesa 17.1.4",
        "version": "3.0"
    },
    "platform": {
        "os": "linux",
        "version": "opensuse-tumbleweed"
    },
    "qtVersion": { "value": "5.9.2" },
    "startCount": { "value": 34 },
    "toolRatio": {
        "objectinspector": { "property": 0.7619047619047619 },
        "quickinspector": { "property": 0.23809523809523808 }
    },
    "usageTime": { "value": 12113  }
}

This is what a local GammaRay instance on this machine would currently sent 
once a week, if I enable the maximum telemetry level. "Once a week" is the 
approximation I was referring to, as that of course assumes the application is 
actually running. If it isn't, it sends at the next possibility after that. 

However, this means that when looking at all samples received in one week, I 
can be reasonably sure to only have included each installation at most once.

The data includes the number of application starts, so you can distinguish 
between frequent short users and keep it running all the time users, like you 
mentioned. Either usage pattern doesn't change general statistics though, as 
both submit the same amount of data.

The raw data stays available, aggregation only happens in the analytics tool. 
So you can of course still correlate various values (feature usage depending 
on OS or locale, for example). This also means you can see if features A and B 
are used in equal parts by all users, or half the users use primarily A and 
the other half primarily B.

GammaRay does track the usage time, so we can properly weight the feature 
usage ratios. This is however up to the individual application, the framework 
doesn't force a specific set of values to collect.

Plotting this over time allows us to see trends (e.g. how quickly new versions 
are actually deployed). What we can't do however is seeing how users develop 
over time, e.g. users of feature A are likely starting to use feature B after 
a few months. At least not based on the above data. If we want to find such 
correlations this needs to be a custom data source that does the necessary 
data collection locally and submits the result. Theoretically possible, but 
obviously a lot more effort and involving a very high latency until we get 
results.

> I know you're well aware of all that given how long you spent to work on
> the topic. I am not pushing for obligation of all app projects to offer IDs
> (and especially with opt-out) but disallowing it in some manifest would
> bring negative results and alienate someone (also stays away from *GPL as
> stated above). So realism is needed here.

Just to be clear, I didn't write this policy to forbid Kexi what it's doing :) 
I wrote it because at Akademy there was consensus we should have such a policy 
before deploying KUserFeedback.

> I've not heard privacy concerns from KEXI's user base but heard concerns
> about us not knowing the user patterns enough. YMMV.
> 
> The key for me is to know users' expectations, so here I would learn what's
> their perception on privacy too. No generalization. Otherwise there are
> comic situations such as when I encounter a post from someone who
> generalizes and calls to take a very strict privacy policy in general and
> make it a KDE's differentiator, BUT the post is all signed "Sent from
> iPhone". Or freedom warriors that happen to use Facebook. Evangelists would
> better start from themselves and offer consulting to projects they know
> from the inside.

Yep, the current KMail user survey therefore includes questions about how 
users feel about telemetry.

[...]
> I am also against ideas to openly share raw telemetry data, I've heard
> about them in this or sibling threads for the first time. All telemetry I
> worked on was based on the trust for given organization and only the
> organization processes the raw data being very careful what results are
> published.

Right, publishing the raw data is mutually exclusive with containing unique 
identification in my view. And even then there seem to be more people against 
it then for it so far.

> PS2: Trivial, if there is any voting planned (?) it's important how do we
> ask. It's already hard enough that mostly the old generation votes...

How we ultimately decide the remaining contentious issues is a good question 
indeed. So far it seems to me we can get general agreement on the vast 
majority of issues, the only remaining questions there seems to be 
disagreement on is data publishing and allowing optional unique identification. 
Other open questions like licensing, retention limits and revocation support 
mainly follow from deciding the former issues I think.

> PS3: Organizations that support IDs can have two nice things: offer the
> users ability to review and remove telemetry data upon request. Hard to do
> that without IDs, right?

We can do that without unique identification too I think, see my reply to 
Martin in this thread on a possible approach.

Regards,
Volker
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.kde.org/pipermail/kde-community/attachments/20170821/2e3eddf6/attachment.sig>