Telemetry Policy

Tue Aug 15 10:26:14 BST 2017

On Monday, 14 August 2017 22:26:36 CEST Thomas Pfeiffer wrote:
> On Sonntag, 13. August 2017 11:47:28 CEST Volker Krause wrote:
> > ## Minimalism
> > 
> > We only track the bare minimum of data necessary to answer specific
> > questions, we do not collect data preemptively or for exploratory
> > research.
> > In particular, this means:
> > - collected data  must have a clear purpose
> 
> While from a privacy perspective this certainly makes sense, with my user
> researcher hat on I'm worried that this might severely limit the usefulness
> of the whole operation, at least if changes to what is being tracked can
> only be made with each new major release of an application.
> 
> Psychologists usually collect more information in their studies than they
> would need strictly to test their hypotheses. We don't do that because we
> just want to collect data or to sell them or whatever.
> No, we collect them because in reality, things are hardly ever as clear-cut
> as we had hypothesized. Our hypotheses are often based on correlations
> between two variables, but in reality, more often than not there is some
> other variable which we had not thought of before that affects one or both
> of the variables we're interested in, and thereby distorts the data.
> 
> Now if we only collected the data that we had a-priori hypotheses about,
> that would mean that after every study, we'd have to go back to the drawing
> board and define which variables to collect next time. This would make
> research both slow and very expensive. By collecting additional data,
> however, we have the chance to run additional exploratory tests after the
> fact, and uncover new hypotheses that we can then test in the next study.
> 
> In the case of KUserFeedback, fortunately cost is not really an issue
> because we don't pay our users for providing the data. Time, on the other
> hand, _is_ an issue. If we strictly only collect data if a hypothesis
> exists about them, that means the following:
> 
> T0: The day of a KDE Applications release, I have a hypothesis about a
> causal link between two variables regarding the usage of KAlgebra.
> 
> T+1day: I use my incredible charming skills to coerce Aleix into
> implementing triggers for collecting data about these two variables.
> 
> T+4 months: The next release ships these collection triggers, data comes in.
> 
> T+5 months: After one month's worth of data are collected, I analyze them.
> the numbers look weird, something is odd. Damn, seems like some other
> variable is in play there. I have a few candidates in mind, some are more
> likely to be the culprit than others.
> 
> T+6 months: I convince Aleix to implement triggers for all the candidates.
> He's reluctant because that seems to go against the minimalism rule, but I
> convince him that I'm really unsure and don't want to risk another release
> cycle only to find out we had tested the wrong variables
> 
> T+8 months: The release with the new variables is out.
> 
> T+9 months: After a month's worth of data, I run my analysis again. Eureka!
> I've finally found my causal link!
> 
> T+10 months: We come up with an improvement to KAlgebra based on the link
> we've found, and it gets implemented.
> 
> T+12 months: A year after I formulated my first hypothesis, the fruits of
> the whole endeavor get into users' hands.
> 
> And this scenario does not even take into account that it may take months
> until our software reaches the big chunk of users who are on "stable
> distros".
> 
> So, long story short: While I agree that we should not just wildly collect
> everything we can, being able to start measuring variables only on the next
> release after a concrete hypothesis has been formulated about them could
> really slow us down.
> 
> Is there any possible way to mitigate this issue?

The latency is indeed a very valid concern, and we can't even estimate this 
properly yet (deployment latency is one of the first things to measure with 
telemetry IMHO). Expecting anything below several months is way too optimistic 
I think.

More aggressive preemptive tracking might avoid one cycle in your above 
example, but only if you actually manage to think about everything you will 
need in the end.

So to have the complete picture, what data would you want to collect if the 
policy wouldn't restrict you to purpose-bound minimalism? Having a few 
examples would make it easier to tweak the balance here I think.

Also note that if we would publish and freely license the raw data, any 
exploratory research on that would still be possible, even if that wasn't the 
original purpose of the data collection.

Technically there are of course ways to address all this, for example by data 
collection scripts provided by the server and executed by a KUserFeedback 
application-side runtime. That's actually how this started, based on Björn's 
initial wishlist, but I think it's clear why we didn't end up there :)

Regards,
Volker
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.kde.org/pipermail/kde-community/attachments/20170815/31475cb3/attachment.sig>