Telemetry Policy

Mon Aug 14 21:26:36 BST 2017

On Sonntag, 13. August 2017 11:47:28 CEST Volker Krause wrote:

> ## Minimalism
> 
> We only track the bare minimum of data necessary to answer specific
> questions, we do not collect data preemptively or for exploratory research.
> In particular, this means:
> - collected data  must have a clear purpose

While from a privacy perspective this certainly makes sense, with my user 
researcher hat on I'm worried that this might severely limit the usefulness of 
the whole operation, at least if changes to what is being tracked can only be 
made with each new major release of an application.

Psychologists usually collect more information in their studies than they 
would need strictly to test their hypotheses. We don't do that because we just 
want to collect data or to sell them or whatever.
No, we collect them because in reality, things are hardly ever as clear-cut as 
we had hypothesized. Our hypotheses are often based on correlations between 
two variables, but in reality, more often than not there is some other 
variable which we had not thought of before that affects one or both of the 
variables we're interested in, and thereby distorts the data.

Now if we only collected the data that we had a-priori hypotheses about, that 
would mean that after every study, we'd have to go back to the drawing board 
and define which variables to collect next time. This would make research both 
slow and very expensive. By collecting additional data, however, we have the 
chance to run additional exploratory tests after the fact, and uncover new 
hypotheses that we can then test in the next study.

In the case of KUserFeedback, fortunately cost is not really an issue because 
we don't pay our users for providing the data. Time, on the other hand, _is_ 
an issue. If we strictly only collect data if a hypothesis exists about them, 
that means the following:

T0: The day of a KDE Applications release, I have a hypothesis about a causal 
link between two variables regarding the usage of KAlgebra.

T+1day: I use my incredible charming skills to coerce Aleix into implementing 
triggers for collecting data about these two variables.

T+4 months: The next release ships these collection triggers, data comes in.

T+5 months: After one month's worth of data are collected, I analyze them. the 
numbers look weird, something is odd. Damn, seems like some other variable is 
in play there. I have a few candidates in mind, some are more likely to be the 
culprit than others.

T+6 months: I convince Aleix to implement triggers for all the candidates. 
He's reluctant because that seems to go against the minimalism rule, but I 
convince him that I'm really unsure and don't want to risk another release 
cycle only to find out we had tested the wrong variables

T+8 months: The release with the new variables is out.

T+9 months: After a month's worth of data, I run my analysis again. Eureka! 
I've finally found my causal link!

T+10 months: We come up with an improvement to KAlgebra based on the link 
we've found, and it gets implemented.

T+12 months: A year after I formulated my first hypothesis, the fruits of the 
whole endeavor get into users' hands.

And this scenario does not even take into account that it may take months 
until our software reaches the big chunk of users who are on "stable distros".

So, long story short: While I agree that we should not just wildly collect 
everything we can, being able to start measuring variables only on the next 
release after a concrete hypothesis has been formulated about them could 
really slow us down.

Is there any possible way to mitigate this issue?

Cheers,
Thomas