Telemetry Policy
Thomas Pfeiffer
thomas.pfeiffer at kde.org
Mon Aug 14 21:26:36 BST 2017
On Sonntag, 13. August 2017 11:47:28 CEST Volker Krause wrote:
> ## Minimalism
>
> We only track the bare minimum of data necessary to answer specific
> questions, we do not collect data preemptively or for exploratory research.
> In particular, this means:
> - collected data must have a clear purpose
While from a privacy perspective this certainly makes sense, with my user
researcher hat on I'm worried that this might severely limit the usefulness of
the whole operation, at least if changes to what is being tracked can only be
made with each new major release of an application.
Psychologists usually collect more information in their studies than they
would need strictly to test their hypotheses. We don't do that because we just
want to collect data or to sell them or whatever.
No, we collect them because in reality, things are hardly ever as clear-cut as
we had hypothesized. Our hypotheses are often based on correlations between
two variables, but in reality, more often than not there is some other
variable which we had not thought of before that affects one or both of the
variables we're interested in, and thereby distorts the data.
Now if we only collected the data that we had a-priori hypotheses about, that
would mean that after every study, we'd have to go back to the drawing board
and define which variables to collect next time. This would make research both
slow and very expensive. By collecting additional data, however, we have the
chance to run additional exploratory tests after the fact, and uncover new
hypotheses that we can then test in the next study.
In the case of KUserFeedback, fortunately cost is not really an issue because
we don't pay our users for providing the data. Time, on the other hand, _is_
an issue. If we strictly only collect data if a hypothesis exists about them,
that means the following:
T0: The day of a KDE Applications release, I have a hypothesis about a causal
link between two variables regarding the usage of KAlgebra.
T+1day: I use my incredible charming skills to coerce Aleix into implementing
triggers for collecting data about these two variables.
T+4 months: The next release ships these collection triggers, data comes in.
T+5 months: After one month's worth of data are collected, I analyze them. the
numbers look weird, something is odd. Damn, seems like some other variable is
in play there. I have a few candidates in mind, some are more likely to be the
culprit than others.
T+6 months: I convince Aleix to implement triggers for all the candidates.
He's reluctant because that seems to go against the minimalism rule, but I
convince him that I'm really unsure and don't want to risk another release
cycle only to find out we had tested the wrong variables
T+8 months: The release with the new variables is out.
T+9 months: After a month's worth of data, I run my analysis again. Eureka!
I've finally found my causal link!
T+10 months: We come up with an improvement to KAlgebra based on the link
we've found, and it gets implemented.
T+12 months: A year after I formulated my first hypothesis, the fruits of the
whole endeavor get into users' hands.
And this scenario does not even take into account that it may take months
until our software reaches the big chunk of users who are on "stable distros".
So, long story short: While I agree that we should not just wildly collect
everything we can, being able to start measuring variables only on the next
release after a concrete hypothesis has been formulated about them could
really slow us down.
Is there any possible way to mitigate this issue?
Cheers,
Thomas
More information about the kde-community
mailing list