Application usage statistics and targeted user surveys

Tue May 2 18:58:05 BST 2017

Thanks for the review!

On Tuesday, 2 May 2017 00:07:43 CEST Albert Astals Cid wrote:
> El diumenge, 23 d’abril de 2017, a les 12:52:57 CEST, Volker Krause va
> > Wanting this for GammaRay I attempted to implement a generic framework for
> > this, with the goal to make this fully transparent, and give the user full
> > control over what data is shared, and how often they want to participate
> > in
> > surveys, ie. make this solid enough on the privacy side that even I would
> > enable it myself. You'll find the code in Git (kde:kuserfeedback).
> 
> Why the weird values in StatisticsCollectionMode ?

Extensibility, so we can add more modes later if needed, while still keeping 
the order based on how much data is submitted.

> Should submissionInterval and encouragementInterval also be a property in
> Provider?

I only added properties needed for a QML configuration user interface so far, 
but if someone wants to do the entire setup in QML it probably makes sense to 
expose the entire API indeed.

(What data you want to share (statisticsCollectionMode) and how often you want 
to be bothered by surveys (surveyInterval) are the only two values meant for 
user configuration, the rest is supposed to be configured by the application 
developer.)

> Also would be nice to specify the default values for submissionInterval,
> encouragementInterval, surveyInterval

done

> Do I gather correctly thta as an app developer the only things I'm actually
> interested in are Provider and FeedbackConfigWidget/Dialog? Would be nice to
> have some docu saying so

Those are the main integration points, yes. You'll also need to add data 
sources for Provider to actually report telemetry though, either a built-in 
one, or implementing a custom one based on AbstractDataSource.

Added a high-level integration overview to Mainpage.dox.

> > Feature-wise it so far contains:
> > - a set of built-in data sources (app version, Qt version, platform,
> > application usage time, screen setup, etc) that applications can choose to
> > enable
> > - generic data sources for tracking the time ratio a Q_PROPERTY has a
> > specific value, allowing to track e.g. which application view is used how
> > much - the ability to add custom/application-specific data sources
> > - reference widgets for customizing what data you want to share, and
> > showing exactly what that means, in human readable translated text and if
> > you insists also all the way down to the raw JSON sent to the server.
> > - survey targeting using simple C++/JS-like expressions that can access
> > all
> > the data sources (ie. you can target e.g. only users with high DPI multi-
> > screen setups)
> > - configurable encouragement of users to contribute (ie. after X starts
> > and/or Y hours of usage, repeated after Z months, suggest the user to
> > participate if they aren't already doing so).
> > - a management and analytic tool that allows you to manage products and
> > survey campaigns, and view recorded data using configurable aggregations
> > - the entire thing works without unique user ids. Fingerprinting can still
> > be an issue on too small user sets and/or when using too much detail in
> > the
> > data. - by default all of this is opt-in of course, although technically
> > the API doesn't prevent applications to change this
> > - it can deal with multiple products, each product can have different data
> > sources and survey campaigns
> 
> Haven't read much of the code yet, so I'll ask some stuff.
> 
> Is there a way for the user to see (locally) the data he has sent to the
> servers?

The default configuration dialog shows you a list of what would be sent at the 
time of looking at it, but there is no local logging of the submitted data at 
this point.

> Is there a way for the user to remove the data he has sent to the servers?
> Guess not since otherwise we would be able to do a 1:1 mapping

No. But it's not impossible to achieve I think, without giving up the "no 
unique user identification" requirement. The server could generate a unique 
random key for each submitted record and send that back to the client. The 
client would store these and if desired can request deletion for the 
corresponding records.

Both good points, how important do you think they are for acceptance of this?

> Do we have some way in the server to protect us from people trying to inject
> "fake/wrong" data?

No. And that could indeed be a problem. We can do some sanity checking, but if 
someone insists on vandalizing this you can easily make this entirely useless 
by submitting tons of plausible/"valid" data. You can block IP addresses/
ranges on the web server level, but that is rather crude and manual, but 
that's as far as my ideas on dealing with this go unfortunately.

> I see you protected the data on the server with a user/password.

It's protecting both read access on the data and write access on product 
configuration and survey campaigns, yes. It would probably make sense to 
separate those two interfaces, and thus also enabling different access control 
for data analysis and product/campaign management.

> If the data is really anonymous do we really need user/password ?

Good point, I would also argue that for building trust in such a system the 
data must be public. However, there are two reasons that still made me protect 
it:
(1) if it's world-readable the fact that it is essentially world-writable (see 
above problem with submitting wrong data) makes this easily exploitable for 
spreading links to illegal content, same as e.g. our pastebin was abused.
(2) we have no operational experience with this and no existing data sets, and 
there is the residual risk of fingerprinting if we track too much due to that.

What might work is to make parts of the data that are certainly not 
problematic (e.g. just numbers, no free strings) publicly available live, and 
have everything else go through human review first.

> And if we actually do need need user/password is there a way to restrict
> which data can a user see (i.e. configure that I can see Okular's data but
> not Krita's?).

Assuming this would be connected to identity.kde.org, I think it would be 
fine to give all people with commit access read access to the data too, or do 
you think we really need to control this per product?

I do see why we might want more control on the product/campaign management 
side, so I don't accidentally destroy Okular's data due to not knowing how to 
use the tool. It would be much easier if we don't need to restrict this per 
product though, but rather just to a group of people who know what they are 
doing.

Regards,
Volker

> Thanks for working on this :)
>
> Cheers,
>   Albert
> 
> > Technically, this consists of the following parts:
> > - a library that goes into the target application, backward compatible all
> > the way to Qt4.8/MSVC2010 (needed for my GammaRay use-case), depending
> > only
> > on QtGui
> > - a library with the reference widgets, also with extended backward
> > compatibility
> > - the server, written in PHP5 and supporting sqlite/mysql/postgresql. Not
> > the most fun technology, but that stuff is available almost anywhere, and
> > easy to deploy and maintain
> > - the management tool, recent Qt5/recent C++, using QtCharts for the data
> > analysis
> > - a command line tool for data import/export, useful for eg. automated
> > backups
> > 
> > All of this is LGPLv2+ licensed.
> > 
> > Feedback obviously very welcome, in particular around privacy concerns, or
> > reasons that would make you enable/disable such a feature.
> > 
> > Regards,
> > Volker
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20170502/ce0ae57c/attachment.sig>