Application usage statistics and targeted user surveys

Thu May 11 23:05:59 BST 2017

El dimarts, 2 de maig de 2017, a les 19:58:05 CEST, Volker Krause va escriure:
> Thanks for the review!
> 
> On Tuesday, 2 May 2017 00:07:43 CEST Albert Astals Cid wrote:
> > El diumenge, 23 d’abril de 2017, a les 12:52:57 CEST, Volker Krause va
> > 
> > > Wanting this for GammaRay I attempted to implement a generic framework
> > > for
> > > this, with the goal to make this fully transparent, and give the user
> > > full
> > > control over what data is shared, and how often they want to participate
> > > in
> > > surveys, ie. make this solid enough on the privacy side that even I
> > > would
> > > enable it myself. You'll find the code in Git (kde:kuserfeedback).
> > 
> > Why the weird values in StatisticsCollectionMode ?
> 
> Extensibility, so we can add more modes later if needed, while still keeping
> the order based on how much data is submitted.
> 
> > Should submissionInterval and encouragementInterval also be a property in
> > Provider?
> 
> I only added properties needed for a QML configuration user interface so
> far, but if someone wants to do the entire setup in QML it probably makes
> sense to expose the entire API indeed.

+1 i think we should start thinking more in "which are the qproperties that 
make sense to expose" instead of the "what are the ones that i actually need".

Though i guess adding new qproporties is abi and api compatible it's always 
nice if someone that has other needs doesn't need to add the qproperty at a 
later stage.

> 
> (What data you want to share (statisticsCollectionMode) and how often you
> want to be bothered by surveys (surveyInterval) are the only two values
> meant for user configuration, the rest is supposed to be configured by the
> application developer.)
> 
> > Also would be nice to specify the default values for submissionInterval,
> > encouragementInterval, surveyInterval
> 
> done
> 
> > Do I gather correctly thta as an app developer the only things I'm
> > actually
> > interested in are Provider and FeedbackConfigWidget/Dialog? Would be nice
> > to have some docu saying so
> 
> Those are the main integration points, yes. You'll also need to add data
> sources for Provider to actually report telemetry though, either a built-in
> one, or implementing a custom one based on AbstractDataSource.
> 
> Added a high-level integration overview to Mainpage.dox.

looks good :)

> 
> > > Feature-wise it so far contains:
> > > - a set of built-in data sources (app version, Qt version, platform,
> > > application usage time, screen setup, etc) that applications can choose
> > > to
> > > enable
> > > - generic data sources for tracking the time ratio a Q_PROPERTY has a
> > > specific value, allowing to track e.g. which application view is used
> > > how
> > > much - the ability to add custom/application-specific data sources
> > > - reference widgets for customizing what data you want to share, and
> > > showing exactly what that means, in human readable translated text and
> > > if
> > > you insists also all the way down to the raw JSON sent to the server.
> > > - survey targeting using simple C++/JS-like expressions that can access
> > > all
> > > the data sources (ie. you can target e.g. only users with high DPI
> > > multi-
> > > screen setups)
> > > - configurable encouragement of users to contribute (ie. after X starts
> > > and/or Y hours of usage, repeated after Z months, suggest the user to
> > > participate if they aren't already doing so).
> > > - a management and analytic tool that allows you to manage products and
> > > survey campaigns, and view recorded data using configurable aggregations
> > > - the entire thing works without unique user ids. Fingerprinting can
> > > still
> > > be an issue on too small user sets and/or when using too much detail in
> > > the
> > > data. - by default all of this is opt-in of course, although technically
> > > the API doesn't prevent applications to change this
> > > - it can deal with multiple products, each product can have different
> > > data
> > > sources and survey campaigns
> > 
> > Haven't read much of the code yet, so I'll ask some stuff.
> > 
> > Is there a way for the user to see (locally) the data he has sent to the
> > servers?
> 
> The default configuration dialog shows you a list of what would be sent at
> the time of looking at it, but there is no local logging of the submitted
> data at this point.

Ok, i guess this would be enough, i mean the user has to trust us anyway, 
since even if we showed a log it could be not all data we sent.

> 
> > Is there a way for the user to remove the data he has sent to the servers?
> > Guess not since otherwise we would be able to do a 1:1 mapping
> 
> No. But it's not impossible to achieve I think, without giving up the "no
> unique user identification" requirement. The server could generate a unique
> random key for each submitted record and send that back to the client. The
> client would store these and if desired can request deletion for the
> corresponding records.

Right, sounds doable.

> Both good points, how important do you think they are for acceptance of
> this?

Don't know, as I said, in both cases the user has to trust that what we're 
showing is true, since e.g. we could tell them "yes we've deleted the data" 
and not really do it.

So maybe it's nice to haves but not really mandatory for a first version?

> > Do we have some way in the server to protect us from people trying to
> > inject "fake/wrong" data?
> 
> No. And that could indeed be a problem. We can do some sanity checking, but
> if someone insists on vandalizing this you can easily make this entirely
> useless by submitting tons of plausible/"valid" data. You can block IP
> addresses/ ranges on the web server level, but that is rather crude and
> manual, but that's as far as my ideas on dealing with this go
> unfortunately.

I have a *very vague memory* of finding how Firefox did this, but can't find 
it right now :/

I've just asked on their IRC and will lurk there for a while to see if i get 
lucky.

> 
> > I see you protected the data on the server with a user/password.
> 
> It's protecting both read access on the data and write access on product
> configuration and survey campaigns, yes. It would probably make sense to
> separate those two interfaces, and thus also enabling different access
> control for data analysis and product/campaign management.

+1, i'd like at least a "read" and an "admin" privilege separation, if i 
understand we plan to run this as a "KDE-wide service".

> 
> > If the data is really anonymous do we really need user/password ?
> 
> Good point, I would also argue that for building trust in such a system the
> data must be public. However, there are two reasons that still made me
> protect it:
> (1) if it's world-readable the fact that it is essentially world-writable
> (see above problem with submitting wrong data) makes this easily
> exploitable for spreading links to illegal content, same as e.g. our
> pastebin was abused.

Apply the same solution we made for pastebin? i.e. i think you need an 
identity account now?

> (2) we have no operational experience with this and no
> existing data sets, and there is the residual risk of fingerprinting if we
> track too much due to that.

true, starting "small" may be a better idea.

> What might work is to make parts of the data that are certainly not
> problematic (e.g. just numbers, no free strings) publicly available live,
> and have everything else go through human review first.
> 
> > And if we actually do need need user/password is there a way to restrict
> > which data can a user see (i.e. configure that I can see Okular's data but
> > not Krita's?).
> 
> Assuming this would be connected to identity.kde.org, I think it would be
> fine to give all people with commit access read access to the data too, or
> do you think we really need to control this per product?

Probably not?

> I do see why we might want more control on the product/campaign management
> side, so I don't accidentally destroy Okular's data due to not knowing how
> to use the tool. It would be much easier if we don't need to restrict this
> per product though, but rather just to a group of people who know what they
> are doing.

Makes sense.

Cheers,
  Albert

> 
> Regards,
> Volker
> 
> > Thanks for working on this :)
> > 
> > Cheers,
> > 
> >   Albert
> >   
> > > Technically, this consists of the following parts:
> > > - a library that goes into the target application, backward compatible
> > > all
> > > the way to Qt4.8/MSVC2010 (needed for my GammaRay use-case), depending
> > > only
> > > on QtGui
> > > - a library with the reference widgets, also with extended backward
> > > compatibility
> > > - the server, written in PHP5 and supporting sqlite/mysql/postgresql.
> > > Not
> > > the most fun technology, but that stuff is available almost anywhere,
> > > and
> > > easy to deploy and maintain
> > > - the management tool, recent Qt5/recent C++, using QtCharts for the
> > > data
> > > analysis
> > > - a command line tool for data import/export, useful for eg. automated
> > > backups
> > > 
> > > All of this is LGPLv2+ licensed.
> > > 
> > > Feedback obviously very welcome, in particular around privacy concerns,
> > > or
> > > reasons that would make you enable/disable such a feature.
> > > 
> > > Regards,
> > > Volker