Fuzzy searching against KDE repositories

Sun May 6 16:04:24 BST 2018

Hello everyone!

Let me introduce you to a project that I am currently working on.

The scope of the project is to provide an easy way to search KDE code 
and translations repository since I consider that such a kind of an 
infrastructure would help possible newcomers to easily obtain valuable 
information about the work of the community. For example:

    - which projects exist

    - which ones are the most active

    - how developers describe their work on them

    - find out the developers that currently work on them

To make the long story short, I was thinking that a google-like search 
engine would facilitate onboarding of newcomers to KDE

So, I ended up to a solution that:

1. Fetches git and svn commit messages from the kde-commits mailing list

2. Parses each message and creates a json file that contains the below 
information:

- commit subject

- commit message

- author

- project

- commit date

- isrevision (does a relative phabricator task exist?)

- istranslation (is it a translation commit?)

- fixesbug (whether the commit is bug-related)

The relative code can be found  here 
https://github.com/dimkard/kde-commits-solr

3. Loads the json recordset to an Apache Solr instance

4. On top of apache Solr, Banana (port of Kibana for Solr) has been 
added. A custom searching panel has been created to provide fuzzy 
searches against KDE repositories.

Moreover, it could also be useful for KDE writers/promoters to get a 
clear view of the current development, either on code or translations, 
the new features, the bug-fixing work, etc

To better illustrate the tool, let's simulate the creation of a post 
like 
https://pointieststick.wordpress.com/2018/04/29/this-week-in-usability-productivity-part-16-everything-else/ 
, leveraging the functionalities offered by this solution.

At first, the promoter wants to get more info and add references to 
open/save dialog project improvements:

/Open/Save dialog project/

/The dialogs now display previews for the same assortment of file types 
as Dolphin does (Alex Nemeth)/

/Grid Spacing in icons view has been tightened up to match Dolphin, 
allowing more to be shown in the window (Alex Nemeth)/

/
/

In case that the writer remembers the name of the committer and knows 
that a relative bug report does exist, the facet in the left will be 
used and the relative time period will also be set (top-left):

https://framapic.org/z4PtCZxEul5K/L3tJZ8visR4I.png

https://framapic.org/DnveENis7bEa/BsKkske4RVPz.png

The records returned are:

https://framapic.org/8fv0crGCijf6/cPsbeZWO1CJH.png

so the commit in concern has been successfully found.

In case that no committer name is available, the writer may search for 
sth like:

https://framapic.org/wYn063MH0jPY/4zEdV8ngIidS.png

Then following the search suggestion

https://framapic.org/owrsQQ2a5HVW/gWUiTNz7xWTB.png

the relative commit will be returned as top result:

https://framapic.org/bNDgq0cbR7J7/xVd3HJqQ40Kv.png

The same applies for the second search:

https://framapic.org/qpddu38zvRJF/ssQO4MBEWt6s.png

since the relative commit is returned as well:

https://framapic.org/g31g6x5mIOxR/5PNUInPMNxYw.png

Moreover, although this is not its primary role, the solution provides 
some useful interactive visualization tools. For example, searching work 
on projects like plasma-phone-components, plasma-settings, plasma-mobile 
and kirigami, the tool would provide useful information regarding work 
on Plasma Mobile. So, a relative promo article could be accompanied with 
some useful statistics and references to real plasma mobile commits, 
like this:

https://framapic.org/2RW8LlxCjYkh/LbkUnVQZTyZV.png

In the future, such a solution could be further extended indexing 
bugzilla data as well. As a result, reports about possible duplicates 
could be automatically generated and, why not, a fuzzy search engine 
could be offered to the bug reporters enhancing the reporting 
experience, avoiding duplicates and frustration about irrelevant results.

Nevertheless, there is a set of factors that should be considered as 
well. At first, the amount of commits on a project is just an indicator 
-among many others- of the activity of a project. A lot of work may 
happen behind the scenes, in terms of communications, design, testing 
etc, and this work may be committed as a single or a few commits. So, 
considering all commits as equal is a trap. In addition, since the tool 
measures the # of commits by each developer, we may think twice about 
the implications of such a tool regarding the psychological effects on 
the personality of contributors.

Do you think that such a tool could help KDE community? I look forward 
to hearing your thoughts, since I am not still convinced if working on 
this would really help the KDE ecosystem.

PS: We may look at other alternatives as regards to the technologies 
involved. I’ve opted for the aforementioned since I have already worked 
on them in the past.

PS1: If similar projects that I am not aware of currently exist in KDE  
we may consider using them instead of this approach (or join efforts if 
they are compatible). My intention is just to start a discussion about 
how big data, indexing and fuzzy searching may improve onboarding and 
"promotion" work.

Dimitris

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-community/attachments/20180506/cd3b8bdf/attachment.htm>