Fuzzy searching against KDE repositories

Dimitris Kardarakos dimkard at gmail.com
Sun May 6 15:04:24 UTC 2018

Hello everyone!

Let me introduce you to a project that I am currently working on.

The scope of the project is to provide an easy way to search KDE code 
and translations repository since I consider that such a kind of an 
infrastructure would help possible newcomers to easily obtain valuable 
information about the work of the community. For example:

    - which projects exist

    - which ones are the most active

    - how developers describe their work on them

    - find out the developers that currently work on them

To make the long story short, I was thinking that a google-like search 
engine would facilitate onboarding of newcomers to KDE

So, I ended up to a solution that:

1. Fetches git and svn commit messages from the kde-commits mailing list

2. Parses each message and creates a json file that contains the below 

- commit subject

- commit message

- author

- project

- commit date

- isrevision (does a relative phabricator task exist?)

- istranslation (is it a translation commit?)

- fixesbug (whether the commit is bug-related)

The relative code can be found  here 

3. Loads the json recordset to an Apache Solr instance

4. On top of apache Solr, Banana (port of Kibana for Solr) has been 
added. A custom searching panel has been created to provide fuzzy 
searches against KDE repositories.

Moreover, it could also be useful for KDE writers/promoters to get a 
clear view of the current development, either on code or translations, 
the new features, the bug-fixing work, etc

To better illustrate the tool, let's simulate the creation of a post 
, leveraging the functionalities offered by this solution.

At first, the promoter wants to get more info and add references to 
open/save dialog project improvements:

/Open/Save dialog project/

/The dialogs now display previews for the same assortment of file types 
as Dolphin does (Alex Nemeth)/

/Grid Spacing in icons view has been tightened up to match Dolphin, 
allowing more to be shown in the window (Alex Nemeth)/


In case that the writer remembers the name of the committer and knows 
that a relative bug report does exist, the facet in the left will be 
used and the relative time period will also be set (top-left):



The records returned are:


so the commit in concern has been successfully found.

In case that no committer name is available, the writer may search for 
sth like:


Then following the search suggestion


the relative commit will be returned as top result:


The same applies for the second search:


since the relative commit is returned as well:


Moreover, although this is not its primary role, the solution provides 
some useful interactive visualization tools. For example, searching work 
on projects like plasma-phone-components, plasma-settings, plasma-mobile 
and kirigami, the tool would provide useful information regarding work 
on Plasma Mobile. So, a relative promo article could be accompanied with 
some useful statistics and references to real plasma mobile commits, 
like this:


In the future, such a solution could be further extended indexing 
bugzilla data as well. As a result, reports about possible duplicates 
could be automatically generated and, why not, a fuzzy search engine 
could be offered to the bug reporters enhancing the reporting 
experience, avoiding duplicates and frustration about irrelevant results.

Nevertheless, there is a set of factors that should be considered as 
well. At first, the amount of commits on a project is just an indicator 
-among many others- of the activity of a project. A lot of work may 
happen behind the scenes, in terms of communications, design, testing 
etc, and this work may be committed as a single or a few commits. So, 
considering all commits as equal is a trap. In addition, since the tool 
measures the # of commits by each developer, we may think twice about 
the implications of such a tool regarding the psychological effects on 
the personality of contributors.

Do you think that such a tool could help KDE community? I look forward 
to hearing your thoughts, since I am not still convinced if working on 
this would really help the KDE ecosystem.

PS: We may look at other alternatives as regards to the technologies 
involved. I’ve opted for the aforementioned since I have already worked 
on them in the past.

PS1: If similar projects that I am not aware of currently exist in KDE  
we may consider using them instead of this approach (or join efforts if 
they are compatible). My intention is just to start a discussion about 
how big data, indexing and fuzzy searching may improve onboarding and 
"promotion" work.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-community/attachments/20180506/cd3b8bdf/attachment.html>

More information about the kde-community mailing list