[Nepomuk] RFC: The grammar of the new Nepomuk query parser

Sun Jun 2 07:48:55 UTC 2013

After having read the comments here on the mailing list and suggestions 
from Vishesh Handa, I have thought of some modifications of the grammar.

The first goal of the parser is to be the most human-friendly possible. 
That means that the users cannot be forced to learn a complex grammar 
before being able to use the parser. Ideally, the grammar should be able 
to understand natural language, even if its understanding is incomplete 
or inexact.

The second goal is to have a grammar formal enough to be able to 
implement syntax highlighting and auto-completion.

When dealing with natural language, one possible parsing algorithm that 
can be used is simply to dig into the query and to find the most 
information possible. For "gsoc proposal, tagged as Nepomuk", a informal 
parser could recognize the "tagged as X" pattern, then "gsoc proposal" 
that doesn't match anything and is therefore a plain text search.

The problem with this solution is that it is impossible to offer 
auto-completion with it. It is possible to syntax-highlight the input 
(each recognized pattern is highlighted in a different color), but the 
parser is unable to predict any input. It is a completely passive one. 
Furthermore, the parser can easily become a bit hackish.

The new Nepomuk parser needs to meet these two goals. The first one 
requires a simple human-friendly grammar, the second one requires that 
this grammar is formal and can be disambiguated. Currently, my proposed 
grammar is formal, but not user-friendly.

An example of query I would like to be able to parse using my parser is 
"mails sent by Bill last week".

Using an informal parser that digs for information, "sent by X" can be 
recognized, then "last week" that is a date, and possibly also "mails" 
that is recognized as a document type (the query needs to list e-mails).

Using a more formal approach, three things can be considered :

* If the lexer allows property names to contain spaces, it can lex "sent 
by" as a property name (the lexer has a list of valid property names).
* In my previous proposal, a property name had to be followed by an 
operator (=, >=, <, etc). The operator was in fact used to detect that 
what comes before is a property name. Here, if the lexer has a list of 
property names, it doesn't need operators to detect them. A default 
operator can therefore be used for each property. For "sent by", the 
default operator is "=".
* "Bill" comes right after a property and its default operator, so it is 
a value. As the lexer doesn't know if the user wanted a value with 
spaces or not, the value of "sent by" is "Bill" and "Bill last week". In 
the ambiguous branch where only "Bill" is the value, "last week" remains 
to be parsed. No property name matches this, so it is parsed as a value, 
"last week", that is detected to be a date-time. The default properties 
of a date-time are "created", "received", etc.

With this simple change of allowing spaces in properties and having a 
list of known properties (with their possible several translations in 
the user's language), the parser is now able to parse queries without 
any special character, more human-friendly.

I think the different operators and the complete grammar need to be 
kept, as only the full grammar is unambiguous. Power users may want to 
be able to use complex and exact queries, and even non-technical users 
may like to be able to use "date<=last week", if someone tell them that 
it is possible and would greatly enhance the results returned by the parser.

What do you think about this update of the proposed grammar ? Do you 
think I am on the right track ?

Denis Steckelmacher.