[Nepomuk] RFC: The grammar of the new Nepomuk query parser
Denis Steckelmacher
steckdenis at yahoo.fr
Sun Jun 2 07:48:55 UTC 2013
After having read the comments here on the mailing list and suggestions
from Vishesh Handa, I have thought of some modifications of the grammar.
The first goal of the parser is to be the most human-friendly possible.
That means that the users cannot be forced to learn a complex grammar
before being able to use the parser. Ideally, the grammar should be able
to understand natural language, even if its understanding is incomplete
or inexact.
The second goal is to have a grammar formal enough to be able to
implement syntax highlighting and auto-completion.
When dealing with natural language, one possible parsing algorithm that
can be used is simply to dig into the query and to find the most
information possible. For "gsoc proposal, tagged as Nepomuk", a informal
parser could recognize the "tagged as X" pattern, then "gsoc proposal"
that doesn't match anything and is therefore a plain text search.
The problem with this solution is that it is impossible to offer
auto-completion with it. It is possible to syntax-highlight the input
(each recognized pattern is highlighted in a different color), but the
parser is unable to predict any input. It is a completely passive one.
Furthermore, the parser can easily become a bit hackish.
The new Nepomuk parser needs to meet these two goals. The first one
requires a simple human-friendly grammar, the second one requires that
this grammar is formal and can be disambiguated. Currently, my proposed
grammar is formal, but not user-friendly.
An example of query I would like to be able to parse using my parser is
"mails sent by Bill last week".
Using an informal parser that digs for information, "sent by X" can be
recognized, then "last week" that is a date, and possibly also "mails"
that is recognized as a document type (the query needs to list e-mails).
Using a more formal approach, three things can be considered :
* If the lexer allows property names to contain spaces, it can lex "sent
by" as a property name (the lexer has a list of valid property names).
* In my previous proposal, a property name had to be followed by an
operator (=, >=, <, etc). The operator was in fact used to detect that
what comes before is a property name. Here, if the lexer has a list of
property names, it doesn't need operators to detect them. A default
operator can therefore be used for each property. For "sent by", the
default operator is "=".
* "Bill" comes right after a property and its default operator, so it is
a value. As the lexer doesn't know if the user wanted a value with
spaces or not, the value of "sent by" is "Bill" and "Bill last week". In
the ambiguous branch where only "Bill" is the value, "last week" remains
to be parsed. No property name matches this, so it is parsed as a value,
"last week", that is detected to be a date-time. The default properties
of a date-time are "created", "received", etc.
With this simple change of allowing spaces in properties and having a
list of known properties (with their possible several translations in
the user's language), the parser is now able to parse queries without
any special character, more human-friendly.
I think the different operators and the complete grammar need to be
kept, as only the full grammar is unambiguous. Power users may want to
be able to use complex and exact queries, and even non-technical users
may like to be able to use "date<=last week", if someone tell them that
it is possible and would greatly enhance the results returned by the parser.
What do you think about this update of the proposed grammar ? Do you
think I am on the right track ?
Denis Steckelmacher.
More information about the Nepomuk
mailing list