[Nepomuk] RFC: The grammar of the new Nepomuk query parser

Wed May 29 12:06:37 UTC 2013

Hi,

As my GSoC proposal, « A new query parser and auto-completed input field 
for Nepomuk » was accepted, I think is is time to discuss the grammar 
that will be used by this parser.

This mail proposes a grammar that is somewhat complex but should be very 
human-friendly. I tried to keep it implementable, and I hope it is, but 
future reflexion may remove some difficult parts of this grammar to make 
it more computer-friendly.

(this is a very long mail, sorry)

General overview
----------------

The aim for this grammar is to be the most simple possible for users, 
while producing fast and correct queries. I also think that the grammar 
should be very adaptable to locales all around the world if we want many 
users to enjoy using the Nepomuk search abilities.

The current parser works fairly well for small queries I enter into the 
nepomukquerytester tool. It is able to recognize filters 
("fileSize>=2000" produces a query that filters the items based on their 
nfo:fileSize and nid3:audiofileSize properties), tag filters 
("hasTag:some" gets all the items having "some" as one of their tags), 
and simple words (that are matched against the content and properties of 
the items, if I understand correctly the query).

The different filters and words can be mixed in the same query. If they 
are separated by spaces or by AND, they are ANDed. OR is also recognized 
and produces the expected result.

The goal of the new parser is to be able to parse everything the current 
one is able to parse, plus other features that are missing. The second 
goal is to provide enough information to enable the implementation of a 
syntax-highlighted and auto-completed input field.

Basic grammar
-------------

The grammar uses space-separated words. The space character is 
locale-dependent (ASCII 32 for languages based on European alphabets, I 
don't know if every other language use the same space).

At the top-most level, the grammar is a suite of rules, separated by any 
rule-separators (characters or keywords that represent an AND or OR 
relation). The current parser uses the space, OR and AND as separator, 
but the new one also needs the comma because it will disambiguate some 
queries, while remaining human-friendly.

statement ::= <rule> ( (<AND> | <OR>) <rule> )* ;
AND ::= " " | "," | i18n("AND") ;
OR ::= i18n("OR") ;

Each rule has the form "property OP value", where everything except the 
value is optional. The property is for instance "fileSize", and OP can 
be an equal sign, a comparison operation, etc. The value can be an 
immediate (a string, a file size, a plain integer or a date-time) or a 
nested query.

rule ::=  [<property> <OP>] <value> | "(" <statement> ")" ;
property ::= (any character not a space nor an operator)*;
OP ::= ">" | ">=" | "<" | "<=" | "=" | ":" | "<>" | "!=" ;
value ::= "(" <statement> ")" | <immediate> ;

"=" and ":" have the same meaning, I use the two because some people 
prefer one over the other. "hasTag:tag" is equivalent to "hasTag=one" 
even if I prefer the colon, and "activity=School" seems clearer to me 
than "activity:School".

Ambiguous strings
-----------------

I have not represented the immediate in the EBNF snippet above because 
it is the part that is not completely unambiguous. For the lexer, an 
immediate is always a string. The trick is that this string may not be 
surrounded by quotes. If the string begins with a quote, it's simple, 
the lexer advances until it finds another quote.

If the string does not begin with a quote, some work needs to be done. 
The lexer is helped by the parser here, as the parser will provide it a 
list of token types it expects.

For instance, if the user has entered "fileSize>=12M", the parser knows 
that the string will be transformed into a file size, and hence that 
only one word has to be parsed. When an integer is expected, again only 
one word needs to be parsed.

Some immediate types may span multiple words. It is the case of a 
general string ("contains") or date-times. In this case, the parser will 
try two strategies to close the string. The first one is to close the 
string at a comma, so that in "hasTag: a tag, size>12M", "a tag" will be 
recognized as a string.

The second strategy is to advance until a "property OP" is encountered. 
Here, we dive into ambiguity : how do you parse "hasTag:my dog size>12M" 
? Do you consider "my dog" as a string, or only "my" and "dog" is a 
separate rule ?

There is no exact response as, in my example, as an user, I wanted "my 
dog" to be the tag. But you had no idea of that by simply looking at the 
input.

Here, the lexer returns two results, it is a fork of the lexing state. 
The parser also forks its state and will try to parse two inputs : one 
that will yield "hasTag:"my" AND contains:"dog" AND size>12M" and one 
that will yield "hasTag:"my dog" AND size>12M".

Ambiguous properties
--------------------

Another source of ambiguity and forks is that properties are optional. 
If I write "dog", I may want to find my documents containing the word 
"dog", or my documents being tagged as "dog" (I don't think the majority 
of users will read the documentation and will type "hasTag:a_tag", they 
just want to find what they are looking for)

When the parser encounters a word that is not followed by a comparison 
operator (=, :, >=, etc), it is a value without a property. The property 
has to be guessed from the value.

In order to do that, the parser will try its different "value parsers". 
One of them is the parser for file sizes, that recognizes that "12.3K" 
is 12300 bytes and that "12.3KiB" is 12588 bytes. Another one is 
KHumanDateTime (https://github.com/steckdenis/khumandatetime), a parser 
that changes "last month", "yesterday" or "first Monday of May" to the 
corresponding date-time. Finally, the tag name parser uses an in-memory 
cache of the known tag names and only matches existing tag names.

If a parser returns a valid value (the parser says whether the value is 
valid or not), the corresponding default property can be used as the 
property of the value. For instance, if the file-size parser returns 
something, it means that the implicit property has a great chance of 
being "fileSize". If the date-time parser returns something and 
estimates that the input string was a date-time, then properties related 
to date-time may be used. "hasTag" is used if the string names an 
existing tag, and "contains" is always tried.

Note: The date-time parser consumes the string. For instance, if I pass 
it "Monday in two weeks", it will begin by consuming "in two weeks". The 
string now internally becomes "Monday", that is also consumed. It may 
happen that the string is not completely empty after the parsing, for 
instance if there were small useless words in it ("The Monday that is in 
two weeks"). This string is compared with the original full string, and 
if it is less than half its size, the parser estimates that more than 
half the string was a date-time, and hence that the string likely was 
meant to be a date-time.

Results of ambiguous queries
----------------------------

At the end of the day, the parser produces a list of possible ASTs for 
its input. By the way, this can be a list or a tree, every node in the 
AST therefore having a list of possible successors.

(OR
     (AND
         (hasTag "my dog")
         (> fileSize 12582912)
     (AND
         (hasTag "my")
         (contains "dog")
         (> fileSize 12582912)
     )
)

Here, the ambiguity is solved by using an OR : the results for all the 
ambiguous branches are shown to the user. I think it is better to show 
more results than expected than less. Note that the fileSize filter is 
distributed in the two branches of the OR, because when the parser 
parses "hasTag:my dog fileSize>12M", the parser first sees the ambiguous 
hasTag, and then the fileSize filter, that has to be added to the two 
hasTag forks.

Ambiguity and syntax coloring
-----------------------------

There is no way to syntax-highlight a text when it can have more than 
one meaning, except if the text can be colored, underlined in other 
colors, then rendered in an italic font, etc.

One idea could be to show an arbitrary highlighting, for example the one 
with the biggest tokens (so that the user sees that he or she has to put 
commas to end strings at the correct places). Another possibility is to 
show ambiguous parts grayed-out, with a tool-tip showing the ambiguous 
ASTs and how the user can disambiguate the query ("put quotes around the 
string" for instance).

Auto-completion
---------------

At each step of the parsing, the parser has a list of what can come 
after the last token. If I type "fileSize>=", the parser knows that a 
file size has to be entered. If I type "hasTag:", a tag name is expected.

This kind of information can be used to provide an auto-completion box. 
If the parsing fails because a value is lacking, the auto-completion box 
can show the possible value types (see 
https://steckdenis.files.wordpress.com/2013/04/out.png). Each value type 
can be accompanied by a list of examples (most used tag names, example 
of date-times understood by the date-time parser, etc).

The auto-completion can also be used to ask the user to disambiguate its 
entry. If "hasTag:foo " (notice the space) is entered, the 
auto-completion box can show "put a comma after the tag name" with 
"hasTag:foo, " as an example, and "put quotes around foo" with 
"hasTag:"foo" as an example). My dream would be that clicking on these 
entries will fix the input.

Performance
-----------

The performance of the parser must be great. Even if this mail is 
complex, the proposed grammar is not terribly difficult to parse, 
because there is a small number of rules. So, the parsing itself will 
not be very resource-intensive.

The main performance bottleneck is the value parsers. For instance, it 
is of great importance for the tag-name parser to never perform Nepomuk 
queries. It would kill the performance if a dozen of tag-name queries 
have to be performed before the query can be syntax highlighted.

One solution is to have each value parser be a class that is 
instantiated when the query parser is. In their constructor, they cache 
everything they need. The tag value parser can fetch all the tag names 
from Nepomuk, and store them. The date-time parses its XML rule list in 
its constructor. This way, the heavy work is done only one time, and 
fast methods using cached content are used during the actual parsing.

Here ends this big mail. I hope it is clear enough, and I welcome any 
comment. When we reach a consensus about the syntax and features of the 
parser, I will describe everything on my blog (that can be read directly 
from Planet KDE) and will begin to implement a simple parser.

Denis Steckelmacher.