[Nepomuk] RFC: The grammar of the new Nepomuk query parser
Denis Steckelmacher
steckdenis at yahoo.fr
Wed May 29 12:06:37 UTC 2013
Hi,
As my GSoC proposal, « A new query parser and auto-completed input field
for Nepomuk » was accepted, I think is is time to discuss the grammar
that will be used by this parser.
This mail proposes a grammar that is somewhat complex but should be very
human-friendly. I tried to keep it implementable, and I hope it is, but
future reflexion may remove some difficult parts of this grammar to make
it more computer-friendly.
(this is a very long mail, sorry)
General overview
----------------
The aim for this grammar is to be the most simple possible for users,
while producing fast and correct queries. I also think that the grammar
should be very adaptable to locales all around the world if we want many
users to enjoy using the Nepomuk search abilities.
The current parser works fairly well for small queries I enter into the
nepomukquerytester tool. It is able to recognize filters
("fileSize>=2000" produces a query that filters the items based on their
nfo:fileSize and nid3:audiofileSize properties), tag filters
("hasTag:some" gets all the items having "some" as one of their tags),
and simple words (that are matched against the content and properties of
the items, if I understand correctly the query).
The different filters and words can be mixed in the same query. If they
are separated by spaces or by AND, they are ANDed. OR is also recognized
and produces the expected result.
The goal of the new parser is to be able to parse everything the current
one is able to parse, plus other features that are missing. The second
goal is to provide enough information to enable the implementation of a
syntax-highlighted and auto-completed input field.
Basic grammar
-------------
The grammar uses space-separated words. The space character is
locale-dependent (ASCII 32 for languages based on European alphabets, I
don't know if every other language use the same space).
At the top-most level, the grammar is a suite of rules, separated by any
rule-separators (characters or keywords that represent an AND or OR
relation). The current parser uses the space, OR and AND as separator,
but the new one also needs the comma because it will disambiguate some
queries, while remaining human-friendly.
statement ::= <rule> ( (<AND> | <OR>) <rule> )* ;
AND ::= " " | "," | i18n("AND") ;
OR ::= i18n("OR") ;
Each rule has the form "property OP value", where everything except the
value is optional. The property is for instance "fileSize", and OP can
be an equal sign, a comparison operation, etc. The value can be an
immediate (a string, a file size, a plain integer or a date-time) or a
nested query.
rule ::= [<property> <OP>] <value> | "(" <statement> ")" ;
property ::= (any character not a space nor an operator)*;
OP ::= ">" | ">=" | "<" | "<=" | "=" | ":" | "<>" | "!=" ;
value ::= "(" <statement> ")" | <immediate> ;
"=" and ":" have the same meaning, I use the two because some people
prefer one over the other. "hasTag:tag" is equivalent to "hasTag=one"
even if I prefer the colon, and "activity=School" seems clearer to me
than "activity:School".
Ambiguous strings
-----------------
I have not represented the immediate in the EBNF snippet above because
it is the part that is not completely unambiguous. For the lexer, an
immediate is always a string. The trick is that this string may not be
surrounded by quotes. If the string begins with a quote, it's simple,
the lexer advances until it finds another quote.
If the string does not begin with a quote, some work needs to be done.
The lexer is helped by the parser here, as the parser will provide it a
list of token types it expects.
For instance, if the user has entered "fileSize>=12M", the parser knows
that the string will be transformed into a file size, and hence that
only one word has to be parsed. When an integer is expected, again only
one word needs to be parsed.
Some immediate types may span multiple words. It is the case of a
general string ("contains") or date-times. In this case, the parser will
try two strategies to close the string. The first one is to close the
string at a comma, so that in "hasTag: a tag, size>12M", "a tag" will be
recognized as a string.
The second strategy is to advance until a "property OP" is encountered.
Here, we dive into ambiguity : how do you parse "hasTag:my dog size>12M"
? Do you consider "my dog" as a string, or only "my" and "dog" is a
separate rule ?
There is no exact response as, in my example, as an user, I wanted "my
dog" to be the tag. But you had no idea of that by simply looking at the
input.
Here, the lexer returns two results, it is a fork of the lexing state.
The parser also forks its state and will try to parse two inputs : one
that will yield "hasTag:"my" AND contains:"dog" AND size>12M" and one
that will yield "hasTag:"my dog" AND size>12M".
Ambiguous properties
--------------------
Another source of ambiguity and forks is that properties are optional.
If I write "dog", I may want to find my documents containing the word
"dog", or my documents being tagged as "dog" (I don't think the majority
of users will read the documentation and will type "hasTag:a_tag", they
just want to find what they are looking for)
When the parser encounters a word that is not followed by a comparison
operator (=, :, >=, etc), it is a value without a property. The property
has to be guessed from the value.
In order to do that, the parser will try its different "value parsers".
One of them is the parser for file sizes, that recognizes that "12.3K"
is 12300 bytes and that "12.3KiB" is 12588 bytes. Another one is
KHumanDateTime (https://github.com/steckdenis/khumandatetime), a parser
that changes "last month", "yesterday" or "first Monday of May" to the
corresponding date-time. Finally, the tag name parser uses an in-memory
cache of the known tag names and only matches existing tag names.
If a parser returns a valid value (the parser says whether the value is
valid or not), the corresponding default property can be used as the
property of the value. For instance, if the file-size parser returns
something, it means that the implicit property has a great chance of
being "fileSize". If the date-time parser returns something and
estimates that the input string was a date-time, then properties related
to date-time may be used. "hasTag" is used if the string names an
existing tag, and "contains" is always tried.
Note: The date-time parser consumes the string. For instance, if I pass
it "Monday in two weeks", it will begin by consuming "in two weeks". The
string now internally becomes "Monday", that is also consumed. It may
happen that the string is not completely empty after the parsing, for
instance if there were small useless words in it ("The Monday that is in
two weeks"). This string is compared with the original full string, and
if it is less than half its size, the parser estimates that more than
half the string was a date-time, and hence that the string likely was
meant to be a date-time.
Results of ambiguous queries
----------------------------
At the end of the day, the parser produces a list of possible ASTs for
its input. By the way, this can be a list or a tree, every node in the
AST therefore having a list of possible successors.
(OR
(AND
(hasTag "my dog")
(> fileSize 12582912)
(AND
(hasTag "my")
(contains "dog")
(> fileSize 12582912)
)
)
Here, the ambiguity is solved by using an OR : the results for all the
ambiguous branches are shown to the user. I think it is better to show
more results than expected than less. Note that the fileSize filter is
distributed in the two branches of the OR, because when the parser
parses "hasTag:my dog fileSize>12M", the parser first sees the ambiguous
hasTag, and then the fileSize filter, that has to be added to the two
hasTag forks.
Ambiguity and syntax coloring
-----------------------------
There is no way to syntax-highlight a text when it can have more than
one meaning, except if the text can be colored, underlined in other
colors, then rendered in an italic font, etc.
One idea could be to show an arbitrary highlighting, for example the one
with the biggest tokens (so that the user sees that he or she has to put
commas to end strings at the correct places). Another possibility is to
show ambiguous parts grayed-out, with a tool-tip showing the ambiguous
ASTs and how the user can disambiguate the query ("put quotes around the
string" for instance).
Auto-completion
---------------
At each step of the parsing, the parser has a list of what can come
after the last token. If I type "fileSize>=", the parser knows that a
file size has to be entered. If I type "hasTag:", a tag name is expected.
This kind of information can be used to provide an auto-completion box.
If the parsing fails because a value is lacking, the auto-completion box
can show the possible value types (see
https://steckdenis.files.wordpress.com/2013/04/out.png). Each value type
can be accompanied by a list of examples (most used tag names, example
of date-times understood by the date-time parser, etc).
The auto-completion can also be used to ask the user to disambiguate its
entry. If "hasTag:foo " (notice the space) is entered, the
auto-completion box can show "put a comma after the tag name" with
"hasTag:foo, " as an example, and "put quotes around foo" with
"hasTag:"foo" as an example). My dream would be that clicking on these
entries will fix the input.
Performance
-----------
The performance of the parser must be great. Even if this mail is
complex, the proposed grammar is not terribly difficult to parse,
because there is a small number of rules. So, the parsing itself will
not be very resource-intensive.
The main performance bottleneck is the value parsers. For instance, it
is of great importance for the tag-name parser to never perform Nepomuk
queries. It would kill the performance if a dozen of tag-name queries
have to be performed before the query can be syntax highlighted.
One solution is to have each value parser be a class that is
instantiated when the query parser is. In their constructor, they cache
everything they need. The tag value parser can fetch all the tag names
from Nepomuk, and store them. The date-time parses its XML rule list in
its constructor. This way, the heavy work is done only one time, and
fast methods using cached content are used during the actual parsing.
Here ends this big mail. I hope it is clear enough, and I welcome any
comment. When we reach a consensus about the syntax and features of the
parser, I will describe everything on my blog (that can be read directly
from Planet KDE) and will begin to implement a simple parser.
Denis Steckelmacher.
More information about the Nepomuk
mailing list