[Nepomuk] RFC: The grammar of the new Nepomuk query parser

Thu May 30 16:07:31 UTC 2013

Hey Denis

On Wed, May 29, 2013 at 5:36 PM, Denis Steckelmacher <steckdenis at yahoo.fr>wrote:

> Hi,
>
> As my GSoC proposal, « A new query parser and auto-completed input field
> for Nepomuk » was accepted, I think is is time to discuss the grammar that
> will be used by this parser.
>
> This mail proposes a grammar that is somewhat complex but should be very
> human-friendly. I tried to keep it implementable, and I hope it is, but
> future reflexion may remove some difficult parts of this grammar to make it
> more computer-friendly.
>
> (this is a very long mail, sorry)
>

I'm getting used to them.

>
> General overview
> ----------------
>
> The aim for this grammar is to be the most simple possible for users,
> while producing fast and correct queries. I also think that the grammar
> should be very adaptable to locales all around the world if we want many
> users to enjoy using the Nepomuk search abilities.
>
> The current parser works fairly well for small queries I enter into the
> nepomukquerytester tool. It is able to recognize filters ("fileSize>=2000"
> produces a query that filters the items based on their nfo:fileSize and
> nid3:audiofileSize properties), tag filters ("hasTag:some" gets all the
> items having "some" as one of their tags), and simple words (that are
> matched against the content and properties of the items, if I understand
> correctly the query).
>
> The different filters and words can be mixed in the same query. If they
> are separated by spaces or by AND, they are ANDed. OR is also recognized
> and produces the expected result.
>
> The goal of the new parser is to be able to parse everything the current
> one is able to parse, plus other features that are missing. The second goal
> is to provide enough information to enable the implementation of a
> syntax-highlighted and auto-completed input field.
>
> Basic grammar
> -------------
>
> The grammar uses space-separated words. The space character is
> locale-dependent (ASCII 32 for languages based on European alphabets, I
> don't know if every other language use the same space).
>
> At the top-most level, the grammar is a suite of rules, separated by any
> rule-separators (characters or keywords that represent an AND or OR
> relation). The current parser uses the space, OR and AND as separator, but
> the new one also needs the comma because it will disambiguate some queries,
> while remaining human-friendly.
>
> statement ::= <rule> ( (<AND> | <OR>) <rule> )* ;
> AND ::= " " | "," | i18n("AND") ;
> OR ::= i18n("OR") ;
>
> Each rule has the form "property OP value", where everything except the
> value is optional. The property is for instance "fileSize", and OP can be
> an equal sign, a comparison operation, etc. The value can be an immediate
> (a string, a file size, a plain integer or a date-time) or a nested query.
>
> rule ::=  [<property> <OP>] <value> | "(" <statement> ")" ;
> property ::= (any character not a space nor an operator)*;
> OP ::= ">" | ">=" | "<" | "<=" | "=" | ":" | "<>" | "!=" ;
> value ::= "(" <statement> ")" | <immediate> ;
>
> "=" and ":" have the same meaning, I use the two because some people
> prefer one over the other. "hasTag:tag" is equivalent to "hasTag=one" even
> if I prefer the colon, and "activity=School" seems clearer to me than
> "activity:School".
>
> Ambiguous strings
> -----------------
>
> I have not represented the immediate in the EBNF snippet above because it
> is the part that is not completely unambiguous. For the lexer, an immediate
> is always a string. The trick is that this string may not be surrounded by
> quotes. If the string begins with a quote, it's simple, the lexer advances
> until it finds another quote.
>
> If the string does not begin with a quote, some work needs to be done. The
> lexer is helped by the parser here, as the parser will provide it a list of
> token types it expects.
>
> For instance, if the user has entered "fileSize>=12M", the parser knows
> that the string will be transformed into a file size, and hence that only
> one word has to be parsed. When an integer is expected, again only one word
> needs to be parsed.
>
> Some immediate types may span multiple words. It is the case of a general
> string ("contains") or date-times. In this case, the parser will try two
> strategies to close the string. The first one is to close the string at a
> comma, so that in "hasTag: a tag, size>12M", "a tag" will be recognized as
> a string.
>
> The second strategy is to advance until a "property OP" is encountered.
> Here, we dive into ambiguity : how do you parse "hasTag:my dog size>12M" ?
> Do you consider "my dog" as a string, or only "my" and "dog" is a separate
> rule ?
>
> There is no exact response as, in my example, as an user, I wanted "my
> dog" to be the tag. But you had no idea of that by simply looking at the
> input.
>
> Here, the lexer returns two results, it is a fork of the lexing state. The
> parser also forks its state and will try to parse two inputs : one that
> will yield "hasTag:"my" AND contains:"dog" AND size>12M" and one that will
> yield "hasTag:"my dog" AND size>12M".
>

Well, this all assumes that "my dog" actually exists as a tag. If it
doesn't and only "my" exists as a tag, then you obviously only want all
files tagged with "my" and contains "dog".

If none of the tags exist then maybe we just want to do a literal search
for "my" and "dog".

>
> Ambiguous properties
> --------------------
>
> Another source of ambiguity and forks is that properties are optional. If
> I write "dog", I may want to find my documents containing the word "dog",
> or my documents being tagged as "dog" (I don't think the majority of users
> will read the documentation and will type "hasTag:a_tag", they just want to
> find what they are looking for)
>
> When the parser encounters a word that is not followed by a comparison
> operator (=, :, >=, etc), it is a value without a property. The property
> has to be guessed from the value.
>
> In order to do that, the parser will try its different "value parsers".
> One of them is the parser for file sizes, that recognizes that "12.3K" is
> 12300 bytes and that "12.3KiB" is 12588 bytes. Another one is
> KHumanDateTime (https://github.com/**steckdenis/khumandatetime<https://github.com/steckdenis/khumandatetime>),
> a parser that changes "last month", "yesterday" or "first Monday of May" to
> the corresponding date-time. Finally, the tag name parser uses an in-memory
> cache of the known tag names and only matches existing tag names.
>
> If a parser returns a valid value (the parser says whether the value is
> valid or not), the corresponding default property can be used as the
> property of the value. For instance, if the file-size parser returns
> something, it means that the implicit property has a great chance of being
> "fileSize". If the date-time parser returns something and estimates that
> the input string was a date-time, then properties related to date-time may
> be used. "hasTag" is used if the string names an existing tag, and
> "contains" is always tried.
>
> Note: The date-time parser consumes the string. For instance, if I pass it
> "Monday in two weeks", it will begin by consuming "in two weeks". The
> string now internally becomes "Monday", that is also consumed. It may
> happen that the string is not completely empty after the parsing, for
> instance if there were small useless words in it ("The Monday that is in
> two weeks"). This string is compared with the original full string, and if
> it is less than half its size, the parser estimates that more than half the
> string was a date-time, and hence that the string likely was meant to be a
> date-time.
>

uhhh .. okay

>
> Results of ambiguous queries
> ----------------------------
>
> At the end of the day, the parser produces a list of possible ASTs for its
> input. By the way, this can be a list or a tree, every node in the AST
> therefore having a list of possible successors.
>
> (OR
>     (AND
>         (hasTag "my dog")
>         (> fileSize 12582912)
>     (AND
>         (hasTag "my")
>         (contains "dog")
>         (> fileSize 12582912)
>     )
> )
>
> Here, the ambiguity is solved by using an OR : the results for all the
> ambiguous branches are shown to the user. I think it is better to show more
> results than expected than less. Note that the fileSize filter is
> distributed in the two branches of the OR, because when the parser parses
> "hasTag:my dog fileSize>12M", the parser first sees the ambiguous hasTag,
> and then the fileSize filter, that has to be added to the two hasTag forks.
>
> Ambiguity and syntax coloring
> -----------------------------
>
> There is no way to syntax-highlight a text when it can have more than one
> meaning, except if the text can be colored, underlined in other colors,
> then rendered in an italic font, etc.
>
> One idea could be to show an arbitrary highlighting, for example the one
> with the biggest tokens (so that the user sees that he or she has to put
> commas to end strings at the correct places). Another possibility is to
> show ambiguous parts grayed-out, with a tool-tip showing the ambiguous ASTs
> and how the user can disambiguate the query ("put quotes around the string"
> for instance).
>
> Auto-completion
> ---------------
>
> At each step of the parsing, the parser has a list of what can come after
> the last token. If I type "fileSize>=", the parser knows that a file size
> has to be entered. If I type "hasTag:", a tag name is expected.
>
> This kind of information can be used to provide an auto-completion box. If
> the parsing fails because a value is lacking, the auto-completion box can
> show the possible value types (see https://steckdenis.files.**
> wordpress.com/2013/04/out.png<https://steckdenis.files.wordpress.com/2013/04/out.png>
> )**. Each value type can be accompanied by a list of examples (most used
> tag names, example of date-times understood by the date-time parser, etc).
>
> The auto-completion can also be used to ask the user to disambiguate its
> entry. If "hasTag:foo " (notice the space) is entered, the auto-completion
> box can show "put a comma after the tag name" with "hasTag:foo, " as an
> example, and "put quotes around foo" with "hasTag:"foo" as an example). My
> dream would be that clicking on these entries will fix the input.
>
> Performance
> -----------
>
> The performance of the parser must be great. Even if this mail is complex,
> the proposed grammar is not terribly difficult to parse, because there is a
> small number of rules. So, the parsing itself will not be very
> resource-intensive.
>
> The main performance bottleneck is the value parsers. For instance, it is
> of great importance for the tag-name parser to never perform Nepomuk
> queries. It would kill the performance if a dozen of tag-name queries have
> to be performed before the query can be syntax highlighted.
>
> One solution is to have each value parser be a class that is instantiated
> when the query parser is. In their constructor, they cache everything they
> need. The tag value parser can fetch all the tag names from Nepomuk, and
> store them. The date-time parses its XML rule list in its constructor. This
> way, the heavy work is done only one time, and fast methods using cached
> content are used during the actual parsing.
>
> Here ends this big mail. I hope it is clear enough, and I welcome any
> comment. When we reach a consensus about the syntax and features of the
> parser, I will describe everything on my blog (that can be read directly
> from Planet KDE) and will begin to implement a simple parser.

Great work so far.

However, I would like to take a step back. I feel this discussion might be
getting too technical.

Our main use case is normal users who do not know that much about the
syntax and then power users who do know about the syntax. If we have to
prioritize one over the other, it should be the "normal users".

This effectively means that most users will not type things such as
"property>=value". They will just type strings, and we might have to infer
the meaning from there.

If you want some real world examples, then have a look at "Google Search"
or even better the "Facebook Graph Search".

Ivan also applied for the same project, and he had created some very
interesting mockups for the query parser which focused more on the user
side and less on the technical. Perhaps you could look at this project? [1]
I'm not saying we should implement something exactly like that, but that
seems like a good place to start instead of focusing more on the grammar.

[1]
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/ivan/20001

-- 
Vishesh Handa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20130530/9b270cc5/attachment-0001.html>