[Nepomuk] Status report for the Nepomuk query parser (Week 2)

Mon Jun 24 11:08:24 UTC 2013

Hi,

As my experimental parser has advanced well during the last week, here 
is a new status report. If you think one status report per week is too 
much, don't hesitate to tell me.

My experimental parser still lives on GitHub[1], but I plan to get a KDE 
Git account and to push it in a new branch of nepomuk-core, if you think 
that it is a good idea. The parser only depends on Nepomuk Core and the 
KDE core libraries (for KCalendarSystem and localization). Its code-base 
is fairly self-contained and the parser itself can be made binary 
compatible with the current QueryParser class.

During my development, I regularly post progress reports on my blog[2]. 
This mail presents the two new big features of last week: nested queries 
and date-times.

Nested queries allow queries like "mails related to files tagged as 
Work" to work. Here, "files tagged as Work" is used to build an AndTerm, 
that is then used as the sub-term of the related_to ComparisonTerm. The 
nested query is ended at the end of the input or at a terminating 
character. For "related to ... ,", the terminating character is the 
comma. This allows queries like "files related to mails of Jimmy, having 
a size > 2M" to be parsed correctly.

Parsing date-times is not really mandatory, as the user can be taught to 
use standard date-time formats that Qt can parse. But as the philosophy 
of the parser is to be as human-friendly as possible, and to stress-test 
it a bit, I implemented the parsing of natural date-times.

Date-times are parsed using the same infrastructure as the rest of the 
parser. Every simple piece of date or time is recognized by the pattern 
matcher (using patterns like "last %1" for "last month" or "last week") 
and transformed into ComparisonTerms.

The property of these comparisons term is an internal URI, like 
"date://month/value" (for absolute values, like in "in January") or 
"date://month/offset" (for relative values, like in "2 months ago"). The 
value of the comparison is an integer literal term corresponding to the 
month number, day number, etc.

The parser thus produces a list of comparisons. Normal comparisons 
(hasTag, author, etc) are kept and date-time comparisons are fused to 
literal date-time values. Fusing consists of replacing a set of adjacent 
date-time comparisons with a single comparison against a date-time 
literal. So, "second week of June" is parsed as "month=June, week=2nd", 
that is then replaced with "2013-06-10 00:00:00".

These literal date-times can finally be used by other properties, like 
"sent on", "created at", etc. With ordinal operators, queries like 
"mails sent before last week" are correctly parsed.

The last parsing step that took a bit of time to get right is the 
handling of intervals. If I say "mails sent last week", I want to get a 
list of mails sent between the first and last days of last week.

Here, my solution is a bit hackish, I admit. The length of the period 
depends on the shortest part of the date-time given by the user, so 
"last week" is one week long, "in 2011" is one year long, "tomorrow" is 
one day long, and finally "at 3:40 pm" is one minute long.

This length is represented by an enumeration (PassDatePeriods::Period). 
This enumeration is converted to an integer and stored into the 
millisecond part of the date-time. This is crazy, but a date-time that 
is not exact to the millisecond remains good enough, and I had to store 
the information somewhere as neither LiteralTerm nor ComparisonTerm 
allow me to store this kind of extra information.

When an equality comparison against a date-time is encountered, it is 
replaced by a comparison against an interval. So, 
"nmo:receivedDate=2013-06-23" is replaced with 
"nmo:receivedDate>=2013-06-23 AND nmo:receivedDate<2013-06-24" for a 
period that is one day long.

The parser is currently made of 15 C++ files (11 parsing passes, the 
parser, the pattern matched, utility methods and a main.cpp to test 
everything), totaling 1 174 SLOC. This is fairly big but remains 
reasonable, and nearly every parsing feature is here.

This week, I plan to start looking at the syntax highlighting and 
auto-completion. Do you think it is a good idea to add positional 
information into the Term class ? (with this information, a start 
position and a length, syntax highlighting would be as simple as 
highlighting every comparison term and its subterms in a different 
color, and using a bold face for the comparison property).

To end this mail that I have failed to keep short, I'm in vacations from 
July 3 to July 13 included. I will have an internet connection, so I 
will be able to respond to mails, but I don't think I will have time to 
implement new features (I may have time to fix bugs, though). But after 
my vacations, I will be able to work on the parser without interruption.

Denis Steckelmacher.

[1]: https://github.com/steckdenis/nepomukqueryparser
[2]: http://steckdenis.be/category-nepomuk.html