[Nepomuk] Status report for the Nepomuk query parser (Week 2)
Denis Steckelmacher
steckdenis at yahoo.fr
Mon Jun 24 11:08:24 UTC 2013
Hi,
As my experimental parser has advanced well during the last week, here
is a new status report. If you think one status report per week is too
much, don't hesitate to tell me.
My experimental parser still lives on GitHub[1], but I plan to get a KDE
Git account and to push it in a new branch of nepomuk-core, if you think
that it is a good idea. The parser only depends on Nepomuk Core and the
KDE core libraries (for KCalendarSystem and localization). Its code-base
is fairly self-contained and the parser itself can be made binary
compatible with the current QueryParser class.
During my development, I regularly post progress reports on my blog[2].
This mail presents the two new big features of last week: nested queries
and date-times.
Nested queries allow queries like "mails related to files tagged as
Work" to work. Here, "files tagged as Work" is used to build an AndTerm,
that is then used as the sub-term of the related_to ComparisonTerm. The
nested query is ended at the end of the input or at a terminating
character. For "related to ... ,", the terminating character is the
comma. This allows queries like "files related to mails of Jimmy, having
a size > 2M" to be parsed correctly.
Parsing date-times is not really mandatory, as the user can be taught to
use standard date-time formats that Qt can parse. But as the philosophy
of the parser is to be as human-friendly as possible, and to stress-test
it a bit, I implemented the parsing of natural date-times.
Date-times are parsed using the same infrastructure as the rest of the
parser. Every simple piece of date or time is recognized by the pattern
matcher (using patterns like "last %1" for "last month" or "last week")
and transformed into ComparisonTerms.
The property of these comparisons term is an internal URI, like
"date://month/value" (for absolute values, like in "in January") or
"date://month/offset" (for relative values, like in "2 months ago"). The
value of the comparison is an integer literal term corresponding to the
month number, day number, etc.
The parser thus produces a list of comparisons. Normal comparisons
(hasTag, author, etc) are kept and date-time comparisons are fused to
literal date-time values. Fusing consists of replacing a set of adjacent
date-time comparisons with a single comparison against a date-time
literal. So, "second week of June" is parsed as "month=June, week=2nd",
that is then replaced with "2013-06-10 00:00:00".
These literal date-times can finally be used by other properties, like
"sent on", "created at", etc. With ordinal operators, queries like
"mails sent before last week" are correctly parsed.
The last parsing step that took a bit of time to get right is the
handling of intervals. If I say "mails sent last week", I want to get a
list of mails sent between the first and last days of last week.
Here, my solution is a bit hackish, I admit. The length of the period
depends on the shortest part of the date-time given by the user, so
"last week" is one week long, "in 2011" is one year long, "tomorrow" is
one day long, and finally "at 3:40 pm" is one minute long.
This length is represented by an enumeration (PassDatePeriods::Period).
This enumeration is converted to an integer and stored into the
millisecond part of the date-time. This is crazy, but a date-time that
is not exact to the millisecond remains good enough, and I had to store
the information somewhere as neither LiteralTerm nor ComparisonTerm
allow me to store this kind of extra information.
When an equality comparison against a date-time is encountered, it is
replaced by a comparison against an interval. So,
"nmo:receivedDate=2013-06-23" is replaced with
"nmo:receivedDate>=2013-06-23 AND nmo:receivedDate<2013-06-24" for a
period that is one day long.
The parser is currently made of 15 C++ files (11 parsing passes, the
parser, the pattern matched, utility methods and a main.cpp to test
everything), totaling 1 174 SLOC. This is fairly big but remains
reasonable, and nearly every parsing feature is here.
This week, I plan to start looking at the syntax highlighting and
auto-completion. Do you think it is a good idea to add positional
information into the Term class ? (with this information, a start
position and a length, syntax highlighting would be as simple as
highlighting every comparison term and its subterms in a different
color, and using a bold face for the comparison property).
To end this mail that I have failed to keep short, I'm in vacations from
July 3 to July 13 included. I will have an internet connection, so I
will be able to respond to mails, but I don't think I will have time to
implement new features (I may have time to fix bugs, though). But after
my vacations, I will be able to work on the parser without interruption.
Denis Steckelmacher.
[1]: https://github.com/steckdenis/nepomukqueryparser
[2]: http://steckdenis.be/category-nepomuk.html
More information about the Nepomuk
mailing list