Parsing a user-entered localized datetime
Denis Steckelmacher
dsteckel at ulb.ac.be
Thu Apr 11 19:19:41 BST 2013
On 11/04/2013 19:17, John Layt wrote :
>
> We do support a "FancyDate" parsing style in QLocale::readDate(), but
> it is very limited to things like "Yesterday" and "Monday". There are
> no plans to extend our fancy date support at this time as it would be
> very hard to get right in a generic way, besides which kdelibs is
> frozen until KF5. In the future (Qt5/KF5) we may move localization to
> using ICU which doesn't offer any such feature so we would need a new
> one class for this.
>
> A new class for parsing "Relative Dates" separate from the existing
> date parsing code would make the most sense. This would just take
> strings and guess a rough time period. I do think it will be very
> hard writing generic code that works for every language that we
> support, you should talk to the translators about this, especially
> Chusselove. I know the Fuzzy Clock tried hard to find a way to output
> dates in a similar way but it ended up requireing lots fo manual work
> for each new language.
>
> As Kevin mentions, we store our default locale settings in the
> entry-desktop files at
> http://quickgit.kde.org/?p=kde-runtime.git&a=tree&f=l10n [1] . You
> can have a default value for a setting that is used by all languages,
> but then also language specific versions of each setting if needed.
> Alternatively you can use the standard i18n() calls.
>
> Good luck :-)
>
> John.
I've looked at KCalendarSystem and it seems that every calendar system
is built
around some sorts of days, months and years. It simplifies things a
bit, it would
have been difficult to handle things like "two seasons ago" in special
calendar
systems.
I like your idea of a dedicated "relative date" class. In fact, I
thought about a
HumanDateParser class, that reads locale-specific parser rules (I
imagined them
to be stored in XML files, as they are very easy to read using Qt, and
something
more rich that i18nc calls is required, except if you want translators
to have to
translate things like
"day(s)[1],week(s)[7],month(s)[31:(January,...)"), and use
them to parse strings.
Yesterday, I tried to note down what I consider are the strings that a
parser
should be able to parse. If <period> is any word in day, week, month,
year and
their plural forms, and <day of week> is the name of a day of the week,
it should
be feasible to parse "<number> <period> ago" (3 weeks ago), "next
<period>"
(next week), "last <period>|<day of week>" (last week, last year, last
Monday), or
something more fancy like "first Thursday of May". Shortcuts can be
given, for
instance "tomorrow". I don't know of these rules have to be regular
expressions,
as some languages may separate words differently or use complex
expression rules.
The parser rules will list the rules recognized by a given language in
a given
calendar system, and provide parsing clues. For instance, some
sentences typically
refer to a future event (next Friday, or even "in May"), while others
can be
understood as a past tense or a future tense, depending on the
application's context
(Dolphin is used to search files that exist, not that will exist in two
weeks).
Finally, the parsing would consists of finding parts of the string that
match one
rule. The first match would be taken. When a date has been found, its
matching
portion of the string is removed and a time is looked for. I hope this
could make
it possible to parse strings like "Last Monday on 8 pm", without having
to worry
about the "on" word, that every user will place differently or replace
with a
comma or any other thing.
Denis Steckelmacher.
(on a side note, I have already written a parser matching only parts of
human-written
content. It extracted quantity information from strings like "2 bottles
of 1 l of milk"
and was able to guess nearly 90% of the quantities. The Human likes to
write valuable
information in recognizable ways, even if there are words between them.
For instance,
the "2" in my example is the only number not followed by a unit, and "1
l" can only
mean "one liter". So, the algorithm found 2x 1 liter)
More information about the kde-core-devel
mailing list