[dot] Interview: MarkMail Indexes KDE Mailinglist Archives

Tue Aug 5 22:03:51 CEST 2008

URL: http://dot.kde.org/1217966578/

From: Jos Poortvliet <jospoortvliet at gmail.com>
Dept: deep-diggin'
Date: Tuesday05/Aug/2008, @13:02

Interview: MarkMail Indexes KDE Mailinglist Archives
====================================================

   Several weeks ago MarkMail [http://www.markmail.com], a project
sponsored and run by Mark Logic [http://www.marklogic.com], started
indexing the KDE mailinglist archives. After about a week of hard work,
the KDE archives are now directly searchable [http://kde.markmail.org]
from MarkMail [http://www.markmail.com]. Besides interesting analytics,
this brings some powerful search capabilities to the table. Read on for
a short interview with Jason Hunter who was responsible for engineering
on the project.

     Hi Jason! Could you give a little introduction of yourself and Mark
Logic?
 [http://www.marklogic.com]
     Hi, KDE! I'm a Silicon Valley hacker. I've been working at Mark
Logic [http://www.marklogic.com] for about 5 years now, since the days
it was an early startup. We sell MarkLogic Server
[http://www.marklogic.com/product/marklogic-server.html], a
special-purpose database built for content (where "content" is the stuff
that's textual, hierarchical, irregular, and not often regularly
repeating - like books, articles, and presentations). We use XML as our
native data type instead of tables, and pride ourselves on performing
very well at high scale.

     Until about a year ago I worked with our customers to help them
write content apps. I had the idea that we could use the core server to
build a public email archive repository, using some of the product
features to push the envelope of what people had done before with email
archives. That's where MarkMail [http://www.markmail.com] came from. We
started with 4,000,000 emails from the Apache Software Foundation
[http://www.apache.org/] mailing lists.

     I've been involved with open source for a long time, leading JDOM
and participating as a member of the Apache Software Foundation, so it
felt natural to put MarkMail to work initially on the problem of getting
more value from open source mailing lists.
 Konqueror showing MarkMail's search results
     Why did you decide to grab the KDE mailinglists?

     Cornelius Schumacher [http://behindkde.org/people/cornelius/]
started the ball rolling when he asked if we could load the KDE lists.
OK, that's not quite true. We have a long list of communities whose
lists we hope to load, and KDE was actually on that list since the very
beginning. It's just that one day in April we heard from Cornelius, and
the next day received a separate request from Adriaan de Groot
[http://behindkde.org/people/ade/]. That popped KDE to the top of the
priority list.

     The KDE mailinglists aren't the largest you have at MarkMail, but
they sure aren't small. Did that pose any problems?

     Yes, KDE is Big. At current count there's 2.7 million KDE emails.
Hosting those emails isn't an issue (we're designed to scale to hundreds
of millions) but we had to work hard to gather clean historical
archives. We have one person on the MarkMail team dedicated only to this
(we like to call him an email archaeologist. I'm not sure he's happy
about that nickname).

     Why the challenge? Well the most authoritative archives for KDE
were the web-based Pipermail [http://mail.gnu.org/pipermail/] archives
(i'm using past tense because i'd like to think that today the most
authoritative archives are in MarkMail). Pipermail exposes a set of
"mbox" files for each archived list. Very handy. The mbox file format is
a classic storage format for email and a format from which we can
readily load. But as we found out, the mbox files aren't really mbox and
there was a lot of post-processing we had to do. Some examples:

    * Pipermail "scrubs" attachments from its mbox files. Instead of
      placing the attachment content into the message as normal, it gets
      placed at an external URL with a marker in the message dictating
      where you can find it. We had to recognize the scrubbed
      references, fetch the attachments, and then inline the contents.
      Sounds simple, doesn't it? It probably would be if the external
      links were always accurate. Sometimes we could guess and fix
      things and sometimes we couldn't - bonus points go to anyone who
      finds an email in MarkMail mentioning an attachment that doesn't
      really exist. Extra bonus points if you know our search syntax
      well enough to write a query that directly lists those emails.
    * Then there's the problem with character encodings in old emails.
      If you look at an mbox file it seems like ASCII, but in fact it's
      a binary file. That's because each message may have a different
      character encoding for its body (or even portion of the body). The
      Pipermail list archiver didn't always realize this, and fixing
      that was non-trivial and imperfect.

     There are more examples, but I don't need to bore you. I should
make clear it's nothing special with KDE or even with Pipermail. Turns
out if you load a couple million emails you'll see at least one example
of almost every problem that's ever existed. It's the same for every
community, just with different challenges.
 Graphically drilling down to a specific date
     You mentioned pushing the envelope. Can you give an example of
that?

     Sure, here's a good example: When you do a search, besides getting
the top 10 most relevant emails, you see lots of analytics. You see a
histogram chart showing the number of messages matching your query each
month across time. With it you can watch trends for lists, people,
ideas, or any combination. Every query also shows the top senders,
lists, attachment types, and message types for the messages matching the
query. You can learn who's an expert on a topic, on what lists something
is being discussed, which people are most involved on lists, and so on.
By dragging across bars on the graph you can limit the view to just a
particular time period. You can also click on any person's name or list
name to limit the search. It's convenient to start with a simple query
and refine interactively.

     We've also strived to make the site easy to navigate. You can hit
"n" and "p" to go to the next and previous search results. To move up
and down the thread view you hit "j" and "k" (a homage to vim users). If
you find an attachment (search for ext:pdf) you can view it inline in
your browser.
 [http://kde.markmail.org/search/?q=ext%3Apdf]
     Oh, and here's a little-known tip. If your screen is sufficiently
wide, we give you all three panes (analytics, results, messages) at
once. If not, you get the "slide".

     Do you have any tips for the KDE community to take advantage of the
available capabilities in MarkMail?

     The first thing to remember is that you can limit your view to
KDE-related mails by going to kde.markmail.org
[http://kde.markmail.org]. The use of a subdomain adds an implicit
constraint to all your queries.

     Another is that you can do negations. For example, KDE has a huge
number of automated emails generated by bug reports and code check-ins.
You can search without those by adding -type:bugs -type:checkins. For
example: http://kde.markmail.org/search/?q=-type%3Abugs+-type%3Acheckins
 [http://kde.markmail.org/search/?q=-type%3Abugs+-type%3Acheckins]
     Lastly, if there's any other lists people want to see, let us know
at our feedback page [http://markmail.org/docs/feedback.xqy]. You can
track what we're up to at our blog.
 [http://markmail.blogspot.com] Thank you very much, Jason, both for the
work Mark Logic [http://www.marklogic.com] and you've been doing, and of
course for granting this interview!