[Kde-pim] CRM114 antispam score display (was: Re: branches/KDE/3.5/kdepim/kmail)

Sun Jul 8 21:33:32 BST 2007

[Let's continue the thread on kmail-devel only.]

On Sunday 08 July 2007 19:07, Martin Steigerwald wrote:
> Am Sonntag 03 Juni 2007 schrieb Ingo Klöcker:
> > On Saturday 02 June 2007 01:18, Martin Steigerwald wrote:
> > > Am Montag 28 Mai 2007 schrieb Ingo Klöcker:
> > > Headers look like this:
> > >
> > > -----------------------------------------------------------------
> > >---- martin at shambala:~/Mail> grep -ir "X-CRM114-Status:" * | cut
> > > -d":" -f3,4 | grep SPAM
> > > X-CRM114-Status: SPAM  ( -43.62  )
> > > X-CRM114-Status: SPAM  ( -17.78  )
> > >  X-CRM114-Status: SPAM  ( -61.96  )
> > > X-CRM114-Status: SPAM  ( -15.03  )
> > >
> > > martin at shambala:~/Mail> grep -ir "X-CRM114-Status:" * | cut -d":"
> > > -f3,4 | grep GOOD | head -10
> > > X-CRM114-Status: GOOD (  11.09  )
> > > X-CRM114-Status: GOOD ( 304.35  )
> > > X-CRM114-Status: GOOD (  81.34  )
>
> [...]
>
> > > martin at shambala:~/Mail> grep -ir "X-CRM114-Status:" * | cut -d":"
> > > -f3,4 | grep UNSURE | head -10
> > > X-CRM114-Status: UNSURE (  -1.80  )
> > > X-CRM114-Status: UNSURE (   3.46  )
> > > X-CRM114-Status: UNSURE (   3.68  )
> > > X-CRM114-Status: UNSURE (   9.94  )
>
> [...]
>
> > It seems we have to introduce yet another score type since with
> > CRM114 spam has large negative scores while ham has large positive
> > scores.
>
> Well yes. Maybe something general where you can specify the complete
> score range and the necessary thresholds would be suitable.
>
> ScoreRange=-400,400
> ScoreUnsureThreshold=-10
> ScoreGoodTreshold=10
>
> Or just one range for each of those?

Actually, I would prefer a function mapping the scores to the interval 
[0, 1] where 0 or below means "most likely no spam" and 1 or above 
means "most likely spam". Multiplying the score with a factor and then 
adding a value (i.e. an affine transformation) will hopefully suffice 
in all cases. So we would calculate

  double normalizedScore( double score )
  {
    return ScoreScalingFactor * score + ScoreTranslation;
  }

Assuming that for CRM114 anything below -10 means "spam" and anything 
above 10 means "ham" we would use

  ScoreScalingFactor=-0.05 // == - 1/20
  ScoreTranslation=0.5

and we would get

  normalizedScore(-10) == -0.05 * (-10) + 0.5 == 1
  normalizedScore(0) == 0.5
  normalizedScore(10) == -0.05 * 10 + 0.5 == 0

So, as desired, -10 (spam) would map to 1 (spam) and 10 (ham) would map 
to 0 (ham).

> > > From what I understand I need to know the exact treshold on that
> > > CRM114 classifies a mail as SPAM at least?
> >
> > Yes.
>
> I will ask on the crm114-general mailinglist for that one. CRM114
> does not seem to specify the treshold in its headers and depending on
> the classifier one uses the tresholds may vary. Maybe it would be
> good if CRM114 puts thresholds for SPAM and UNSURE into the headers
> somehow.
>
> > > Ingo, Andreas what about mails classified as UNSURE? Does spam
> > > score display in KMail support those?
> >
> > Well, I guess for scores corresponding to UNSURE the color bar
> > should be partially filled. For ham the color bar should be empty
> > and for spam it should be completely filled.
>
> Actually I do not quite understand the spam score display
> completely...

Assuming normalized scores, for ham (0 or below) the color bar is 
completely empty, for spam (1 or above) the color bar is completely 
filled and for anything between 0 and 1 the color bar is partially 
filled.

> > > I have holidays in the next two weeks, I will be with limited
> > > internet access next week but after that really like to take the
> > > time to look into trying to bring together suitable KMail spam
> > > score display configuration statements for KMail to finally
> > > complete the CRM114 configuration for KMail...
>
> ... well after facing the difference of theory and experience I
> managed to do at least a minimal spam score display for CRM114. I
> just put in a boolean filter for now[1]:
>
> ScoreName=CRM114
> ScoreHeader=X-CRM114-Status
> ScoreType=Bool
> ScoreValueRegexp=SPAM
> ScoreThresholdRegexp=
>
> But as far as I understand thats the best that works out of the box
> for now. At least KMail makes a difference between spam and
> ham/unsure mails in the spam score display.

Good.

> But I do not yet get that: When I mail is spam I get a color gradient
> from green over yellow to red displayed. Is that correct behaviour?

Yes.

> I wonder why there is green in there after all when its a spam. When a
> mail is unsure or ham I get a blank box. I would have expected
> something green here ;-).

See above. For unsure you should get a partial gradient, e.g. from green 
to yellow.

> For some mails that were flagged by 
> SpamAssassin I got a partially filled box with a partial color
> gradient, for example the gradient up to yellow. I would have
> interepreted this as UNSURE.

A partially filled box is supposed to signifiy UNSURE.

> So how does this actually work? Maybe it should be rethought a bit, I
> do not think its very intuitive. I would use the following:
>
> - a red (SPAM) / yellow (UNSURE) / green (HAM) box for a boolean /
> triplean ;-)
>
> - a red (SPAM) / yellow (UNSURE) / green (HAM) bar that displays the
> amount of spamicity, unsurecity or hamicity. Hmmm, but this might be
> confusing as well. Need to think about this a bit more.

I think at least for continuous scores (i.e. floating point/non-boolean) 
it's okay as it is. A bit of green means "more likely ham than spam". 
If the gradient extents to yellow it means "might be ham or spam". And 
if the gradient extents to red it means "more likely spam than ham".

> Anyway, to support unsure mails in the spam score displays some C++
> code needs to be touched. As well as for supporting a new score type
> for the CRM114 score range. I did not yet dig into this. My last C
> programming experience is years ago, and that wasn't C++ altough it
> was using an object orientated GUI framework nonetheless. Well let's
> see. If I manage to take some more time for this, I will have a look
> at the source code of the antispam stuff and look whether I can make
> a sense of it.
>
> If someone wants to help with the C++ part, I gadly appreciate it.
> And if I have questions when looking at the source I will find
> someone to ask those ;-).

It shouldn't be that difficult to change the code from the current 
implementation to an implementation using the normalization as above. I 
guess this would even simplify the code a bit because after the 
normalization all continuous (i.e. floating point) scores could be 
treated exactly the same.

Regards,
Ingo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/kde-pim/attachments/20070708/fa8c006f/attachment.sig>
-------------- next part --------------
_______________________________________________
kde-pim mailing list
kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
kde-pim home page at http://pim.kde.org/