95 lines
3.0 KiB
Plaintext
95 lines
3.0 KiB
Plaintext
Return-Path: tim.one@comcast.net
|
|
Delivery-Date: Mon Sep 9 04:36:00 2002
|
|
From: tim.one@comcast.net (Tim Peters)
|
|
Date: Sun, 08 Sep 2002 23:36:00 -0400
|
|
Subject: [Spambayes] testing results
|
|
In-Reply-To: <20020909012051.GD27510@glacier.arctrix.com>
|
|
Message-ID: <LNBBLJKPBEHFEDALKOLCCEAMBDAB.tim.one@comcast.net>
|
|
|
|
[Neil Schemenauer]
|
|
> Woops. I didn't have the summary files so I regenerated them using a
|
|
> slightly different set of data. Here are the results of enabling the
|
|
> "received" header processing:
|
|
>
|
|
> false positive percentages
|
|
> 0.707 0.530 won -25.04%
|
|
> 0.873 0.524 won -39.98%
|
|
> 0.301 0.301 tied
|
|
> 1.047 1.047 tied
|
|
> 0.602 0.452 won -24.92%
|
|
> 0.353 0.177 won -49.86%
|
|
>
|
|
> won 4 times
|
|
> tied 2 times
|
|
> lost 0 times
|
|
>
|
|
> total unique fp went from 17 to 14 won -17.65%
|
|
>
|
|
> false negative percentages
|
|
> 2.167 1.238 won -42.87%
|
|
> 0.969 0.969 tied
|
|
> 1.887 1.372 won -27.29%
|
|
> 1.616 1.292 won -20.05%
|
|
> 1.029 0.858 won -16.62%
|
|
> 1.548 1.548 tied
|
|
>
|
|
> won 4 times
|
|
> tied 2 times
|
|
> lost 0 times
|
|
>
|
|
> total unique fn went from 50 to 38 won -24.00%
|
|
>
|
|
> My test set is different than Tim's in that all the email was received
|
|
> by the same account. Also, my set contains email sent to me, not to
|
|
> mailing lists (I use a different addresses for mailing lists).
|
|
|
|
Enabling the Received headers works even better for me <wink>; here's the
|
|
f-n section from a quick run on 500-element subsets:
|
|
|
|
0.600 0.200 won -66.67%
|
|
0.200 0.200 tied
|
|
0.200 0.000 won -100.00%
|
|
0.800 0.400 won -50.00%
|
|
0.400 0.200 won -50.00%
|
|
0.400 0.000 won -100.00%
|
|
0.200 0.000 won -100.00%
|
|
1.000 0.400 won -60.00%
|
|
0.800 0.200 won -75.00%
|
|
1.200 0.600 won -50.00%
|
|
0.400 0.200 won -50.00%
|
|
2.000 0.800 won -60.00%
|
|
0.400 0.400 tied
|
|
1.200 0.600 won -50.00%
|
|
0.400 0.000 won -100.00%
|
|
2.000 1.000 won -50.00%
|
|
0.400 0.000 won -100.00%
|
|
0.800 0.000 won -100.00%
|
|
0.000 0.200 lost +(was 0)
|
|
0.400 0.000 won -100.00%
|
|
|
|
won 17 times
|
|
tied 2 times
|
|
lost 1 times
|
|
|
|
total unique fn went from 38 to 15 won -60.53%
|
|
|
|
A huge improvement, but for wrong reasons ... except not entirely! The most
|
|
powerful discriminator in the whole database on one training set became:
|
|
|
|
'received:unknown' 881 0.99
|
|
|
|
That's got nothing to do with BruceG, right?
|
|
|
|
'received:bfsmedia.com'
|
|
|
|
was also a strong spam indicator across all training sets. I'm jealous.
|
|
|
|
> If people cook up more ideas I will be happy to test them.
|
|
|
|
Neil, are using your own tokenizer now, or the tokenizer.Tokenizer.tokenize
|
|
generator? Whichever, someone who's not afraid of their headers should try
|
|
adding mboxtest.MyTokenizer.tokenize_headers into the mix, once in lieu of
|
|
tokenizer.Tokenizer.tokenize_headers(), and again in addition to it. Jeremy
|
|
reported on just the former.
|
|
|