GeronBook/Ch3/datasets/spam/easy_ham/01729.01f5d745e5bca5dcb35f0...

95 lines
3.0 KiB
Plaintext

Return-Path: tim.one@comcast.net
Delivery-Date: Mon Sep 9 04:36:00 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 08 Sep 2002 23:36:00 -0400
Subject: [Spambayes] testing results
In-Reply-To: <20020909012051.GD27510@glacier.arctrix.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEAMBDAB.tim.one@comcast.net>
[Neil Schemenauer]
> Woops. I didn't have the summary files so I regenerated them using a
> slightly different set of data. Here are the results of enabling the
> "received" header processing:
>
> false positive percentages
> 0.707 0.530 won -25.04%
> 0.873 0.524 won -39.98%
> 0.301 0.301 tied
> 1.047 1.047 tied
> 0.602 0.452 won -24.92%
> 0.353 0.177 won -49.86%
>
> won 4 times
> tied 2 times
> lost 0 times
>
> total unique fp went from 17 to 14 won -17.65%
>
> false negative percentages
> 2.167 1.238 won -42.87%
> 0.969 0.969 tied
> 1.887 1.372 won -27.29%
> 1.616 1.292 won -20.05%
> 1.029 0.858 won -16.62%
> 1.548 1.548 tied
>
> won 4 times
> tied 2 times
> lost 0 times
>
> total unique fn went from 50 to 38 won -24.00%
>
> My test set is different than Tim's in that all the email was received
> by the same account. Also, my set contains email sent to me, not to
> mailing lists (I use a different addresses for mailing lists).
Enabling the Received headers works even better for me <wink>; here's the
f-n section from a quick run on 500-element subsets:
0.600 0.200 won -66.67%
0.200 0.200 tied
0.200 0.000 won -100.00%
0.800 0.400 won -50.00%
0.400 0.200 won -50.00%
0.400 0.000 won -100.00%
0.200 0.000 won -100.00%
1.000 0.400 won -60.00%
0.800 0.200 won -75.00%
1.200 0.600 won -50.00%
0.400 0.200 won -50.00%
2.000 0.800 won -60.00%
0.400 0.400 tied
1.200 0.600 won -50.00%
0.400 0.000 won -100.00%
2.000 1.000 won -50.00%
0.400 0.000 won -100.00%
0.800 0.000 won -100.00%
0.000 0.200 lost +(was 0)
0.400 0.000 won -100.00%
won 17 times
tied 2 times
lost 1 times
total unique fn went from 38 to 15 won -60.53%
A huge improvement, but for wrong reasons ... except not entirely! The most
powerful discriminator in the whole database on one training set became:
'received:unknown' 881 0.99
That's got nothing to do with BruceG, right?
'received:bfsmedia.com'
was also a strong spam indicator across all training sets. I'm jealous.
> If people cook up more ideas I will be happy to test them.
Neil, are using your own tokenizer now, or the tokenizer.Tokenizer.tokenize
generator? Whichever, someone who's not afraid of their headers should try
adding mboxtest.MyTokenizer.tokenize_headers into the mix, once in lieu of
tokenizer.Tokenizer.tokenize_headers(), and again in addition to it. Jeremy
reported on just the former.