GeronBook/Ch3/datasets/spam/easy_ham/01719.a401ddc61fc3d89fbaee7...

49 lines
1.5 KiB
Plaintext

Return-Path: tim.one@comcast.net
Delivery-Date: Sun Sep 8 19:28:02 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 08 Sep 2002 14:28:02 -0400
Subject: [Spambayes] testing results
In-Reply-To: <20020908172113.GA26741@glacier.arctrix.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEPHBCAB.tim.one@comcast.net>
[Neil Schemenauer]
> These results are from timtest.py. I've got three sets of spam and ham
> with about 500 messages in each set. Here's what happens when I enable
> my latest "received" header code:
If you've still got the summary files, please cvs up and try running cmp.py
again -- in the process of generalizing cmp.py, you managed to make it skip
half the lines <wink>. That is, if you've got N sets, you *should* get
N**2-N pairs for each error rate. You have 3 sets, so you should get 6
pairs of f-n rates and 6 pairs of f-p rates.
> false positive percentages
> 0.187 0.187 tied
> 0.749 0.562 won -24.97%
> 0.780 0.585 won -25.00%
>
> won 2 times
> tied 1 times
> lost 0 times
>
> total unique fp went from 19 to 17
>
> false negative percentages
> 2.072 1.318 won -36.39%
> 2.448 1.318 won -46.16%
> 0.574 0.765 lost +33.28%
>
> won 2 times
> tied 0 times
> lost 1 times
>
> total unique fn went from 43 to 28
Looks promising! Getting 6 lines of output for each block would give a
clearer picture, of course.
> Anthony's header counting code does not seem to help.
It helps my test data too much <wink/sigh>.