119 lines
2.7 KiB
Plaintext
119 lines
2.7 KiB
Plaintext
Return-Path: anthony@interlink.com.au
|
|
Delivery-Date: Tue Sep 10 10:29:19 2002
|
|
From: anthony@interlink.com.au (Anthony Baxter)
|
|
Date: Tue, 10 Sep 2002 19:29:19 +1000
|
|
Subject: [Spambayes] Current histograms
|
|
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEFBBDAB.tim.one@comcast.net>
|
|
Message-ID: <200209100929.g8A9TJV27347@localhost.localdomain>
|
|
|
|
|
|
>>> Tim Peters wrote
|
|
> We've not only reduced the f-p and f-n rates in my test runs, we've also
|
|
> made the score distributions substantially sharper. This is bad news for
|
|
> Greg, because the non-existent "middle ground" is becoming even less
|
|
> existent <wink>:
|
|
|
|
Well, I've finally got around to pulling down the SF code. Starting
|
|
with it, and absolutely zero local modifications, I see the following:
|
|
|
|
Ham distribution for all runs:
|
|
* = 589 items
|
|
0.00 35292 ************************************************************
|
|
2.50 36 *
|
|
5.00 21 *
|
|
7.50 12 *
|
|
10.00 6 *
|
|
12.50 9 *
|
|
15.00 6 *
|
|
17.50 3 *
|
|
20.00 8 *
|
|
22.50 5 *
|
|
25.00 3 *
|
|
27.50 18 *
|
|
30.00 9 *
|
|
32.50 1 *
|
|
35.00 4 *
|
|
37.50 3 *
|
|
40.00 0
|
|
42.50 3 *
|
|
45.00 3 *
|
|
47.50 4 *
|
|
50.00 9 *
|
|
52.50 5 *
|
|
55.00 5 *
|
|
57.50 3 *
|
|
60.00 4 *
|
|
62.50 2 *
|
|
65.00 2 *
|
|
67.50 6 *
|
|
70.00 1 *
|
|
72.50 3 *
|
|
75.00 2 *
|
|
77.50 4 *
|
|
80.00 3 *
|
|
82.50 3 *
|
|
85.00 6 *
|
|
87.50 8 *
|
|
90.00 4 *
|
|
92.50 8 *
|
|
95.00 15 *
|
|
97.50 441 *
|
|
|
|
Spam distribution for all runs:
|
|
* = 504 items
|
|
0.00 393 *
|
|
2.50 17 *
|
|
5.00 18 *
|
|
7.50 12 *
|
|
10.00 4 *
|
|
12.50 6 *
|
|
15.00 11 *
|
|
17.50 10 *
|
|
20.00 10 *
|
|
22.50 5 *
|
|
25.00 3 *
|
|
27.50 19 *
|
|
30.00 8 *
|
|
32.50 2 *
|
|
35.00 0
|
|
37.50 1 *
|
|
40.00 5 *
|
|
42.50 5 *
|
|
45.00 7 *
|
|
47.50 2 *
|
|
50.00 5 *
|
|
52.50 1 *
|
|
55.00 9 *
|
|
57.50 11 *
|
|
60.00 6 *
|
|
62.50 4 *
|
|
65.00 3 *
|
|
67.50 5 *
|
|
70.00 7 *
|
|
72.50 9 *
|
|
75.00 2 *
|
|
77.50 13 *
|
|
80.00 3 *
|
|
82.50 7 *
|
|
85.00 15 *
|
|
87.50 16 *
|
|
90.00 11 *
|
|
92.50 16 *
|
|
95.00 45 *
|
|
97.50 30226 ************************************************************
|
|
|
|
|
|
My next (current) task is to complete the corpus I've got - it's currently
|
|
got ~ 9000 ham, 7800 spam, and about 9200 currently unsorted. I'm tossing
|
|
up using either hammie or spamassassin to do the initial sort - previously
|
|
I've used various forms of 'grep' for keywords and a little gui thing to
|
|
pop a message up and let me say 'spam/ham', but that's just getting too, too
|
|
tedious.
|
|
|
|
I can't make it available en masse, but I will look at finding some of
|
|
the more 'interesting' uglies. One thing I've seen (consider this
|
|
'anecdotal' for now) is that the 'skip' tokens end up in a _lot_ of the
|
|
f-ps.
|
|
|
|
Anthony
|