146 lines
4.6 KiB
Plaintext
146 lines
4.6 KiB
Plaintext
Return-Path: tim.one@comcast.net
|
|
Delivery-Date: Thu Sep 12 01:13:06 2002
|
|
From: tim.one@comcast.net (Tim Peters)
|
|
Date: Wed, 11 Sep 2002 20:13:06 -0400
|
|
Subject: [Spambayes] Current histograms
|
|
In-Reply-To: <200209110423.g8B4Ndh07749@localhost.localdomain>
|
|
Message-ID: <LNBBLJKPBEHFEDALKOLCKEJBBDAB.tim.one@comcast.net>
|
|
|
|
[Anthony Baxter]
|
|
> 5 sets, each of 1800ham/1550spam, just ran the once (it matched all 5 to
|
|
> each other...)
|
|
>
|
|
> rates.py sez:
|
|
>
|
|
> Training on Data/Ham/Set1 & Data/Spam/Set1 ... 1798 hams & 1548 spams
|
|
> 0.445 0.388
|
|
> 0.445 0.323
|
|
> 2.108 4.072
|
|
> 0.556 1.097
|
|
> Training on Data/Ham/Set2 & Data/Spam/Set2 ... 1798 hams & 1546 spams
|
|
> 2.113 0.517
|
|
> 1.335 0.194
|
|
> 3.106 5.365
|
|
> 2.113 2.903
|
|
> Training on Data/Ham/Set3 & Data/Spam/Set3 ... 1798 hams & 1547 spams
|
|
> 2.447 0.646
|
|
> 0.945 0.388
|
|
> 2.884 3.426
|
|
> 2.058 1.097
|
|
> Training on Data/Ham/Set4 & Data/Spam/Set4 ... 1803 hams & 1547 spams
|
|
> 1.057 2.584
|
|
> 0.723 1.682
|
|
> 0.890 1.164
|
|
> 0.445 0.452
|
|
> Training on Data/Ham/Set5 & Data/Spam/Set5 ... 1798 hams & 1550 spams
|
|
> 0.779 4.328
|
|
> 0.501 3.299
|
|
> 0.667 3.361
|
|
> 0.388 4.977
|
|
> total false pos 273 3.03501945525
|
|
> total false neg 367 4.74282760403
|
|
|
|
How were these msgs broken up into the 5 sets? Set4 in particular is giving
|
|
the other sets severe problems, and Set5 blows the f-n rate on everything
|
|
it's predicting -- when the rates across runs within a training set vary by
|
|
as much as a factor of 25, it suggests there was systematic bias in the way
|
|
the sets were chosen. For example, perhaps they were broken into sets by
|
|
arrival time. If that's what you did, you should go back and break them
|
|
into sets randomly instead. If you did partition them randomly, the wild
|
|
variance across runs is mondo mysterious.
|
|
|
|
|
|
>> I expect hammie will do a much better job on this already than hand
|
|
>> grepping. Be sure to stare at the false positives and get the
|
|
>> spam out of there.
|
|
|
|
> Yah, but there's a chicken-and-egg problem there - I want stuff that's
|
|
> _known_ to be right to test this stuff,
|
|
|
|
Then you have to look at every message by eyeball -- any scheme has non-zero
|
|
error rates of both kinds.
|
|
|
|
> so using the spambayes code to tell me whether it's spam is not
|
|
> going to help.
|
|
|
|
Trust me <wink> -- it helps a *lot*. I expect everyone who has done any
|
|
testing here has discovered spam in their ham, and vice versa. Results
|
|
improve as you improve the categorization. Once the gross mistakes are
|
|
straightened out, it's much less tedious to scan the rest by eyeball.
|
|
|
|
[on skip tokens]
|
|
> Yep, it shows up in a lot of spam, but also in different forms in hams.
|
|
> But the hams each manage to pick a different variant of
|
|
> ~~~~~~~~~~~~~~~~~~~~~~
|
|
> or whatever - so they don't end up counteracting the various bits in the
|
|
> spam.
|
|
>
|
|
> Looking further, a _lot_ of the bad skip rubbish is coming from
|
|
> uuencoded viruses &c in the spam-set.
|
|
|
|
For whatever reason, there appear to be few of those in BruceG's spam
|
|
collection. I added code to strip uuencoded sections, and pump out uuencode
|
|
summary tokens instead. I'll check it in. It didn't make a significant
|
|
difference on my usual test run (a single spam in my Set4 is now judged as
|
|
ham by the other 4 sets; nothing else changed). It does shrink the database
|
|
size here by a few percent. Let us know whether it helps you!
|
|
|
|
Before and after stripping uuencoded sections:
|
|
|
|
false positive percentages
|
|
0.000 0.000 tied
|
|
0.000 0.000 tied
|
|
0.050 0.050 tied
|
|
0.000 0.000 tied
|
|
0.025 0.025 tied
|
|
0.000 0.000 tied
|
|
0.075 0.075 tied
|
|
0.025 0.025 tied
|
|
0.025 0.025 tied
|
|
0.000 0.000 tied
|
|
0.050 0.050 tied
|
|
0.000 0.000 tied
|
|
0.025 0.025 tied
|
|
0.000 0.000 tied
|
|
0.000 0.000 tied
|
|
0.050 0.050 tied
|
|
0.025 0.025 tied
|
|
0.000 0.000 tied
|
|
0.025 0.025 tied
|
|
0.050 0.050 tied
|
|
|
|
won 0 times
|
|
tied 20 times
|
|
lost 0 times
|
|
|
|
total unique fp went from 8 to 8 tied
|
|
|
|
false negative percentages
|
|
0.255 0.255 tied
|
|
0.364 0.364 tied
|
|
0.254 0.291 lost +14.57%
|
|
0.509 0.509 tied
|
|
0.436 0.436 tied
|
|
0.218 0.218 tied
|
|
0.182 0.218 lost +19.78%
|
|
0.582 0.582 tied
|
|
0.327 0.327 tied
|
|
0.255 0.255 tied
|
|
0.254 0.291 lost +14.57%
|
|
0.582 0.582 tied
|
|
0.545 0.545 tied
|
|
0.255 0.255 tied
|
|
0.291 0.291 tied
|
|
0.400 0.400 tied
|
|
0.291 0.291 tied
|
|
0.218 0.218 tied
|
|
0.218 0.218 tied
|
|
0.145 0.182 lost +25.52%
|
|
|
|
won 0 times
|
|
tied 16 times
|
|
lost 4 times
|
|
|
|
total unique fn went from 89 to 90 lost +1.12%
|
|
|