Return-Path: tim.one@comcast.net
Delivery-Date: Sun Sep  8 20:48:13 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 08 Sep 2002 15:48:13 -0400
Subject: [Spambayes] testing results
In-Reply-To: <20020908172113.GA26741@glacier.arctrix.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEPKBCAB.tim.one@comcast.net>

Neil trained a classifier using 3 sets with about 500 ham and spam in each.
We're missing half his test run results due to a cmp.py bug (since fixed);
the "before custom fiddling" figures on the 3 reported runs were:

    false positive percentages
        0.187
        0.749
        0.780
    total unique fp 19

    false negative percentages
        2.072
        2.448
        0.574
    total unique fn 43

The "total unique" figures counts all 6 runs; it's just the individual-run
fp and fn percentages we're missing for 3 runs.

Jeremy reported these "before custom fiddling" figures on 4 sets with about
600 ham and spam in each:

    false positive percentages
        0.000
        1.398
        1.398
        0.000
        1.242
        1.242
        1.398
        1.398
        0.000
        1.553
        1.553
        0.000
    total unique fp 139

    false negative percentages
       10.413
        6.104
        5.027
        8.259
        2.873
        5.745
        5.206
        4.488
        9.336
        5.206
        5.027
        9.874
    total unique fn 970

So things are clearly working much better for Neil.  Both reported
significant improvements in both f-n and f-p rates by folding in more header
lines.  Neal added Received analysis to the base tokenizer's header
analysis, and Jeremy skipped the base tokenizer's header analysis completely
but added base-subject-line-like but case-folded tokenization for almost all
header lines (excepting only Received, Data, X-From_, and, I *suspect*, all
those starting with 'x-vm').

When I try 5 random pairs of 500-ham + 500-spam subsets in my test data, I
see:

    false positive percentages
        0.000
        0.000
        0.200
        0.000
        0.200
        0.000
        0.200
        0.000
        0.000
        0.200
        0.400
        0.000
        0.200
        0.000
        0.200
        0.400
        0.000
        0.400
        0.200
        0.600
    total unique fp 10

    false negative percentages
        0.800
        0.400
        0.200
        0.600
        1.000
        0.000
        0.600
        1.200
        1.200
        0.800
        0.400
        0.800
        1.800
        0.800
        0.400
        1.000
        1.000
        0.400
        0.000
        0.600
    total unique fn 36

This is much closer to what Neil saw, but still looks better.  Another run
on a disjoint 5 random pairs looked much the same; total unique fp rose to
12 and fn fell to 27; on a third run with another set of disjoint 5 random
pairs, likewise, with fp 12 and fn 40.  So I'm pretty confident that it's
not going to matter which random subsets of 500 I take from my data.

It's hard to conclude anything given Jeremy's much worse results.  If they
were in line with Neil's results, I'd suspect that I've over-tuned the
algorithm to statistical quirks in my corpora.