GeronBook/Ch3/datasets/spam/easy_ham/01716.8e1547062c811ff9b6218...

95 lines
2.9 KiB
Plaintext

Return-Path: tim.one@comcast.net
Delivery-Date: Sun Sep 8 08:48:28 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 08 Sep 2002 03:48:28 -0400
Subject: [Spambayes] test sets?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEOIBCAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEOJBCAB.tim.one@comcast.net>
[Tim]
> ...
> I'd prefer to strip HTML tags from everything, but last time I
> tried that it still had bad effects on the error rates in my
> corpora (the full test results with and without HTML tag stripping
> is included in the "What about HTML?" comment block). But as the
> comment block also says,
>
> # XXX So, if another way is found to slash the f-n rate, the decision here
> # XXX not to strip HTML from HTML-only msgs should be revisited.
>
> and we've since done several things that gave significant f-n rate
> reductions. I should test that again now.
I did so. Alas, stripping HTML tags from all text still hurts the f-n rate
in my test data:
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.075 lost +50.00%
0.025 0.025 tied
0.075 0.025 won -66.67%
0.000 0.000 tied
0.100 0.100 tied
0.050 0.075 lost +50.00%
0.025 0.025 tied
0.025 0.000 won -100.00%
0.050 0.075 lost +50.00%
0.050 0.050 tied
0.050 0.025 won -50.00%
0.000 0.000 tied
0.000 0.000 tied
0.075 0.075 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.050 tied
won 3 times
tied 14 times
lost 3 times
total unique fp went from 13 to 11
false negative percentages
0.327 0.400 lost +22.32%
0.400 0.400 tied
0.327 0.473 lost +44.65%
0.691 0.654 won -5.35%
0.545 0.473 won -13.21%
0.291 0.364 lost +25.09%
0.218 0.291 lost +33.49%
0.654 0.654 tied
0.364 0.473 lost +29.95%
0.291 0.327 lost +12.37%
0.327 0.291 won -11.01%
0.691 0.654 won -5.35%
0.582 0.655 lost +12.54%
0.291 0.400 lost +37.46%
0.364 0.436 lost +19.78%
0.436 0.582 lost +33.49%
0.436 0.364 won -16.51%
0.218 0.291 lost +33.49%
0.291 0.400 lost +37.46%
0.254 0.327 lost +28.74%
won 5 times
tied 2 times
lost 13 times
total unique fn went from 106 to 122
Last time I tried this (see tokenizer.py comments), the f-n rate after
stripping tags ranged from 0.982% to 1.781%, with a median of about 1.34%,
so we've made tons of progress on the f-n rate since then. But the mere
presence of HTML tags still remains a significant clue for c.l.py traffic,
so I'm left with the same comment:
> # XXX So, if another way is found to slash the f-n rate, the decision here
> # XXX not to strip HTML from HTML-only msgs should be revisited.
If we want to take the focus of this away from c.l.py traffic, I can't say
what effect HTML stripping would have (I don't have suitable test data to
measure that on).