Return-Path: jeremy@alum.mit.edu Delivery-Date: Sat Sep 7 18:18:17 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Sat, 7 Sep 2002 13:18:17 -0400 Subject: [Spambayes] understanding high false negative rate In-Reply-To: References: <15737.16782.542869.368986@slothrop.zope.com> Message-ID: <15738.13529.407748.635725@slothrop.zope.com> >>>>> "TP" == Tim Peters writes: TP> [Jeremy Hylton[ >> The total collections are 1100 messages. I trained with 1100/5 >> messages. TP> I'm reading this now as that you trained on about 220 spam and TP> about 220 ham. That's less than 10% of the sizes of the TP> training sets I've been using. Please try an experiment: train TP> on 550 of each, and test once against the other 550 of each. This helps a lot! Here are results with the stock tokenizer: Training on & ... 644 hams & 557 spams 0.000 10.413 1.398 6.104 1.398 5.027 Training on & ... 644 hams & 557 spams 0.000 8.259 1.242 2.873 1.242 5.745 Training on & ... 644 hams & 557 spams 1.398 5.206 1.398 4.488 0.000 9.336 Training on & ... 644 hams & 557 spams 1.553 5.206 1.553 5.027 0.000 9.874 total false pos 139 5.39596273292 total false neg 970 43.5368043088 And results from the tokenizer that looks at all headers except Date, Received, and X-From_: Training on & ... 644 hams & 557 spams 0.000 7.540 0.932 4.847 0.932 3.232 Training on & ... 644 hams & 557 spams 0.000 7.181 0.621 2.873 0.621 4.847 Training on & ... 644 hams & 557 spams 1.087 4.129 1.087 3.052 0.000 6.822 Training on & ... 644 hams & 557 spams 0.776 3.411 0.776 3.411 0.000 6.463 total false pos 97 3.76552795031 total false neg 738 33.1238779174 Is it safe to conclude that avoiding any cleverness with headers is a good thing? TP> Do that a few times making a random split each time (it won't be TP> long until you discover why directories of individual files are TP> a lot easier to work -- e.g., random.shuffle() makes this kind TP> of thing trivial for me). You haven't looked at mboxtest.py, have you? I'm using random.shuffle(), too. You don't need to have many files to make an arbitrary selection of messages from an mbox. I'll report some more results when they're done. Jeremy