GeronBook/Ch3/datasets/spam/easy_ham/01682.aabc3014dc8e7bbf3748d...

21 lines
894 B
Plaintext

Return-Path: tim.one@comcast.net
Delivery-Date: Sat Sep 7 01:32:26 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 06 Sep 2002 20:32:26 -0400
Subject: [Spambayes] understanding high false negative rate
In-Reply-To: <15737.16782.542869.368986@slothrop.zope.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEKNBCAB.tim.one@comcast.net>
[Jeremy Hylton[
> The total collections are 1100 messages. I trained with 1100/5
> messages.
I'm reading this now as that you trained on about 220 spam and about 220
ham. That's less than 10% of the sizes of the training sets I've been
using. Please try an experiment: train on 550 of each, and test once
against the other 550 of each. Do that a few times making a random split
each time (it won't be long until you discover why directories of individual
files are a lot easier to work -- e.g., random.shuffle() makes this kind of
thing trivial for me).