GeronBook/Ch3/datasets/spam/easy_ham/01674.6bb054bf786bfbfeacc78...

23 lines
947 B
Plaintext

Return-Path: tim.one@comcast.net
Delivery-Date: Fri Sep 6 22:32:21 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 06 Sep 2002 17:32:21 -0400
Subject: [Spambayes] Deployment
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEIPBCAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEJOBCAB.tim.one@comcast.net>
[Tim]
> My tests train on about 7,000 msgs, and a binary pickle of the database is
> approaching 10 million bytes.
That shrinks to under 2 million bytes, though, if I delete all the WordInfo
records with spamprob exactly equal to UNKNOWN_SPAMPROB. Such records
aren't needed when scoring (an unknown word gets a made-up probability of
UNKNOWN_SPAMPROB). Such records are only needed for training; I've noted
before that a scoring-only database can be leaner.
In part the bloat is due to character 5-gram'ing, part due to that the
database is brand new so has never been cleaned via clearjunk(), and part
due to plain evil gremlins.