Return-Path: tim.one@comcast.net Delivery-Date: Fri Sep 6 22:32:21 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 17:32:21 -0400 Subject: [Spambayes] Deployment In-Reply-To: Message-ID: [Tim] > My tests train on about 7,000 msgs, and a binary pickle of the database is > approaching 10 million bytes. That shrinks to under 2 million bytes, though, if I delete all the WordInfo records with spamprob exactly equal to UNKNOWN_SPAMPROB. Such records aren't needed when scoring (an unknown word gets a made-up probability of UNKNOWN_SPAMPROB). Such records are only needed for training; I've noted before that a scoring-only database can be leaner. In part the bloat is due to character 5-gram'ing, part due to that the database is brand new so has never been cleaned via clearjunk(), and part due to plain evil gremlins.