59 lines
2.0 KiB
Plaintext
59 lines
2.0 KiB
Plaintext
Return-Path: neale@woozle.org
|
|
Delivery-Date: Fri Sep 6 20:58:33 2002
|
|
From: neale@woozle.org (Neale Pickett)
|
|
Date: 06 Sep 2002 12:58:33 -0700
|
|
Subject: [Spambayes] Deployment
|
|
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEIPBCAB.tim.one@comcast.net>
|
|
References: <LNBBLJKPBEHFEDALKOLCGEIPBCAB.tim.one@comcast.net>
|
|
Message-ID: <w53znuuc2l2.fsf@woozle.org>
|
|
|
|
So then, Tim Peters <tim.one@comcast.net> is all like:
|
|
|
|
> [Guido]
|
|
> > ...
|
|
> > I don't know how big that pickle would be, maybe loading it each time
|
|
> > is fine. Or maybe marshalling.)
|
|
>
|
|
> My tests train on about 7,000 msgs, and a binary pickle of the database is
|
|
> approaching 10 million bytes.
|
|
|
|
My paltry 3000-message training set makes a 6.3MB (where 1MB=1e6 bytes)
|
|
pickle. hammie.py, which I just checked in, will optionally let you
|
|
write stuff out to a dbm file. With that same message base, the dbm
|
|
file weighs in at a hefty 21.4MB. It also takes longer to write:
|
|
|
|
Using a database:
|
|
real 8m24.741s
|
|
user 6m19.410s
|
|
sys 1m33.650s
|
|
|
|
Using a pickle:
|
|
real 1m39.824s
|
|
user 1m36.400s
|
|
sys 0m2.160s
|
|
|
|
This is on a PIII at 551.257MHz (I don't know what it's *supposed* to
|
|
be, 551.257 is what /proc/cpuinfo says).
|
|
|
|
For comparison, SpamOracle (currently the gold standard in my mind, at
|
|
least for speed) on the same data blazes along:
|
|
|
|
real 0m29.592s
|
|
user 0m28.050s
|
|
sys 0m1.180s
|
|
|
|
Its data file, which appears to be a marshalled hash, is 448KB.
|
|
However, it's compiled O'Caml and it uses a much simpler tokenizing
|
|
algorithm written with a lexical analyzer (ocamllex), so we'll never be
|
|
able to outperform it. It's something to keep in mind, though.
|
|
|
|
I don't have statistics yet for scanning unknown messages. (Actually, I
|
|
do, and the database blows the pickle out of the water, but it scores
|
|
every word with 0.00, so I'm not sure that's a fair test. ;) In any
|
|
case, 21MB per user is probably too large, and 10MB is questionable.
|
|
|
|
On the other hand, my pickle compressed very well with gzip, shrinking
|
|
down to 1.8MB.
|
|
|
|
Neale
|