Return-Path: neale@woozle.org Delivery-Date: Fri Sep 6 20:58:33 2002 From: neale@woozle.org (Neale Pickett) Date: 06 Sep 2002 12:58:33 -0700 Subject: [Spambayes] Deployment In-Reply-To: References: Message-ID: So then, Tim Peters is all like: > [Guido] > > ... > > I don't know how big that pickle would be, maybe loading it each time > > is fine. Or maybe marshalling.) > > My tests train on about 7,000 msgs, and a binary pickle of the database is > approaching 10 million bytes. My paltry 3000-message training set makes a 6.3MB (where 1MB=1e6 bytes) pickle. hammie.py, which I just checked in, will optionally let you write stuff out to a dbm file. With that same message base, the dbm file weighs in at a hefty 21.4MB. It also takes longer to write: Using a database: real 8m24.741s user 6m19.410s sys 1m33.650s Using a pickle: real 1m39.824s user 1m36.400s sys 0m2.160s This is on a PIII at 551.257MHz (I don't know what it's *supposed* to be, 551.257 is what /proc/cpuinfo says). For comparison, SpamOracle (currently the gold standard in my mind, at least for speed) on the same data blazes along: real 0m29.592s user 0m28.050s sys 0m1.180s Its data file, which appears to be a marshalled hash, is 448KB. However, it's compiled O'Caml and it uses a much simpler tokenizing algorithm written with a lexical analyzer (ocamllex), so we'll never be able to outperform it. It's something to keep in mind, though. I don't have statistics yet for scanning unknown messages. (Actually, I do, and the database blows the pickle out of the water, but it scores every word with 0.00, so I'm not sure that's a fair test. ;) In any case, 21MB per user is probably too large, and 10MB is questionable. On the other hand, my pickle compressed very well with gzip, shrinking down to 1.8MB. Neale