Return-Path: tim.one@comcast.net Delivery-Date: Fri Sep 6 15:59:38 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 10:59:38 -0400 Subject: [Spambayes] test sets? In-Reply-To: <200209060759.g867xcV03853@localhost.localdomain> Message-ID: [Anthony Baxter] > I've got a test set here that's the last 3 and a bit years email to > info@ekit.com and info@ekno.com - it's a really ugly set of 20,000+ > messages, currently broken into 7,000 spam, 9,000 ham, 9,000 currently > unclassified. These addresses are all over the 70-some different > ekit/ekno/ISIConnect websites, so they get a LOT of spam. > > As well as the usual spam, it also has customers complaining about > credit card charges, it has people interested in the service and > asking questions about long distance rates, &c &c &c. Lots and lots > of "commercial" speech, in other words. Stuff that SA gets pretty > badly wrong. Can this corpus be shared? I suppose not. > I'm currently mangling it by feeding all parts (text, html, whatever > else :) into the filters, as well as both a selected number of headers > (to, from, content-type, x-mailer), and also a list of (header, > count_of_header). This is showing up some nice stuff - e.g. the > X-uidl that stoopid spammers blindly copy into their messages. If we ever have a shared corpus, an easy refactoring of timtest should allow to plug in different tokenizers. I've only made three changes to Graham's algorithm so far (well, I've made dozens -- only three survived testing as proven winners), all the rest has been refining the tokenization to provide better clues. > I did have Received in there, but it's out for the moment, as it causes > rates to drop. That's ambiguous. Accuracy rates or error rates, ham or spam rates? > I'm also stripping out HTML tags, except for href="" and src="" - there's > so so much goodness in them (note that I'm only keeping the contents of > the attributes). Mining embedded http/https/ftp thingies cut the false negative rate in half in my tests (not keying off href, just scanning for anything that "looked like" one); that was the single biggest f-n improvement I've seen. It didn't change the false positive rate. So you know whether src added additional power, or did you do both at once?