Return-Path: anthony@interlink.com.au Delivery-Date: Fri Sep 6 08:59:38 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 06 Sep 2002 17:59:38 +1000 Subject: [Spambayes] test sets? Message-ID: <200209060759.g867xcV03853@localhost.localdomain> I've got a test set here that's the last 3 and a bit years email to info@ekit.com and info@ekno.com - it's a really ugly set of 20,000+ messages, currently broken into 7,000 spam, 9,000 ham, 9,000 currently unclassified. These addresses are all over the 70-some different ekit/ekno/ISIConnect websites, so they get a LOT of spam. As well as the usual spam, it also has customers complaining about credit card charges, it has people interested in the service and asking questions about long distance rates, &c &c &c. Lots and lots of "commercial" speech, in other words. Stuff that SA gets pretty badly wrong. I'm currently mangling it by feeding all parts (text, html, whatever else :) into the filters, as well as both a selected number of headers (to, from, content-type, x-mailer), and also a list of (header,count_of_header). This is showing up some nice stuff - e.g. the X-uidl that stoopid spammers blindly copy into their messages. I did have Received in there, but it's out for the moment, as it causes rates to drop. I'm also stripping out HTML tags, except for href="" and src="" - there's so so much goodness in them (note that I'm only keeping the contents of the attributes). -- Anthony Baxter It's never too late to have a happy childhood.