39 lines
1.5 KiB
Plaintext
39 lines
1.5 KiB
Plaintext
Return-Path: anthony@interlink.com.au
|
|
Delivery-Date: Fri Sep 6 08:59:38 2002
|
|
From: anthony@interlink.com.au (Anthony Baxter)
|
|
Date: Fri, 06 Sep 2002 17:59:38 +1000
|
|
Subject: [Spambayes] test sets?
|
|
Message-ID: <200209060759.g867xcV03853@localhost.localdomain>
|
|
|
|
|
|
I've got a test set here that's the last 3 and a bit years email to
|
|
info@ekit.com and info@ekno.com - it's a really ugly set of 20,000+
|
|
messages, currently broken into 7,000 spam, 9,000 ham, 9,000 currently
|
|
unclassified. These addresses are all over the 70-some different
|
|
ekit/ekno/ISIConnect websites, so they get a LOT of spam.
|
|
|
|
As well as the usual spam, it also has customers complaining about
|
|
credit card charges, it has people interested in the service and
|
|
asking questions about long distance rates, &c &c &c. Lots and lots
|
|
of "commercial" speech, in other words. Stuff that SA gets pretty
|
|
badly wrong.
|
|
|
|
I'm currently mangling it by feeding all parts (text, html, whatever
|
|
else :) into the filters, as well as both a selected number of headers
|
|
(to, from, content-type, x-mailer), and also a list of
|
|
(header,count_of_header). This is showing up some nice stuff - e.g. the
|
|
X-uidl that stoopid spammers blindly copy into their messages.
|
|
|
|
I did have Received in there, but it's out for the moment, as it causes
|
|
rates to drop.
|
|
|
|
I'm also stripping out HTML tags, except for href="" and src="" - there's
|
|
so so much goodness in them (note that I'm only keeping the contents of
|
|
the attributes).
|
|
|
|
|
|
--
|
|
Anthony Baxter <anthony@interlink.com.au>
|
|
It's never too late to have a happy childhood.
|
|
|