GeronBook/Ch3/datasets/spam/easy_ham/01422.cbcae309928553721fdf4...

39 lines
1.5 KiB
Plaintext

Return-Path: anthony@interlink.com.au
Delivery-Date: Fri Sep 6 08:59:38 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Fri, 06 Sep 2002 17:59:38 +1000
Subject: [Spambayes] test sets?
Message-ID: <200209060759.g867xcV03853@localhost.localdomain>
I've got a test set here that's the last 3 and a bit years email to
info@ekit.com and info@ekno.com - it's a really ugly set of 20,000+
messages, currently broken into 7,000 spam, 9,000 ham, 9,000 currently
unclassified. These addresses are all over the 70-some different
ekit/ekno/ISIConnect websites, so they get a LOT of spam.
As well as the usual spam, it also has customers complaining about
credit card charges, it has people interested in the service and
asking questions about long distance rates, &c &c &c. Lots and lots
of "commercial" speech, in other words. Stuff that SA gets pretty
badly wrong.
I'm currently mangling it by feeding all parts (text, html, whatever
else :) into the filters, as well as both a selected number of headers
(to, from, content-type, x-mailer), and also a list of
(header,count_of_header). This is showing up some nice stuff - e.g. the
X-uidl that stoopid spammers blindly copy into their messages.
I did have Received in there, but it's out for the moment, as it causes
rates to drop.
I'm also stripping out HTML tags, except for href="" and src="" - there's
so so much goodness in them (note that I'm only keeping the contents of
the attributes).
--
Anthony Baxter <anthony@interlink.com.au>
It's never too late to have a happy childhood.