GeronBook/Ch3/datasets/spam/easy_ham/01741.2a15d667c53727befded9...

33 lines
1.5 KiB
Plaintext

Return-Path: gward@python.net
Delivery-Date: Mon Sep 9 20:25:42 2002
From: gward@python.net (Greg Ward)
Date: Mon, 9 Sep 2002 15:25:42 -0400
Subject: [Spambayes] deleting "duplicate" spam before training? good idea
orbad?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIECKBDAB.tim.one@comcast.net>
References: <15740.52432.861148.597750@12-248-11-90.client.attbi.com>
<LNBBLJKPBEHFEDALKOLCIECKBDAB.tim.one@comcast.net>
Message-ID: <20020909192542.GB2002@cthulhu.gerg.ca>
On 09 September 2002, Tim Peters said:
> > Would people be interested in the script? I'd be happy to extricate
> > it from my local modules and check it into CVS.
>
> Sure! I think it's relevant, but maybe for another purpose. Paul Svensson
> is thinking harder about real people <wink> than the rest of us, and he may
> be able to get use out of approaches that identify closely related spam.
> For example, some amount of spam is going to end up in the ham training data
> in real life use, and any sort of similarity score to a piece of known spam
> may be an aid in finding and purging it.
OTOH, look into DCC (Distributed Checksum Clearinghouse,
http://www.rhyolite.com/anti-spam/dcc/), which uses fuzzy checksums.
It's quite likely that DCC's checksumming scheme is better than
something any of us would throw together for personal use (no offense,
Skip!). But I have no personal experience of it.
Greg
--
Greg Ward <gward@python.net> http://www.gerg.ca/
If it can't be expressed in figures, it is not science--it is opinion.