33 lines
1.5 KiB
Plaintext
33 lines
1.5 KiB
Plaintext
Return-Path: gward@python.net
|
|
Delivery-Date: Mon Sep 9 20:25:42 2002
|
|
From: gward@python.net (Greg Ward)
|
|
Date: Mon, 9 Sep 2002 15:25:42 -0400
|
|
Subject: [Spambayes] deleting "duplicate" spam before training? good idea
|
|
orbad?
|
|
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIECKBDAB.tim.one@comcast.net>
|
|
References: <15740.52432.861148.597750@12-248-11-90.client.attbi.com>
|
|
<LNBBLJKPBEHFEDALKOLCIECKBDAB.tim.one@comcast.net>
|
|
Message-ID: <20020909192542.GB2002@cthulhu.gerg.ca>
|
|
|
|
On 09 September 2002, Tim Peters said:
|
|
> > Would people be interested in the script? I'd be happy to extricate
|
|
> > it from my local modules and check it into CVS.
|
|
>
|
|
> Sure! I think it's relevant, but maybe for another purpose. Paul Svensson
|
|
> is thinking harder about real people <wink> than the rest of us, and he may
|
|
> be able to get use out of approaches that identify closely related spam.
|
|
> For example, some amount of spam is going to end up in the ham training data
|
|
> in real life use, and any sort of similarity score to a piece of known spam
|
|
> may be an aid in finding and purging it.
|
|
|
|
OTOH, look into DCC (Distributed Checksum Clearinghouse,
|
|
http://www.rhyolite.com/anti-spam/dcc/), which uses fuzzy checksums.
|
|
It's quite likely that DCC's checksumming scheme is better than
|
|
something any of us would throw together for personal use (no offense,
|
|
Skip!). But I have no personal experience of it.
|
|
|
|
Greg
|
|
--
|
|
Greg Ward <gward@python.net> http://www.gerg.ca/
|
|
If it can't be expressed in figures, it is not science--it is opinion.
|