GeronBook/Ch3/datasets/spam/easy_ham/01736.c66fbc9c72afb9ea06253...

23 lines
866 B
Plaintext

Return-Path: skip@pobox.com
Delivery-Date: Mon Sep 9 17:31:12 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 9 Sep 2002 11:31:12 -0500
Subject: [Spambayes] deleting "duplicate" spam before training? good idea or
bad?
Message-ID: <15740.52432.861148.597750@12-248-11-90.client.attbi.com>
Because I get mail through several different email addresses, I frequently
get duplicates (or triplicates or more-plicates) of various spam messages.
In saving spam for later analysis I haven't always been careful to avoid
saving such duplicates.
I wrote a script some time ago to try an minimize the duplicates I see by
calculating a loose checksum, but I still have some duplicates. Should I
delete the duplicates before training or not? Would people be interested in
the script? I'd be happy to extricate it from my local modules and check it
into CVS.
Skip