GeronBook/Ch3/datasets/spam/easy_ham/01546.600ca62f96dca0db15aa9...

83 lines
3.6 KiB
Plaintext

From jm@jmason.org Thu Oct 10 13:14:29 2002
Return-Path: <yyyy@spamassassin.taint.org>
Delivered-To: yyyy@spamassassin.taint.org
Received: by spamassassin.taint.org (Postfix, from userid 500)
id 0610616F17; Thu, 10 Oct 2002 13:14:29 +0100 (IST)
Received: from spamassassin.taint.org (localhost [127.0.0.1])
by jmason.org (Postfix) with ESMTP
id 033BAF7DA; Thu, 10 Oct 2002 13:14:29 +0100 (IST)
To: Daniel Quinlan <quinlan@pathname.com>
Cc: yyyy@spamassassin.taint.org (Justin Mason),
SpamAssassin-talk@lists.sourceforge.net,
SpamAssassin-devel@lists.sourceforge.net
Subject: Re: [SAdev] fully-public corpus of mail available
In-Reply-To: Message from Daniel Quinlan <quinlan@pathname.com>
of "09 Oct 2002 21:47:02 PDT." <yf2r8eygat5.fsf@proton.pathname.com>
From: yyyy@spamassassin.taint.org (Justin Mason)
X-GPG-Key-Fingerprint: 0A48 2D8B 0B52 A87D 0E8A 6ADD 4137 1B50 6E58 EF0A
X-Habeas-Swe-1: winter into spring
X-Habeas-Swe-2: brightly anticipated
X-Habeas-Swe-3: like Habeas SWE (tm)
X-Habeas-Swe-4: Copyright 2002 Habeas (tm)
X-Habeas-Swe-5: Sender Warranted Email (SWE) (tm). The sender of this
X-Habeas-Swe-6: email in exchange for a license for this Habeas
X-Habeas-Swe-7: warrant mark warrants that this is a Habeas Compliant
X-Habeas-Swe-8: Message (HCM) and not spam. Please report use of this
X-Habeas-Swe-9: mark in spam to <http://www.habeas.com/report/>.
Date: Thu, 10 Oct 2002 13:14:23 +0100
Sender: yyyy@spamassassin.taint.org
Message-Id: <20021010121429.0610616F17@spamassassin.taint.org>
(trimmed cc list)
Daniel Quinlan said:
> 1. These messages could end up being falsely (or incorrectly) reported
> to Razor, DCC, Pyzor, etc. Certain RBLs too. I don't think the
> results for these distributed tests can be trusted in any way,
> shape, or form when running over a public corpus.
I'll note that in the README.
> 2. These messages could also be submitted (more than once) to projects
> like SpamAssassin that rely on filtering results submission for GA
> tuning and development.
> The second problem could be alleviated somewhat by adding a Nilsimsa
> signature (or similar) to the mass-check file (the results format used
> by SpamAssassin) and giving the message files unique names (MD5 or
> SHA-1 of each file).
OK; maybe rewriting the message-ids will help here, that should allow
us to pick them out. I'll do that.
> 3. Spammers could adopt elements of the good messages to throw off
> filters. And, of course, there's always progression in technology
> (by both spammers and non-spammers).
> The third problem doesn't really worry me.
nah, me neither.
> These problems (and perhaps others I have not identified) are unique
> to spam filtering. Compression corpuses and other performance-related
> corpuses have their own set of problems, of course.
>
> In other words, I don't think there's any replacement for having
> multiple independent corpuses. Finding better ways to distribute
> testing and collate results seems like a more viable long-term solution
> (and I'm glad we're working on exactly that for SpamAssassin). If
> you're going to seriously work on filter development, building a corpus
> of 10000-50000 messages (half spam/half non-spam) is not really that
> much work. If you don't get enough spam, creating multi-technique
> spamtraps (web, usenet, replying to spam) is pretty easy. And who
> doesn't get thousands of non-spam every week? ;-)
Yep. The primary reason I released this, was to provide a good, big
corpus for academic testing of filter systems; it allows results to
be compared between filters using a known corpus.
For SpamAssassin development, everyone has to maintain their own corpus.
--j.