From jm@jmason.org Thu Oct 10 13:14:29 2002 Return-Path: Delivered-To: yyyy@example.com Received: by example.com (Postfix, from userid 500) id 0610616F17; Thu, 10 Oct 2002 13:14:29 +0100 (IST) Received: from example.com (localhost [127.0.0.1]) by jmason.org (Postfix) with ESMTP id 033BAF7DA; Thu, 10 Oct 2002 13:14:29 +0100 (IST) To: Daniel Quinlan Cc: yyyy@example.com (Justin Mason), SpamAssassin-talk@lists.sourceforge.net, SpamAssassin-devel@lists.sourceforge.net Subject: Re: [SAdev] fully-public corpus of mail available In-Reply-To: Message from Daniel Quinlan of "09 Oct 2002 21:47:02 PDT." From: yyyy@example.com (Justin Mason) X-GPG-Key-Fingerprint: 0A48 2D8B 0B52 A87D 0E8A 6ADD 4137 1B50 6E58 EF0A X-Habeas-Swe-1: winter into spring X-Habeas-Swe-2: brightly anticipated X-Habeas-Swe-3: like Habeas SWE (tm) X-Habeas-Swe-4: Copyright 2002 Habeas (tm) X-Habeas-Swe-5: Sender Warranted Email (SWE) (tm). The sender of this X-Habeas-Swe-6: email in exchange for a license for this Habeas X-Habeas-Swe-7: warrant mark warrants that this is a Habeas Compliant X-Habeas-Swe-8: Message (HCM) and not spam. Please report use of this X-Habeas-Swe-9: mark in spam to . Date: Thu, 10 Oct 2002 13:14:23 +0100 Sender: yyyy@example.com Message-Id: <20021010121429.0610616F17@example.com> X-Spam-Status: No, hits=-27.8 required=5.0 tests=AWL,HABEAS_SWE,IN_REP_TO,QUOTED_EMAIL_TEXT, T_NONSENSE_FROM_00_10,T_NONSENSE_FROM_10_20, T_NONSENSE_FROM_20_30,T_NONSENSE_FROM_30_40, T_NONSENSE_FROM_40_50,T_NONSENSE_FROM_50_60, T_NONSENSE_FROM_60_70,T_NONSENSE_FROM_70_80, T_NONSENSE_FROM_80_90,T_NONSENSE_FROM_90_91, T_NONSENSE_FROM_91_92,T_NONSENSE_FROM_92_93, T_NONSENSE_FROM_93_94,T_NONSENSE_FROM_94_95, T_NONSENSE_FROM_95_96,T_NONSENSE_FROM_96_97, T_NONSENSE_FROM_97_98,T_NONSENSE_FROM_98_99, T_NONSENSE_FROM_99_100,T_QUOTED_EMAIL_TEXT version=2.50-cvs X-Spam-Level: (trimmed cc list) Daniel Quinlan said: > 1. These messages could end up being falsely (or incorrectly) reported > to Razor, DCC, Pyzor, etc. Certain RBLs too. I don't think the > results for these distributed tests can be trusted in any way, > shape, or form when running over a public corpus. I'll note that in the README. > 2. These messages could also be submitted (more than once) to projects > like SpamAssassin that rely on filtering results submission for GA > tuning and development. > The second problem could be alleviated somewhat by adding a Nilsimsa > signature (or similar) to the mass-check file (the results format used > by SpamAssassin) and giving the message files unique names (MD5 or > SHA-1 of each file). OK; maybe rewriting the message-ids will help here, that should allow us to pick them out. I'll do that. > 3. Spammers could adopt elements of the good messages to throw off > filters. And, of course, there's always progression in technology > (by both spammers and non-spammers). > The third problem doesn't really worry me. nah, me neither. > These problems (and perhaps others I have not identified) are unique > to spam filtering. Compression corpuses and other performance-related > corpuses have their own set of problems, of course. > > In other words, I don't think there's any replacement for having > multiple independent corpuses. Finding better ways to distribute > testing and collate results seems like a more viable long-term solution > (and I'm glad we're working on exactly that for SpamAssassin). If > you're going to seriously work on filter development, building a corpus > of 10000-50000 messages (half spam/half non-spam) is not really that > much work. If you don't get enough spam, creating multi-technique > spamtraps (web, usenet, replying to spam) is pretty easy. And who > doesn't get thousands of non-spam every week? ;-) Yep. The primary reason I released this, was to provide a good, big corpus for academic testing of filter systems; it allows results to be compared between filters using a known corpus. For SpamAssassin development, everyone has to maintain their own corpus. --j.