StanfordMLOctave/machine-learning-ex6/ex6/easy_ham/1488.6694aa49b517656806a631...

From craig@hughes-family.org  Mon Sep  2 13:12:49 2002
Return-Path: <craig@hughes-family.org>
Delivered-To: yyyy@localhost.netnoteinc.com
Received: from localhost (localhost [127.0.0.1])
	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 7600944161
	for <jm@localhost>; Mon,  2 Sep 2002 07:38:06 -0400 (EDT)
Received: from phobos [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for jm@localhost (single-drop); Mon, 02 Sep 2002 12:38:06 +0100 (IST)
Received: from blount.mail.mindspring.net (blount.mail.mindspring.net
    [207.69.200.226]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id
    g81B83Z21386 for <jm@jmason.org>; Sun, 1 Sep 2002 12:08:03 +0100
Received: from user-1120fqe.dsl.mindspring.com ([66.32.63.78]
    helo=belphegore.hughes-family.org) by blount.mail.mindspring.net with
    esmtp (Exim 3.33 #1) id 17lSaV-0007tD-00; Sun, 01 Sep 2002 07:08:16 -0400
Received: from belphegore.hughes-family.org (belphegore.hughes-family.org
    [10.0.240.200]) by belphegore.hughes-family.org (Postfix) with ESMTP id
    BEB2D3C545; Sun,  1 Sep 2002 04:08:14 -0700 (PDT)
Date: Sun, 1 Sep 2002 04:08:14 -0700 (PDT)
From: Craig R Hughes <craig@hughes-family.org>
Reply-To: craig@stanfordalumni.org
To: Daniel Quinlan <quinlan@pathname.com>
Cc: "Craig R.Hughes" <craig@deersoft.com>,
	Justin Mason <jm@jmason.org>,
	SpamAssassin Developers <spamassassin-devel@lists.sourceforge.net>
Subject: Re: [SAdev] results of scorer evaluation
In-Reply-To: <yf27ki6cco2.fsf@proton.pathname.com>
Message-Id: <Pine.LNX.4.44.0209010357090.5726-100000@belphegore.hughes-family.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Pyzor: Reported 0 times.
X-Spam-Status: No, hits=-1.7 required=7.0
	tests=EMAIL_ATTRIBUTION,FORGED_RCVD_TRAIL,IN_REP_TO,
	      SPAM_PHRASE_00_01,USER_AGENT_PINE
	version=2.40-cvs
X-Spam-Level:

Daniel Quinlan wrote:

DQ> Before we release, it'd be great if someone could test a few
DQ> additional score ranges.  Maybe we can lower FPs a bit more.  :-)

I don't think there's much more room for lowering FPs left which the GA can
achieve.  Remember, also, that the AWL will reduce FPs, but its effects aren't
factored in to the GA scores.

The work currently being done on the GA, and comparing different methods of
doing the score-setting, is very worthwhile, and extremely useful; however, we
really ought to get a release out, since 2.31 is getting decreasingly useful as
time goes on.

The FP/FN rate of 2.40 with pretty well *any* score-setting mechanism will be
better than 2.31 -- we can continue with adjusting how the scores are set on the
2.41 or 2.50 branches.

DQ> Something like:
DQ>
DQ>   for (low = -12; low <= -4; low += 2)
DQ>     for (high = 2; high <= 6; high += 2)
DQ>       evolve

You could just allow low and high to be evolved by the GA (within ranges); I'd
be enormously surprised if it didn't end up with low=-12 and high=+6, since
that'd give the GA the broadest lattitude in setting individual scores.  The
issue with fixing low and high is not one of optimization, but rather one of
human-based concern that individual scores larger than about +4 are dangerous
and liable to generate FPs, and individual scores less than -8 are dangerous and
liable to be forged to generate FNs.

DQ> Maybe even add a nybias loop.

Adding an nybias loop is not worthwhile -- changing nybias scores will just
alter the evaluation function's idea of what the FP:FN ratio should be.

DQ> > AFAIK there's nothing major hanging out waiting to be checked in
DQ> > on b2_4_0 is there?
DQ>
DQ> Nope.

Great!

DQ> > I'll be on IM most of today, tomorrow, and monday while cranking
DQ> > on the next Deersoft product release (should be a fun one).  Hit
DQ> > me at:
DQ> >
DQ> > AIM: hugh3scr
DQ> > ICQ: 1130120
DQ> > MSN: craig@stanfordalumni.org
DQ> > YIM: hughescr
DQ>
DQ> We've been hanging out on IRC at irc.rhizomatic.net on #spamassassin
DQ> (the timezone difference gets in the way, though).

I've been searching for that, but I guess the details of where the channel was
got lost in the shuffle.

C