StanfordMLOctave/machine-learning-ex6/ex6/easy_ham/1829.1c598ff775a4de81c391eb...

24 lines
968 B
Plaintext

Return-Path: tim.one@comcast.net
Delivery-Date: Sun Sep 8 21:56:59 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 08 Sep 2002 16:56:59 -0400
Subject: [Spambayes] All Cap or Cap Word Subjects
In-Reply-To: <3D7B7F11.22376.29256B69@localhost>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEPPBCAB.tim.one@comcast.net>
[Brad Clements]
> Just curious if subject line capitalization can be used as an indicator.
>
> Either the percentage of characters that are caps..
>
> Or, percentage starting with a capital letter (if number of words > xx)
Supply a mod to tokenizer.py and I'll test it (eventually <wink>). Note
that the tokenizer already *preserves* case in subject-line words, because
experiment showed that this was better than folding case away in this
specific context (but experiment also showed-- against my
expectations --that preserving case everywhere didn't make a significant
difference to either error rate -- the subject line is a special case for
this).