GeronBook/Ch3/datasets/spam/easy_ham/01731.48b91fabd8e14b220898d...

35 lines
1.5 KiB
Plaintext

Return-Path: guido@python.org
Delivery-Date: Mon Sep 9 15:10:00 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 09 Sep 2002 10:10:00 -0400
Subject: [Spambayes] test sets?
In-Reply-To: Your message of "Sun, 08 Sep 2002 03:18:49 EDT."
<LNBBLJKPBEHFEDALKOLCGEOIBCAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCGEOIBCAB.tim.one@comcast.net>
Message-ID: <200209091410.g89EA0428928@pcp02138704pcs.reston01.va.comcast.net>
> I'd prefer to strip HTML tags from everything, but last time I tried
> that it still had bad effects on the error rates in my corpora
Your corpora are biased in this respect though -- newsgroups have a
strong social taboo on posting HTML, but in many people's personal
inboxes it is quite abundant.
Getting a good ham corpus may prove to be a bigger hurdle than I
though! My own saved mail doesn't reflect what I receive, since I
save and throw away selectively (much more so than in the past :-).
> > multipart/mixed
> > text/plain (brief text plus URL(s))
> > text/html (long HTML copied from website)
>
> Ah! That explains why the HTML tags didn't get stripped. I'd again
> offer to add an optional argument to tokenize() so that they'd get
> stripped here too, but if it gets glossed over a third time that
> would feel too much like a loss <wink>.
I'll bite. Sounds like a good idea to strip the HTML in this case;
I'd like to see how this improves the f-p rate on this corpus.
--Guido van Rossum (home page: http://www.python.org/~guido/)