249 lines
12 KiB
Plaintext
249 lines
12 KiB
Plaintext
Return-Path: guido@python.org
|
|
Delivery-Date: Sun Sep 8 04:38:47 2002
|
|
From: guido@python.org (Guido van Rossum)
|
|
Date: Sat, 07 Sep 2002 23:38:47 -0400
|
|
Subject: [Spambayes] test sets?
|
|
In-Reply-To: Your message of "Sat, 07 Sep 2002 22:55:12 EDT."
|
|
<LNBBLJKPBEHFEDALKOLCOEOBBCAB.tim.one@comcast.net>
|
|
References: <LNBBLJKPBEHFEDALKOLCOEOBBCAB.tim.one@comcast.net>
|
|
Message-ID: <200209080338.g883clw17223@pcp02138704pcs.reston01.va.comcast.net>
|
|
|
|
> > But it also identified as spam everything in my inbox that had any
|
|
> > MIME structure or HTML parts, and several messages in my saved 'zope
|
|
> > geeks' list that happened to be using MIME and/or HTML.
|
|
>
|
|
> Do you know why? The strangest implied claim there is that it hates MIME
|
|
> independent of HTML. For example, the spamprob of 'content-type:text/plain'
|
|
> in that pickle is under 0.21. 'content-type:multipart/alternative' gets
|
|
> 0.93, but that's not a killer clue, and one bit of good content will more
|
|
> than cancel it out.
|
|
|
|
I reran the experiment (with the new SpamHam1.pik, but it doesn't seem
|
|
to make a difference). Here are the clues for the two spams in my
|
|
inbox (in hammie.py's output format, which sorts the clues by
|
|
probability; the first two numbers are the message number and overall
|
|
probability; then line-folded):
|
|
|
|
66 1.00 S 'facility': 0.01; 'speaker': 0.01; 'stretch': 0.01;
|
|
'thursday': 0.01; 'young,': 0.01; 'mistakes': 0.12; 'growth':
|
|
0.85; '>content-type:text/plain': 0.85; 'please': 0.85; 'capital':
|
|
0.92; 'series': 0.92; 'subject:Don': 0.94; 'companies': 0.96;
|
|
'>content-type:text/html': 0.96; 'fee': 0.96; 'money': 0.96;
|
|
'8:00am': 0.99; '9:00am': 0.99; '>content-type:image/gif': 0.99;
|
|
'>content-type:multipart/alternative': 0.99; 'attend': 0.99;
|
|
'companies,': 0.99; 'content-type/type:multipart/alternative':
|
|
0.99; 'content-type:multipart/related': 0.99; 'economy': 0.99;
|
|
'economy"': 0.99
|
|
|
|
This has 6 content-types as spam clues, only one of which is related
|
|
to HTML, despite there being an HTML alternative (and 12 other spam
|
|
clues, vs. only 6 ham clues). This was an announcement of a public
|
|
event by our building owners, with a text part that was the same as
|
|
the HTML (AFAICT). Its language may be spammish, but the content-type
|
|
clues didn't help. (BTW, it makes me wonder about the wisdom of
|
|
keeping punctuation -- 'economy' and 'economy"' to me don't seem to
|
|
deserve two be counted as clues.)
|
|
|
|
76 1.00 S '(near': 0.01; 'alexandria': 0.01; 'conn': 0.01;
|
|
'from:adam': 0.01; 'from:email addr:panix': 0.01; 'poked': 0.01;
|
|
'thorugh': 0.01; 'though': 0.03; "i'm": 0.03; 'reflect': 0.05;
|
|
"i've": 0.06; 'wednesday': 0.07; 'content-disposition:inline':
|
|
0.10; 'contacting': 0.93; 'sold': 0.96; 'financially': 0.98;
|
|
'prices': 0.98; 'rates': 0.99; 'discount.': 0.99; 'hotel': 0.99;
|
|
'hotels': 0.99; 'hotels.': 0.99; 'nights,': 0.99; 'plaza': 0.99;
|
|
'rates,': 0.99; 'rates.': 0.99; 'rooms': 0.99; 'season': 0.99;
|
|
'stations': 0.99; 'subject:Hotel': 0.99
|
|
|
|
Here is the full message (Received: headers stripped), with apologies
|
|
to Ziggy and David:
|
|
|
|
"""
|
|
Date: Fri, 06 Sep 2002 17:17:13 -0400
|
|
From: Adam Turoff <ziggy@panix.com>
|
|
Subject: Hotel information
|
|
To: guido@python.org, davida@activestate.com
|
|
Message-id: <20020906211713.GK7451@panix.com>
|
|
MIME-version: 1.0
|
|
Content-type: text/plain; charset=us-ascii
|
|
Content-disposition: inline
|
|
User-Agent: Mutt/1.4i
|
|
|
|
I've been looking into hotels. I poked around expedia for availability
|
|
from March 26 to 29 (4 nights, wednesday thorugh saturday).
|
|
|
|
I've also started contacting hotels for group rates; some of the group
|
|
rates are no better than the regular rates, and they require signing a
|
|
contract with a minimum number of rooms sold (with someone financially
|
|
responsible for unbooked rooms). Most hotels are less than responsive...
|
|
|
|
Radission - Barcelo Hotel (DuPont Circle)
|
|
$125/night, $99/weekend
|
|
|
|
State Plaza hotel (Foggy Bottom; near GWU)
|
|
$119/night, $99/weekend
|
|
|
|
Hilton Silver Spring (Near Metro, in suburban MD)
|
|
$99/hight, $74/weekend
|
|
|
|
Windsor Park Hotel
|
|
Conn Ave, between DuPont Circle/Woodley Park Metro stations
|
|
$95/night; needs a car
|
|
|
|
Econo Lodge Alexandria (Near Metro, in suburban VA)
|
|
$95/night
|
|
|
|
This is a hand picked list; I ignored anything over $125/night, even
|
|
though there are some really well situated hotels nearby at higher rates.
|
|
Also, I'm not sure how much these prices reflect an expedia-only
|
|
discount. I can't vouch for any of these hotels, either.
|
|
|
|
I also found out that the down season for DC Hotels are mid-june through
|
|
mid-september, and mid-november through mid-january.
|
|
|
|
Z.
|
|
"""
|
|
|
|
This one has no MIME structure nor HTML! It even has a
|
|
Content-disposition which is counted as a non-spam clue. It got
|
|
f-p'ed because of the many hospitality-related and money-related
|
|
terms. I'm surprised $125/night and similar aren't clues too. (And
|
|
again, several spam clues are duplicated with different variations:
|
|
'hotel', 'hotels', 'hotels.', 'subject:Hotel', 'rates,', 'rates.'.
|
|
|
|
> WRT hating HTML, possibilities include:
|
|
>
|
|
> 1. It really had to do with something other than MIME/HTML.
|
|
>
|
|
> 2. These are pure HTML (not multipart/alternative with a text/plain part),
|
|
> so that the tags aren't getting stripped. The pickled classifier
|
|
> despises all hints of HTML due to its c.l.py heritage.
|
|
>
|
|
> 3. These are multipart/alternative with a text/plain part, but the
|
|
> latter doesn't contain the same text as the text/html part (for
|
|
> example, as Anthony reported, perhaps the text/plain part just
|
|
> says something like "This is an HMTL message.").
|
|
>
|
|
> If it's #2, it would be easy to add an optional bool argument to tokenize()
|
|
> meaning "even if it is pure HTML, strip the tags anyway". In fact, I'd like
|
|
> to do that and default it to True. The extreme hatred of HTML on tech lists
|
|
> strikes me as, umm, extreme <wink>.
|
|
|
|
I also looked in more detail at some f-p's in my geeks traffic. The
|
|
first one's a doozie (that's the term, right? :-). It has lots of
|
|
HTML clues that are apparently ignored. It was a multipart/mixed with
|
|
two parts: a brief text/plain part containing one or two sentences, a
|
|
mondo weird URL:
|
|
|
|
http://x60.deja.com/[ST_rn=ps]/getdoc.xp?AN=687715863&CONTEXT=973121507.1408827441&hitnum=23
|
|
|
|
and some employer-generated spammish boilerplate; the second part was
|
|
the HTML taken directly from the above URL. Clues:
|
|
|
|
43 1.00 S '"main"': 0.01; '(later': 0.01; '(lots': 0.01; '--paul':
|
|
0.01; '1995-2000': 0.01; 'adopt': 0.01; 'apps': 0.01; 'commands':
|
|
0.01; 'deja.com': 0.01; 'dejanews,': 0.01; 'discipline': 0.01;
|
|
'duct': 0.01; 'email addr:digicool': 0.01; 'email name:paul':
|
|
0.01; 'everitt': 0.01; 'exist,': 0.01; 'forwards': 0.01;
|
|
'framework': 0.01; 'from:email addr:digicool': 0.01; 'from:email
|
|
name:<paul': 0.01; 'from:paul': 0.01; 'height': 0.01;
|
|
'hodge-podge': 0.01; 'http0:deja': 0.01; 'http0:zope': 0.01;
|
|
'http1:[st_rn': 0.01; 'http1:comp': 0.01; 'http1:getdoc': 0.01;
|
|
'http1:ps]': 0.01; 'http>1:22': 0.01; 'http>1:24': 0.01;
|
|
'http>1:57': 0.01; 'http>1:an': 0.01; 'http>1:author': 0.01;
|
|
'http>1:fmt': 0.01; 'http>1:getdoc': 0.01; 'http>1:pr': 0.01;
|
|
'http>1:products': 0.01; 'http>1:query': 0.01; 'http>1:search':
|
|
0.01; 'http>1:viewthread': 0.01; 'http>1:xp': 0.01; 'http>1:zope':
|
|
0.01; 'inventing': 0.01; 'jsp': 0.01; 'jsp.': 0.01; 'logic': 0.01;
|
|
'maps': 0.01; 'neo': 0.01; 'newsgroup,': 0.01; 'object': 0.01;
|
|
'popup': 0.01; 'probable': 0.01; 'query': 0.01; 'query,': 0.01;
|
|
'resizes': 0.01; 'servlet': 0.01; 'skip:? 20': 0.01; 'stems':
|
|
0.01; 'subject:JSP': 0.01; 'sucks!': 0.01; 'templating': 0.01;
|
|
'tempted': 0.01; 'url.': 0.01; 'usenet': 0.01; 'usenet,': 0.01;
|
|
'wrote': 0.01; 'x-mailer:mozilla 4.74 [en] (windows nt 5.0; u)':
|
|
0.01; 'zope': 0.01; '#000000;': 0.99; '#cc0000;': 0.99;
|
|
'#ff3300;': 0.99; '#ff6600;': 0.99; '#ffffff;': 0.99; '©':
|
|
0.99; '>': 0.99; ' ': 0.99; '"no': 0.99;
|
|
'.med': 0.99; '.small': 0.99; '0pt;': 0.99; '0px;': 0.99; '10px;':
|
|
0.99; '11pt;': 0.99; '12px;': 0.99; '18pt;': 0.99; '18px;': 0.99;
|
|
'1pt;': 0.99; '2px;': 0.99; '640;': 0.99; '8pt;': 0.99; '<!--':
|
|
0.99; '</b>': 0.99; '</body>': 0.99; '</head>': 0.99; '</html>':
|
|
0.99; '</script>': 0.99; '</select>': 0.99; '</span>': 0.99;
|
|
'</style>': 0.99; '</table>': 0.99; '</td>': 0.99; '</td></tr>':
|
|
0.99; '</tr>': 0.99; '</tr><tr': 0.99; '<b><a': 0.99; '<base':
|
|
0.99; '<body': 0.99; '<br>': 0.99; '<br> ': 0.99; '<br><a':
|
|
0.99; '<br><span': 0.99; '<font': 0.99; '<form': 0.99; '<head>':
|
|
0.99; '<html>': 0.99; '<img': 0.99; '<input': 0.99; '<meta': 0.99;
|
|
'<option': 0.99; '<p>': 0.99; '<p>a': 0.99; '<script>': 0.99;
|
|
'<select': 0.99; '<span': 0.99; '<style>': 0.99; '<table': 0.99;
|
|
'<td': 0.99; '<td>': 0.99; '<td></td>': 0.99; '<td><img': 0.99;
|
|
'<tr': 0.99; '<tr>': 0.99; '<tr><td': 0.99; '<tr><td><img': 0.99;
|
|
'absolute;': 0.99; 'align="left"': 0.99; 'align=center': 0.99;
|
|
'align=left': 0.99; 'align=middle': 0.99; 'align=right': 0.99;
|
|
'align=right>': 0.99; 'alt=""': 0.99; 'bold;': 0.99; 'border=0':
|
|
0.99; 'border=0>': 0.99; 'color:': 0.99; 'colspan=2': 0.99;
|
|
'colspan=2>': 0.99; 'colspan=4': 0.99; 'face="arial"': 0.99;
|
|
'font-family:': 0.99; 'font-size:': 0.99; 'font-weight:': 0.99;
|
|
'footer': 0.99; 'for<br>': 0.99; 'fucking<br>': 0.99;
|
|
'height="1"': 0.99; 'height="16"': 0.99; 'height=1': 0.99;
|
|
'height=12': 0.99; 'height=125': 0.99; 'height=17': 0.99;
|
|
'height=18': 0.99; 'height=21': 0.99; 'height=4': 0.99;
|
|
'height=57': 0.99; 'height=60': 0.99; 'height=8': 0.99;
|
|
'hspace=0': 0.99; 'http0:g': 0.99; 'http0:web2': 0.99; 'http1:0':
|
|
0.99; 'http1:ads': 0.99; 'http1:d': 0.99; 'http1:page': 0.99;
|
|
'http1:site': 0.99; 'http>1:article': 0.99; 'http>1:back': 0.99;
|
|
'http>1:com': 0.99; 'http>1:d': 0.99; 'http>1:gif': 0.99;
|
|
'http>1:go': 0.99; 'http>1:group': 0.99; 'http>1:http': 0.99;
|
|
'http>1:post': 0.99; 'http>1:ps': 0.99; 'http>1:site': 0.99;
|
|
'http>1:st': 0.99; 'http>1:title': 0.99; 'http>1:yahoo': 0.99;
|
|
'inc.</a>': 0.99; 'jobs!': 0.99; 'normal;': 0.99; 'nowrap': 0.99;
|
|
'nowrap>': 0.99; 'nowrap><font': 0.99; 'padding:': 0.99;
|
|
'rowspan=2': 0.99; 'rowspan=3': 0.99; 'servlets,': 0.99;
|
|
'size=15': 0.99; 'size=35': 0.99; 'skip:< 10': 0.99; 'skip:b 60':
|
|
0.99; 'skip:h 110': 0.99; 'skip:h 170': 0.99; 'skip:h 200': 0.99;
|
|
'skip:h 240': 0.99; 'skip:h 250': 0.99; 'skip:h 290': 0.99;
|
|
'skip:v 40': 0.99; 'solid;': 0.99; 'text=#000000': 0.99; 'to<br>':
|
|
0.99; 'type="image"': 0.99; 'type="text"': 0.99; 'type=hidden':
|
|
0.99; 'type=image': 0.99; 'type=radio': 0.99; 'type=submit': 0.99;
|
|
'type=text': 0.99; 'valign=top': 0.99; 'valign=top>': 0.99;
|
|
'value="">': 0.99; 'visibility:': 0.99; 'width:': 0.99;
|
|
'width="33"': 0.99; 'width=1': 0.99; 'width=100%': 0.99;
|
|
'width=100%>': 0.99; 'width=12': 0.99; 'width=125': 0.99;
|
|
'width=130': 0.99; 'width=137': 0.99; 'width=2': 0.99; 'width=20':
|
|
0.99; 'width=25': 0.99; 'width=4': 0.99; 'width=468': 0.99;
|
|
'width=6': 0.99; 'width=72': 0.99; 'works<br>': 0.99
|
|
|
|
The second f-p had the same structure (and sender :-); the third f-p
|
|
had the same structure and a different sender. Ditto the fifth, sixth. (Not posting clues for
|
|
brevity.)
|
|
|
|
The fourth was different: plaintext with one very short sentence and a
|
|
URL. Clues:
|
|
|
|
300 1.00 S 'from:email addr:digicool': 0.01; 'http1:news': 0.24;
|
|
'from:email addr:com>': 0.32; 'from:tres': 0.50; 'http>1:1114digi':
|
|
0.50; 'proto:http': 0.50; 'subject:Geeks': 0.50; 'x-mailer:mozilla
|
|
4.75 [en] (x11; u; linux 2.2.14-5.0smp i686)': 0.50; 'take': 0.54;
|
|
'bool:noorg': 0.61; 'http0:com': 0.66; 'skip:h 50': 0.83;
|
|
'http>1:htm': 0.90; 'subject:Software': 0.96; 'http>1:business':
|
|
0.99; 'http>1:local': 0.99; 'subject:firm': 0.99; 'us:': 0.99
|
|
|
|
The seventh was similar.
|
|
|
|
I scanned a bunch more until I got bored, and most of them were either
|
|
of the first form (brief text with URL followed by quoted HTML from
|
|
website) or the second (brief text with one or more URLs).
|
|
|
|
It's up to you to decide what to call this, but I think these are none
|
|
of your #1, #2 or #3 (they're close to #3, but all are multipart/mixed
|
|
rather than multipart/alternative).
|
|
|
|
> > So I guess I'll have to retrain it (yes, you told me so :-).
|
|
>
|
|
> That would be a different experiment. I'm certainly curious to see whether
|
|
> Jeremy's much-worse-than-mine error rates are typical or aberrant.
|
|
|
|
It's possible that the corpus you've trained on is more homogeneous
|
|
than you thought.
|
|
|
|
--Guido van Rossum (home page: http://www.python.org/~guido/)
|