GeronBook/Ch3/datasets/spam/easy_ham/01743.07f7f47e25de2b8f08cab...

105 lines
5.1 KiB
Plaintext

Return-Path: gward@python.net
Delivery-Date: Mon Sep 9 22:01:10 2002
From: gward@python.net (Greg Ward)
Date: Mon, 9 Sep 2002 17:01:10 -0400
Subject: [Spambayes] python.org email harvesting ready to roll
Message-ID: <20020909210110.GA2224@cthulhu.gerg.ca>
[followups to spambayes@python.org please, unless you're specifically
concerned about some particular bit of email policy for python.org]
OK, after much fiddling with and tweaking of /etc/exim/exim4.conf and
/etc/exim/local_scan.py on mail.python.org, I am fairly confident that
I can start harvesting all incoming email at a moment's notice. For the
record, here's how it all works:
* exim4.conf works almost exactly the same as before if the file
/etc/exim/harvest does not exist. That is, any "junk mail
condition" that can be detected by Exim ACLs (access control lists)
is handled entirely in exim4.conf: the message is rejected before it
ever gets to local_scan.py. This covers such diverse cases as
"message from known spammer" (reject after every RCPT TO command),
"no message-id header", and "8-bit chars in subject" (both rejected
after the message headers/body are read).
The main things I have changed in the absence of /etc/exim/harvest
are:
- don't check for 8-bit chars in "From" header -- the vast
majority of hits for this test were bounces from some
Asian ISP; the remaining hits should be handled by SpamAssassin
- do header sender verification (ie. ensure that there's a
verifiable email address in at least one of "From", "Reply-to",
and "Sender") as late as possible, because it requires DNS
lookups which can be slow (and can also make messages that
should have been rejected merely be deferred, if those DNS
lookups timeout)
* if /etc/exim/harvest exists, then the behaviour of all of those
ACLs in exim4.conf suddenly changes: instead of rejecting recipients
or messages, they add an X-reject header to the message. This
header is purely for internal use; it records the name of the folder
to which the rejected message should be saved, and also gives the
SMTP error message which should ultimately be used to reject
the message.
Thus, those messages will now be seen by local_scan.py, which now
looks for the X-reject header. If found, it uses the folder name
specified there to save the message, and then rejects it with the
SMTP error message also given in X-reject. (Currently X-reject is
retained in saved messages.)
If a message was not tagged with X-reject, then local_scan.py
runs the usual virus and spam checks. (Namely, my homebrew
scan for attachments with filenames that look like Windows
executables, and a run through SpamAssassin.) The logic is
basically this:
if virus:
folder = "virus"
else:
run through SpamAssassin
if score >= 10.0:
folder = "rejected-spam"
elif score >= 5.0:
folder = "caught-spam"
Finally, local_scan.py writes the message to the designated folder.
By far the biggest folder will be "accepted" -- the server handles
2000-5000 incoming messages per day, of which maybe 100-500 are junk
mail. (Oops, just realized I haven't written the code that actually
saves the message -- d'ohh! Also haven't written anything to
discriminate personal email, which I must do. Sigh.)
* finally, the big catch: waiting until after you've read the message
headers and body to actually reject the message is problematic,
because certain broken MTAs (including those used by some spammers)
don't consider a 5xx after DATA as a permanent error, but keep
retrying. D'ohh. This is a minor annoyance currently, where a fair
amount of stuff is rejected at RCPT TO time. But in harvest mode,
*everything* (with the exception of people probing for open relays)
will be rejected at DATA time. So I have cooked up something called
the ASBL, or automated sender blacklist. This is just a Berkeley DB
file that maps (sender_ip, sender_address) to an expiry time. When
local_scan() rejects a message from (sender_ip, sender_address) --
for whatever reason, including finding an X-reject header added by
an ACL in exim4.conf -- it adds a record to the ASBL, with an expiry
time 3 hours in the future. Meanwhile, there's an ACL in exim4.conf
that checks for records in the ASBL; if there's a record for the
current (sender_ip, sender_address) that hasn't expired yet, we
reject all recipients without ever looking at the message headers or
body.
The downside of this from the point-of-view of corpus collection is
that if some jerk is busily spamming *@python.org, one SMTP
connection per address, we will most likely only get one copy. This
is a win if you're just thinking about reducing server load and
bandwidth, but I'm not sure if it's helpful for training spam
detectors. Tim?
Happy harvesting --
Greg
--
Greg Ward <gward@python.net> http://www.gerg.ca/
Budget's in the red? Let's tax religion!
-- Dead Kennedys