detect spam from random domain names in header, in .procmailrc

Question

Spam from a well-known organization that sells large lots of knockoffs of everything from drones to roombas uses random domains in the From:, Message-ID:, and Reply-To: fields, all different, without even a common trailing xyz.com.
(Nonspam mail tends to share domain names for at least Message-ID: and Reply-To:.)

Can a recipe in ~/.procmailrc detect such spam, to then forward it to a spam folder?

Fancy regexes with named capturing groups?
Something about chaining actions with 'A' or 'a'?
Call procmail recursively, as suggested by its manpage?
A 'filter' to pass the mail's header to a script written in a language with better string processing?

An example: namebrandwigs.com, mysuburbankitchen.com, aliyun.com.

From soumedyfenkoa@namebrandwigs.com  Wed Mar 17 09:27:54 2021
Return-Path: <soumedyfenkoa@namebrandwigs.com>
X-Original-To: ---
Delivered-To: ---
Received: from mysuburbankitchen.com (unknown [5.253.84.113])
        by --- (Postfix) with ESMTP id 332025E236
        for <--->; Wed, 17 Mar 2021 09:27:53 -0500 (CDT)
To: ---
Subject: drone with new features
Message-ID: <75d167a6b7be6548dcb16af2cf729811@sweetwater.com>
Date: Wed, 17 Mar 2021 08:13:03 +0100
From: "Jake Allen" <soumeayfenkoa@namebrandwigs.com>
Reply-To: teogingsilklo@aliyun.com
MIME-Version: 1.0
X-Mailer-Sent-By: 1
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Status: RO
<html>
<head>
</head>
<body>
<span style="display: block; text-align: left;"><span style="display:
block; text-align: left;">Hi,<br /><br /><span>Hope all is well.</span><br
/>We are contacting you today to let you know that we have got the
following new drone, ready to ship to worldwide customers.<span><span><br
/><br /></span></span></span></span><span style="display: block;
text-align: left;"><span style="display: block; text-align:
left;"></span></span><span style="display: block; text-align: left;"><span
style="display: block; text-align: left;"><span style="display: block;
text-align: left;"></span></span></span><span style="display: block;
text-align: left;">Explore new places and amp up your videography skills
with the our F9 4K HD camera drone.
...

Other examples, similarly formatted, but lacking flaggable text in the message body:

a leather massage chair from momentumwatch.com / musicalley.com
a roomba from mtndewkid.com / myhondafitev.com / constructiongear.com
a video projector from hairrehablondon.com / hairocean.com / hotmail.com

tripleee · Answer 1 · 2021-03-19T09:23:10.760

Based on a few limited examples, it's hard to come up with anything specific which would work today and continue to work tomorrow. If your actual question is really "how can I prevent spam from Procmail" the obvious, simple, and well-documented answer to that is "run a full-spectrum spam filter like SpamAssassin and examine its result". Even then, your accuracy will probably never be 100%; but SpamAssassin does a decent job for a tool you just basically configure and forget. It relies extensively on external services which provide dynamic reputation information for IP addresses, URLs, and other network resources used by spammers, so there is in fact a fair amount of action going on behind the scenes.

UsedViaProcmail on the SpamAssassin wiki has more instructions. In brief, once you have installed and configured SpamAssassin, try something like

:0fw
* < 512000
| spamassassin
:0:

^X-Spam-Level: ***************

almost-certainly-spam
:0:

^X-Spam-Status: Yes

probably-spam

The second colon in :0: is only correct if you are delivering to a mailbox which requires locking (such as an mbox file, but definitely not a Maildir directory; but based on the sample in your question, you seem to be on mbox). If you regularly receive big spam messages, maybe take out the size condition * < 512000 or adjust the number. The SpamAssassin standard Procmail boilerplate includes a lock file which is unnecessary on your personal system and perhaps dubious on shared hosts, and some weird cargo cult voodoo around broken From lines which I believe was never actually correct.

If you want advice which is specific to the samples you provided, please understand that even deeply researched and absolutely truthful facts which could let you discard these specific messages with full confidence that there will be no false positives will be all but useless for handling any other messages, and/or obsolete tomorrow or next week.

tripleee · Accepted Answer · 2021-03-19T09:27:21.640

Here is a Procmail recipe which implements what I think you may be asking.

It uses scoring which is a slightly obscure but occasionally useful feature. Briefly, we assign a score of 1 if there is a From: header with a domain name (as there would always be), then subtract one from the score if the Reply-To: or Message-Id: header has the same string after the @.

:0:
*    1^0 ^From:.*@\/[^@<>   ]+
* $ -1^0 ^Message-Id:.*@$\MATCH\>
* $ -1^0 ^Reply-To:.*@$\MATCH\>
suspicious

I predict that this will have a fairly high rate of false positives, but perhaps it can provide value for you if you get a lot of spam with this particular pattern in it, especially if you can combine it with a whitelist.

I would still recommend that you check the suspicious folder regularly, and fish out any false positives back to your regular inbox.

Here's a demo run with the sample you provided, with delivery to /dev/null instead just for the demo.

bash$ procmail -m VERBOSE=yes /tmp/procmailrc </tmp/sample 
procmail: [16] Fri Mar 19 09:06:29 2021
procmail: Rcfile: "/tmp/procmailrc"
procmail: Assigning "MAILDIR=/home/tripleee"
procmail: Assigning "MATCH="
procmail: Matched "namebrandwigs.com"
procmail: Score:       1       1 "^From:.*@\/[^@<>  ]+"
procmail: Score:       0       1 "^Message-Id:.*@()namebrandwigs\.com\>"
procmail: Score:       0       1 "^Reply-To:.*@()namebrandwigs\.com\>"
procmail: Assigning "LASTFOLDER=/dev/null"
procmail: Opening "/dev/null"
 Subject: drone with new features
  Folder: /dev/null                            1373

A specific complication is that this does not allow for subdomain hits; it would not be too hard to allow Message-id: <randomjunk@subdomain.example.com> for a sender From: real name <address@example.com> but the opposite scenario is much trickier, because you can't really know in the general case of From: sender <address@many.labels.here> whether the domain name is labels.here (like in the .com and .fr TLDs for example) or many.labels.here (like in the .co.uk, .com.au etc TLDs) or even longer (as the case may be for k12.place.name.us and some prefectures in Japan, etc).

In some more detail, 1^0 assigns a 1 score for the first hit on the first recipe line, and no additional score for additional hits. The \/ token captures the string after it from the matched string, i.e. everything after the last @ sign in the header. The MATCH variable is then used in the following recipe lines to refer back to this captured string; the syntax $\MATCH produces a regex-escaped pattern which matches the literal string. The subsequent recipes have a $ flag to tell Procmail to interpolate any variables (i.e. $MATCH) into the recipe condition, and a -1^0 scoring instruction to subtract one for the first hit on the condition, and then nothing if it matches again.

The man pages explain all these constructs, but can be rather dense; perhaps see also the Procmail quick reference which is even denser, but perhaps also then quicker to read and understand.

I posted this as a separate answer so as not to mix up the content here with my other answer, which basically tries to dissuade you from creating spam filters of your own using just Procmail.

detect spam from random domain names in header, in .procmailrc

2 Answers2