I am having some trouble setting up my personal spamassassin rules. My problem: I get a lot of Russian spam with Cyrillic letters, many in UTF-8. Because of that, searching for a charset is not sufficient. So I want to search a few typical Russian letters instead (e.g.): (д|ж|з|и|й).
I tried the pattern /(д|ж|з|и|й)/i as well as /(\xd0\xb4|\xd0\xb6|\xd0\xb7|\xd0\xb8|\xd0\xb9)/i (these regex patterns should do the same, right?) in a Subject search:
header CYRILLIC_LETTER_PRESENT Subject =~/(д|ж|з|и|й)/i
Result: The UTF-8 spam is still coming through. I analyzed the emails coming through. All of them have a similar structure. The (important part of the) source looks for one example spam mail as following
Subject: =?UTF-8?B?0KLQtdCx0LUg0L/QvtC90YDQsNCy0LjRgtGM0YHRjyEg0J/QvtC60LDQt9GL?= =?UTF-8?B?0LLQsNGOINC+0YLQu9C40YfQvdGL0Lkg0LLQsNGA0LjQsNC90YIg0L/QvtC7?= =?UTF-8?B?0YPRh9C10L3QuNGPINC00L7RhdC+0LTQsCEg0J/RgNC+0YHRgtC+0Lkg0Lgg?= =?UTF-8?B?0YDQtdC30YPQu9GM0YLQsNGC0LjQstC90YvQueKAiyE=?=
MIME-Version: 1.0
Date: Wed, 8 Mar 2017 06:57:11 +0100
From: =?UTF-8?B?0KDQsNC00LjQuSDQn9C40YjRgg==?= <radiypisht140@zarabotokfm8.ru>
Sender: radiypisht140@zarabotokfm8.ru
Message-ID: <904499458.39893@zarabotokfm8.ru>
X-Priority: 3
List-Unsubscribe: <http://ie8qrshyns.zarabotokfm8.ru/uns/tFRyGZzisv/58dhKEk2im53c/DBetz>
Content-Type: multipart/alternative;
boundary="291e4fd846a7aa548d279e9eb1f199e9_1"
--291e4fd846a7aa548d279e9eb1f199e9_1
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: base64
....encoded....body....
--291e4fd846a7aa548d279e9eb1f199e9_1
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: base64
....2nd(?)....encoded....body....
--291e4fd846a7aa548d279e9eb1f199e9_1--
I googled and found just one kind of useful information: http://shallowsky.com/blog/programming/decoding-email-headers.html
So, this subject uses RFC2047: =?UTF-8?B?msg_subject?= =?UTF-8?B?msg_subject2?= [...]. This line says us, the subject uses the utf-8 charset and base64 encoding (compare http://www.ietf.org/rfc/rfc2047.txt).
Obviously, spamassassin is not decoding this (properly). I have not found any possibility to get this working. I found this site as well: https://dropbear.xyz/2007/08/07/filtering-base64-encoded-spam/
But this is not helping me, as it just describes, how to filter base64 encoded strings, that are long enough. Since I am looking for single characters, I cannot use this approach.
Am I missing something? Thanks for your help!
edit: I also tried the rawbody search, because this should decode the base64 encoding as stated in the docs:
rawbody CYRILLIC_LETTER_PRESENT /(д|ж|з|и|й)/i
Did not work for me as well, although it should search the whole body as well, and it is full of Cyrillic letters.
edit2: I tried to investigate the problem further. If I try to test textcat with spamassassin -D textcat -t spamtest, it tells me that it "can't determine language uniquely enough".
Moreover, I get the following result in the end:
X-Spam-Flag: YES
X-Spam-Level: *******
X-Spam-Status: Yes, score=7.3 required=3.0 tests=HTML_FONT_LOW_CONTRAST,
HTML_MESSAGE,LOCAL_CYRILLIC,RDNS_NONE,SPF_SOFTFAIL,T_DKIM_INVALID
autolearn=no autolearn_force=no version=3.4.0
So it looks like it works. Everything were fine. My rule, here called LOCAL_CYRILLIC, works as intended. BUT, the problem is, this mail went through without being recognized as spam, as the same rule was present in the config file. I tried to forward the same mail again to me, and then, the email source looks like this:
X-Spam-Level: **
X-Spam-Status: No, score=2.7 required=3.0 tests=LOCAL_CYRILLIC,
RCVD_IN_DNSWL_MED autolearn=no autolearn_force=no version=3.4.0
So, there seems to be a difference between running that test locally on a file and an actually incoming email. Why? I always restart spamassassin with systemctl restart spamassassin. I checked it with systemctl status spamassassin and everything looks fine, spamd is restarted as well, as it should be. There I can find also the following info for the forwarded email:
spamd: clean message (2.7/3.0) for spamd:5555 in 6.0 seconds, 8371 bytes.
spamd: result: . 2 - LOCAL_CYRILLIC,RCVD_IN_DNSWL_MED scantime=6.0,size=8371,user=spamd,uid=5555,required_score=3.0,[...]