Low Scoring Spams

From Roaring Penguin
Revision as of 19:50, 13 January 2017 by JohnMertz (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

As of the time of writing (Jan. 2017), we have noticed a large increase in the number of low-scoring spam messages. We are always working to make adjustments to address the latest spam trends, but these messages appear to be presenting a special case. The vast majority of these are getting placed in Pending which we do see as the filter doing it's job, but it can be understandably frustrating for users to see a high volume of "Pending" content. The reasons for why they don't score higher and what can be done about it are included here.

Function of the Pending trap

To begin the discussion on the Pending Message load, it is best to first mention what a user's mindset regarding these notifications should be. It should be understood that the vast majority of messages in the Pending Quarantine are expected to be spam. Messages that have been placed there are those which the system has determined to be spam, but can't be so sure as to automatically reject, as discussed below.

Users actively rejecting these messages is absolutely NOT essential. When a message is rejected we use that information to train the Bayesian scanner for content analysis. Given that these messages are essentially all already hitting the maximum Bayes score, training them further does not have a significant impact except to reinforce that determination.

Permanently blocking senders or domains is also not typically very effective given that spammers will rarely use the same address twice. The sources which do use a consistent sending address are generally things like mailing lists and retailers who have an unsubscribe process. The block rules may make this process easier, but it is not best practices given that new mailing lists are compromised every day so it is best to actually have an address removed, rather than blocked.

Users should focus on using these notifications as a way to recover false-positives. ie. If they are anticipating a reply or new message from a contact, they can quickly scan through the list for that.

Increase in overall spam volumes

This increase in spam is correlated with an actual increase in overall mail flow. In other words, it is not the case that messages which would have previously been auto-rejected are now being put in pending; these are a new class of messages which are much more difficult to filter out. Many of these contain content like fake news or promotional material, but nothing that is outright malicious from the scanner's perspective.

Why these messages are not automatically rejected

Simply put, from the perspective of the scanner, there is not much that is actually "spammy" about these messages.

As mentioned, the Bayesian content analysis is hitting almost every single one of these messages, meaning that the system does recognize the actual content of these messages as being spammy. This is represented by the percentage next to the score for any given incident. We apply the score, as defined from Rules->Bayes Settings based on that percentage; the level of confidence that the content is spam. However, while Bayesian analysis can very reliably notice spammy content, it is also prone to false-positives and thus we do not give it the power to auto-reject any messages on it's own.

To human eyes a particular incident may seem obviously spammy, but this determination is extremely subjective for this type of content. For example, while one user may not be interested in the latest fashion sales or an article about improving my SEO, many users are. This is can be seen by reviewing the list of accepted (Quarantine->Non-Spam) messages, the content of which can be practically indestinguishable from the rejected (Quarantine->Spam) items for other users, in many cases.

Many, many items are hitting the Bayes analysis exclusively. Each incident in the CanIt WebUI can be inspected by clicking on the link in the Date column from the Quarantine page, or by clicking the "See Incident Details" link at the top of the message contents. We are extremely transparent about how we filter and include a comprehensive list of all tests that were hit when the messages came in at the bottom of the incident report.

Aside from spammy content, we have a wide array of other scans for bad senders, IPs, domains, links, attachments, keywords, meta-data, reputation lists, legitimacy checks and so on. In most cases, these new messages hit none or few of these tests. They are being sent from addresses at domains that are owned by the sender, from properly configured mail servers, with no malicious content, nor bad links, obvious profane or other objectionable content, or even any information that is easily identifiable as misleading. As previously stated, spammers constantly change sending addresses, so their reputation is generally fine.

What can be done to automatically reject these messages

Additional rules can be created in order to push specific messages over the auto-reject threshold. However, any additional rule (excluding those with a negative score) will will increases the chances of false-positives. Roaring Penguin provides a reliable set of rules which effectively quarantine or reject mail for a general use-case while making false-positively extremely rare. Because we cannot determine the needs of each individual user or organization, we provide all of the tools that we do to allow users and administrators to adjust those settings to a level of strictness that meets their individual needs and we are happy to support them in doing so.

A lot of these messages are originating from various countries which likely don't represent legitimate senders, but the default configuration for CanIt cannot make assumptions in this regard. We have clients all over the world who themselves have a diverse range of contacts. A simple action to take is to block messages from countries that you do not anticipate any legitimate mail from using our Rules->Countries feature. However, this should be used with caution as the nature of the internet means that many messages will not geolocate where you expect. It's also noteworthy that historically up to 70% of all spam originated from the US. That percentage appears to be much lower for these specific messages but is still significant. Alternatively you can also block Top Level Domains (TLDs) by using Rules->Domains and entering the TLD with a dot prefix. eg. .ru

Aside from this, we provide a powerful suite of other tools to allow you to block messages broadly or very specifically. If you see any common factors among these messages, tools such as Rules->Custom Rules can be used to increase the score of all future messages which meet that criteria including content in the Subject, Body, Headers, and so on (or any combination of these). We do create and monitor these rules when we notice such identifiable behaviour, but are also happy to address reports from our users.

Finally, the thresholds used to determine what messages are allowed into pending are flexible on a per-user basis. This article describes this in more depth. Given that some users genuinely do want many of the messages described here (see again: Quarantine->Non-Spam for messages that users have released), it is not recommended that you be too strict as a default setting. If specific users would like to be very strict, this can be done from their stream directly. As with rule creation, lowering the threshold does increase the chance of false-positives being automatically rejected. The messages can be viewed from Quarantine->Spam and recovered by an administrator if necessary.