The big squeeze: closing down the junk e-mail pipe.
Organizations use a variety of locations and technologies when using anti-spam technology. Spam filters come in both client-based and server-based flavors (client-based runs on end-user machines and serverbased can run from an outside Internet service, to the firewall, to the e-mail server). Spam filter approaches fall into four major categories, with many filters combining the technologies: Whitelisting/blacklisting, pattern matching, signature filtering, and natural language processing.
This basic filtering level works on lists of good e-mail addresses (whitelist) or spammer e-mail addresses or domains (blacklist). Blacklist filters reject any messages originating from or routed through blacklisted addresses or domains, while whitelists only accept any messages from an address or domain on a user-approved list. Some filtering applications use one or the other but many combine them.
Whitelisting, or positive filters, checks incoming e-mail against a list of approved addresses. If the e-mail sender is not on the list, the filter can delete it, send it into a quarantined folder, or send back a challenge e-mail to the sender. If the sender personally replies to the challenge, the whitelist believes there is a real person at the other end and adds the address to the approved list. This option is extremely selective about incoming e-mail, but challenge responses can seriously annoy legitimate senders. It is also susceptible to sophisticated address forging.
E-mail users should be able to add addresses to the whitelist. Most whitelist filters will start by building themselves from e-mail addresses found in the user's existing mailbox and address book. Whitelists won't catch spammers who have hijacked good known addresses, but will catch spammers who haven't. They will also catch e-mail from your mother if she isn't in your e-mail whitelist, so users should check the list periodically.
Blacklisting, or negative filters, compares incoming addresses, subject lines and messages to a blacklist. It intercepts any offending messages and deletes or moves them into a quarantine folder. For example, common filters include rules for blocking mail with "free" or "cash" in the subject line as well as shady words we won't mention. Filters can also block certain ISPs or specific addresses. Blacklisting used to be simpler, but must now adjust to ridiculous punctuation use in spam message subject lines. Blacklisting also requires a large number of filters and CPU processing time, and often returns false positives--identifying an innocent message as spam. In fact, blacklists are better at blocking known viruses than spam--e-mail administrators can use them to deny attachments with common virus extensions such as .exe, .bat and .vbs. Blacklisting needs carefully maintained lists to work since spam programmers are flexible, creative and can turn on a dime. Companies using blacklists can keep a database in-house, though this is labor intensive. Many sign-up with a third-party service provider who constantly updates its blacklists for client companies.
Pattern matching defines a set of criteria that classify messages as spam. Characteristics include such items as all capitalized subject lines, frequent spam phrases, and suspicious header lines. Administrators and users can assign point values to individual characteristics (for example, a high value for porn and a lower one for business offers). The filter then marks any messages scoring at, or higher than, the threshold as spam. Some systems allow the user to train the software to recognize spam or to exempt messages from spam blocking.
Pattern matching filters often use whitelist/blacklist techniques as well, but depend on more sophisticated technologies like content pattern recognition and flexible content filtering. Typical approaches include:
* Identifying invalid HTML tags: Spammers try to disguise HTML-enabled spam by inserting meaningless content within specific HTML tags
* Making case-sensitive checks: Another common spammer technique is displaying subject lines exclusively in upper case
* Practicing intelligent word recognition: To avoid blacklists, spammers will deliberately alter the subject line by adding or removing punctuation, adding nonsense phrases, misspelling words or compressing spaces.
* Blocking MIME content types: Suspicious types include perennial spam favorite HTML, and some viruses that present as specific MIME types.
A type of pattern-matching filter, Bayesian filters don't require whitelists or blacklists. Bayesian filters learn from the user's own classification: users will run a new Bayesian filter against two folders, one containing wanted mail and the other mail that the user considers is spam. The more messages there are, the better the filters will work. This is just the beginning, since Bayesian spam filters are trainable (autoadaptive) and will adjust their matches according to subsequent user actions. Bayesian filters view characteristics such as words in the body of the message, headers, HTML code, word pairs, phrases and meta information. For example, if you are a business owner you may get a good amount of legitimate mail with the word "client." This filter will identify this word as overwhelmingly belonging in your good e-mail store. But if you also receive a good deal of spam with "mortgage" in it, the filter will classify that as a probable spam message but will count mitigating factors. This way, if you really are buying a house and your lender sends you an e-mail, the Bayesian filter won't automatically relegate the message to the spam folder.
Similar to blacklisting but more flexible, signature filtering depends on algorithms. An e-mail's signature is combined from several different characteristics such as address, content, subject and domain. Signature filters use algorithms to produce a short character string to uniquely identify e-mail signatures. The signature filter captures incoming messages, compares their resulting strings to a database of suspect signatures, and blocks the spam signatures. Users can submit new spam addresses directly to the database, and third-party lists include regular updates on changing spam addresses. Database administrators use several validity checking techniques against false positives, including a requirement that multiple users submit a possible spam message before the database adds a signature.
Natural Language Processing (NLP) tries to replicate intuitive human understanding of written information. NLP-based spam filters work by recognizing all probable forms of single words, which means that if a spammer substitutes "mor@gage" for "mortgage," NLP will recognize it anyway. NLP filters also identify phrase and sentence structures and relationships, assign dictionary definitions from context, and can process common sense information. Spam filtering NLP is trainable, which gives it a measure of artificial intelligence.
For example, Corvigo's intent-based filtering technology uses NLP to analyze an e-mail's intent: (a) the sender wants to sell you something--i.e., commercial e-mail; or (b) it doesn't want to sell you something--non-commercial e-mail like from a boss, a client, or uncle David. It further defines commercial e-mail as unwanted junk e-mail or a bulk mailing from a legitimate advertiser. It then sends on the desired e-mail to user inboxes and files the two types of commercial e-mail into separate folders. Users can train the system by rejecting messages from legitimate advertisers or non-commercial e-mail, and can correct spam categories if it's something they want to receive.
Most spam filtering technologies include a variety of these techniques. The challenge in building anti-spam features is there's a lot of money in spamming, and heavy-weight spammers employ good programmers to constantly beat the system. Filtering technologies that auto-adapt to spam challenges have the best chance of staying two steps ahead of the spammer's threatening game
Technology What it Does Whitelisting/Blacklisting Checks incoming e-mail against lists of approved users and/or lists of suspected spammers and suspicious domains. Users and IT administrators can keep lists inhouse or rely on third-party database services. Pattern matching Catches spam by using content pattern recognition and content filtering. Bayesian filters use algorithms to assign spam probabilities to incoming content. Signature filtering Compares known spam elements against subjects, messages, sending ISP, headers and restricted sender names. Often works in conjunction with blacklists. Natural language NLP reproduces human interpretation of processing (NLP) language. NLP-based spam filters analyze words and phrases to determine message intent, much as a human reader would. Table 1
Jeff Ready is CEO at Corvigo, Inc. (Mountain View, CA)
|Printer friendly Cite/link Email Feedback|
|Publication:||Computer Technology Review|
|Date:||Dec 1, 2003|
|Previous Article:||iSCSI advantages and solutions for businesses.|
|Next Article:||Simplifying disaster recovery solutions to protect your data.|