Company News Products Tools Support Documentation Q & A Contact Us

Documentation Home
Help! Errors
Help! False Positives
Help! Spam Leakage
Installation Guides
Features
Procedures
SNF Community
Software
Technology
Tools
Direct Support
Glossary
Q&A

GBUdb

Evaluation Ranges

IP statistics in GBUdb are evaluated in two dimensions. This is usually represented graphically with the probability figure on the x axis (horizontally left to right from -1 to +1) and the confidence figure on the y axis (vertically from top to bottom from 0.0 to 1.0).

The envelope for each evaluation range can then be drawn as a collection of points. These ranges are evaluated in a priority sequence so that overlapping ranges can easily be resolved. The priority is (from highest priority to lowest) White, Black, Caution, Undefined. A higher priority range always overrides a lower priority range.

Below is an ascii-art representation of the default GBUdb Range Map. This ascii-art is produced in the <licenseid>_snf_engine_cfg.log file as a debugging aid. This file is produced by the SNFServer whenever it interprets a new configuration. The configuration log can be compared with the snf_engine.xml file to locate discrepancies.

Range Map - [W]hite [B]lack [C]aution [  ]Normal


    |-9876543210123456789+|
    |               CCCCCC|0
    |               CCCCCC|0.1
    |                CCCBB|0.2
    |                 CCBB|0.3
    |W                 CBB|0.4
    |W                  BB|0.5
    |W                  BB|0.6
    |WW                 BB|0.7
    |WW                 BB|0.8
    |WW                 BB|0.9
    |WWW                BB|1
    |---------------------|
			

White

IPs that fall in the white range consistently produce good messages. Normally if the source IP of a message falls in this range then the GBUdb will override any pattern matching rules so that the message will not be tagged as spam. Learning will continue, however, so if a good IP turns bad it will eventually be pushed out of this range and lose that privilege.

Caution

IPs that fall in the caution range are likely to be spam producers, however there is not yet enough confidence to treat them as bad sources (depending upon your system policy). It could be that the first few message from this IP are unlucky spam from a mixed source that later will produce mostly ham.

In testing it is *almost* always true that if one of the first dozen or so messages from a new IP are spam that the source is a bad source and that any messages that did not match were simply so new that no patterns were in the rulebase yet. Early on our default for the caution range extended all the way to a probability of -0.9 so that if any of the first few messages turned out to be spam the system was highly prejudiced. Unfortunately, this did cause a few false positives in early training. The current default settings are very conservative in order to avoid any false positives we can.

Some systems may find that they can re-tune this range to be extremely prejudicial of new IPs with great success. Others will most likely leave this range mapped as it is rather than risk an occasional false positive from a new mixed source.

By default, if a message comes through with an IP source in this range and no pattern match is found then SNF will produce a 40 result code. This is a unique code associated with the caution range. Filtering systems that translate SNF result codes to weighting schemes may want to chose an alternate weight for messages that are tagged with this code.

Black

IPs that fall in this range consistently produce bad messages. It is extremely unlikely that any legitimate source will fall in this range. By default, if a message comes through with an IP source in this range and no pattern match is found then SNF will produce a 63 result code. This result code is typically associated with IP black rules. If the message does match a pattern rule (white or black) then the pattern rule will determine the result code.

Truncate

How much more black could it be? The answer is none. None more black. - or - These go to eleven.

IPs that fall in this range are "blacker than black". That is, they fall in the black range but in addition to that their probability figure is sufficiently high that we are willing to cut the scanning process short and base the scan result solely on the GBUdb result. This saves CPU cycles and increases throughput at the expense of some detail about the message contents.

By default, if a message comes through with an IP source in this range the message is truncated as soon as the source IP is identified and SNF will produce a 20 result code. This result code is unique to this mode. Filtering systems may want to treat messages differently when SNF tags them with this code either by translating the code to a different (probably higher) weight, or by disabling some later tests. All of these choices are, of course, a matter of system policy.

The Blindness Paradox (and how to get out of it)

As previously stated, messages from IPs in the other ranges continue to be scanned by SNF's pattern matching engine. Messages in the truncate range are not scanned however. This can create what is known as the blindness paradox.

The blindness paradox says that a spam filtering system may become so good at filtering out spam that it can no-longer see what spam looks like.

In order to prevent this, the truncate mode also has a "peek" setting that allows some fraction of truncated messages to be scanned in the normal way. This allows the pattern matching engine to "see" what kinds of messages are coming from the IP source and retrain the GBUdb - all be it at a slower rate than normal.

If an IP source in the truncate range suddenly becomes a source of good messages then the combination of re-training through the "peek" mechanism and regular GBUdb "condensation" will eventually force the IP back into the ordinary black range where all of it's messages will be evaluated by the SNF pattern matching engine.

If the system administrator notices the change before the GBUdb then they can always use the SNFClient utility (or an SNF_XCI transaction) to immediately update their system.

Virtual Spam Traps

On the dark side of the blindness paradox it is possible that new kinds of spam may be coming from known bad message sources. We might otherwise never see these messages until they start coming from new, as yet unknown IPs in the form of leakage.

This is actually an opportunity in disguise. Since we have known bad message sources and we have a high confidence in that assessment we can randomly sample messages from these sources and if they are new to us (they do not match SNF pattern rules) then we can send those samples to special (virtual) spam traps for evaluation. This has many benefits:

For security reasons some systems may choose not to participate in the virtual spam trap program. For this reason it can easily be turned off without compromising the "peek" functionality that prevents the blindness paradox.