This article is intended as a basic starting point for identifying performance issues and their possible sources. Unfortunately, because of the number of potential sources of performance issues and the complexity of their interaction with both hardware and software it is not possible to provide a comprehensive step-by-step recipe for all possible eventualities.
Average Scan Times
One of the best indicators of performance is the actual speed at which messages are being scanned. In the following command, look at the "Avg ms/Scan" column:
This indicates how many milliseconds the average scan has taken to complete within the last 10 seconds, 1 minute, 5 minutes, and 10 minutes so that you can track short-term load. 'load' can be replaced with 'hload' to give information on hourly averages for 1, 4, 12 and 24 hours. A healthy, well-provisioned machine should complete scans in a few seconds (a few 1000's of milliseconds). If scans are taking significantly longer, it is possible that you may need to provide better hardware, or that some part of the scanning process is struggling.
Alternatively, you may be able to provide extra scanning processes if MIMEDefang is not currently taking advantage of all of the available memory (see below). To see if this is necessary you can use:
If all or almost all of the workers are busy, you may require more concurrent connections.
Note that scan times also depend on factors other than the performance of the CanIt system. For instance, part of the scanning process includes checking RBL lists and Phishing lists which depends on proper DNS resolution and prompt responses from the servers that maintain those lists. If scan times are long, but the system isn't otherwise loaded, you can test a DNS lookup or try to ping the relevant hostname(s) to see if it responds promptly:
time host combined.bl.rptn.ca ping combined.bl.rptn.ca
A brief summary of total memory usage can be found with the command:
The key thing to look at here is the 'Swap' usage. Swap is non-volatile virtual memory; bits of memory that are pushed to the hard drive when the system is running low on actual RAM. This allows the operating system to use more memory than the system actually has, but read/write times to disk are incredibly slow compared to RAM, so if scanning processes start having to use swap, system performance will drop significantly. If the swap usage is anywhere as high as 10-20%, there is likely no need for concern, otherwise you will want to look into increasing the amount of RAM for the machine.
-h parameter produces more human-readable output (e.g. 16G) Omit if you want...)
Alternatively, decreasing the number of concurrent connections may actually speed up scanning if it prevents the system from eating into swap. See the same article as above, but reduce the connections instead of increasing them. t You can also get a more thorough and active view of RAM usage per process with the 'top' command, as discussed below, to see if any errant processes are eating up too much memory (press Shift+M to sort by memory usage; press q to exit top).
Speaking of swap usage, the following command:
... is useful. There are many columns of output. For swap usage, look at the
swap column, which has two values,
so for swap-in and swap-out. This command produces output where the first line is an overall summary, then subsequent lines are output 1/sec so if you see anything other than 0s in this column (aside from the first overall summary line), then you know this machine is actively making use of swap-space.
If a CanIt server is making any use of swap space at all, its performance will be terrible. On "normal" systems, swap space allows a system without enough memory to keep functioning by swapping the least-used bits of RAM to disk. On a CanIt server, the processes that use memory are all constantly using it, so it thrashes the disk horribly. In addition, if the disk is busy thrashing due to swapping, disk IO becomes poor and the database and other processes rely on a fast disk, so performance snowballs.
While you're looking at the 'vmstat' output, notice on the right, the
cpu column. One of the values is
wa which is CPU wait. If the CPU spends a lot of time waiting for IO (disk IO, network IO, whatever IO) that's a performance problem. Values above, say, 70% iowait ("wa") are an indication of poor IO (often disk IO, less often network IO). See
iostat below for more about disk IO.
Linux can function extremely well with consistently high CPU load, so checking CPU stats is not likely to provide much insight into performance problems unless you know specifically what you are looking for. The
top -b -n 1
command will provide static output similar to what you would expect from Windows Task Manager. Being static allows you to pipe it to other commands if you know what you are looking for. Use just `top`, without any arguments to get the interactive view which automatically sorts by CPU usage (q to exit).
You can also search running processes using the following:
ps -aux | grep process_substring
For instance, searching 'canit' in place of process_substring will return all canit processes. This can indicate if a certain process has perhaps hung, which would not necessarily float to the top of the list in `top` given that it would not be using much/any CPU.
Using the instructions above for `ps -aux` and searching for 'canit' results can allow you to see if a particular processes has hung unexpectedly which can result in System Check anomalies and old data.
Some CanIt processes are expected to be running for a very long time. The master 'canitd' process, for instance starts whenever the machine boots up and is not expected to exit until it is forced to. However tasks run by 'canitd' should not run for very long.
A common limiting factor for processing speed, especially for Virtual Machines is the disk read/write speed. The following will give you some idea of that performance:
iostat 1 -x 1
If a significant amount of time is being spent waiting (%iowait), this is an indication that processing speed is being limited by the speed at which it is able to read and write data to the disk. If there is significant %ioidle, your disks are easily keeping up with the read/write demand.
It is possible that performance may be due to a known problem which will be documented in the mail logs. For instance, database connection could be failing, and so forth. These logs will also provide information about timeouts connecting to other machines and so on. If you have the Log Indexer installed, as per Chapter 6.1 in the Installation Guide, mail logs will be available from the WebUI using Administration->Search Logs. These lag a little behind, so if you need to see immediate logs, or you want to search for information that is not indexed, you can find these logs in /var/log/mail-daily (current.log is a symlink to today's log which is automatically updated nightly). If you don't use the indexer, there will be a /var/log/maillog file instead. These can be searched and filtered with any number of UNIX commands, including: grep, less, cat, and others. To view the most recent X lines use:
tail -n X /var/log/mail-daily/current.log
To see live logs as the come in, use (you can pipe this to grep to display specific content):
tail -f /var/log/mail-daily/current.log
To view the entire log in it's current state (this may be fairly slow if the system is loaded), use:
`less` is a paging application that lets you view the entire file. It uses common Vi/Vim bindings for navigation. For instance you can jump to a given line, for example line 123, with '123gg'. You can jump to the bottom with 'G' and to the top with 'gg'. You can search for stings with '/string'. You can use PgUp and PgDown to move a screen height at a time.
To search for a specific string, use (-i searches case-insensitively):
grep -i "specific_string" /var/log/mail-daily/current.log