Performance Troubleshooting

From Roaring Penguin
Jump to: navigation, search

This article is intended as a basic starting point for identifying performance issues and their possible sources. Unfortunately, because of the number of potential sources of performance issues and the complexity of their interaction with both hardware and software it is not possible to provide a comprehensive step-by-step recipe for all possible eventualities.

Average Scan Times

One of the best indicators of performance is the actual speed at which messages are being scanned. In the following command, look at the "Avg ms/Scan" column:

   md-mx-ctrl load

This indicates how many milliseconds the average scan has taken to complete within the last 10 seconds, 1 minute, 5 minutes, and 10 minutes so that you can track short-term load. 'load' can be replaced with 'hload' to give information on hourly averages for 1, 4, 12 and 24 hours. A healthy, well-provisioned machine should complete scans in a few seconds (a few 1000's of milliseconds). If scans are taking significantly longer, it is possible that you may need to provide better hardware, or that some part of the scanning process is struggling.

Alternatively, you may be able to provide extra scanning processes if MimeDefang is not currently taking advantage of all of the available memory (see below). To see if this is necessary you can use:

   md-mx-ctrl status

If all or almost all of the slaves are busy, you may require more more concurrent connections.

Note that scan times also depend on factors other than the performance of the CanIt system. For instance, part of the scanning process includes checking RBL lists and Phishing lists which depends on proper DNS resolution and prompt responses from the servers that maintain those lists. If scan times are long, but the system isn't otherwise loaded, you can test a DNS lookup or ping try to ping the relevant hostname(s) to see if it responds promptly:

   time host combined.bl.rptn.ca
   ping combined.bl.rptn.ca

Memory Usage

A brief summary of total memory usage can be found with the command:

   free

The key thing to look at here is the 'Swap' usage. Swap is non-volatile virtual memory; bits of memory that are pushed to the hard drive when the system is running low on actual RAM. This allows the operating system to use more memory than the system actually has, but read/write times to disk are incredibly slow compared to RAM, so if scanning processes start having to use swap, system performance will drop significantly. If the swap usage is anywhere as high as 10-20%, there is likely no need for concern, otherwise you will want to look into increasing the amount of RAM for the machine.

Alternatively, decreasing the number of concurrent connections may actually speed up scanning if it prevents the system from eating into swap (see the link above, but decrease values instead).

You can also get a more thorough and active view of RAM usage per-process with the 'top' command, as discussed below, to see if any errant processes are eating up too much memory (press Shift+M to sort by memory usage; press q to exit top).

CPU Usage

Linux can function extremely well with consistently high CPU load, so checking CPU stats is not likely to provide much insight into performance problems unless you know specifically what you are looking for. The

   top -b -n 1

command will provide static output similar to what you would expect from Windows Task Manager. Being static allows you to pipe it to other commands if you know what you are looking for.

You can also search running processes using the following:

   ps -aux | grep process_substring

For instance, searching 'canit' in place of process_substring will return all canit processes. This can indicate if a certain process has perhaps hung, which would not necessarily float to the top of the list in 'top' given that it would not be using much/any CPU.

Disk Performance

A common limiting factor for processing speed, especially for Virtual Machines is the disk read/write speed. The following will give you some idea of that performance:

   iostat 1 -x 1

If a significant amount of time is being spent waiting (%iowait), this is an indication that processing speed is being limited by the speed at which it is able to read and write data to the disk. If there is significant %ioidle, your disks are easily keeping up with the read/write demand.

Mail Logs

It is possible that performance may be due to a known problem which will be documented in the mail logs. For instance, database connection could be failing, and so forth. These logs will also provide information about timeouts connecting to other machines and so on. If you have the Log Indexer installed, as per Chapter 6.1 in the Installation Guide, mail logs will be available from the WebUI using Administration->Search Logs. These lag a little behind, so if you need to see immediate logs, or you want to search for information that is not indexed, you can find these logs in /var/log/mail-daily (current.log is a symlink to today's log which is automatically updated nightly). If you don't use the indexer, there will be a /var/log/maillog file instead. These can be searched and filtered with any number of UNIX commands, including: grep, less, cat, and others. To view the most recent X lines use:

   tail -n X /var/log/mail-daily/current.log

To see live logs as the come in, use (you can pipe this to grep to display specific content):

   tail -f /var/log/mail-daily/current.log

To search for a specific string, use (-i searches case-insensitively):

   grep -i "specific_string" /var/log/mail-daily/current.log

Opening the file with less will allow you to page through (and search with /) the entire log file. This may be fairly slow if the system is loaded.