Difference between revisions of "Performance Troubleshooting"

From Roaring Penguin
Jump to: navigation, search
(Memory Usage)
(-h > -m)
 
(5 intermediate revisions by 2 users not shown)
Line 24: Line 24:
 
A brief summary of total memory usage can be found with the command:
 
A brief summary of total memory usage can be found with the command:
  
     free
+
     free -h
  
 
The key thing to look at here is the 'Swap' usage. Swap is non-volatile virtual memory; bits of memory that are pushed to the hard drive when the system is running low on actual RAM. This allows the operating system to use more memory than the system actually has, but read/write times to disk are incredibly slow compared to RAM, so if scanning processes start having to use swap, system performance will drop significantly. If the swap usage is anywhere as high as 10-20%, there is likely no need for concern, otherwise you will want to look into increasing the amount of RAM for the machine.
 
The key thing to look at here is the 'Swap' usage. Swap is non-volatile virtual memory; bits of memory that are pushed to the hard drive when the system is running low on actual RAM. This allows the operating system to use more memory than the system actually has, but read/write times to disk are incredibly slow compared to RAM, so if scanning processes start having to use swap, system performance will drop significantly. If the swap usage is anywhere as high as 10-20%, there is likely no need for concern, otherwise you will want to look into increasing the amount of RAM for the machine.
 +
 +
(The <code>-h</code> parameter produces more human-readable output (e.g. 16G)  Omit if you want...)
  
 
Alternatively, decreasing the number of concurrent connections may actually speed up scanning if it prevents the system from eating into swap. See the [https://www.roaringpenguin.com/wiki/index.php/No_Free_Workers_Error same article as above], but reduce the connections instead of increasing them.
 
Alternatively, decreasing the number of concurrent connections may actually speed up scanning if it prevents the system from eating into swap. See the [https://www.roaringpenguin.com/wiki/index.php/No_Free_Workers_Error same article as above], but reduce the connections instead of increasing them.
 +
t
 +
You can also get a more thorough and active view of RAM usage per process with the 'top' command, as discussed below, to see if any errant processes are eating up too much memory (press Shift+M to sort by memory usage; press q to exit top).
 +
 +
== VMStat ==
 +
 +
Speaking of swap usage, the following command:
 +
 +
    vmstat 1
 +
 +
... is useful.  There are many columns of output.  For swap usage, look at the <code>swap</code> column, which has two values, <code>si</code> and <code>so</code> for swap-in and swap-out.  This command produces output where the first line is an overall summary, then subsequent lines are output 1/sec so if you see anything other than 0s in this column (aside from the first overall summary line), then you know this machine is actively making use of swap-space.
 +
 +
If a CanIt server is making any use of swap space at all, its performance will be terrible.  On "normal" systems, swap space allows a system without enough memory to keep functioning by swapping the least-used bits of RAM to disk.  On a CanIt server, the processes that use memory are all constantly using it, so it thrashes the disk horribly.  In addition, if the disk is busy thrashing due to swapping, disk IO becomes poor and the database and other processes rely on a fast disk, so performance snowballs.
 +
 +
While you're looking at the 'vmstat' output, notice on the right, the <code>cpu</code> column.  One of the values is <code>wa</code> which is CPU wait.  If the CPU spends a lot of time waiting for IO (disk IO, network IO, whatever IO) that's a performance problem.  Values above, say, 70% iowait ("wa") are an indication of poor IO (often disk IO, less often network IO).  See <code>iostat</code> below for more about disk IO.
  
You can also get a more thorough and active view of RAM usage per process with the 'top' command, as discussed below, to see if any errant processes are eating up too much memory (press Shift+M to sort by memory usage; press q to exit top).
 
  
 
==CPU Usage==
 
==CPU Usage==
Line 38: Line 53:
 
     top -b -n 1
 
     top -b -n 1
  
command will provide static output similar to what you would expect from Windows Task Manager. Being static allows you to pipe it to other commands if you know what you are looking for.
+
command will provide static output similar to what you would expect from Windows Task Manager. Being static allows you to pipe it to other commands if you know what you are looking for. Use just `top`, without any arguments to get the interactive view which automatically sorts by CPU usage (q to exit).
  
 
You can also search running processes using the following:
 
You can also search running processes using the following:
Line 44: Line 59:
 
     ps -aux | grep process_substring
 
     ps -aux | grep process_substring
  
For instance, searching 'canit' in place of process_substring will return all canit processes. This can indicate if a certain process has perhaps hung, which would not necessarily float to the top of the list in 'top' given that it would not be using much/any CPU.
+
For instance, searching 'canit' in place of process_substring will return all canit processes. This can indicate if a certain process has perhaps hung, which would not necessarily float to the top of the list in `top` given that it would not be using much/any CPU.
 +
 
 +
==Hung Processes==
 +
 
 +
Using the instructions above for `ps -aux` and searching for 'canit' results can allow you to see if a particular processes has hung unexpectedly which can result in System Check anomalies and old data.
 +
 
 +
Some CanIt processes are expected to be running for a very long time. The master 'canitd' process, for instance starts whenever the machine boots up and is not expected to exit until it is forced to. However tasks run by 'canitd' should not run for very long.
  
 
==Disk Performance==
 
==Disk Performance==
Line 63: Line 84:
  
 
     tail -f /var/log/mail-daily/current.log
 
     tail -f /var/log/mail-daily/current.log
 +
 +
To view the entire log in it's current state (this may be fairly slow if the system is loaded), use:
 +
 +
    less /var/log/mail-daily/current.log
 +
 +
`less` is a paging application that lets you view the entire file. It uses common Vi/Vim bindings for navigation. For instance you can jump to a given line, for example line 123, with '123gg'. You can jump to the bottom with 'G' and to the top with 'gg'. You can search for stings with '/string'. You can use PgUp and PgDown to move a screen height at a time.
  
 
To search for a specific string, use (-i searches case-insensitively):
 
To search for a specific string, use (-i searches case-insensitively):
  
 
     grep -i "specific_string" /var/log/mail-daily/current.log
 
     grep -i "specific_string" /var/log/mail-daily/current.log
 
Opening the file with less will allow you to page through (and search with /) the entire log file. This may be fairly slow if the system is loaded.
 
  
 
<div style="float:right; clear:both; margin-right:0.5em">[[Support Wiki | [Home]]]</div>
 
<div style="float:right; clear:both; margin-right:0.5em">[[Support Wiki | [Home]]]</div>
 
[[category:All]][[category:Troubleshooting]][[category:Best Practices]]
 
[[category:All]][[category:Troubleshooting]][[category:Best Practices]]

Latest revision as of 15:16, 28 August 2018

This article is intended as a basic starting point for identifying performance issues and their possible sources. Unfortunately, because of the number of potential sources of performance issues and the complexity of their interaction with both hardware and software it is not possible to provide a comprehensive step-by-step recipe for all possible eventualities.

Average Scan Times

One of the best indicators of performance is the actual speed at which messages are being scanned. In the following command, look at the "Avg ms/Scan" column:

   md-mx-ctrl load

This indicates how many milliseconds the average scan has taken to complete within the last 10 seconds, 1 minute, 5 minutes, and 10 minutes so that you can track short-term load. 'load' can be replaced with 'hload' to give information on hourly averages for 1, 4, 12 and 24 hours. A healthy, well-provisioned machine should complete scans in a few seconds (a few 1000's of milliseconds). If scans are taking significantly longer, it is possible that you may need to provide better hardware, or that some part of the scanning process is struggling.

Alternatively, you may be able to provide extra scanning processes if MIMEDefang is not currently taking advantage of all of the available memory (see below). To see if this is necessary you can use:

   md-mx-ctrl status

If all or almost all of the workers are busy, you may require more concurrent connections.

Note that scan times also depend on factors other than the performance of the CanIt system. For instance, part of the scanning process includes checking RBL lists and Phishing lists which depends on proper DNS resolution and prompt responses from the servers that maintain those lists. If scan times are long, but the system isn't otherwise loaded, you can test a DNS lookup or try to ping the relevant hostname(s) to see if it responds promptly:

   time host combined.bl.rptn.ca
   ping combined.bl.rptn.ca

Memory Usage

A brief summary of total memory usage can be found with the command:

   free -h

The key thing to look at here is the 'Swap' usage. Swap is non-volatile virtual memory; bits of memory that are pushed to the hard drive when the system is running low on actual RAM. This allows the operating system to use more memory than the system actually has, but read/write times to disk are incredibly slow compared to RAM, so if scanning processes start having to use swap, system performance will drop significantly. If the swap usage is anywhere as high as 10-20%, there is likely no need for concern, otherwise you will want to look into increasing the amount of RAM for the machine.

(The -h parameter produces more human-readable output (e.g. 16G) Omit if you want...)

Alternatively, decreasing the number of concurrent connections may actually speed up scanning if it prevents the system from eating into swap. See the same article as above, but reduce the connections instead of increasing them. t You can also get a more thorough and active view of RAM usage per process with the 'top' command, as discussed below, to see if any errant processes are eating up too much memory (press Shift+M to sort by memory usage; press q to exit top).

VMStat

Speaking of swap usage, the following command:

   vmstat 1

... is useful. There are many columns of output. For swap usage, look at the swap column, which has two values, si and so for swap-in and swap-out. This command produces output where the first line is an overall summary, then subsequent lines are output 1/sec so if you see anything other than 0s in this column (aside from the first overall summary line), then you know this machine is actively making use of swap-space.

If a CanIt server is making any use of swap space at all, its performance will be terrible. On "normal" systems, swap space allows a system without enough memory to keep functioning by swapping the least-used bits of RAM to disk. On a CanIt server, the processes that use memory are all constantly using it, so it thrashes the disk horribly. In addition, if the disk is busy thrashing due to swapping, disk IO becomes poor and the database and other processes rely on a fast disk, so performance snowballs.

While you're looking at the 'vmstat' output, notice on the right, the cpu column. One of the values is wa which is CPU wait. If the CPU spends a lot of time waiting for IO (disk IO, network IO, whatever IO) that's a performance problem. Values above, say, 70% iowait ("wa") are an indication of poor IO (often disk IO, less often network IO). See iostat below for more about disk IO.


CPU Usage

Linux can function extremely well with consistently high CPU load, so checking CPU stats is not likely to provide much insight into performance problems unless you know specifically what you are looking for. The

   top -b -n 1

command will provide static output similar to what you would expect from Windows Task Manager. Being static allows you to pipe it to other commands if you know what you are looking for. Use just `top`, without any arguments to get the interactive view which automatically sorts by CPU usage (q to exit).

You can also search running processes using the following:

   ps -aux | grep process_substring

For instance, searching 'canit' in place of process_substring will return all canit processes. This can indicate if a certain process has perhaps hung, which would not necessarily float to the top of the list in `top` given that it would not be using much/any CPU.

Hung Processes

Using the instructions above for `ps -aux` and searching for 'canit' results can allow you to see if a particular processes has hung unexpectedly which can result in System Check anomalies and old data.

Some CanIt processes are expected to be running for a very long time. The master 'canitd' process, for instance starts whenever the machine boots up and is not expected to exit until it is forced to. However tasks run by 'canitd' should not run for very long.

Disk Performance

A common limiting factor for processing speed, especially for Virtual Machines is the disk read/write speed. The following will give you some idea of that performance:

   iostat 1 -x 1

If a significant amount of time is being spent waiting (%iowait), this is an indication that processing speed is being limited by the speed at which it is able to read and write data to the disk. If there is significant %ioidle, your disks are easily keeping up with the read/write demand.

Mail Logs

It is possible that performance may be due to a known problem which will be documented in the mail logs. For instance, database connection could be failing, and so forth. These logs will also provide information about timeouts connecting to other machines and so on. If you have the Log Indexer installed, as per Chapter 6.1 in the Installation Guide, mail logs will be available from the WebUI using Administration->Search Logs. These lag a little behind, so if you need to see immediate logs, or you want to search for information that is not indexed, you can find these logs in /var/log/mail-daily (current.log is a symlink to today's log which is automatically updated nightly). If you don't use the indexer, there will be a /var/log/maillog file instead. These can be searched and filtered with any number of UNIX commands, including: grep, less, cat, and others. To view the most recent X lines use:

   tail -n X /var/log/mail-daily/current.log

To see live logs as the come in, use (you can pipe this to grep to display specific content):

   tail -f /var/log/mail-daily/current.log

To view the entire log in it's current state (this may be fairly slow if the system is loaded), use:

   less /var/log/mail-daily/current.log

`less` is a paging application that lets you view the entire file. It uses common Vi/Vim bindings for navigation. For instance you can jump to a given line, for example line 123, with '123gg'. You can jump to the bottom with 'G' and to the top with 'gg'. You can search for stings with '/string'. You can use PgUp and PgDown to move a screen height at a time.

To search for a specific string, use (-i searches case-insensitively):

   grep -i "specific_string" /var/log/mail-daily/current.log