Saturday 22 June 2013

Setting up your dump device in AIX

Setting up your dump device in AIX

Setting up your AIX dump device correctly can be very useful when your system crashes or restarts unexpectedly. It helps IBM support or application owners to investigate the root cause of the crash. A system dump creates a picture of your system's memory contents. It can be also manually initiated by system users (with root authority) and programmers and thus analyze its contents when debugging new applications.
If at the time of installation of AIX the LPAR has less than 4GB of memory, the default dump device settings (for AIX 6.1 and AIX 5.3) are:
primary              /dev/hd6
secondary            /dev/sysdumpnull
copy directory       /var/adm/ras
forced copy flag     TRUE
always allow dump    FALSE
dump compression     ON
type of dump         traditional
NOTE: If your system has memory above 4GB at installation time, /dev/lg_dumplv is configured as the primary dump device.
If you remember that your system has always had more than 4GB of memory, but you still see hd6 as a dump device, the reason might be that your system was cloned from another system's mksysb backup oralt_disk_copy.
The first thing that should be changed is the option "always allow dump". If this option remains to its default setting and the system is not managed by HMC, no dump will be captured. When using an HMC, or Hardware Management Console, the hypervisor tends to ignore this setting and will write out a dump to the dump device. A quick way to change this option is with the command: "sysdumpdev -K"
There are two problems associated with keeping /dev/hd6 as a primary dump device!
  1. At the time of system crash, it may turn out to be too small to store the entire dump image. Because of this we will have a partial dump. The partial dump can be of some use but it may not provide enough information to find the reason for a system crash.
  2. Having the default paging space as a dump device, presents AIX with the problem of having to copy this dump image to a file before hd6 is overwritten to be used again as a paging device when the LPAR is restarted (after the dump is complete). This takes place during phase 2 of the boot process. rc.boottemporarily mounts /var and invokes the copycore command. The copycore command copies the dump image from hd6 to /var/adm/ras, which is the default copy directory.
Usually /var does not have enough space to hold a fairly large system dump image. If /var does not have enough space to hold the compressed dump image, AIX will stop at progress code 0c33 and display a message on the system console. The message is presenting the user with a choice, to either select another device where to copy the dump image or to continue and effectively overwrite the dump that is on hd6.
Not having the system console opened at that particular moment in front of you, may cause a lot of confusion.
NOTE:  You can always avoid the problem of having to wait for console input by changing the "forced copy flag" to FALSE. But then you run into the risk of not having a valid system dump to send to IBM to analyze and now you have to wait for the problem to reoccur in order to get a good dump image.
It is a very good practice to change the primary dump device to be a logical volume other than hd6!

PROS & CONS for not using hd6 as a dump device

PROS:

The dump image will be stored on its own logical volume.
When "snap -ac" is invoked the dump will be copied to a file in the directory used by the snap program and not in /var/adm/ras. The snap command does not make any changes to the actual dump on the device. So the dump remains intact until it is overwritten by another system dump and can be collected as many times as needed by "snap -ac".
When the dump is stored on a device other than hd6, phase 2 of the boot process will still mount /var and invoke "copycore". But copy will not occur, because the dump is now stored on a separate logical volume. Therefore no need to make /var unnecessarily large and no need to worry about monitoring the system console.

CONS:

The primary dump device must always be in the root volume group. The only exception to this rule is if you want to temporary set the primary dump device to a LV in a user volume group.
This is done by omitting the -P flag, but settings will not persist when the system is restarted.
Therefore to permanently change the settings of the primary dump device you need to have enough free PPs in rootvg, to accommodate for large enough logical volume.

GUIDELINES ON HOW TO ESTIMATE DUMP SIZE (information is from our practical experience)

For a system that has been working for a long time, the estimated dump size given by "sysdumpdev -e" should be more or less accurate.
So you can take the output of "sysdumpdev -e", round it up to the next gigabyte and use this for the size of your dump logical volume.
For a system, which you expect to be busier in the future, but it is not at that point yet, the output of "sysdumpdev -e" is not of much use. We could say as a minimum (if you are short on space in rootvg), set the dump logical volume size to 1GB.
For production systems, it is a good practice to set the size of the dump logical volume to 2GB and above. 4GB is a good starting size for a busy RDBMS or Java system.
If the system is running memory intensive applications like: Java, Oracle, DB2 and so on, the size of the dump can be very large. It is not a rare case dumps to be 5GB in size when compressed.

HOW TO CREATE DUMP LVs AND SET THEM AS PRI/SEC DUMP DEVICES

Assuming that rootvg is mirrored between hdisk0 & hdisk1 and PP SIZE is 128MB. Mirroring of dump device (if the dump device is not a paging space) is not supported.
First, do not forget to run "sysdumpdev -K" to enable the system to generate dumps. Otherwise when ABEND/ASSERT or kernel panic occurs, the system will only restart itself without creating a dump image.
# mklv -y lg_dumplv -t sysdump rootvg 16 hdisk0
# mklv -y lg_dumplv2 -t sysdump rootvg 16 hdisk1
# sysdumpdev -P -p /dev/lg_dumplv
# sysdumpdev -P -s /dev/lg_dumplv2

OTHER USEFUL INFORMATION

In the cases when an LPAR becomes unresponsive or hang and you want to initiate a system dump to send to IBM, but you cannot logon to AIX to initiate the dump, do that from the HMC command line or GUI.
HMC command line:
# chsysstate -m MANAGED_SYSTEM -r lpar -o dumprestart -n LPAR_NAME
EXAMPLE:
# chsysstate -m managed_system -r lpar -o dumprestart -n lpar_name
HMC GUI:
Select the LPAR
      -> Operations
                  ->Restart
                              ->Dump (and click OK)

Other information about aix dump refer this Link:http://www.unixmantra.com/2013/04/system-dump-aix.html

0 blogger-disqus:

Post a Comment