Troubleshooting GPFS Issues

In this article ,we are going to discuss about most general methods of GPFS issues troubleshooting.

When you got GPFS issue?

Got a problem? Don’t panic!

Check for possible basic problems:

Is Network OK?
Check status of the cluster: “mmgetstate–a”
Check status of NSDs: “mmlsdisk fsname”

Take a 5 min break

In major cases GPFS will recover by it self without need of any intervention from the administrator

If not recovered

Ensure that you are the only person who is doing the work!
check gpfslogs (first on cluster manager, then on FS manager, then on NSD servers)
check syslog(/var/log/messages) for eventual errors
Check disks availability (mmlsdisk fsname)
Consult “Problem determination guide”

Some usefull commands:

“mmfsadm dump waiters” will help to find long lasting processes
“mmdiag --network|grep pending” helps to individuate non-responsive node
“mmdiag --iohist” lists last 512 I/O operations performed by GPFS on current node (helps to find malfunctioning disk)
“gpfs.snap” will garter all logs and configurations from all nodes in the cluster
the first thing to send to IBM support when opening service reques

GPFS V3.4 Problem Determination Guide:

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp

NFS stale file handle:

When a GPFS mount point is in the "NFS stale file handle" status, example

[root@um-gpfs1 root]# df

Filesystem 1K-blocks Used Available Use% Mounted on !

/dev/gpfs_um1 8125032448 8023801088 101231360 99% /storage/gpfs_um

df: `/storage/gpfs_um': Stale NFS file handle

Then check if there is any NSD with status "down"

[root@um-gpfs1 root]# mmlsdisk gpfs_um

disk driver sector failure holds holds

name type size group metadata data status availability

------------ -------- ------ ------- -------- ----- ------------- ------------

disk21 nsd 512 4015 yes yes ready up !

disk22 nsd 512 4015 yes yes ready down !

disk23 nsd 512 4015 yes yes ready down !

disk24 nsd 512 4013 yes yes ready up !

restart the NSDs (important: do it for all NSD with status "down" in one command):

[root@um-gpfs1 root]# mmchdisk gpfs_um start -d "disk21;disk24”

re-mount filesystems

Recovery of GPFS configuration:

If a node of the cluster lost its configuration (has been re-installed) but still present as member of this cluster

(“mmgetstate” lists it in “unknown” state) use this command to recover the node:

/usr/lpp/mmfs/bin/mmsdrrestore -p diskserv-san-5 -R /usr/bin/scp

Checking existing NSD:

If get this warning while creating new nsd Disk descriptor xxx system refers to an existing NSD

Use this command to verify if this device is actually used in one of the file systems

mmfsadm test readdescraw /dev/emcpowerax

Sunday, 9 March 2014