Introduction
The purpose of this guide is to introduce Tivoli® System Automation for Multiplatforms and provide a quick-start, purpose-driven approach to users that need to use the software, but have little or no past experience with it.This guide describes the role that TSA plays within IBM’s Smart Analytics System solution and the commands that can be used to manipulate the application. Further, some basic problem diagnosis techniques will be discussed, which may help with minor issues that could be experienced during regular use.
When the Smart Analytics system is built with High Availability, TSA is automatically installed and configured by the ATK. Therefore, this guide will not describe how to install or configure a TSA cluster (domain) from scratch, but rather how to manipulate and work with an existing environment. To learn to define a cluster of servers, please refer to the References appendix for IBM courses that are available.
Terminology
It is advisable to become familiar with the following terms, since they are used throughout this guide. It will also help you become familiar with the scopes of the different components within TSA.Table 1. Terminology
Term | Definition |
---|---|
Peer Domain: | A cluster of servers, or nodes, for which TSA is responsible |
Resource: | Hardware or software that can be monitored or controlled. These can be fixed or floating. Floating resources can move between nodes. |
Resource group: | A virtual group or collection of resources |
Relationships: | Describe how resources work together. A start-stop relationship creates a dependency (see below) on another resource. A location relationship applies when resources should be started on the same or different nodes. |
Dependency: | A limitation on a resource that restricts operation. For example, if resource A depends on resource B, then resource B must be online for resource A to be started. |
Equivalency: | A set of fixed resources of the same resource class that provide the same functionality |
Quorum: | A cluster is said to have quorum when there it has the capability to form a majority within its nodes. The cluster can lose quorum when there is a communication failure, and sub-clusters form with an even number of nodes. |
Nominal State: | This can be online or offline. It is the desired state of a resource, and can be changed so that TSA will bring a resource online or shut it down. |
Tie Breaker: | Used to maintain quorum, even in a split-brain situation (as mentioned in the definition of quorum). A tie-breaker allows sub-clusters to determine which set of nodes will take control of the domain. |
Failover: | When a failure occurs (typically hardware), which causes resources to be moved from one machine to another machine, the resources are said to have “failed over” |
Getting Started
The purpose of TSA in the Smart Analytics system is to manage software and hardware resources, so that in the event of a failure, they can be restarted or moved to a backup system. TSA uses background scripts to check the status of processes and ensure that everything is working ok. It also uses “heart-beating” between all the nodes in the domain to ensure that every server is reachable. Should a process fail the status check, or a node fails to respond to a heartbeat, appropriate action will be taken by TSA to bring the system back to its nominal state.Let’s start with the basics. In a Smart Analytics System, the TSA domain includes the DB2 Admin node, the Data nodes, and any Standby/backup nodes. The management server is not part of the domain and TSA commands will not work there. Further, all TSA commands are run as the root user.
The first thing you want to do is check the status of the domain, and start it if required:
# lsrpdomain Name OpState RSCTActiveVersion MixedVersions TSPort GSPort bcudomain Online 2.5.3.3 No 12347 12348
startrpdomain bcudomain
stoprpdomain bcudomain
stoprpdomain -f bcudomain
/usr/lpp/mmfs/bin/mmunmount /db2home
# lsrpnode Name OpState RSCTVersion beluga006 Online 2.5.3.3 beluga008 Online 2.5.3.3 beluga007 Online 2.5.3.3
Resource Groups
After you have verified that the Domain is started, and all your nodes are Online, you will want to check the status of your resources. TSA manages all resources through resource groups. You cannot start a resource individually through TSA. When you start a resource group however, it will start all resources that belong to that group.To check the status of your DB2 resources, use the hals command. This gives you a summary of all nodes in the peer domain, including their primary and backup locations, current location, and failover state.
+===============+===============+===============+==================+==================+===========+ | PARTITIONS | PRIMARY | SECONDARY | CURRENT LOCATION | RESOURCE OPSTATE | HA STATUS | +===============+===============+===============+==================+==================+===========+ | 0 | dwadmp1x | dwhap1x | dwadmp1x | Online | Normal | | 1,2,3,4 | dwdmp1x | dwhap1x | dwdmp1x | Online | Normal | | 5,6,7,8 | dwdmp2x | dwhap1x | dwdmp2x | Online | Normal | | 9,10,11,12 | dwdmp3x | dwhap1x | dwhap1x | Online | Failover | | 13,14,15,16 | dwdmp4x | dwhap1x | dwdmp4x | Online | Normal | +===============+===============+===============+==================+==================+===========+
The hals command is actually a summary of the complete output. For more detailed information about each resource, use the lssam command. The following output is an example of a cluster with the following nodes:
Admin node: beluga006 Data node: beluga007 Standby node: beluga008
# lssam | grep Nominal Online IBM.ResourceGroup:SA-nfsserver-rg Nominal=Online Online IBM.ResourceGroup:db2_bculinux_NLG_beluga006-rg Nominal=Online '- Online IBM.ResourceGroup:db2_bculinux_0-rg Nominal=Online Online IBM.ResourceGroup:db2_bculinux_NLG_beluga007-rg Nominal=Online |- Online IBM.ResourceGroup:db2_bculinux_1-rg Nominal=Online |- Online IBM.ResourceGroup:db2_bculinux_2-rg Nominal=Online |- Online IBM.ResourceGroup:db2_bculinux_3-rg Nominal=Online '- Online IBM.ResourceGroup:db2_bculinux_4-rg Nominal=Online
Let’s step through the above output:
Online IBM.ResourceGroup:SA-nfsserver-rg Nominal=Online
Online IBM.ResourceGroup:db2_bculinux_NLG_beluga006-rg Nominal=Online '- Online IBM.ResourceGroup:db2_bculinux_0-rg Nominal=Online
Online IBM.ResourceGroup:db2_bculinux_NLG_beluga007-rg Nominal=Online |- Online IBM.ResourceGroup:db2_bculinux_1-rg Nominal=Online |- Online IBM.ResourceGroup:db2_bculinux_2-rg Nominal=Online |- Online IBM.ResourceGroup:db2_bculinux_3-rg Nominal=Online '- Online IBM.ResourceGroup:db2_bculinux_4-rg Nominal=Online
Now, let us examine the full lssam output. Try to find each of the lines from the grepped output in the full output:
# lssam Online IBM.ResourceGroup:SA-nfsserver-rg Nominal=Online |- Online IBM.AgFileSystem:shared_db2home |- Online IBM.AgFileSystem:shared_db2home:beluga006 '- Offline IBM.AgFileSystem:shared_db2home:beluga008 |- Online IBM.AgFileSystem:varlibnfs |- Online IBM.AgFileSystem:varlibnfs:beluga006 '- Offline IBM.AgFileSystem:varlibnfs:beluga008 |- Online IBM.Application:SA-nfsserver-server |- Online IBM.Application:SA-nfsserver-server:beluga006 '- Offline IBM.Application:SA-nfsserver-server:beluga008 '- Online IBM.ServiceIP:SA-nfsserver-ip-1 |- Online IBM.ServiceIP:SA-nfsserver-ip-1:beluga006 '- Offline IBM.ServiceIP:SA-nfsserver-ip-1:beluga008 Online IBM.ResourceGroup:db2_bculinux_NLG_beluga006-rg Nominal=Online '- Online IBM.ResourceGroup:db2_bculinux_0-rg Nominal=Online |- Online IBM.Application:db2_bculinux_0-rs |- Online IBM.Application:db2_bculinux_0-rs:beluga006 '- Offline IBM.Application:db2_bculinux_0-rs:beluga008 |- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0000-rs |- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0000-rs:beluga006 '- Offline IBM.Application:db2mnt-db2fs_bculinux_NODE0000-rs:beluga008 '- Online IBM.ServiceIP:db2ip_172_16_10_228-rs |- Online IBM.ServiceIP:db2ip_172_16_10_228-rs:beluga006 '- Offline IBM.ServiceIP:db2ip_172_16_10_228-rs:beluga008 Online IBM.ResourceGroup:db2_bculinux_NLG_beluga007-rg Nominal=Online |- Online IBM.ResourceGroup:db2_bculinux_1-rg Nominal=Online |- Online IBM.Application:db2_bculinux_1-rs |- Online IBM.Application:db2_bculinux_1-rs:beluga007 '- Offline IBM.Application:db2_bculinux_1-rs:beluga008 '- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0001-rs |- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0001-rs:beluga007 '- Offline IBM.Application:db2mnt-db2fs_bculinux_NODE0001-rs:beluga008 |- Online IBM.ResourceGroup:db2_bculinux_2-rg Nominal=Online |- Online IBM.Application:db2_bculinux_2-rs |- Online IBM.Application:db2_bculinux_2-rs:beluga007 '- Offline IBM.Application:db2_bculinux_2-rs:beluga008 '- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0002-rs |- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0002-rs:beluga007 '- Offline IBM.Application:db2mnt-db2fs_bculinux_NODE0002-rs:beluga008 |- Online IBM.ResourceGroup:db2_bculinux_3-rg Nominal=Online |- Online IBM.Application:db2_bculinux_3-rs |- Online IBM.Application:db2_bculinux_3-rs:beluga007 '- Offline IBM.Application:db2_bculinux_3-rs:beluga008 '- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0003-rs |- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0003-rs:beluga007 '- Offline IBM.Application:db2mnt-db2fs_bculinux_NODE0003-rs:beluga008 '- Online IBM.ResourceGroup:db2_bculinux_4-rg Nominal=Online |- Online IBM.Application:db2_bculinux_4-rs |- Online IBM.Application:db2_bculinux_4-rs:beluga007 '- Offline IBM.Application:db2_bculinux_4-rs:beluga008 '- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0004-rs |- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0004-rs:beluga007 '- Offline IBM.Application:db2mnt-db2fs_bculinux_NODE0004-rs:beluga008
Let us take a look at the NFS resource group:
Online IBM.ResourceGroup:SA-nfsserver-rg Nominal=Online |- Online IBM.AgFileSystem:shared_db2home |- Online IBM.AgFileSystem:shared_db2home:beluga006 '- Offline IBM.AgFileSystem:shared_db2home:beluga008
Similarly, for the admin node, we can now see the individual resources:
Online IBM.ResourceGroup:db2_bculinux_NLG_beluga006-rg Nominal=Online '- Online IBM.ResourceGroup:db2_bculinux_0-rg Nominal=Online |- Online IBM.Application:db2_bculinux_0-rs |- Online IBM.Application:db2_bculinux_0-rs:beluga006 '- Offline IBM.Application:db2_bculinux_0-rs:beluga008
The lssam command also shows Equivalencies as part of the output. I will include it for the sake of completion, but we will discuss this later on:
Online IBM.Equivalency:SA-nfsserver-nieq-1 |- Online IBM.NetworkInterface:bond0:beluga006 '- Online IBM.NetworkInterface:bond0:beluga008 Online IBM.Equivalency:db2_FCM_network |- Online IBM.NetworkInterface:bond0:beluga006 |- Online IBM.NetworkInterface:bond0:beluga007 '- Online IBM.NetworkInterface:bond0:beluga008 Online IBM.Equivalency:db2_bculinux_0-rg_group-equ |- Online IBM.PeerNode:beluga006:beluga006 '- Online IBM.PeerNode:beluga008:beluga008 Online IBM.Equivalency:db2_bculinux_1-rg_group-equ |- Online IBM.PeerNode:beluga007:beluga007 '- Online IBM.PeerNode:beluga008:beluga008 Online IBM.Equivalency:db2_bculinux_2-rg_group-equ |- Online IBM.PeerNode:beluga007:beluga007 '- Online IBM.PeerNode:beluga008:beluga008 Online IBM.Equivalency:db2_bculinux_3-rg_group-equ |- Online IBM.PeerNode:beluga007:beluga007 '- Online IBM.PeerNode:beluga008:beluga008 Online IBM.Equivalency:db2_bculinux_4-rg_group-equ |- Online IBM.PeerNode:beluga007:beluga007 '- Online IBM.PeerNode:beluga008:beluga008 Online IBM.Equivalency:db2_bculinux_NLG_beluga006-equ |- Online IBM.PeerNode:beluga006:beluga006 '- Online IBM.PeerNode:beluga008:beluga008 Online IBM.Equivalency:db2_bculinux_NLG_beluga007-equ |- Online IBM.PeerNode:beluga007:beluga007 '- Online IBM.PeerNode:beluga008:beluga008
# lssam –g SA-nfsserver-rg Online IBM.ResourceGroup:SA-nfsserver-rg Nominal=Online |- Online IBM.AgFileSystem:shared_db2home |- Online IBM.AgFileSystem:shared_db2home:beluga006 '- Offline IBM.AgFileSystem:shared_db2home:beluga008 |- Online IBM.AgFileSystem:varlibnfs |- Online IBM.AgFileSystem:varlibnfs:beluga006 '- Offline IBM.AgFileSystem:varlibnfs:beluga008 |- Online IBM.Application:SA-nfsserver-server |- Online IBM.Application:SA-nfsserver-server:beluga006 '- Offline IBM.Application:SA-nfsserver-server:beluga008 '- Online IBM.ServiceIP:SA-nfsserver-ip-1 |- Online IBM.ServiceIP:SA-nfsserver-ip-1:beluga006 '- Offline IBM.ServiceIP:SA-nfsserver-ip-1:beluga008
Table 2. Useful Commands
Command | Definition |
---|---|
hals: | shows HA status summary for all db2 partitions |
hachknode | shows the status of the node in the domain and details about the private and public networks |
hastartdb2 | start db2 partition resources |
hastopdb2 | stop db2 partition resources |
hafailback | moves partitions back to the primary machine specified in the primary_machine argument |
Equivalency: | A set of fixed resources of the same resource class that provide the same functionality |
hafailover | moves partitions off of the primary machine specified in the primary_machine argument to it is standby |
hareset | attempt to reset pending, failed, stuck resource states |
Stopping and Starting Resources
If you want to stop or start the DB2 service, you need to stop the respective DB2 resource groups using TSA commands. TSA will then start or stop DB2.The command to do this is chrg. To stop a resource group named db2_bculinux_NLG_beluga007, issue the command,
chrg –o offline –s “Name == ‘db2_bculinux_NLG_beluga007’”
chrg –o online –s “Name == ‘db2_bculinux_NLG_beluga007’”
chrg –o online –s “1=1”
hastartdb2 and hastopdb2
If TSA has pre-configured rules/dependencies, they will ensure that resources are stopped and started in the correct order. For example, DB2 resources that depend on NFS will not start if the NFS share is Offline.
TSA Components
Now that you understand the basics of Tivoli System Automation, we can discuss some of the other components that it can manage.Service IP
A service IP is a virtual, floating resource attached to a network device. Essentially, it is an IP address that can move from one machine to another, in the event of a failover. Service IPs play a key role in a highly available environment. Because they move from a failed machine to a standby, they allow an application to reconnect to the new machine using the same IP address – as if the original server had simply restarted.The following command will allow you to view what service IPs have been configured for your system.
# lsrsrc -Ab IBM.ServiceIP Resource Persistent and Dynamic Attributes for IBM.ServiceIP resource 1: Name = "db2ip_10_160_20_210-rs" ResourceType = 0 AggregateResource = "0x2029 0xffff 0x414c690c 0x7cc2abfa 0x919b42d5 0xbf62ab75" IPAddress = "10.160.20.210" NetMask = "255.255.255.0" ProtectionMode = 1 NetPrefix = 0 ActivePeerDomain = "bcudomain" NodeNameList = {"t6udb3a"} OpState = 2 ConfigChanged = 0 ChangedAttributes = {} resource 2: Name = "db2ip_10_160_20_210-rs" ResourceType = 0 AggregateResource = "0x2029 0xffff 0x414c690c 0x7cc2abfa 0x919b42d5 0xbf62ab75" IPAddress = "10.160.20.210" NetMask = "255.255.255.0" ProtectionMode = 1 NetPrefix = 0 ActivePeerDomain = "bcudomain" NodeNameList = {"t6udb1a"} OpState = 1 ConfigChanged = 0 ChangedAttributes = {} resource 3: Name = "db2ip_10_160_20_210-rs" ResourceType = 1 AggregateResource = "0x3fff 0xffff 0x00000000 0x00000000 0x00000000 0x00000000" IPAddress = "10.160.20.210" NetMask = "255.255.255.0" ProtectionMode = 1 NetPrefix = 0 ActivePeerDomain = "bcudomain" NodeNameList = {"t6udb1a","t6udb3a"} OpState = 1 ConfigChanged = 0 ChangedAttributes = {}
Application Resources
TSA manages resources using scripts. Some scripts are built in (and part of TSA), such as those for controlling DB2. These scripts are responsible for starting, stopping and monitoring the application. Sometimes it can be useful to understand these scripts, or even edit them for problem diagnosis. To find out where they are located, we use the lsrsrc command, which provides us with the complete configuration of a particular resource.Following is an example:
# lsrsrc -Ab IBM.Application resource 12: Name = "db2_dbedw1da_8-rs" ResourceType = 1 AggregateResource = "0x3fff 0xffff 0x00000000 0x00000000 0x00000000 0x00000000" StartCommand = "/usr/sbin/rsct/sapolicies/db2/db2V97_start.ksh dbedw1da 8" StopCommand = "/usr/sbin/rsct/sapolicies/db2/db2V97_stop.ksh dbedw1da 8" MonitorCommand = "/usr/sbin/rsct/sapolicies/db2/db2V97_monitor.ksh dbedw1da 8" MonitorCommandPeriod = 60 MonitorCommandTimeout = 180 StartCommandTimeout = 330 StopCommandTimeout = 140 UserName = "root" RunCommandsSync = 1 ProtectionMode = 1 HealthCommand = "" HealthCommandPeriod = 10 HealthCommandTimeout = 5 InstanceName = "" InstanceLocation = "" SetHealthState = 0 MovePrepareCommand = "" MoveCompleteCommand = "" MoveCancelCommand = "" CleanupList = {} CleanupCommand = "" CleanupCommandTimeout = 10 ProcessCommandString = "" ResetState = 0 ReRegistrationPeriod = 0 CleanupNodeList = {} MonitorUserName = "" ActivePeerDomain = "bcudomain" NodeNameList = {"d8udb11a","d8udb3a"} OpState = 1 ConfigChanged = 0 ChangedAttributes = {} HealthState = 0 HealthMessage = "" MoveState = [32768,{}] RegisteredPID = 0
Table 3. Resource Attributes
Attribute | Definition |
---|---|
ResourceType: | Indicates whether the resource is allowed to run on multiple nodes, or a single node. A fixed resource is identified with a ResouceType value of 0, and a floating resource has a value of 1. |
StartCommand: | Specifies the command to be run when the resources is started |
StopCommand: | Specifies the command to be run when the resource is stopped |
MonitorCommand: | Specifies the command to be run when the resource is being
monitored. This happens on a regular interval, and you will likely
see this command often when you run the
“ps –ef” command. |
UserName: | The userid that TSA will use to start this resource |
NodeNameList: | Indicates on which nodes the resource is allowed to run. This is an attribute of an RSCT resource. |
OpState: | Specifies the operational state of a resource or a resource group.
The valid states are, 0 - UNKNOWN 1 - ONLINE 2 - OFFLINE 3 - FAILED_OFFLINE 4 - STUCK_ONLINE 5 - PENDING_ONLINE 6 - PENDING_OFFLINE |
Network Resources
Every machine typically has an Ethernet adaptor, with a configured network address. TSA is aware of this and you can see how they have been configured with the lsrsrc command. For example,# lsrsrc -Ab IBM.NetworkInterface resource 1: Name = "en0" DeviceName = "" IPAddress = "172.22.1.217" SubnetMask = "255.255.252.0" Subnet = "172.22.0.0" CommGroup = "CG1" HeartbeatActive = 1 Aliases = {} DeviceSubType = 6 LogicalID = 0 NetworkID = 0 NetworkID64 = 0 PortID = 0 HardwareAddress = "00:21:5e:a3:be:60" DevicePathName = "" IPVersion = 4 Role = 0 ActivePeerDomain = "bcudomain"
Log Files
It is important to be aware of the log files that TSA actively writes to:- History file – this logs the commands that were sent to TSA
/var/ct/IBM.RecoveryRM.log2
- Error and monitor logs – these logs are simply the AIX and Linux
system logs. They will show you the output of the start, stop, and
monitor scripts as well as any diagnostic information coming from TSA.
Although the system administrator can configure the location for these
logs, they are typically located in the following locations,
AIX: /tmp/syslog.out Linux: /var/log/messages
Command Reference
Table 4 describes the most common commands that a TSA administrator will use.Table 4. Common TSA Commands
Command | Definition |
---|---|
hals: | Display HA configuration summary |
hastopdb2: | Stop DB2 using TSA |
hastartdb2: | Start DB2 using TSA |
mkequ: | Makes an equivalency resource |
chequ: | Changes a resource equivalency |
lsequ: | Lists equivalencies and their attributes |
rmequ: | Removes one or more resource equivalencies |
mkrg: | Makes a resource group |
chrg: | Changes persistent attribute values of a resource group (including starting and stopping a resource group) |
lsrg: | Lists persistent attribute values of a resource group or its resource group members |
rmrg: | Removes a resource group |
mkrel: | Makes a managed relationship between resources |
chrel: | Changes one or more managed relationships between resources |
lsrel: | Lists managed relationships |
rmrel: | Removes a managed relationship between resources |
samcrl: | Sets the IBM TSA control parameters |
lssamctrl: | Lists the IBM TSA controls |
addrgmbr: | Adds one ore more resources to a resource group |
chrgmbr: | Changes the persistent attribute value(s) of a managed resource in a resource group |
rmrgmbr: | Removes one or more resources from the resource group |
lsrgreq: | Lists outstanding requests applied against resource groups or managed resources |
rgmbrreq: | Requests a managed resource to be started or stopped, or cancels the request |
rgreq: | Requests a resource group to be started, stopped, or moved, or cancels the request |
lssam: | Lists the defined resource groups and their members in a tree format |
Command Tips
Following are some useful commands with examples.Show relationships/dependencies:
lsrel | sort
# lsrel -A b -s "Name = 'db2_bculinux_0-rs_DependOn_db2_bculinux_qp-rel'" Managed Relationship 1: Class:Resource:Node[Source] = IBM.Application:db2_bculinux_qp Class:Resource:Node[Target] = {IBM.Application:db2_bculinux_0-rs} Relationship = DependsOn Conditional = NoCondition Name = db2_bculinux_0-rs_DependOn_db2_bculinux_qp-rel ActivePeerDomain = bcudomain ConfigValidity =
rmrel -s "Name like 'db2_bculinux_%-rs_DependsOn_db2_bculinux_0-rs-rel'"
chrsrc -s "Name=='" attribute=value
chrsrc -s "Name=='db2ip_10_160_10_27-rs'" IBM.ServiceIP NetMask='255.255.255.0'
sampolicy –s /tmp/sampolicy.current.xml
sampolicy –c /tmp/sampolicy.current.xml
sampolicy –a /tmp/sampolicy.current.xml
Troubleshooting
This section describes methods that can be used to determine the cause of a particular problem or failure. Though techniques vary depending on the type of problem, the following should be a good starting point for most issues.Resolving FAILED OFFLINE status
A failed offline status will prevent you from setting the nominal status to ONLINE, so these must be resolved first and changed to OFFLINE before turning it back to ONLINE. Make sure that the Nominal status is showing OFFLINE before resolving it.
To resolve the Failed offline messages, use the resetrsrc command.
resetrsrc -s ‘Name = "db2whse_appinstance_01.abxplatform_server1"‘ IBM.Application resetrsrc -s 'Name = "db2whse_appinstance_01.adminconsole_server1"' IBM.Application
Take all TSA resources offline. The lssam output should reflect “Offline” for all resources before you attempt to bring them back online. To reset NFS resources, use:
resetrsrc -s "Name like 'SA-nfsserver-%'" IBM.Application (if necessary) resetrsrc -s "Name like 'SA-nfsserver-%'" IBM.ServiceIP (if necessary)
When you are restarting DB2, you must verify that all the resources are offline before attempting to bring them online again. You must also correct the db2nodes.cfg file. Make sure you have backup copies of db2nodes.cg and db2ha.sys.
NFS mounts stop functioning
In testing the NFS failover, we were able to move the server over successfully, but the existing NFS client mounts stopped functioning. We solved this problem by unmounting and remounting the NFS volume.
Resolving Binding=Sacrificed
To resolve this problem you have to look at the overall cluster and how its setup/defined. Issues that can and will cause this are types that will have a cluster-wide impact but not specifically affect one resource.
- Check for failed relationships by listing the relationships with the
following command
"lsrel -Ab"
, and then determine if one or more of the relationships relating to the failed resource group have not been satisfied. - Check for failed equivalencies by listing them with the following
command
"lsequ -Ab"
and then determine if one re more of the equivalencies have not been satisfied. - Check your resource group attributes and look for anything that maybe
set incorrectly, some of the commands to use are listed as follows:
lsrg -Ab -g
lsrsrc -s 'Name="failed_resource"' –Ab IBM. lsrg -m -g samdiag -g <resource_group_name> - Check for anything specific to your configuration that all of the sacrificed resources share in common, like a mount point, a database instance, a virtual IP.
dmesg
– check
initialization errorsdate
– check server
synchronizationifconfig
–
to check network adaptersnetstat -I
– to
check network configuration
ps -ef | grep inetd
– will provide a list
of the running processes, including group and PID Resource state is unknown
Try resetting the resource using the
resetrsrc
command:resetrsrc -s "Name like 'db2_db2inst2_%'" IBM.Application resetrsrc -s "Name like 'db2_db2inst2_%'" IBM.ServiceIP
For the health query interval of each resource, use:
chrsrc -s 'Name like "db2_db2inst2%"' IBM.Application MonitorCommandPeriod=300
chrsrc -s 'Name like "db2_db2inst2%"' IBM.Application MonitorCommandTimeout=290
chrsrc -s 'Name like "db2_db2inst2%"' IBM.Application StartCommandTimeout=300
chrsrc -s 'Name like "db2_db2inst2%"' IBM.Application StopCommandTimeout=720
If the problem is most likely related to the automation manager, you should try recycling the automation manager (IBM.RecoveryRM) before contacting IBM support. This can be done using the following commands:
Find out on which node the RecoveryRM master daemon is running:
# lssrc -ls IBM.RecoveryRM | grep Master
# lssrc -ls IBM.RecoveryRM | grep PID # kill -9
Resolving lssam hangs
http://www-01.ibm.com/support/docview.wss?uid=swg21293701
Move to another node in the same HA group and see if you can run the lssam command. If you can, go back to the original node to see if you can now do the lssam command. If this still does not work, then run the following commands:
lssrc -ls IBM.RecoveryRM | grep -i master lssrc -ls IBM.GblResRM | grep -i leader
AVOID the following (DON’Ts)
- Do not use rpower –a, or rpower on more than one node in the same HA group when SAMP HA is up and running.
- Do not offline HA-NFS using a sudo command while logged in as the instance owner and while in the /db2home directory. HA-NFS will get stuck online, and the RecoveryRM daemon has to be killed on the master. If RecoveryRM will not start, reboot may be required.
- Do not use ifdown to bring down a network interface. This will result in the eth (or en) device to be deleted from equivalency member and will require you to add the "eth" device (in Linux) or "en" device (in AIX) back into the network equivalency using chequ command
- Do not manipulate any BW resources that are under active SAMP control.
Turn automation off (samctrl –M T) before manipulating these BW resources. - Do not implement changes to the SA MP policy unless exhaustive testing of the HA test cases is completed.
- Ensure the /home and /db2home directories are always mounted before starting up a node.
- Check for process ids that may be blocking stop, start and monitor commands.
- Save backup copies of the db2nodes.cfg and db2ha.sys file.
- Save the backup copies of the current SAMP policy before and after every SAMP change. Compare the current SAMP policy to the backup SAMP policy every time there is an HA incident.
- Save backup copies of db2pd -ha output before and after every SAMP change. Compare the current db2pd outputs to the backup db2pd outputs every time there is an HA incident.
- Save backup copies of the samdiag outputs.
0 blogger-disqus:
Post a Comment