Based on my experience, the most common issue that prevents DLPAR operations from working are network problems. Before diving into the deep end and trying to debug RSCT, it’s always best to start with the basics
For example, can you ping the HMC from the LPAR?
If you check the network and you are happy that the LPAR and the HMC can communicate, then perhaps you need to re-initialize the RMC subsystems on the AIX LPAR. Run the following commands:
# /usr/sbin/rsct/bin/rmcctrl –z
# /usr/sbin/rsct/bin/rmcctrl –A
# /usr/sbin/rsct/bin/rmcctrl –p
For example, can you ping the HMC from the LPAR?
Can you
ping the LPAR from the HMC? If either of these tests fails, check the
network configuration on both components before doing anything else.
On the HMC check the network settings first e.g.
Click on HMC Configuration and then Customize Network Settings.
– Verify the IP address, netmask, default gateway, network routes, DNS server are all set correctly.
– Check the LPAR communications box in HMC configuration screen for LAN adapter that is used for HMC-to-LPAR communication.
– By the
way, unlike POWER4 systems, LPARs on POWER5 and POWER6 systems do not
depend on host name resolution for DLPAR operations.
Check routing on the LPAR and the HMC.
– Use ping and the HMC’s Test Network Connectivity task to verify the LPAR and the HMC can communicate with each other.
– Use ping and the HMC’s Test Network Connectivity task to verify the LPAR and the HMC can communicate with each other.
# /usr/sbin/rsct/bin/rmcctrl –z
# /usr/sbin/rsct/bin/rmcctrl –A
# /usr/sbin/rsct/bin/rmcctrl –p
Wait up to 5 minutes before trying
DLPAR again. If DLPAR still doesn’t work i.e. the HMC is still reporting
no values for DCaps, and the IBM.DRM subsystem still won’t start, try
using the recfgct command.
hscroot@hmc1:~> lspartition -dlpar
.....
<#5> LPAR:<24*9117-MMB*10284FP, , 192.168.1.15>
Active:<1>, OS:<AIX, 6.1, 6100-05-01-1016>, DCaps:<0x0>, CmdCaps:<0x0, 0x0>, PinnedMem:<768>
.....
# /usr/sbin/rsct/install/bin/recfgct
Wait 5 minutes. This should resolve
your DLPAR issue. The IBM.DRM subsystem should now be active and there
should be good (non-zero) values for DCaps:
# lssrc -g rsct_rm
Subsystem Group PID Status
IBM.DRM rsct_rm 6881300 active
IBM.CSMAgentRM rsct_rm 7274530 active
IBM.ServiceRM rsct_rm 6029480 active
IBM.AuditRM rsct_rm 6357058 active
IBM.ERRM rsct_rm 4456566 active
IBM.LPRM rsct_rm 6946986 active
hscroot@hmc1:~> lspartition -dlpar
....
<#5> LPAR:<24*9117-MMB*10284FP, , 192.168.1.15>
Active:<1>, OS:<AIX, 6.1, 6100-05-01-1016>, DCaps:<0xc5f>, CmdCaps:<0x1b, 0x1b>, PinnedMem:<994>
....
Only run the rmcctrl and recfgct commands if you believe something has become corrupt in the RMC configuration of the LPAR. The fastest way to fix a broken configuration or to clear out the RMC ACL files after cloning (via alt_disk migration) is to use the recfgct command.
These
daemons should work “out of the box” and are not typically the cause of
DLPAR issues. However, you can try stopping and starting the daemons
when troubleshooting DLPAR issues.
The rmcctrl -z command just stops the daemons. The rmcctrl -A command ensures that the subsystem group (rsct) and the subsystem (ctrmc) objects are added to the SRC, and an appropriate entry added to the end of /etc/inittab and it starts the daemons.
The rmcctrl –p command enables the daemons for remote client connections i.e. from the HMC to the LPAR and vice versa.
If you are familiar with the System Resource Controller (SRC) you might be tempted to use stopsrc and startsrc commands to stop and start these daemons.
Do not do it; use the rmcctrl commands instead.
If /var is 100% full, use chfs to
expand it. If there is no more space available, examine subdirectories
and remove unnecessary files (for example, trace.*, core, and so forth).
If /varis full, RMC subsystems may fail to function correctly.
The
polling interval for the RMC daemons on the LPAR to check with the HMC
daemons is 5-7 minutes; so you need to wait long enough for the daemons
to start up and synchronize.
The
Resource Monitoring and Control (RMC) daemons are part of the Reliable,
Scalable Cluster Technology (RSCT) and are controlled by the System
Resource Controller (SRC).
These daemons run in all LPARs and
communicate with equivalent RMC daemons running on the HMC. The daemons
start automatically when the operating system starts and synchronize
with the HMC RMC daemons.
The
daemons in the LPARs and the daemons on the HMC must be able to
communicate over the network for DLPAR operations to succeed. This
is not the network connection between the managed system (FSP) and the
HMC; it is the network connection between the operating system (AIX) in
each LPAR and the HMC.
Note: Apart from rebooting, there is no way to stop and start the RMC daemons on the HMC.
The
following links also contain some (out dated) information relating to
DLPAR verification and troubleshooting. Even though it is quite old some
of it is still relevant today and is good a place to start.
The most common reasons for failures with Dynamic Logical LPARing
Dynamic LPAR tips and checklists for RMC authentication and authorization
Setting up the HMC/LPARs hostname and network (old but interesting)
lsLPAR -dlpar, DCAPs values (old but still applies)
The previous link (above) provides some information relating to the values for DCaps and what they mean (also out dated):
0 - DR CPU capable (can move CPUs)
1 - DR MEM capable (can move memory)
2 - DR I/O capable (can move I/O resources)
3 - DR PCI Bridge (can move PCI bridges)
4 - DR Entitlement (POWER 5 can change shared entitlement)
5 - Multiple DR CPU (AIX 5.3 can move 2+ CPUs at once)
0x3f = max, and 0xf is common for AIX 5.2
If you are interested in how HMC and LPAR authentication works with DLPAR, then read on. Otherwise, happy DLPARing!
HMC and LPAR authentication (RSCT authentication)
The
diagram below outlines how the HMC and an LPAR authenticate with each
other in order for DLPAR operations to work. RSCT authentication is used
to ensure the HMC is communicating with the correct LPAR.
The RSCT authorization process in detail:
1. On the HMC: DMSRM pushes down
the secret key and HMC IP address to NVRAM when it detects a new CEC;
this process is repeated every five minutes. Each time an HMC is
rebooted or DMSRM is restarted, a new key is used.
2. On the AIX LPAR: CSMAgentRM, through RTAS (Run-time
Abstraction Services), reads the key and HMC IP address out from NVRAM.
It will then authenticate the HMC. This process is repeated every five
minutes on a LPAR to detect a new HMCs and if the key has changed. An
HMC with a new key is treated as a new HMC and will go though the
authentication and authorization processes again.
3. On the AIX LPAR: After authenticating the HMC, CSMAgentRM will contact the DMSRM on the HMC to create a ManagedNode resource in order to identify itself as a LPAR of this HMC. CSMAgentRM then creates a compatible ManagementServer resource on AIX. This can be displayed on AIX with the lssrsrc command. e.g.
root@aix6 / # lsrsrc "IBM.ManagementServer"
Resource Persistent Attributes for IBM.ManagementServer
resource 1:
Name = "192.168.1.244"
Hostname = "192.168.1.244"
ManagerType = "HMC"
LocalHostname = "10.153.3.133"
ClusterTM = "9078-160"
ClusterSNum = ""
ActivePeerDomain = ""
NodeNameList = {"aix6"}
4.On the AIX LPAR: After the creation of the ManagedNode and ManagementServer resources on the HMC and AIX respectively, CSMAgentRM grants HMC permission to access necessary resource classes on the LPAR. After granting the HMC permission, CSMAgentRMwill change its ManagedNode, on the HMC, Status to 1. (It
should be noted that without proper permission on AIX, the HMC would be
able to establish a session with the LPAR but will not be able to query
for OS information, DLPAR capabilities, or execute DLPAR commands
afterwards.)
5. On the HMC: After the ManagedNode Status is changed to 1, LparCmdRM establishes a session with the LPAR, queries for operating system information and DLPAR capabilities, notifies CIMOM about the DLPAR capabilities of the LPAR, and then waits for the DLPAR commands from users.
0 blogger-disqus:
Post a Comment