Wednesday 29 May 2013

Improved-control-for-live-partition-mobility-choose-your-destination-fibre-channel-port

There is a drawback in  Live partition Mobility .When there are fibre channel adapters in the frame laprs , always when mobility operation is performed the moving lpar is always using the first fibre channel adapter.

If you have more lpars in the frames it would be make your work worst ,fibre channel switch port can be saturated.


With the new release of PowerVM, you can now choose the destination fibre channel adapter and all lpars can be distributed among all fibre channel adapters.

Prerequisites

Be sure Hardware Management Console and Virtual I/O Server are up to date with the latest version :
  • Hardware Management Console version has to be 7.6.0 :
# lshmc -V
"version= Version: 7
 Release: 7.6.0
 Service Pack: 1
HMC Build level 20121109.1
","base_version=V7R7.5.0
  • Source and destination Virtual I/O Servers have to be 2.2.2.1 :
# ioslevel
2.2.2.1

Mobility


Here is an example : all virtual fibre channel adapters are mapped on the same fibre channel adapter : fcs0 is mapped to seven virtual fibre channel adapters, and fcs1 to none, this is a result of multiple mobility operations :
# lsnports
name             physloc                        fabric tports aports swwpns  awwpns
fcs0             U5803.001.9ZZ03PZ-P1-C2-T1          1     64     57   2048    2021
fcs1             U5803.001.9ZZ03PZ-P1-C2-T2          1     64     64   2048    2048

# lsmap -all -npiv | grep "FC name"
FC name:fcs0                    FC loc code:U5803.001.9ZZ03PZ-P1-C2-T1
FC name:fcs0                    FC loc code:U5803.001.9ZZ03PZ-P1-C2-T1
FC name:fcs0                    FC loc code:U5803.001.9ZZ03PZ-P1-C2-T1
FC name:fcs0                    FC loc code:U5803.001.9ZZ03PZ-P1-C2-T1
FC name:fcs0                    FC loc code:U5803.001.9ZZ03PZ-P1-C2-T1
FC name:fcs0                    FC loc code:U5803.001.9ZZ03PZ-P1-C2-T1
FC name:fcs0                    FC loc code:U5803.001.9ZZ03PZ-P1-C2-T1
Choosing the destination fibre channel adapter can only be done using command line (I hope a dialog box will be available with the next Hardware Management Console release). You have to choose the destination adapter by your own :
  • Before the mobility operation 64 aports are available on fcs5 on this Virtual I/O Server :
# lsnports
name             physloc                        fabric tports aports swwpns  awwpns
fcs2             U5803.001.9ZZ03PZ-P2-C2-T1          1     64     64   2048    2048
fcs3             U5803.001.9ZZ03PZ-P2-C2-T2          1     64     64   2048    2048
fcs4             U5803.001.9ZZ03PZ-P2-C3-T1          1     64     64   2048    2048
fcs5             U5803.001.9ZZ03PZ-P2-C3-T2          1     64     64   2048    2048
  • The mobility operation is launched by the command line only, as you can see fibre channel adapter fcs5 is used on this mobility operation :
# migrlpar -o m -m P795-SRC -t P795-DST -p lpar-test -w 1 -i 'virtual_fc_mappings="10/vios1/15//fcs5,11/vios2/16//fcs5",source_msp_name=vios3,dest_msp_name=vios1,shared_proc_pool_name=shp_test'
  • After the mobility operation 63 ports are available on fcs5 on the destination Virtual I/O Server :
# lsnports
name             physloc                        fabric tports aports swwpns  awwpns
fcs2             U5803.001.9ZZ03PZ-P2-C2-T1          1     64     64   2048    2048
fcs3             U5803.001.9ZZ03PZ-P2-C2-T2          1     64     64   2048    2048
fcs4             U5803.001.9ZZ03PZ-P2-C3-T1          1     64     64   2048    2048
fcs5             U5803.001.9ZZ03PZ-P2-C3-T2          1     64     63   2048    2045

Use case

After moving all my lpars from one machine to another here is the result : all lpars fibre channel adapters are distributed among all the real fibre channel adapters.
# lsnports
name             physloc                        fabric tports aports swwpns  awwpns
fcs0             U5803.001.9ZZ03PZ-P2-C6-T1          1     64     43   2048    1967
fcs1             U5803.001.9ZZ03PZ-P2-C6-T2          1     64     58   2048    2030
fcs4             U5803.001.9ZZ03PZ-P2-C8-T1          1     64     55   2048    2017
fcs5             U5803.001.9ZZ03PZ-P2-C8-T2          1     64     59   2048    2033
# /usr/ios/cli/ioscli lsmap -all -npiv | grep "FC name" | sort | uniq -c
  12 FC name:fcs0                    FC loc code:U5803.001.9ZZ03PZ-P2-C6-T1
   8 FC name:fcs1                    FC loc code:U5803.001.9ZZ03PZ-P2-C6-T2
   8 FC name:fcs4                    FC loc code:U5803.001.9ZZ03PZ-P2-C8-T1
   9 FC name:fcs5                    FC loc code:U5803.001.9ZZ03PZ-P2-C8-T2

Tuesday 28 May 2013

Unix Evolution
















                                                                                                                                        







Unix (officially trademarked as UNIX, sometimes also written as 

Unix in small caps) is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, Michael Lesk and Joe Ossanna.

First developed in assembly language, by 1973 it had been almost entirely recoded in C, greatly facilitating its further development and porting to other hardware.

In 1974, UNIX was first licensed to an outside institution, the University of Illinois at Urbana Champaign, by Greg Chesson and Donald B. Gillies. Today's Unix system evolution is split into various branches, developed over time by AT&T as well as various commercial vendors, universities (such as University of California, Berkeley's BSD), and non-profit organizations.

The Open Group, an industry standards consortium, now owns the UNIX trademark. Only systems fully compliant with and certified according to the Single UNIX Specification are qualified to use the trademark; others might be called Unix system-like or Unix-like, although the Open Group disapproves of this term. However, the term Unix is often used informally to denote any operating system that closely resembles the trademarked system.

During the late 1970s and early 1980s, the influence of Unix in academic circles led to large-scale adoption of Unix (particularly of the BSD variant, originating from the University of California, Berkeley) by commercial startups, the most notable of which are Solaris, HP-UX, Sequent, and AIX, as well as Darwin, which forms the core set of components upon which Apple's OS X and iOS are based.

Today, in addition to certified Unix systems such as those already mentioned, Unix-like operating systems such as MINIX, Linux, and BSD descendants (FreeBSD, NetBSD, OpenBSD, and DragonFly BSD) are commonly encountered. The term traditional Unix may be used to describe an operating system that has the characteristics of either Version 7 Unix or UNIX System V.


Courtesy: Wikipedia

Monday 27 May 2013

Capped Mode Vs Uncapped Mode


Capped Mode

In capped mode the processing units given to a partition can not exceed the assigned processing units (entitled capacity) even though there may be resources in the shared pool.


Uncapped Mode 

In uncapped mode the processing units can exceed the entitled capacity of the partition if enough resources are available in the shared resource pool. At this point of time the assigned uncapped weight of the partition comes into picture.

Understanding Micro-Partitioning

Micro-Partitioning was introduced as a feature of the POWER5 processor-bad product line back in 2004, yet I still get a number of questions on a regular basis around implementing and understanding Micro-Partitioning. In this article, I'll try and paint a concise and clear picture of everything you need to know about Micro-Partitioning in the Power Systems environment and address the most frequently asked questions with regards to best practices. Every reference I'll be making throughout this article will be in the context of shared uncapped LPARs.

Understanding Entitled Capacity

The entitled capacity of an LPAR plays a crucial role. It determines two very important factors: the guaranteed CPU cycles the LPAR will get at any point in time, and the base unit of measurement for utilization statistics. One aspect of a managed system's entitled capacity is that the total entitled capacity on the system cannot exceed the number of physical processors in that system. In plain English: the system's processors can not be over-subscribed by the total of the entitlements. As a side effect, this means every LPAR on a managed system will always be able to use its entitled capacity at any point in time. This capacity is guaranteed to be available to its LPAR within one dispatch cycle (10ms).

On the other hand, if an LPAR isn’t making full use of its entitlement, these cycles are being yielded back to the shared processor pool that LPAR is part of. The second crucial aspect of entitled capacity is being the basis for utilization statistics and performance reporting. Ore more simply: an LPAR consuming all of its entitled CPU capacity will report 100 percent utilization. Now that LPAR will not necessarily be limited to 100 percent utilization. Depending on that LPAR's virtual-processor configuration, it'll be able to borrow unused cycles from the shared processor pool and report more then 100 percent utilization. In that case, it’s important to know that any capacity used beyond an LPAR's entitled capacity isn’t guaranteed (as it might be some other LPAR's entitlement). Therefore, if an LPAR is running beyond 100 percent CPU, it might be forced back down to 100 percent if another LPAR requires that borrowed capacity.

Then why is there a minimum/desired/maximum setting for entitlement? Because the entitled capacity of an LPAR can be changed dynamically. The minimum and maximum values of entitled capacity are there to set the limits to which a running LPAR's entitled capacity may be varied.

The Role of Virtual Processors

Virtual processors are what AIX sees as being actual CPUs from an OS standpoint. You have to look at them as being a logical entity that is backed up by physical processor cycles. For each virtual processor, between 0.1 and 1.0 physical processor can be dispatched to execute tasks in that virtual processor's run queue. There are no conditions under which a single virtual processor will consume more then 1.0 physical processor. Therefore, the number of online virtual processors dictates the absolute maximum CPU consumption an LPAR can achieve (should enough capacity be available in its shared processor pool). That being said, if an LPAR has an entitlement of 2.0 processors and four virtual processors, this LPAR could be able to consume up to four physical processors, in which case, it will report 200 percent CPU utilization. You must keep in mind, while configuring virtual processors on a system, it’s possible to dispatch more virtual processors then there are physical processors in the shared processor pool, therefore you might not be able to have an LPAR peak all the way up to its number of virtual processors.

Again, in configuring an LPAR, a minimum/desired/maximum value must be set for the number of virtual processors. These values serve strictly as boundaries for dynamic LPAR operation while varying the number of virtual processors on a running LPAR.

From Dedicated to Shared

Still today, a very large number of customers are running LPARs in a dedicated mode. A question I often get is: What's the best way to convert an LPAR from a dedicated mode to a shared uncapped mode without affecting performance? The simplest way is to just change its mode from dedicated to shared uncapped and make sure the number of virtual processors are equal to the entitled capacity. By doing so, the LPAR will preserve its entitled capacity, which is its guaranteed cycles. Its entitled capacity being guaranteed will ensure this LPAR can’t starve and application response time isn’t impacted. The immediate advantage is any unused CPU cycles from that LPAR will go from wasted (dedicated mode) to yielded back to the shared processor mode (shared uncapped) and readily available for any other LPAR that might need it. A second step is to look at the LPAR's CPU consumption and determine if it could use more CPU. If the LPAR does show signs of plateauing at 100 percent, then adjusting the number of virtual processors up (one virtual processor at a time) will allow that LPAR to borrow cycles from other LPARs that might not be using their full entitlement.

There are a few important considerations when implementing this very simple method. If your LPAR now uses more then its entitled capacity, it’ll report more then 100 percent CPU utilization. That can result in issues with some performance-monitoring software and make some people uneasy. The other consideration is users might have varying application response times at different times of day, based on the available CPUs in the processor pool. If your processor pool reaches full utilization, applications using more then their entitled capacity might have their response time return dedicated-like performance. Fortunately, this isn’t often the case. If you haven’t enabled processor pooling in your configuration, give it a try; you'll be surprised just how much free CPU you’ll end up with.

Optimizing Pooled Resources

Over the years, I've developed a very simple approach to getting the most out of micropartitioned environments. This approach is based on a good understanding of entitled capacity. In maximizing a system's utilization, you'll want to drive each LPAR's utilization as close as possible to 100 percent, on average. Once an LPAR has been converted from being dedicated to being shared uncapped, you’ll want to gradually reduce its entitled capacity so it reports higher utilization until your LPAR's average utilization is at a level you feel comfortable with. Your LPAR's peaks, more then likely, will exceed the LPAR's entitled capacity (100 percent), and that's fine. If all of your LPARs on your managed systems run at 90 percent utilization on average, and all your entitled capacity is dispatched, then your entire managed system will be running at 90 percent utilization.

One very important factor in determining the average utilization you wish to have on your LPAR is the managed system size. The larger the system, the more LPAR on the system, the higher the utilization target can be set. This is simply a reflection of the law of large numbers in probability theory.

More info click source article

Thursday 23 May 2013

How to exit system console from HMC?

Its quite often we ended up in giving wrong  user name when  it promptedHMC console for a lpar.
Instead closing the console use this tip to  exit from the prompt.

Its the Key Combination of   ~. (tilt+dot)


Tuesday 21 May 2013

Upgrading the GPFS cluster on AIX

Upgrading the GPFS cluster on AIX

  • All the GPFS nodes should be upgrade at same time.
  • Make sure the Application is stopped fully.
  • Before starting with OS Upgrade all the GPFS file system should be unmounted. If there are any application process running, and those process using the GPFS file systems, we cannot unmount the GPFS file systems.
  • Before the OS upgrade starts, the GPFS cluster should be stopped.

1) View the cluster information

Before starting the OS Upgrade starts complete below steps.
# mmlscluster                                  
Example output:
GPFS cluster information
========================
  GPFS cluster name:         HOST.test1-gpfs
  GPFS cluster id:           13882565243868289165
  GPFS UID domain:           HOST.test1-gpfs
  Remote shell command:      /usr/bin/ssh
  Remote file copy command:  /usr/bin/scp
GPFS cluster configuration servers:
-----------------------------------
  Primary server:    test1-gpfs
  Secondary server:  test2-gpfs
Node  Daemon node name            IP address       Admin node name             Designation
-----------------------------------------------------------------------------------------------
   1   test1-gpfs            192.168.199.137  test1-gpfs            quorum
   2   test3-gpfs            192.168.199.138  test3-gpfs            quorum
   3   test2-gpfs            192.168.199.139  test2-gpfs            quorum

2) View all the gpfs file systems

# mmlsfs all                                    
Example output:

File system attributes for /dev/gpfs1001:
=========================================
flag value            description
---- ---------------- -----------------------------------------------------
-f  131072           Minimum fragment size in bytes
-i  512              Inode size in bytes
-I  32768            Indirect block size in bytes
-m  1                Default number of metadata replicas
-M  2                Maximum number of metadata replicas
-r  1                Default number of data replicas
-R  2                Maximum number of data replicas
-j  cluster          Block allocation type
-D  nfs4             File locking semantics in effect
-k  all              ACL semantics in effect
-a  -1               Estimated average file size
-n  64               Estimated number of nodes that will mount file system
-B  4194304          Block size
-Q  user;group;fileset Quotas enforced
     none             Default quotas enabled
-F  1000000          Maximum number of inodes
-V  10.01 (3.2.1.5)  File system version
-u  yes              Support for large LUNs?
-z  no               Is DMAPI enabled?
-L  4194304          Logfile size
-E  yes              Exact mtime mount option
-S  no               Suppress atime mount option
-K  whenpossible     Strict replica allocation option
-P  system           Disk storage pools in file system
-d  gpfs1nsd;gpfs2nsd;gpfs3nsd;gpfs4nsd  Disks in file system
-A  yes              Automatic mount option
-o  none             Additional mount options
-T  /sasmart         Default mount point
File system attributes for /dev/gpfs1002:
=========================================
flag value            description
---- ---------------- -----------------------------------------------------
-f  131072           Minimum fragment size in bytes
-i  512              Inode size in bytes
-I  32768            Indirect block size in bytes
-m  1                Default number of metadata replicas
Standard input -M  2                Maximum number of metadata replicas
-r  1                Default number of data replicas
-R  2                Maximum number of data replicas
-j  cluster          Block allocation type
-D  nfs4             File locking semantics in effect
-k  all              ACL semantics in effect
-a  -1               Estimated average file size
-n  64               Estimated number of nodes that will mount file system
-B  4194304          Block size
-Q  user;group;fileset Quotas enforced
     none             Default quotas enabled
-F  1000000          Maximum number of inodes
-V  10.01 (3.2.1.5)  File system version
-u  yes              Support for large LUNs?
-z  no               Is DMAPI enabled?
-L  4194304          Logfile size
-E  yes              Exact mtime mount option
-S  no               Suppress atime mount option
-K  whenpossible     Strict replica allocation option
-P  system           Disk storage pools in file system
-d  gpfs5nsd       Disks in file system
-A  yes              Automatic mount option
-o  none             Additional mount options
-T  /sasplex1        Default mount point
File system attributes for /dev/gpfs1003:
=========================================
flag value            description
---- ---------------- -----------------------------------------------------
-f  131072           Minimum fragment size in bytes
-i  512              Inode size in bytes
-I  32768            Indirect block size in bytes
-m  1                Default number of metadata replicas
-M  2                Maximum number of metadata replicas
-r  1                Default number of data replicas
-R  2                Maximum number of data replicas
-j  scatter          Block allocation type
-D  nfs4             File locking semantics in effect
-k  all              ACL semantics in effect
-a  -1               Estimated average file size
-n  64               Estimated number of nodes that will mount file system
-B  4194304          Block size
Standard input -Q  user;group;fileset Quotas enforced
     none             Default quotas enabled
-F  1000000          Maximum number of inodes
-V  10.01 (3.2.1.5)  File system version
-u  yes              Support for large LUNs?
-z  no               Is DMAPI enabled?
-L  4194304          Logfile size
-E  yes              Exact mtime mount option
-S  no               Suppress atime mount option
-K  whenpossible     Strict replica allocation option
-P  system           Disk storage pools in file system
-d gpfs6nsd;gpfs7nsd;gpfs8nsd;gpfs9nsd;gpfs10nsd;gpfs11nsd;gpfs12nsd;gpfs13nsd;gpfs14nsd;gpfs15nsd;gpfs16nsd;gpfs17nsd;gpfs1
8nsd;gpfs19nsd;gpfs20nsd;gpfs21nsd;gpfs22nsd;
-d
gpfs23nsd;gpfs24nsd;gpfs25nsd;gpfs26nsd;gpfs27nsd;gpfs28nsd;gpfs29nsd;gpfs30nsd;gpfs31nsd;gpfs32nsd;gpfs33nsd;gpfs34nsd;g
pfs35nsd;gpfs36nsd;gpfs37nsd;gpfs38nsd;gpfs39nsd;
-d
gpfs40nsd;gpfs41nsd;gpfs42nsd;gpfs43nsd;gpfs44nsd;gpfs45nsd;gpfs46nsd;gpfs47nsd;gpfs48nsd;gpfs49nsd;gpfs50nsd;gpfs51nsd;g
pfs52nsd;gpfs53nsd;gpfs54nsd;gpfs55nsd;gpfs56nsd;
-d
gpfs57nsd;gpfs58nsd;gpfs59nsd;gpfs60nsd;gpfs61nsd;gpfs62nsd;gpfs63nsd;gpfs64nsd;gpfs65nsd;gpfs66nsd;gpfs67nsd;gpfs68nsd;g
pfs69nsd  Disks in file system
-A  yes              Automatic mount option
-o  none             Additional mount options
-T  /app1            Default mount point
File system attributes for /dev/gpfs1004:
=========================================
flag value            description
---- ---------------- -----------------------------------------------------
-f  131072           Minimum fragment size in bytes
-i  512              Inode size in bytes
-I  32768            Indirect block size in bytes
-m  1                Default number of metadata replicas
-M  2                Maximum number of metadata replicas
-r  1                Default number of data replicas
-R  2                Maximum number of data replicas
-j  cluster          Block allocation type
-D  nfs4             File locking semantics in effect
-k  all              ACL semantics in effect
-a  -1               Estimated average file size
-n  64               Estimated number of nodes that will mount file system
-B  4194304          Block size
-Q  user;group;fileset Quotas enforced
     none             Default quotas enabled
Standard input -F  1000000          Maximum number of inodes
-V  10.01 (3.2.1.5)  File system version
-u  yes              Support for large LUNs?
-z  no               Is DMAPI enabled?
-L  4194304          Logfile size
-E  yes              Exact mtime mount option
-S  no               Suppress atime mount option
-K  whenpossible     Strict replica allocation option
-P  system           Disk storage pools in file system
-d  gpfs70nsd      Disks in file system
-A  yes              Automatic mount option
-o  none             Additional mount options
-T  /sasuserhome     Default mount point

3) View the gpfs filesystem mounted on number of nodes

# mmlsmount all                                          
Example output:
File system gpfs1001 is mounted on 3 nodes.
File system gpfs1002 is mounted on 3 nodes.
File system gpfs1003 is mounted on 3 nodes.
File system gpfs1004 is mounted on 3 nodes.
Standard input: END

4) Check the existing gpfs cluster version

# lslpp -l |grep -i gpfs
Example output:
  gpfs.base                 3.2.1.18  APPLIED    GPFS File Manager
  gpfs.msg.en_US            3.2.1.11  APPLIED    GPFS Server Messages - U.S.
  gpfs.base                 3.2.1.18  APPLIED    GPFS File Manager
  gpfs.docs.data             3.2.1.1  APPLIED    GPFS Server Manpages and

5) unmount all gpfs filesystems

# mmumount all -N test1-gpfs,test3-gpfs,test2-gpfs
Example output:
Wed May 11 00:05:35 CDT 2011: 6027-1674 mmumount: Unmounting file systems ...

6) Verify all gpfs file system are un mounted

# mmlsmount all                            
Example output:
File system gpfs1001 is not mounted.
File system gpfs1002 is not mounted.
File system gpfs1003 is not mounted.
File system gpfs1004 is not mounted.

7) Stop the gpfs cluster

# mmshutdown -a                      
Example output:
Wed May 11 00:08:22 CDT 2011: 6027-1341 mmshutdown: Starting force unmount of GPFS file systems
Wed May 11 00:08:27 CDT 2011: 6027-1344 mmshutdown: Shutting down GPFS daemons
test3-gpfs:  Shutting down!
test2-gpfs:  Shutting down!
test3-gpfs:  'shutdown' command about to kill process 516190
test2-gpfs:  'shutdown' command about to kill process 483444
test1-gpfs:  Shutting down!
test1-gpfs:  'shutdown' command about to kill process 524420
test1-gpfs:  Master did not clean up; attempting cleanup now
test1-gpfs:  Wed May 11 00:09:28.423 2011: GPFS: 6027-311 mmfsd64 is shutting down.
test1-gpfs:  Wed May 11 00:09:28.424 2011: Reason for shutdown: mmfsadm shutdown command timed out
test1-gpfs:  Wed May 11 00:09:28 CDT 2011: mmcommon mmfsdown invoked.  Subsystem: mmfs  Status: down
test1-gpfs:  Wed May 11 00:09:28 CDT 2011: 6027-1674 mmcommon: Unmounting file systems ...
Wed May 11 00:09:33 CDT 2011: 6027-1345 mmshutdown: Finished

8) Verify any cluster process running

# ps -ef|grep -i gpfs                      
After GPFS cluster is stopped, then proceed with patching/upgrade.
Once the OS patching/upgrade complete, then upgrade the GPFS .Make sure the GPFS file systems are not in mounted state.

9) mount the NIM directory on /mnt

# mount wydainim010:/export/ibm _lpp /mnt

10) Change dir to the GPFS new version filesets location

# /mnt/gpfs/3.3/3.3.0.12                           

11) Use update_all cmd to update the file sets do preview only

# smitty update _all                                    
Example output:
INPUT device / directory for software                  .
* SOFTWARE to update_update_allPREVIEW only?
(update operation will NOT occur)                        yes       =====> Select yes for Preview
COMMIT software updates?                               no         =====> Select no for COMMIT filesets
SAVE replaced files?                                         yes       =====> Select yes here
AUTOMATICALLY install requisite software?       yes+
EXTEND file systems if space needed?              yes+
VERIFY install and check file sizes?                   no+
DETAILED output?                                            no+
Process multiple volumes?                                 yes+
ACCEPT new license agreements?                     yes       =====> Accept new licence agreement
Preview new LICENSE agreements?                   no+
If everything is fine in PREVIEW stage, proceed with upgrading the GPFS filesets bye selecting PREVIEW as no

12) Now verify the GPFS filesets version

# lslpp -l |grep -i gpfs              
Example output:
  gpfs.base                  3.3.0.8  APPLIED    GPFS File Manager
  gpfs.msg.en_US             3.3.0.5  APPLIED    GPFS Server Messages - U.S.
  gpfs.base                  3.3.0.8  APPLIED    GPFS File Manager
  gpfs.docs.data             3.3.0.1  APPLIED    GPFS Server Manpages and
Then continue with EMC upgrade and once the emc upgrade is done.
Make sure all the pv's are available in all nodes after the EMC upgrade.
Start the GPFS cluster.

13) Starts the GPFS cluster

# mmstartup -a                                            
Example output:
Wed May 11 06:09:32 CDT 2011: 6027-1642 mmstartup: Starting GPFS ...

14) Check the GPFS cluster

# mmlscluster                                
Example output:
GPFS cluster information
========================
  GPFS cluster name:         HOST.test1-gpfs
  GPFS cluster id:           13882565243868289165
  GPFS UID domain:           HOST.test1-gpfs
  Remote shell command:      /usr/bin/ssh
  Remote file copy command:  /usr/bin/scp
GPFS cluster configuration servers:
-----------------------------------
  Primary server:    test1-gpfs
  Secondary server:  test2-gpfs
Node  Daemon node name            IP address       Admin node name             Designation
-----------------------------------------------------------------------------------------------
   1   test1-gpfs            192.168.199.137  test1-gpfs            quorum
   2   test3-gpfs            192.168.199.138  test3-gpfs            quorum
   3   test2-gpfs            192.168.199.139  test2-gpfs            quorum

15) Check the GPFS cluster state on all nodes

# mmgetstate -a                                            
Example output:
Node number  Node name        GPFS state
------------------------------------------
       1      test1-gpfs active
       2      test3-gpfs active
       3      test2-gpfs active

16) Check all the filesystems

# mmlsfs all                                       

17) Mount all the gpfs fil systems

# mmmount all -a                       
Wed May 11 06:13:16 CDT 2011: 6027-1623 mmmount: Mounting file systems ...

18) Check the file systems are mounted on all nodes

# mmlsmount all                        
Example output:
File system gpfs1001 is mounted on 3 nodes.
File system gpfs1002 is mounted on 3 nodes.
File system gpfs1003 is mounted on 3 nodes.
File system gpfs1004 is mounted on 3 nodes.

19) Verify the GPFS cluster configuration information

# mmlsconfig                                
Example output:
Configuration data for cluster HOST.test1-gpfs:
----------------------------------------------------------
clusterName HOST.test1-gpfs
clusterId 13882565243868289165
clusterType lc
autoload yes
minReleaseLevel 3.2.1.5
dmapiFileHandleSize 32
maxblocksize 4096K
pagepool 1024M
maxFilesToCache 5000
maxStatCache 40000
maxMBpS 3200
prefetchPct 60
seqDiscardThreshhold 10240000000
worker1Threads 400
prefetchThreads 145
adminMode allToAll
File systems in cluster HOST.test1-gpfs:
---------------------------------------------------
/dev/gpfs1001
/dev/gpfs1002
/dev/gpfs1003
/dev/gpfs1004

Monday 20 May 2013

HOWTO determine installed Technology Level

Introduction

The new IBM methodology dictates two Technology Level (TL) releases per year. The first Technology Level includes hardware features, enablement, and software services. The second includes software features in the release, which means the second release is larger and more comprehensive. Finally, there is also now support for new hardware on older Technology Levels. This page is dedicated about finding information about AIX systems versions.

Determine machine type

To determine the machine type of an IBM AIX server use the uname command:

Command: determining the machine type


# uname -MuL
IBM,9133-55A IBM,0365B005G 3 65-B005G

The options:
-M: gives the machine type and model.
-u: gives the plant code and machine identifier.
-L: show the LPAR number and name.

so in the example:

The machine type: is 9113,
The model is: 55A,
OF prefix IBM: 0365B005G,
Plant code is 3,
Sequence number, 65-B005G

Determine OS level and maintenance level

To determine the AIX OS level and maintenance level use the instfix command with the option -i:

Command: determining the operating system level

# instfix -i | grep AIX_ML
    All filesets for 5.3.0.0_AIX_ML were found.
    All filesets for 5300-01_AIX_ML were found.
    All filesets for 5300-02_AIX_ML were found.
    All filesets for 5300-03_AIX_ML were found.
    All filesets for 5300-04_AIX_ML were found.
    All filesets for 5300-05_AIX_ML were found.
    All filesets for 5300-06_AIX_ML were found.
    All filesets for 5300-07_AIX_ML were found.
    All filesets for 5300-08_AIX_ML were found.

You can also use the command oslevel to determine the current AIX version, the -r option determines the highest recommended technology level.

Command: determining the highest technology level

# oslevel -r
5300-08


HOWTO determine the MAC address of a network interface

Introduction

Sometimes it can be useful to know the physical address of a network interface (MAC address), to perform some configuration and/or troubleshooting.

Find the MAC address

On the most of the systems this information can be retrieved using the ifconfig or the netstat commands but this is not true for IBM AIX.

To get the MAC address, use the lscfg as root as follow:

Command: displaying the MAC address of the interface <INTERFACE>

# lscfg -vpl <INTERFACE>

Example


To retrieve the MAC address of the interface ent0:

Command: displaying the MAC address of the interface ent0

# lscfg -vl ent0
  ent0             U787B.001.DNW7722-P1-T9  2-Port 10/100/1000 Base-TX PCI-X Adapter (14108902)
      2-Port 10/100/1000 Base-TX PCI-X Adapter:
        Network Address.............000D604DFA2A
        ROM Level.(alterable).......DV0210
        Hardware Location Code......U787B.001.DNW7722-P1-T9
  PLATFORM SPECIFIC
  Name:  ethernet
    Node:  ethernet@1
    Device Type:  network
    Physical Location: U787B.001.DNW7722-P1-T9

HOWTO Recreate boot logical volume (BLV)

Introduction

If a Boot Logical Volume (BLV) is corrupted, a machine will not boot. By example bad block in a disk might cause a corrupted BLV. This page threats how to resote that BLV.

Recreate boot logical volume

To fix this issue, boot the server in maintenance mode, from a CD ROM, a tape or a NIM if available.

The bootlists are set using the bootlist command or through the System Management Services progam (SMS). pressing F1 will go to SMS Mode. If you have an HMC, then at the time of booting select boot as SMS in the properties of that partition.

Note that the bosboot command requires that boot logical volume hd5 exists. To create a BLV (that may had been deleted by mistake), create a new hd5 logical volume of one PP size that must be in rootvg, specify boot as logical volume type using the following command:
Command: creating the BLV
# mklv -y hd5 -t boot rootvg 1
Then change the bootlist for service (maintenance) mode as 1st device to the CD ROM using the following command:
Command: changing the bootlist
# bootlist -m service cd0 hdisk0 hdisk1
Then start maintenance mode for system recovery, access the volume group rootvg to start a shell and recreate the BLV using the bosboot command as following:
Command: recreating the BLV
# bosboot -ad /dev/hdisk0
it's important to perform a proper shutdown, all changes need to be written from the memory to the disks.
Command: rebooting the system
# shutdown -Fr

Get the HMC IP address from LPAR

Something you may need to access the HMC of a AIX system but if you don't remember the IP address and don't have an up to date documentation, this information can lost. This page give a small tip to retrieve the HMC IP address directly from the AIX system itself.

Saturday 18 May 2013

Show which hdisk(s) each of your filesystems reside on for AIX

Here is a small script to map out which hdisk(s) each of your filesystems reside on for AIX: 
#!/bin/ksh
for vg in `lsvg -o`; do
        for fs in `lsvgfs $vg`; do
                printf "%-22s" $fs;
                for disk in `lsvg -p $vg | tail +3 | awk '{print $1}'`; do
                        lspv -l $disk | grep -q " ${fs}$" && printf "%-8s" $disk;
                done;
                echo
        done;
done

Here is what the output looks like:


System Hangs at c33 with an LFT - AIX

Question

System hangs at c33 with an LFT.

Answer

System Hangs at c33 with an LFT

c33 is configuring the console as a tty. Typically this occurs because the console is configured as a tty but has no tty. In this situation, when booting into service mode and executing an lscons, the console will show up as anlft. To verify that this is the case, enter smitty chcons. PATHNAME of console will be set to /dev/tty0. Change the pathname to /dev/lft0 and reboot your system.

rpc.mountd Will Not Start

Technote (FAQ)

rpc.mountd Will Not Start

Answer

Environment

AIX 4.3 and higher

Problem

rpc.mountdwill not start. The following error is found in theerrptoutput:

Solution

Fix loopback/localhost name resolution as follows:
  1. Edit the/etc/hostsfile:
         127.0.0.1 loopback localhost
    
    ->Save the file.
  2. Edit the/etc/netsvc.conffile:
         hosts=local,bind4
    
    ->Save the file.
  3. Test as follows, enter:
         host localhost
    
    ->Your output should be similar to the following:
         loopback is 127.0.0.1, aliases: localhost
    
  4. Enter the following commands:
         startsrc -s rpc.mountd
         lssrc -s rpc.mountd
    
    ->rpc.mountdshould now be listed as active.

Using AIX Tools to Debug Network Problems

Question

Using AIX Tools to Debug Network Problems

Answer

This document discusses some standard AIX commands that can check for network connectivity or performance problems.

From time to time users may be unable to access servers via their client applications or they may experience performance problems. When application and system checks do not indicate the problem, the system administrator may need to check the network or the system's network settings to find the problem. Using standard AIX tools, you can quickly determine if a server is experiencing a network problem due to configuration or network issues. These tools include thenetstat and tcpdump commands, which can help you isolate problems, from loss of connectivity to more complex network performance problems.

  • Basic tools and the OSI-RM
  • Using the netstat command
  • Using the tcpdump command

Basic tools and the OSI-RM

The AIX commands you can use for a quick checkup include the lsdeverrptnetstat and tcpdump commands. With these tools, you can assess the lower layers of your system's network configuration within the model known as the Open Systems Interconnection (OSI) Reference Model (RM) (see Table 1). Using the OSI-RM allows you to check common points of failure, without spending too much time looking at elusive errors that might be caused by loss of network access within an application.

Open Systems Interconnection Reference Model

 Model Layer           Function                         Assessment Tools
             
7. Application Layer  Consists of application          . 
                      programs that use the network.
6. Presentation Layer Standardizes data presentation 
                      to the applications.
5. Session Layer      Manages sessions between 
                      applications.
4. Transport Layer    Organizes data grams into        netstat -s 
                      segments and reliably delivers   iptrace 
                      them to upper layers.            tcpdump
3. Network Layer      Manages connections across the   netstat -in, -rn, -s, -D
                      network for the upper layers.    topas
                                                       iptrace
                                                       tcpdump
2. Data Link Layer    Provides reliable data delivery  netstat -v, -D
                      across the physical link.        iptrace
                                                       tcpdump
1.  Physical Layer    Defines the physical             netstat -v, -D 
                      characteristics of the           lsdev -C
                      network media.                   errpt
                                                       iptrace
                                                       tcpdump

Using the netstat command

One of the netstat tools, the netstat -v command, can help you decide if corrective action needs to be taken on the server or elsewhere in the network. Output from this command is the same as the entstattokstatfddistat, and atmstat commands combined. The netstat -v command assesses the physical and data link layers of the OSI-RM. Thus, it is one of the first commands you should use, after determining that there is no hardware availability problem. (The errpt andlsdev -C commands can help determine availability.) The netstat -v output can indicate whether you need to adjust configuration of a network adapter (to reestablish or improve communications) or tune an adapter for better data throughput.

Sample scenario

A simple scenario illustrates how the netstat -v command helps determine why a system is not communicating on its network.

The scenario assumes a system with the following characteristics:
  • An IBM 4-Port 10/100 Mbps Ethernet PCI Adapter (ent0 - ent3)
  • An onboard IBM 10/100 Mbps Ethernet PCI Adapter (ent4)
  • A single cable connected to one of the ports on the four-port adapters
  • A single IP address configured, on en0, which also maps to one of the logical devices (ent0) on the 4-Port card
The problem: Since TCP/IP was configured on en0, the system has been unable to ping any system on the network.
Example 1
  1. The lsdev -C and errpt commands were used to verify the availability of the adapter and interface.'

  2. The netstat -in command (interface configuration) and the netstat -rn (route configuration) command were used to check the IP configuration.

  3. After the first two preliminary steps, the next step is to use the netstat -v command to review specific statistics for adapter operations. Without a filter, thenetstat -v command produces at least 10 screens of data, so this examples uses the netstat -v ent0 command to limit the output as follows:

    netstat -v ent0 | grep -p "Specific Statistics"

    The RJ45 Port Link Status line in the sample output indicates whether or not the adapter has a link to the network. In this example, the RJ45 Port Link Status is down
    IBM 4-Port 10/100 Base-TX Ethernet PCI Adapter Specific Statistics:
    ------------------------------------------------
    Chip Version: 26
    RJ45 Port Link Status : down
    Media Speed Selected: Auto negotiation
    Media Speed Running: 100 Mbps Full Duplex
    Receive Pool Buffer Size: 384
    Free Receive Pool Buffers: 128
    No Receive Pool Buffer Errors: 0
    Inter Packet Gap: 96
    Adapter Restarts due to IOCTL commands: 1
  4. Running netstat -v a second time without a filter allows you to check the port link status for every adapter. For example, enter:

    netstat -v | more

    and then use /Specific as the search string for the more command. In this example, such a search shows that ent3, not ent0, shows a port link status ofup. This information indicates that the cable is in the wrong port on the 4-Port Adapter, and that moving the cable to the correct (that is, configured) port fixes the problem.
Example 2
Interpreting the portion of the netstat -v output that indicates adapter resource configuration can help isolate a system configuration problem. When setting up servers that provide for network backup (such as, TSM or SysBack), administrators commonly do some preliminary testing and achieve good results. Then, as more remote servers are added to the backup schedule, performance can decrease. Where network throughput was once good, but then has decreased, netstat -v can uncover potential problems with adapter resources.

Many modern adapters have tunable buffers that allow you to adjust the resources a device can obtain. When a backup server requires extensive resources to handle data reception, looking at the output of netstat -v for Receive Statistics and for Adapter Specific Statistics can help isolate potential network performance bottlenecks. It is not uncommon to see errors in the Adapter Specific section of the 10/100 Mbps adapter that indicate "No Receive Pool Buffer Errors". In Example 2 the netstat -v command is run twice, 30 seconds apart, while the server is handling several backup jobs. The output shows the default setting of 384 on the receive pool buffer needs to be adjusted higher. As long as no other errors suggesting additional problems show up in the output, you can safely assume that performance will improve when the receive pool buffer on ent4 is adjusted.
  1. Run the following command to see specific statistics for en4:

    netstat -v ent4 | grep -p "Specific Statistics"

    Command output is similar to the following:
    IBM 4-Port 10/100 Base-TX Ethernet PCI Adapter Specific Statistics:
    ------------------------------------------------
    Chip Version: 26
    RJ45 Port Link Status : up
    Media Speed Selected: Auto negotiation
    Media Speed Running: 100 Mbps Full Duplex
    Receive Pool Buffer Size: 384
    Free Receive Pool Buffers: 128
    No Receive Pool Buffer Errors: 999875
    Inter Packet Gap: 96
    Adapter Restarts due to IOCTL commands: 1
    
  2. Run the following commands to check the No Receive Pool Buffer Errors after 30 seconds:

    sleep 30 ; netstat -v ent4 | grep "Receive Pool Buffer Errors"

    Output is similar to the following:
    No Receive Pool Buffer Errors: 1005761

Using the tcpdump command

The netstat tools (netstat -innetstat -rn and netstat -v) cannot always determine the nature of a connection problem.
Example 3
Suppose your server has four separate network adapters configured and attached to separate network segments. Two are working fine (VLAN A and B) while no connections can be established to your server on the other two segments (VLAN C and D). The output of netstat -v shows that data is coming in on all four adapters and no errors are being logged, indicating that the configuration at the physical and data link layers is working. In such a case, you need to examine the inbound data itself. You can use the tcpdump tool to examine the data online to help you determine the connection problem.

The tcpdump command provides much data, but for quick analysis only some basics pieces of its output (IP addresses) are needed:
You also want to consider the logical configuration you have set up for your interfaces (netstat -in). In this example, en2 was configured with address 9.3.6.225 and is in VLAN C (IP network 9.3.6.224, netmask 255.255.255.240); en3 was configured with address 9.3.6.243 and is in VLAN D (IP network 9.3.6.240, netmask 255.255.255.240).

  1. Run the following command to check traffic on en2:

    tcpdump -i en2 -I -n

    Output similar to the following is displayed:
    -TIME STAMP-    -SOURCE IP-    -DESTINATION IP-   -FLAG   -ADDITION INFO- 
    09:04:27.313527323 9.3.6.244.23 > 9.3.6.241.38160: P 7:9(2) ack 8 win 
    65535
    09:04:27.402377282 9.3.6.245.45017 > 9.53.168.52.23: . ack 24 win 
    17520 (DF) [tos 0x10]
    09:04:27.418818536 9.3.6.241.38160 > 9.3.6.244.23: . ack 9 win 65535 
    [tos 0x10
    09:04:27.419054751 9.3.6.244.23 > 9.3.6.241.38160: P 9:49(40) ack 8 
    win 65535
    09:04:27.524512144 9.3.6.245.45017 > 9.53.168.52.23: P 4:5(1) ack 24 
    win 17520 (DF) [tos 0x10]
    09:04:27.526159054 9.53.168.52.23 > 9.3.6.245.45017: P 24:25(1) ack 5 
    win 2482 (DF)
    09:04:27.602600775 9.3.6.245.45017 > 9.53.168.52.23: . ack 25 win 
    17520 (DF) [tos 0x10]
    09:04:27.628488745 9.3.6.241.38160 > 9.3.6.244.23: . ack 49 win 65535 
    [tos 0x1
  2. Press Ctrl-C to stop the output display:

    ^C
    38 packets received by filter
    0 packets dropped by kernel
Useful data can be gained from the tcpdump output simply by recognizing the source IP addresses in the traffice (shown in bold type in the sample output). Thus, the sample output shows that ent2 is physically attached to the wrong network segment. The source IP addressses should be in the 9.2.6.22x range, not the 9.3.6.24x range. It is possible that swapping the cables for ent2 and ent3 may solve the problem. If not, you may need to ask your network administrator to reconfigure switch ports to pass the correct traffic. With the information you gain from using the netstat -v and tcpdump tools, you can better decide which action is most appropriate.

AIX provides many tools for querying TCP/IP status on AIX servers. However, the netstat and tcpdump commands do provide some methods for quick problem determination. For example, these tools can help determine if you own the problem or if it needs to be addressed by a network administrator.

For additional information, please refer to AIX Online Documents at the following URL: Link

How to Set Up sar AIX

How to Set Up sar

Error messages

Error messages regarding the sar command and data files include the following:
  • The following errors indicate that the data collection programs have not been set up to collect the data that sar reports. If you see either of the following errors, follow the procedure in the "How to set up sar data files" section in this document.
  sar:0551-201 cannot open /usr/adm/sa/sa12 
  or
  sar:0551-213 try running /usr/lib/sa/sa1  
  • If on bootup (or when running the sar command) you see the following error, go to Step 4 of the "How to set up sar data files" procedure and follow Steps 4 and 5.
  sar: 0511-211 specify a positive integer for the time change 

How to set up sar data files

1) Log in as root and enter su - adm.
2) Enter crontab -e.
3) Uncomment the following lines by removing the # sign from the front of each line:
 # 0 8-17 * * 1-5 /usr/lib/sa/sa1 1200 3 & 
 # 0 * * * 0,6 /usr/lib/sa/sa1 & 
 # 0 18-7 * * 1-5 /usr/lib/sa/sa1 & 
 # 5 18 * * 1-5 /usr/lib/sa/sa2 -s 8:00 -e 18:01 -i 3600 -ubcwyaqvm & 
4) Uncomment the following line in the /etc/rc file:
# /bin/su - adm -c /usr/lib/sa/sadc /usr/adm/sa/sa`date  +%id`
Reboot the system. This will turn on the data collection programs the sar command uses for displaying data.

Repairing File Systems with fsck in AIX V5 (LED 517 or 518)

Technote (FAQ)

Repairing File Systems with fsck in AIX V5 or V6 (LED 517 or 518)

Answer

This document covers the use of the fsck (file system check) command in Maintenance mode to repair inconsistencies in file systems. The procedure described is useful when file system corruption in the primary root file systems is suspected or, in many cases, to correct an IPL hang at LED value 517, 518, or LED value 555.
This document applies to AIX version 5.x, 6.x, and VIOS LPAR.

Recovery procedure

  1. Boot your system into a limited function maintenance shell (Service, or Maintenance mode) from AIX bootable media to perform file system checks on your root file systems.
  2.  
    Please refer to your system user's or installation and service guide for specific IPL procedures related to your type and model of hardware. You can also refer to the document titled "Booting in Service Mode," available at http://techsupport.services.ibm.com/server/aix.srchBroker.
  3. With bootable media of the same version and level as the system, boot the system. If this is a VIOS LPAR, use the correct VIOS media. The bootable media can be any ONE of the following:
    • Bootable CD-ROM
    • NON_AUTOINSTALL mksysb
    • Bootable Install Tape
    Follow the screen prompts to the following menu:
       Welcome to Base Operating System 
       Installation and Maintenance 
    
  4. Choose Start Maintenance Mode for System Recovery (Option 3).
    The next screen displays the Maintenance menu.
  5. Choose Access a Root Volume Group (Option 1).
    The next screen displays a warning that indicates you will not be able to return to the Base OS menu without rebooting.
  6. Choose 0 continue.
    The next screen displays information about all volume groups on the system.
  7. Select the root volume group by number.
  8. Choose Access this volume group and start a shell before mounting file systems (Option 2).
    If you get errors from the preceding option, do not continue with the rest of this procedure. Correct the problem causing the error. If you need assistance correcting the problem causing the error, contact one of the following:
    • local branch office
    • your point of sale
    • your AIX support center
    If no errors occur, proceed with the following steps.
  9. Run the following commands to check and repair file systems.
    NOTE: The -y option gives fsck permission to repair file system corruption when necessary. This flag can be used to avoid having to manually answer multiple confirmation prompts, however, use of this flag can cause permanent, unnecessary data loss in some situations.
     fsck /dev/hd4 
     fsck /dev/hd2 
     fsck /dev/hd3 
     fsck /dev/hd9var 
     fsck /dev/hd1 
    
  10. To format the default jfslog for the rootvg Journaled File System (JFS) file systems, run the following command:
     /usr/sbin/logform /dev/hd8 
    
    Answer yes when asked if you want to destroy the log.
  11. If your system is hanging at LED 517 or 518 during a Normal mode boot, it is possible the /etc/filesystems file is corrupt or missing. To temporarily replace the disk-based /etc/filesystems file, run the following commands:
     mount /dev/hd4 /mnt
     mv /mnt/etc/filesystems /mnt/etc/filesystems.[MMDDYY]
     cp /etc/filesystems /mnt/etc/filesystems
     umount /mnt
    
    MMDDYY represents the current two-digit representation of the Month, Day and Year, respectively.
  12. Type exit to exit from the shell. The file systems should automatically mount after you type exit. If you receive error messages, reboot into a limited function maintenance shell again to attempt to address the failure causes.
  13. If you have user-created file systems in the rootvg volume group, run fsck on them now. Enter:
     fsck /dev/[LVname] 
    
    LVname is the name of your user-defined logical volume.
  14. If you used the preceding procedure to temporarily replace the /etc/filesystems file, and you have user-created file systems in the rootvg volume group, you must also run the following command:
     imfs -l /dev/[LVname]
    
  15. If you used the preceding procedure to temporarily replace the /etc/filesystems file, also run the following command:
     imfs [VGname]
    
    The preceding commands can be repeated for each user-defined volume group on the system.
  16. If your system was hanging at LED 517 or 518 and you are unable to activate non-rootvg volume groups in Service mode, you can manually edit the/etc/filesystems file and add the appropriate entries.
    The file /etc/filesystems.MMDDYY saved in the preceding steps may be used as a reference if it is readable. However, the imfs method is preferred since it uses information stored in the logical volume control block to re-populate the /etc/filesystems file.
  17. If your system has a mode select key, turn it to the Normal position.
  18. Reboot the system into Normal mode using the following command:
     sync;sync;sync;reboot 
    
If you followed all of the preceding steps and the system still stops at an LED 517 or 518 during a reboot in Normal mode, you may want to consider re-installing your system from a recent backup. Isolating the cause of the hang could be excessively time-consuming and may not be cost-effective in your operating environment. To isolate the possible cause of the hang, would require a debug boot of the system. Instructions for doing this are included in the document "Capturing Boot Debug", available at IBM Technical Help Database for AIX. It is still possible, in the end, that isolation of the problem may indicate a restore or reinstall of AIX is necessary to correct it.

Recovery from an LED 553 in AIX

Question

Recovery from an LED 553 in AIX

Answer

This document describes a procedure to attempt to recover from an IPL hang at LED 553.

  1. About LED 553
  2. Recovery procedure
  3. Sample /etc/inittab file for AIX Versions 5 and 6
  4. Sample /etc/environment file for AIX.

1) About LED 553:

An LED value of 553 is a checkpoint code displayed to indicate the system transition to phase 3 of IPL. A halt or hang at LED 553 is often the result of a corrupted or missing /etc/inittab file. It can also be caused by full / (root) or /tmp file systems, inconsistencies in either startup configuration files, Object Data Manager (ODM) object class databases, or system library files. Additionally, a number of other issues involving file permissions, invalid hard links in the root file system, etc. have been observed to cause a hang at LED 553.

Summary of the recovery procedure:

To attempt to isolate the cause for an LED 553 hang, start by checking the root file systems with the fsck command. Then check /dev/hd3 and /dev/hd4 for space problems, and erase files if necessary. Check the /etc/inittab file for corruption, and fix it if necessary. If the inittab file was not corrupted, you will need to check the shell profile and environment files, the /bin/bsh file, as well as other system configuration files. A check of the consistency of all installed files within the installed fileset base and an update of the boot image should be done. To conclude, run the configuration manager to find out if there is a hang during device configuration.

2) Recovery procedure:

    1. Boot your system into a limited function maintenance shell (Service or Maintenance mode) from AIX bootable media.
      Please refer to your system user's or installation and service guide for specific IPL procedures related to your type and model of hardware. You can also refer to the document titled "Booting in Service Mode", available at http://techsupport.services.ibm.com/server/aix.srchBroker for more information.
    2. With bootable media of the same version and level as the system, boot the system. The bootable media can be any ONE of the following:
      • Bootable CD-ROM
      • mksysb
      • Bootable Install Tape
      Follow the screen prompts to the Welcome to Base OS menu.
    3. Choose Start Maintenance Mode for System Recovery (Option 3). The next screen contains prompts for the Maintenance menu.
      1. Choose Access a Root Volume Group (Option 1).
        The next screen displays a warning that indicates you will not be able to return to Base OS menu without rebooting.
      2. Choose 0 continue.
        The next screen displays information about all volume groups on the system.
      3. Select the root volume group by number. The logical volumes in rootvg will be displayed with two options below.
      4. Choose Access this volume group and start a shell before mounting file systems (Option 2).
      If you get errors from the preceding option, do not continue with the rest of this procedure. Correct the problem causing the error. If you need assistance correcting the problem causing the error, contact one of the following:
      • local branch office
      • your point of sale
      • your AIX support center
      If no errors occur, proceed with the following steps.
    4. Run the following series of commands to check and repair file systems.
       fsck -p /dev/hd4
       fsck -p /dev/hd2
       fsck -p /dev/hd3
       fsck -p /dev/hd9var
       fsck -p /dev/hd1
      
      NOTE: The -y option gives the fsck command permission to repair file system corruption when necessary. This flag can be used to avoid having to manually answer multiple confirmation prompts, however, use of this flag can cause permanent data loss in some situations.
    5. To format the default jfslog for the rootvg Journaled File Systems (JFS), run the following command:
       /usr/sbin/logform /dev/hd8
      
      Answer yes when asked if you want to destroy the log.
    6. Type exit to exit from the shell. The file systems should automatically mount after you type exit. If you receive error messages at this point, reboot into a limited function maintenance shell again to attempt to address the failure causes.
    7. Use the df command to check for free space in /dev/hd3 and /dev/hd4.
         df  /dev/hd3
         df  /dev/hd4
      
    8. If the output from the df command shows that either file system is out of space, erase some files from that file system. Three files you may want to erase are/smit.log/smit.script and /.sh_history.
    9. Next, check the /etc/inittab file for corruption. It may be empty or missing, or it may have an incorrect entry. For comparison, see the section "Sample /etc/inittab file" at the end of this document.
    10. If the inittab file is corrupt, set your terminal type in preparation for editing the file. (xxx stands for a terminal type, such as lft, ibm3151, or vt100.)
         TERM=xxx
         export TERM
      
      Now use an editor to create the /etc/inittab file. For an example, see the section "Sample /etc/inittab file" in this document. If your /etc/inittab file was corrupt and you recreated it, the following steps may not be necessary.
      There are only three entries which must be in the /etc/inittab file to successfully boot the system. If your /etc/inittab file is missing or corrupted AND you are unable to use an editor while in Service mode, do the following to create a minimal inittab file to boot the machine into run level 2 (Normal mode).
        mv /etc/inittab /etc/inittab.MMYYDD
        touch /etc/inittab
        chmod 544 /etc/inittab
        chown root:system /etc/inittab
        echo 'init:2:initdefault:' >> /etc/inittab
        echo 'brc::sysinit:/sbin/rc.boot 3 >/dev/console 2>&1' >> /etc/inittab
        echo 'cons:0123456789:respawn:/usr/sbin/getty /dev/console' >> /etc/inittab
      
      MMDDYY represents the current two-digit representation of the Month, Day and Year respectively.
    11. Use the following command to check for any modifications or problems with permissions on shell startup files.
      NOTE: The /.kshrc and /.profile files are not necessary for the system to boot into run level 2 (Normal mode) and, in fact, may not exist on your system.
         ls -al /.kshrc /.profile /etc/environment /etc/profile
      
      Sample output:
      -rw-r--r--  1 root  system   71 Dec 14 1993  /.kshrc
      -rw-r--r--  1 root  system  158 Dec 14 1993  /.profile
      -rw-rw-r--  1 root  system 1389 Oct 26 1993  /etc/environment
      -rw-r-xr-x  1 bin   bin    1214 Jan 22 1993  /etc/profile
      
      etc/profile or .profile may contain a command that is valid only in the Korn shell. Change the command to something that is also valid in the Bourne shell. For example, change the following:
         export PATH=/bin:/usr/bin/:/etc:/usr/ucb:.
      
      to the following:
         PATH=/bin:/usr/bin/:/etc:/usr/ucb:.
         export PATH
      
      /etc/environment is a special case. The only commands it may contain are simple variable assignments, such as statements of the form (varname)=(value). Check this file with an editor to verify the format. See the section "Sample /etc/environment file" at the end of this document.
    12. Check for missing or moved files, or changed ownership/permissions with the following command:
         ls -al /bin /bin/bsh /bin/sh /lib /unix /u
      
      Sample output:
      lrwxrwxrwx 1 bin  bin       8 Aug 5 1994 /bin -> /usr/bin
      -r-xr-xr-x 3 bin  bin 25622 4 Jun 4 1993 /bin/bsh
      -r-xr-xr-x 3 bin  bin 25622 4 Jun 4 1993 /bin/sh
      lrwxrwxrwx 1 bin  bin       8 Aug 5 1994 /lib -> /usr/lib
      lrwxrwxrwx 1 bin  bin       5 Aug 5 1994 /u -> /home 
      lrwxrwxrwx 1 root system    18 Aug 5 1994 /unix -> /usr/lib/boot/unix
      
      If any of these files are missing, the problem may be a missing symbolic link. Use the commands from the following list that correspond to the missing links.
         ln -s /usr/bin /bin
         ln -s /usr/lib/boot/unix /unix
         ln -s /usr/lib /lib
         ln -s /home /u
      
    13. Use the following command to make sure that rc.boot is not missing or corrupt.
         ls -l /sbin/rc.boot
      
      Sample output:
      -rwxrwxr-- 1 root system 33760 Aug 30 1993 /sbin/rc.boot
      
    14. Make sure the /etc/inittab file is for AIX Version 5 or 6. For these versions, the line that begins with brc is
         brc::sysinit:/sbin/rc.boot 3 >/dev/console 2>&1
      
      See the section "Sample /etc/inittab file" in this document for an example.
    15. If you have not found any obvious problems, try substituting ksh for bsh with the following series of commands. (The first command saves your bsh before you copy over it.)
         cp /bin/bsh /bin/bsh.orig
         cp /bin/ksh /bin/bsh
      
      If you can then reboot successfully, this indicates that one of the profiles was causing problems for bsh. Check the profiles again by running the following:
         /bin/bsh.orig /.profile
         /bin/bsh.orig /etc/profile
         /bin/bsh.orig /etc/environment
      
      If you receive errors with any of the preceding commands, this indicates that there is a command in that profile that bsh cannot handle.
    16. To run a checksum validation of all files in the installed fileset base and a consistency check of the fileset installation, run the following commands:
       lppchk -c
       lppchk -v
              lppchk -l
      
      NOTE: These commands should not produce output. If they do, then the messages should be examined to assess whether it is a potential cause of the hang.
    17. Detemine the boot drive and update the boot image with the following command:
       lslv -m hd5
      
      Sample output:
         hd5:N/A
         LP    PP1  PV1               PP2  PV2               PP3  PV3
         0001  0001 hdisk0
      
      The disk number under the PV1 column is the disk name you should use to run the following two commands:
       bosboot -ad /dev/hdisk0
       bootlist -m normal hdisk0
      
    18. To check the device configuration routines, the following command should identify any problems associated with configuration routines:
       cfgmgr -vp 2
      
      If the cfgmgr command hangs, this is likely the cause of the system hang. You may be able to stop the command by pressing Ctrl-C, however, a reboot is often required to get back into Service mode and continue troubleshooting the problem.
    19. If your model has a mode select key, turn it to the Normal position.
    20. Attempt to reboot the system into Normal mode by running the following command:
       sync;sync;sync;reboot
      
    If you followed all of the preceding steps and the system still stops at an LED 553 during a reboot in Normal mode, you may want to consider reinstalling your system from a recent backup. Isolating the cause of the hang could be excessively time-consuming and may not be cost-effective in your operating environment. To isolate the possible cause of the hang, would require a debug boot of the system. Instructions for doing this are included in the document, "Capturing Boot Debug", available at http://techsupport.services.ibm.com/server/aix.srchBroker. It is still possible, in the end, that isolation of the problem may indicate a restore or reinstall of AIX is necessary to correct it.
    If you wish, you may pursue further system recovery assistance from one of the following:
    • local branch office
    • your point of sale
    • your AIX support center

Sample /etc/inittab file for AIX Versions 5 and 6

:  US Government Users Restricted Rights - Use, duplication or
:  disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
:
: Note - initdefault and sysinit should be the first and second entry.
:
init:2:initdefault:
brc::sysinit:/sbin/rc.boot 3 >/dev/console 2>&1 # Phase 3 of system boot
powerfail::powerfail:/etc/rc.powerfail 2>&1 | alog -tboot > /dev/console # Power  Failure Detection
load64bit:2:once:/etc/methods/cfg64 >/dev/console 2>&1 # Enable 64-bit execs
rc:2:wait:/etc/rc 2>&1 | alog -tboot > /dev/console # Multi-User checks
fbcheck:2:wait:/usr/sbin/fbcheck 2>&1 | alog -tboot > /dev/console # run /etc/firstboot
srcmstr:2:respawn:/usr/sbin/srcmstr # System Resource Controller
rctcpip:2:wait:/etc/rc.tcpip > /dev/console 2>&1 # Start TCP/IP daemons
rcnfs:2:wait:/etc/rc.nfs > /dev/console 2>&1 # Start NFS Daemons
cron:2:respawn:/usr/sbin/cron
piobe:2:wait:/usr/lib/lpd/pio/etc/pioinit >/dev/null 2>&1  # pb cleanup
uprintfd:2:respawn:/usr/sbin/uprintfd
logsymp:2:once:/usr/lib/ras/logsymptom # for system dumps
pmd:2:wait:/usr/bin/pmd > /dev/console 2>&1 # Start PM daemon
diagd:2:once:/usr/lpp/diagnostics/bin/diagd >/dev/console 2>&1
dt:2:wait:/etc/rc.dt
cons:0123456789:respawn:/usr/sbin/getty /dev/console

Sample /etc/environment file for AIX Versions 5 and 6

# @(#)18        1.21  src/bos/etc/environment/environment, cmdsh, bos430, ...
PATH=/usr/bin:/etc:/usr/sbin:/usr/ucb:/usr/bin/X11:/sbin:/local/netscape:/usr/local/bin
TZ=CST6CDT
LANG=en_US
LOCPATH=/usr/lib/nls/loc
MOZILLA_HOME=/local/netscape
export MOZILLA_HOME
NLSPATH=/usr/lib/nls/msg/%L/%N:/usr/lib/nls/msg/%L/%N.cat
LC__FASTMSG=true
PS1='MYSYSTEM $PWD=>'
set -o vi
# ODM routines use ODMDIR to determine which objects to operate on
# the default is /etc/objrepos - this is where the device objects
# reside, which are required for hardware configuration
ODMDIR=/etc/objrepos