IBM AIX- Admin Best Practices

When I was started my career as AIX Admin, it was like Greek and Latin, so scared . I am not aware where,how to start .

In this article, I will walk you through best practices as AIX admin which makes your life easy.

These are also applicable to other flavors of Unix Administration ( like Solaris,Linux & HP-UX) , the only difference some of commands differs to flavour to flavor..

Rule 1: Learn Process

If you pass this area it would be get very much easy life going forward.

I would like emphasize on below points especially.

Follow ITIL processes which are adopted by most of the companies.
Get to know about the SLAs ( Service Level Agreement)
Always try to acknowledge tickets as per SLA and update on regular intervals
If its P1 login to bridge call and voice out your findings promptly and wisely
Always perform approved changes within the prescribed change window

Table describes different ITIL process in short.

Process Name	Definition	Priorities	Tools
Incident/Ticket Management	An unplanned interruption to an IT Service or a reduction in the Quality of an IT Service. Failure of a Configuration Item that has not yet impacted Service is also an Incident.	P1,P2,P3 & P4	Maximo,BMC Remedy,HP-Service Manager ,Peregrine
Change Management	A process to control and coordinate all changes to an IT production environment.	Normal,Urgent,Emergency &Expedite	Maximo,BMC Remedy,HP-Service Manager ,Peregrine
Service Request /Task Management	A monitoring and reporting the agreed Key Performance Indicators (KPI) corresponding to the compliance with customer and management.		Maximo,BMC Remedy,HP-Service Manager ,Peregrine
Problem Management	Problem Management includes the activities required to diagnose the root cause of incidents identified through the Incident Management process, and to determine the resolution to those problems		Maximo,BMC Remedy,HP-Service Manager ,Peregrine
Release management	Process of managing software releases from development stage to software release.

Rule 2: Get to know Coverage Area & Contacts

Make supported servers inventory: If possible make sheet in such way that it includes environment,jump server,application,datacentre location and console information.
Get the access for the servers
Collect vendor contact information with phone numbers
Also keep application team contact information handy

Rule 3: Day to day operations

Backups:

Take system backup mksysb on regularly atleast for one week and keep it in other server preferably in NIM server
Verify mksysb with “lsmksysb -l -f /mksysbimg”(check size).
Check the /etc/exclude.rootvg to see if any important filesystem/dir is excluded from the mksysb backup.
Ensure file systems (non-rootvg )backups as per backup software of your company. ( Eg: TSM or Net backup)
Take system snap for every week ( make cron entry) and keep log file in different server or make a copy in your desktop

System Consistency Checks:

Ensure the current OS level is consistent: “oslevel -s;oslevel -r;instfix -i|grep ML;instfix -i|grep SP;lppchk -v” [If the os is inconsistent, then first bring the os level to consistent state and then proceed with the change].
Proactively remediate compliance issues.
Check your firm policies on server uptime and arrange for reboot , generally some organizations fix it as < 90 days / < 180 days period .

Troubleshooting issues:

Don't do issue fixing without a proper incident record.
Engage relevant parties while working on the issue
Always try get the information about the issue from the user ( requestor) with questions line "what, when, where"
Look at errpt first
Check ‘alog -t console -o’ to see if its boot issue
Also looking log files mentioned in "/etc/syslog.conf" , may give some more information for investigation.
Check backups if your looking for configuration change issues
if your running out of time,involve your next level team and managers
Take help from vendors like IBM,EMC,Symantec if necessary

P1 issues:

if its a priority 1 (P1) issue you may need to consider few more additional points apart from above.

On sev1 issues, update the SDM (Service Delivery Manager) in the ST/Communicator multi chat at regular intervals.
Over the conference voice call(bridge call ), if they verbally request you to perform any change, get the confirmation in writing in the multi ST chat.
Update the incident record ( IR) in regular intervals.
Update your team with the issue status(via mail).
Document any new leanings(from issues/changes) and share it with team.

Working on a Change:

Thumb Rule: Change should go in sequence manner DEV ==> UAT/QA==> PROD environment servers.

Make sure change record is in fully approved otherwise don't start any of your task
Ensure proper validated CR procedure is in place; Precheck -> Installation -> Backout -> Post-Verification
Supress alerts if needed
Remember Application/Database teams are responsible for their Application/Database backup/restore and stop/start. Therefore alert the application teams .
Check the history of the servers(CRs or IRs )…to see if there were any issues or change failures for these servers.
EXPECT THE UNEXPECTED : Ensure you have the proper back out plan in place.
Ensure you are on right server('uname -n'/'hostname') before you perform change.
Make sure your id as well as root id is not expired and working.
Ensure no other from your team are working on the same task to avoid one change being performed by multiple SAs. Its better to verify with the ‘who -u’ command, to see if there are any SAs already working on the server.
Remember one change at onetime; multiple changes could cause problems & can complicate troubleshooting.
Ensure there are no other conflicting changes from other departments such as SAN, network, firewall, application.. which could dampen your change.
Maintain/record the commands run/console output in the notepad(named after the change).

if its configuration change:

Take backup of pre and post values and document them
Take screenshots if you comfortable in taking
If your are updating configuration of a file take #cp -p <filename> filename_`date +"%m_%d_%Y_%H:%M`"

if its a change to reboot or update s/w :

Check if the server is running any cluster (HACMP/PowerHA), if so then you have to follow different procedure.
Always remember three essential things are in place before you perform any change “backup(mksysb); system information; console”
Take system configuration information (sysinfo script).
Check the lv/filesystems consistency “df -k”(df should not hang); all lvs should be in sync state “lsvg -o|lsvg -il”.
Check errpt & ‘alog -t console -o’ to see if there are any errors.
Ensure latest mksysb(OS image backup) kept in relevant NIM server
Ensure non-rootvg file systems backup taken
Verify boot list & boot device: “bootlist -m normal -o” “ipl_varyon -i”
Login to HMC console

Additional points for reboot:

Put the servers in maintenance mode (stop alerts) to avoid unnecessary incident alerts.
Check filesystems count “df -g|wc -l” ; verify the count after migration or reboot.
Ensure there are no schedule reboots in crontab. If there is any then comment it before you proceed with the change.
If the system has not rebooted from long-time(> 100 days); then perform ‘bosboot’ & then reboot the machine(verify the fs/appfs after reboot), & then commence with the migration/upgrade. [Don't reboot the machine if the bosboot fails!]
Look for the log messages carefully; don't ignore warnings.

Additional points for OS & S/W upgrades:

Ensure hd5(bootlv) is 32MB (contiguous PPs) [very important for migration]
For OS updates Initiate the change on console. If there is any network disconnection during the change, you can reconnect to the console and get the menus back.
If situation demands ,ensure there is enough free filesystem space(/usr, /var, / ), required for the change.
Have the patches/filesets pre-downloaded and verified.
Check/verify the repositories on NIM/central server; check if these repositories were tested/used earlier.
If there are two disks in rootvg, then perform alt disk clone for one disk. This is fastest & safest back-out method in case of any failure. Though you perform alt disk clone, ensure you as well take mksysb.
For migration change, check if there is SAN(IBM/EMC..) used, if so, then you have to follow the procedure of exporting vgs, uninstall sdd* fileset;and after migration reinstall sdd* fileset, reimport vgs etc.
Perform preview(TL/SP upgrade) before you perform actual change; see if there are any errors reported in preview(look for keyword ‘error’ / ‘fail’); look for the tail/summary of messages;
Though the preview may report as ‘ok’ at the header, still you have to look in the messages and read the tail/summary of preview.
If preview reports any dependency/requisite filesets missing then have those downloaded as well.
Ensure you have enough free space in rootvg. Min of 1-2 GB to be free in rootvg(TL upgrade/OS migration).
Ensure application team have tested their application on the new TL/SP/OS to which you are upgrading your system.
If you have multiple putty sessions opened; then name the sessions accordingly [Putty -> under behaviour -> window title]; this will help you in quickly getting to the right session or else use PuttyCM ( Putty Connection Manager)
Ensure for TL upgrades, you go by TL by TL, shortcut to direct TL could sometimes cause problem.

What if you are crossing change widow ?

inform the relevant application teams and SDMs and take extended with proper approvals
Raise a incident record in supporting the issue.

What if change fails ?

Inform the relevant application teams and SDMs
Close the record with the facts
Attend the change review calls for the failed changes

Successful Change:

if possible send the success status to relevant parties with artifacts
Update the change request with relevant artifacts and close it

Last but not the least:

Don't hesitate to take your team mates help or vendor support when your issue taking more time
Inform your managers if the issue in escalation situation ( if its P1 you need to inform prior).
Always perform change with proper approvals in place
Take backups and make your life easy

Happy Unixing Thumbs up