Friday 21 February 2014

IBM AIX- Admin Best Practices

 IBM AIX- Admin Best Practices

When I  was started my career as AIX Admin, it was like  Greek and Latin, so scared . I am not aware where,how  to start .

In this article, I will walk you  through  best practices as  AIX admin  which  makes your life easy.

These are also  applicable to other flavors of Unix Administration  ( like Solaris,Linux & HP-UX) , the only difference some of commands differs to flavour to flavor..

Rule 1:  Learn Process

If you pass this area it would be get very much easy life going forward.

I would like emphasize  on below points especially.

  • Follow ITIL processes which are adopted by most of the companies.
  • Get to know about the SLAs ( Service Level Agreement)
  • Always try to acknowledge tickets as per SLA  and update on regular intervals
  • If its P1 login to bridge call and voice out your findings promptly and wisely
  • Always perform  approved changes within the prescribed change window
Table describes different ITIL process in short.

Process Name
Definition
Priorities
Tools
Incident/Ticket Management
An unplanned interruption to an IT Service or a reduction in the Quality of an IT Service. Failure of a Configuration Item that has not yet impacted Service is also an Incident.
P1,P2,P3 & P4
Maximo,BMC Remedy,HP-Service Manager ,Peregrine
Change Management
A process to control and coordinate all changes to an IT production environment.
Normal,Urgent,Emergency &Expedite
Maximo,BMC Remedy,HP-Service Manager ,Peregrine
Service Request /Task Management
A monitoring and reporting the agreed Key Performance Indicators (KPI) corresponding to the compliance with customer and management.

Maximo,BMC Remedy,HP-Service Manager ,Peregrine
Problem Management
Problem Management includes the activities required to diagnose the root cause of incidents identified through the Incident Management process, and to determine the resolution to those problems

Maximo,BMC Remedy,HP-Service Manager ,Peregrine
Release management
Process of managing software releases from development stage to software release.


Rule 2: Get to know Coverage Area & Contacts

  • Make supported servers inventory: If possible make sheet in such way that it includes environment,jump server,application,datacentre location and console information.
  • Get the access for the servers
  • Collect vendor contact information with phone numbers
  • Also keep application team contact information handy

Rule 3: Day to day operations

Backups:

  • Take system backup mksysb on regularly atleast for one week and keep it in  other server preferably in NIM server
  • Verify mksysb with “lsmksysb -l -f /mksysbimg”(check size).
  • Check the /etc/exclude.rootvg to see if any important filesystem/dir is excluded from the mksysb backup.
  • Ensure  file systems (non-rootvg )backups as per backup software of your company. ( Eg: TSM or Net backup)
  • Take system snap for every week ( make cron entry) and keep log file in different server or make a copy in your desktop

System Consistency Checks:

  • Ensure the current OS level is consistent:     “oslevel -s;oslevel -r;instfix -i|grep ML;instfix -i|grep SP;lppchk -v” [If the os is inconsistent, then first bring the os level to consistent state and then proceed with the change].
  • Proactively remediate compliance issues.
  • Check your firm policies on server uptime and arrange for reboot , generally some organizations  fix it as < 90 days / < 180 days period .

Troubleshooting issues:

  1. Don't do issue fixing without a proper incident record.
  2. Engage relevant parties while working on  the issue
  3. Always try get the information about the issue from the  user ( requestor) with questions line "what, when, where"
  4. Look at errpt first
  5. Check  ‘alog -t console -o’ to see if  its boot issue
  6. Also looking log files mentioned in  "/etc/syslog.conf" , may give some more information for investigation.
  7. Check backups if your looking for configuration change issues
  8. if  your running out of time,involve your next level team and managers
  9. Take help from vendors like IBM,EMC,Symantec if necessary

P1 issues:  

if its a priority 1 (P1) issue you may need to consider few more additional points apart from  above.
  1. On sev1 issues, update the SDM (Service Delivery Manager) in the ST/Communicator  multi chat  at regular intervals.
  2. Over the conference  voice call(bridge call ), if they verbally request you to perform any change, get the confirmation in writing in the multi ST chat.
  3. Update the incident record ( IR) in regular intervals.
  4. Update your team with the issue status(via mail).
  5. Document any new leanings(from issues/changes) and share it with team.

Working on a Change:

Thumb Rule: Change should go in sequence manner DEV ==> UAT/QA==> PROD environment servers.
  1. Make sure change record is in fully approved otherwise don't start any of your task
  2. Ensure proper validated CR procedure is in place;  Precheck -> Installation -> Backout -> Post-Verification
  3. Supress alerts if needed
  4. Remember Application/Database teams are responsible for their Application/Database backup/restore and stop/start. Therefore alert the application teams .
  5. Check the history of the servers(CRs or IRs )…to see if there were any issues or change failures for these servers.
  6. EXPECT THE UNEXPECTED : Ensure you have the proper back out plan in place.
  7. Ensure you are on right server('uname -n'/'hostname') before you perform change.
  8. Make sure your id as well as root id is not expired and working.
  9. Ensure  no other from your team are working on the same task to avoid one change being performed by multiple SAs. Its better to verify with the ‘who -u’ command, to see if there are any SAs already working on the server.
  10. Remember one change at onetime; multiple changes could cause problems & can complicate troubleshooting.
  11. Ensure there are no other conflicting changes from other departments such as SAN, network, firewall, application.. which could dampen your change.
  12. Maintain/record the commands run/console output in the notepad(named after the change).

if its configuration change:

  • Take backup of pre and post values and document them
  • Take screenshots if you comfortable in taking
  • If your are updating configuration of  a file take                                                                                                                              #cp -p <filename> filename_`date +"%m_%d_%Y_%H:%M`"

if its a change to reboot or update s/w :

  1. Check if the server is running any cluster (HACMP/PowerHA), if so then you have to follow different procedure.
  2. Always remember three essential things are in place before you perform any change “backup(mksysb); system information; console”
  3. Take system configuration information (sysinfo script).
  4. Check the lv/filesystems consistency “df -k”(df should not hang); all lvs should be in sync state “lsvg -o|lsvg -il”.
  5. Check errpt & ‘alog -t console -o’ to see if there are any errors.
  6. Ensure latest  mksysb(OS image backup) kept in relevant NIM server
  7. Ensure  non-rootvg file systems backup taken
  8. Verify boot list & boot device:   “bootlist -m normal -o”  “ipl_varyon -i”
  9. Login to HMC console
Additional points for reboot:
  1. Put the servers in maintenance  mode (stop alerts) to avoid unnecessary incident alerts.
  2. Check filesystems count “df -g|wc -l”  ; verify the count after migration or reboot.
  3. Ensure there are no schedule reboots in crontab. If there is any then comment it before you proceed with the change.
  4. If the system has not rebooted from long-time(> 100 days); then perform ‘bosboot’ & then reboot the machine(verify the fs/appfs after reboot), & then commence with the migration/upgrade. [Don't reboot the machine if the bosboot fails!]
  5. Look for the log messages carefully; don't ignore warnings.
Additional points for OS & S/W upgrades:
  1. Ensure hd5(bootlv) is 32MB (contiguous PPs)  [very important for migration]
  2. For OS updates Initiate the change on console. If there is any network disconnection during the change, you can reconnect to the console and get the menus back.
  3. If situation demands ,ensure there is enough free filesystem space(/usr, /var, / ), required for the change.
  4. Have the patches/filesets pre-downloaded and verified.
  5. Check/verify the repositories on NIM/central server; check if these repositories were tested/used earlier.
  6. If there are two disks in rootvg, then perform alt disk clone for one disk. This is fastest & safest back-out method in case of any failure. Though you perform alt disk clone, ensure you as well take mksysb.
  7. For migration change, check if there is SAN(IBM/EMC..) used, if so, then you have to follow the procedure of exporting vgs, uninstall sdd* fileset;and after migration reinstall sdd* fileset, reimport vgs etc.
  8. Perform preview(TL/SP upgrade) before you perform actual change; see if there are any errors reported in preview(look for keyword ‘error’ / ‘fail’); look for the tail/summary of messages;
  9. Though the preview may report as ‘ok’ at the header, still you have to look in the messages and read the tail/summary of preview.
  10. If preview reports any dependency/requisite filesets  missing then have those downloaded as well.
  11. Ensure you have enough free space in rootvg. Min of 1-2 GB to be free in rootvg(TL upgrade/OS migration).
  12. Ensure application team have tested their application on the new TL/SP/OS to which you are upgrading your system.
  13. If you have multiple putty sessions opened; then name the sessions accordingly [Putty -> under behaviour -> window title]; this will help you in quickly getting to the right session or else use PuttyCM ( Putty Connection Manager)
  14. Ensure for TL upgrades, you go by TL by TL, shortcut to direct TL could sometimes cause problem.
What if you are crossing change widow ?
  • inform the relevant application teams and SDMs  and take extended with proper approvals
  • Raise a incident record in supporting the issue.
What if change fails ?
  • Inform the relevant application teams and SDMs
  • Close the record with the facts
  • Attend the change review calls for the failed changes
Successful Change:
  • if possible send the success status to relevant parties  with artifacts
  • Update the change request with relevant artifacts and close it

Last but not the least:

  • Don't hesitate to take your team mates help or vendor support  when your  issue taking more time
  • Inform your managers if the issue in escalation situation ( if its P1 you need to inform prior).
  • Always perform change with proper approvals in place
  • Take backups  and  make your life easy
Happy Unixing Thumbs up

1 comment: