When I was started my career as AIX Admin, it was like Greek and Latin, so scared . I am not aware where,how to start .
In this article, I will walk you through best practices as AIX admin which makes your life easy.
These are also applicable to other flavors of Unix Administration ( like Solaris,Linux & HP-UX) , the only difference some of commands differs to flavour to flavor..
Rule 1: Learn Process
If you pass this area it would be get very much easy life going forward.
I would like emphasize on below points especially.
I would like emphasize on below points especially.
- Follow ITIL processes which are adopted by most of the companies.
- Get to know about the SLAs ( Service Level Agreement)
- Always try to acknowledge tickets as per SLA and update on regular intervals
- If its P1 login to bridge call and voice out your findings promptly and wisely
- Always perform approved changes within the prescribed change window
Process
Name
|
Definition
|
Priorities
|
Tools
|
Incident/Ticket
Management
|
An
unplanned interruption to an IT Service or a reduction in the Quality of an
IT Service. Failure of a Configuration Item that has not yet impacted Service
is also an Incident.
|
P1,P2,P3
& P4
|
Maximo,BMC
Remedy,HP-Service Manager ,Peregrine
|
Change
Management
|
A
process to control and coordinate all changes to an IT production
environment.
|
Normal,Urgent,Emergency
&Expedite
|
Maximo,BMC
Remedy,HP-Service Manager ,Peregrine
|
Service
Request /Task Management
|
A
monitoring and reporting the agreed Key Performance Indicators (KPI)
corresponding to the compliance with customer and management.
|
Maximo,BMC
Remedy,HP-Service Manager ,Peregrine
|
|
Problem
Management
|
Problem
Management includes the activities required to diagnose the root cause of
incidents identified through the Incident Management process, and to
determine the resolution to those problems
|
Maximo,BMC
Remedy,HP-Service Manager ,Peregrine
|
|
Release
management
|
Process
of managing software releases from development stage to software release.
|
Rule 2: Get to know Coverage Area & Contacts
- Make supported servers inventory: If possible make sheet in such way that it includes environment,jump server,application,datacentre location and console information.
- Get the access for the servers
- Collect vendor contact information with phone numbers
- Also keep application team contact information handy
Rule 3: Day to day operations
Backups:
- Take system backup mksysb on regularly atleast for one week and keep it in other server preferably in NIM server
- Verify mksysb with “lsmksysb -l -f /mksysbimg”(check size).
- Check the /etc/exclude.rootvg to see if any important filesystem/dir is excluded from the mksysb backup.
- Ensure file systems (non-rootvg )backups as per backup software of your company. ( Eg: TSM or Net backup)
- Take system snap for every week ( make cron entry) and keep log file in different server or make a copy in your desktop
System Consistency Checks:
- Ensure the current OS level is consistent: “oslevel -s;oslevel -r;instfix -i|grep ML;instfix -i|grep SP;lppchk -v” [If the os is inconsistent, then first bring the os level to consistent state and then proceed with the change].
- Proactively remediate compliance issues.
- Check your firm policies on server uptime and arrange for reboot , generally some organizations fix it as < 90 days / < 180 days period .
Troubleshooting issues:
- Don't do issue fixing without a proper incident record.
- Engage relevant parties while working on the issue
- Always try get the information about the issue from the user ( requestor) with questions line "what, when, where"
- Look at errpt first
- Check ‘alog -t console -o’ to see if its boot issue
- Also looking log files mentioned in "/etc/syslog.conf" , may give some more information for investigation.
- Check backups if your looking for configuration change issues
- if your running out of time,involve your next level team and managers
- Take help from vendors like IBM,EMC,Symantec if necessary
P1 issues:
if its a priority 1 (P1) issue you may need to consider few more additional points apart from above.- On sev1 issues, update the SDM (Service Delivery Manager) in the ST/Communicator multi chat at regular intervals.
- Over the conference voice call(bridge call ), if they verbally request you to perform any change, get the confirmation in writing in the multi ST chat.
- Update the incident record ( IR) in regular intervals.
- Update your team with the issue status(via mail).
- Document any new leanings(from issues/changes) and share it with team.
Working on a Change:
Thumb Rule: Change should go in sequence manner DEV ==> UAT/QA==> PROD environment servers.- Make sure change record is in fully approved otherwise don't start any of your task
- Ensure proper validated CR procedure is in place; Precheck -> Installation -> Backout -> Post-Verification
- Supress alerts if needed
- Remember Application/Database teams are responsible for their Application/Database backup/restore and stop/start. Therefore alert the application teams .
- Check the history of the servers(CRs or IRs )…to see if there were any issues or change failures for these servers.
- EXPECT THE UNEXPECTED : Ensure you have the proper back out plan in place.
- Ensure you are on right server('uname -n'/'hostname') before you perform change.
- Make sure your id as well as root id is not expired and working.
- Ensure no other from your team are working on the same task to avoid one change being performed by multiple SAs. Its better to verify with the ‘who -u’ command, to see if there are any SAs already working on the server.
- Remember one change at onetime; multiple changes could cause problems & can complicate troubleshooting.
- Ensure there are no other conflicting changes from other departments such as SAN, network, firewall, application.. which could dampen your change.
- Maintain/record the commands run/console output in the notepad(named after the change).
if its configuration change:
- Take backup of pre and post values and document them
- Take screenshots if you comfortable in taking
- If your are updating configuration of a file take #cp -p <filename> filename_`date +"%m_%d_%Y_%H:%M`"
if its a change to reboot or update s/w :
- Check if the server is running any cluster (HACMP/PowerHA), if so then you have to follow different procedure.
- Always remember three essential things are in place before you perform any change “backup(mksysb); system information; console”
- Take system configuration information (sysinfo script).
- Check the lv/filesystems consistency “df -k”(df should not hang); all lvs should be in sync state “lsvg -o|lsvg -il”.
- Check errpt & ‘alog -t console -o’ to see if there are any errors.
- Ensure latest mksysb(OS image backup) kept in relevant NIM server
- Ensure non-rootvg file systems backup taken
- Verify boot list & boot device: “bootlist -m normal -o” “ipl_varyon -i”
- Login to HMC console
Additional points for reboot:
- Put the servers in maintenance mode (stop alerts) to avoid unnecessary incident alerts.
- Check filesystems count “df -g|wc -l” ; verify the count after migration or reboot.
- Ensure there are no schedule reboots in crontab. If there is any then comment it before you proceed with the change.
- If the system has not rebooted from long-time(> 100 days); then perform ‘bosboot’ & then reboot the machine(verify the fs/appfs after reboot), & then commence with the migration/upgrade. [Don't reboot the machine if the bosboot fails!]
- Look for the log messages carefully; don't ignore warnings.
Additional points for OS & S/W upgrades:
- Ensure hd5(bootlv) is 32MB (contiguous PPs) [very important for migration]
- For OS updates Initiate the change on console. If there is any network disconnection during the change, you can reconnect to the console and get the menus back.
- If situation demands ,ensure there is enough free filesystem space(/usr, /var, / ), required for the change.
- Have the patches/filesets pre-downloaded and verified.
- Check/verify the repositories on NIM/central server; check if these repositories were tested/used earlier.
- If there are two disks in rootvg, then perform alt disk clone for one disk. This is fastest & safest back-out method in case of any failure. Though you perform alt disk clone, ensure you as well take mksysb.
- For migration change, check if there is SAN(IBM/EMC..) used, if so, then you have to follow the procedure of exporting vgs, uninstall sdd* fileset;and after migration reinstall sdd* fileset, reimport vgs etc.
- Perform preview(TL/SP upgrade) before you perform actual change; see if there are any errors reported in preview(look for keyword ‘error’ / ‘fail’); look for the tail/summary of messages;
- Though the preview may report as ‘ok’ at the header, still you have to look in the messages and read the tail/summary of preview.
- If preview reports any dependency/requisite filesets missing then have those downloaded as well.
- Ensure you have enough free space in rootvg. Min of 1-2 GB to be free in rootvg(TL upgrade/OS migration).
- Ensure application team have tested their application on the new TL/SP/OS to which you are upgrading your system.
- If you have multiple putty sessions opened; then name the sessions accordingly [Putty -> under behaviour -> window title]; this will help you in quickly getting to the right session or else use PuttyCM ( Putty Connection Manager)
- Ensure for TL upgrades, you go by TL by TL, shortcut to direct TL could sometimes cause problem.
What if you are crossing change widow ?
- inform the relevant application teams and SDMs and take extended with proper approvals
- Raise a incident record in supporting the issue.
What if change fails ?
- Inform the relevant application teams and SDMs
- Close the record with the facts
- Attend the change review calls for the failed changes
Successful Change:
- if possible send the success status to relevant parties with artifacts
- Update the change request with relevant artifacts and close it
Last but not the least:
- Don't hesitate to take your team mates help or vendor support when your issue taking more time
- Inform your managers if the issue in escalation situation ( if its P1 you need to inform prior).
- Always perform change with proper approvals in place
- Take backups and make your life easy
Thanks You have done great job
ReplyDelete