Tuesday, 23 April 2013

Troubleshooting process hangs caused by JFS2 inode

Troubleshooting process hangs caused by JFS2 inode with an empty extended attribute entry

Summary:  In this article, we reveal a special kind of inode that cannot be detected by the fsck command, and explain why it can cause process hangs and drop system performance. We also introduce some approaches to tackle such an issue.

Introduction

It is not surprising that file system corruption often causes process (or system) hangs.But sometimes, a file system that looks clean might still cause process hangs.

In this article, we introduce methods to create an access control list (ACL)-enabled enhanced journaled file system (JFS2) inode that has an empty extended attribute (EA) entry, and investigate how it cause process hangs. This article also outlines approaches to tackle process hangs caused by such JFS2 inodes.

Setting extended permissions for AIXC ACL

The AIXC ACL includes base permissions and extended permissions. Base permissions are the traditional file-access modes assigned to the file owner, file group, and other users. Extended permissions modify the base file permissions (of the owner, group, or others) by permitting, denying, or specifying access modes for specific individuals, groups, or user and group combinations.

To enable ACL for the /testfs/foo file and add specific permissions to a user, we can use the acledit command:
# export EDITOR=/usr/bin/vi; acledit /testfs/foo 
Then, we can use the vi commands to change the extended permissions to enabled and add the specific permissions:
 *
 * ACL_type   AIXC
 *
 attributes:
 base permissions
     owner(root):  rw-
     group(system):  r--
     others:  r--
 extended permissions
     enabled
 permit   rwx     u:bin
 
 Should the modified ACL be applied? (yes) or (no) yes
We can run aclget to check the ACL of a file.

Extended attribute descriptor in inode

Many applications require the ability to associate variable length control information with a file system object. These objects are called extended attributes or EA. The EA descriptor in the inode, di_ea, describes if the inode has any EAs stored for it. If the inode does have EAs associated with it, the EA descriptor will also describe the inode extension associated with the inode. The EA descriptor is defined in j2_types.head_t:
 typedef struct {
         uint8   flag;      /* 1: flags */
         uint8   nEntry;    /* 1: */
         uint8   len;       /* 1: length in unit of fsblksize */
         uint8   addr1;     /* 1: address in unit of fsblksize */
         uint32  addr2;     /* 4: address in unit of fsblksize */
         uint16  type;      /* 2: ea type */
         int16   nblocks;   /* 2: nBlocks for outline pages */
         int32   rsrvd;     /* 4: */
 } ead_t; /* 16 */

Create an ACL-enabled JFS2 inode with an empty EA entry

When ACL is enabled, the di_ea.nEntry of inode normally will not be zero. For test purpose we need to intentionally set di_ea.nEntry to 0 for the ACL enabled inode. We have several methods to accomplish it.
Using fsdb to modify inode
First, we create a test file:
# echo "a test file" > /testfs/foo
# sync
Then, we get the inode number:
# ls -i /testfs/foo
3 /testfs/foo
The fsdb can be used to modify the on-disk inode after umount file system:
# umount /testfs
# fsdb /testfs

File System:                    /testfs
File System Size:               2620952 (512 byte blocks)
Aggregate Block Size:           4096
Allocation Group Size:          8192   (aggregate blocks)
> i 3
Inode 3 at block 33, offset 0x600:
……
[5] di_nlink:  1                     [22] di_ixpxd.addr2:    0x00000021
[6] di_mode:  0x000081a4                   di_ixpxd.address:  33
               0100644 -rw-r--r--    [24] di_uid:            0
……
[13] di_ea.flag:   0x00              [30] di_ea.len:          0
     EAv1                             [31] di_ea.addr1:        0x00
[15] di_ea.nEntry: 0x00              [32] di_ea.addr2:       0x00000000
[16] di_ea.type:   0x0000                  di_ea.address:     0
                                      [34] di_ea.nblocks:     0
change_inode: [m]odify, [e]a, [t]ree, or e[x]it > m 6 0x020081a4
change_inode: [m]odify, [e]a, [t]ree, or e[x]it > m 16 2
Inode 3 at block 33, offset 0x600:
……
Inode 3 at block 33, offset 0x600:
……
[5] di_nlink:  1                     [22] di_ixpxd.addr2:    0x00000021
[6] di_mode:  0x020081a4                   di_ixpxd.address:  33
               0100644 -rw-r--r--    [24] di_uid:            0
……
[13] di_ea.flag:   0x00              [30] di_ea.len:          0
     EAv1                             [31] di_ea.addr1:        0x00
[15] di_ea.nEntry: 0x00              [32] di_ea.addr2:       0x00000000
[16] di_ea.type:   0x0002                 di_ea.address:     0
                                      [34] di_ea.nblocks:     0
 
change_inode: [m]odify, [e]a, [t]ree, or e[x]it > x
> q
Now, we run the fscck command to check /testfs:
# fsck -yvv /testfs

The current volume is: /dev/fslv00
Primary superblock is valid.
        Superblock s_state = 0x0 mode = 0x3
*** Phase 1 - Initial inode scan
*** Phase 2 - Process remaining directories
*** Phase 3 - Process remaining files
*** Phase 4 - Check and repair inode allocation map
*** Phase 5 - Check and repair block allocation map
File system is clean.
The output shows that the fsck command cannot detect the ACL-enabled JFS2 inode that has an empty EA entry.

Using kdb to modify inode

In IBM® AIX® 6.1 and AIX 7.1, you can use the kdb command to modify the inode in memory while the file system is mounted.
(0)> i2 -i 3
ADDRESS           DEVICE           I_NUM/FS    COUNT   TYPE    FLAG
F1000A00588F6880 8000000A0000000C       3/16    00000   VREG    
……
On-disk Persistent Inode @ 0xF1000A00588F6A20:
uid..........0x00000000  gid..........0x00000000
mode......0x000081A4  mode.........-rw-r--r---  mode.........0100644
……
ea.flag......0x00000000  ea.address...0x0000000000000000  
ea.addr1.....0x00000000  ea.addr2.....0x00000000  
ea.nEntry....0x00000000  ea.len.........0x00000000  
ea.type.......0x00000000

(0)> mw 0xF1000A00588F6A20+0x3C
F1000A00588F6A5C:  000081A4  = 020081a4
F1000A00588F6A60:  00000000  = .

(0)> mw 0xF1000A00588F6A20+0x88
F1000A00588F6AA8:  00000000  = 00020000
F1000A00588F6AAC:  00000000  = .
Check inode again to verify the changes:
(0)> i2 F1000A00588F6880
……
On-disk Persistent Inode @ 0xF1000A00588F6A20:
uid..........0x00000000  gid..........0x00000000
mode......0x020081A4  mode.........-rw-r--r---  mode.........0100644
……
ea.flag......0x00000000   ea.address...0x0000000000000000  
ea.addr1.....0x00000000  ea.addr2.....0x00000000  
ea.nEntry....0x00000000  ea.len.........0x00000000  
ea.type.......0x00000002
typeNames....
AIXACL

Troubleshooting process hang

The process might hang when it opens the inode created using the above methods:
# more foo
# ps ax|grep more
 25100524  pts/3 A     0:25 more foo
The processor system time will increase drastically when more processes try to open that file, and the hang process could not be killed:
# kill -9 25100524
# ps ax|grep more
 25100524  pts/3 A     0:57 more foo
Signals are delivered only when a thread returns from the kernel mode. So, we need to examine what function the thread is currently running in the kernel mode and why it cannot return to user mode. Trace and kdb are the commands that are very useful commands in such situation.

Using kdb to examine the hang process

# kdb

(0)> th * |grep more
pvthread+006200   98*more     RUN   062027 06D    0         0  
pvthread+00EB00  235!more     RUN   0EB039 078    2         0  
pvthread+015800  344!more     RUN   158047 06C    7         0  

(0)> f 344
pvthread+015800 STACK:
Use current context [F00000002FF47600] of cpu 7
[00009518].simple_lock+000018 ()
[00274808]iPut+000054 (??, ??)
[20646678]20646678 ()
[DEADBEEB]DEADBEEB ()
[0044FE64]openpnp+000788 (??, ??, ??, ??, ??)
[0044FF9C]openpath+0000BC (??, ??, ??, ??, ??, ??, ??)
[0045027C]copen+000218 (??, ??, ??, ??, ??)
[0044F678]kopen+00001C (??, ??, ??)

(0)> f 344
pvthread+015800 STACK:
Use current context [F00000002FF47600] of cpu 7
[0000B598].fetch_and_addlp+000018 ()
[003DAEEC]lookuppn+0000D8 (??, ??, ??, ??, ??, ??, ??, ??)
[0044FE64]openpnp+000788 (??, ??, ??, ??, ??)
[0044FF9C]openpath+0000BC (??, ??, ??, ??, ??, ??, ??)
[0045027C]copen+000218 (??, ??, ??, ??, ??)
[0044F678]kopen+00001C (??, ??, ??)
The kdb command shows that the hang thread's kernel stack is different in different time, but keeps calling the openpath() andopenpnp() functions.

It means the hang thread is not sleeping for some resource, but is getting in to an infinite looping between the openpath() andopenpnp() kernel functions. This also explains why the hang process is consuming so much processor system time.

Trace log and report

Trace can be used to get more details about the hang process.
# trace -anl -C all -T20M -L40M -o trace.raw; sleep 5; trcstop
# ps ax|grep more
 286908  pts/5 A    13:07 more foo
 323740  pts/8 A    12:26 more foo

# trcrpt -C all -O 'exec=on,pid=on,tid=on,cpu=on' -o trace.log -p 323740 trace.raw
The trace.log repeated with following block:
107 more  3  323740 1401077  4.450242983 0  lookuppn: foo
4DF more  3  323740 1401077  4.450244140  1 JFS2 iget: vp = F100010038159FE8, 
count = 0017, ino = 0002, dev = 000000A0000000B 

4DB more  3  323740 1401077  4.450244521  0  vnop_hold(vp = F100010038159FE8,
getcaller = 3DB1B8) = 0000

4DF more  3  323740 1401077  4.450246527   2  JFS2 iget: vp = F100010038169FE8,
count = 0006, ino = 0003, dev = 000000A0000000B

107 more  3  323740 1401077  4.450247078    0  
vnop_lookup(dvp = F100010038159FE8, flag = 000A) = 0000, *vpp = 100010038169FE8

107 more  3  323740 1401077  4.450247558    0  
lookuppn exit: 'foo' = vnode F100010038169FE8

15B more  3  323740 1401077  4.450248828    1 
vnop_open(vp = F100010038169FE8, flags = 4000001, ext = 0000) = 0002
It is obvious that the thread first called the lookuppn() function and found the foo file exit, but vnop_open() returned the 02 (ENOENT) error b when it tried to open foo. Then, it repeated calling thelookuppn()andvnop_open() functions constantly, thus resulting in process hang.

How to tackle the problem

We usually have to restart the system when hang process cannot be killed. But reboot is not preferred for important production systems. When we know the root cause of the hang, sometimes, we can fix it without a restart. In this specific example, we already know that the hang is caused by empty EA entry. So, we can exit the infinite loop by disabling ACL.

Disable ACL using kdb

In AIX 6.1 and 7.1, the inode in memory can be modified using kdb to disable ACL.
(0)> i2 F1000A0034496880
   
On-disk Persistent Inode @ 0xF1000A0034496A20:
uid..........0x00000000   gid..........0x00000000
mode...... 0x020081A4  mode.........-rw-r--r---  mode.........0100644
……
ea.flag....... 0x00000000   ea.address...0x0000000000000000  
ea.addr1.....0x00000000   ea.addr2.....0x00000000  
ea.nEntry....0x00000000   ea.len.........0x00000000  
ea.type........0x00000002
typeNames....
AIXACL  

(0)> mw 0xF1000A0034496A20+0x3c
F1000A0034496A5C:  020081A4  = 000081A4
F1000A0034496A60:  00000000  = .

(0)> mw 0xF1000A0034496A20+0x88
F1000A0034496AA8:  00020000  = 0
F1000A0034496AAC:  00000000  = .

(0)> q
Now all hang processes will continue to run or would be killed if kill signal was pending before.

Writing a program to disable ACL

In AIX 5.3, we write a fix program because kdb cannot modify the memory. (See the Download section.)
The fix program also works for AIX 6.1 and AIX 7.1.

Firstly, we get the memory address of on-disk persistent inode using kdb. Then we read, modify and write back ea.typeand di_mode through the /dev/kmem interface:
 /* on-disk inode */
 dinode_t dip;
 
 /* on-disk inode address got from kdb */
 unsigned long long ip=0xF1000A005D84F620;
 
 open("/dev/kmem", O_RDWR, 0);
 kread(ip, (char *)&dip, sizeof(dinode_t));
 
 /* clear S_IXACL */
 if(dip.di_mode == 0x20081a4)
      dip.di_mode = 0x81a4;  
 
 /* clear ea.type  */  
 if(dip.di_ea.type == 0x02)
      dip.di_ea.type = 0;  
 
 kwrite(ip, (char *)&dip, sizeof(dinode_t));
After running the fix program, all hang process will continue to run or would be killed if the kill signal was pending before.

Conclusion

Processes will hang when they open an ACL-enabled file with an empty EA entry. IBM recently has provided an authorized program analyst report (APAR) IV34969 that can avoid this problem. If your AIX system unfortunately encounters this problem before applying for IV34969 and you do not want to restart your system, you can apply the approaches introduced in this article as a workaround.

Download

DescriptionNameSizeDownload method
Pogram to disable ACLfix.c3 KBHTTP

0 blogger-disqus:

Post a Comment