read() on closed pipe doesn't return EOF

From: Urie, Todd (TUrie@trueposition.com)
Date: Mon Sep 16 2002 - 14:56:06 EDT


I have been trying to figure out why Apache hangs when starting from the
/etc/init.d/apache script. I used 'truss' to figure out where it was
hanging and discovered that apparently the problem is an attempt to read
from a closed 'pipe.' I executed 'truss -f -o out_file ./apache start' from
/etc/init.d. Here is what I got:

1062: execve("/sbin/sh", 0xFFBEFC94, 0xFFBEFCA4) argc = 3
1062: getuid() = 0 [0]
1062: getuid() = 0 [0]
1062: getgid() = 1 [1]
1062: getgid() = 1 [1]
1062: getpid() = 1062 [1061]
1062: getpgid(1062) = 1061
1062: getsid(1062) = 421
1062: brk(0x000698F8) = 0
1062: sysconfig(_CONFIG_SIGRT_MIN) = 38
1062: sysconfig(_CONFIG_SIGRT_MAX) = 45
1062: sigaltstack(0xFFBEFB84, 0x00000000) = 0
1062: sigaction(SIGHUP, 0x00000000, 0xFFBEFB00) = 0
1062: sigaction(SIGHUP, 0xFFBEFA60, 0xFFBEFAE0) = 0
1062: sigaction(SIGINT, 0x00000000, 0xFFBEFB00) = 0
1062: sigaction(SIGINT, 0xFFBEFA60, 0xFFBEFAE0) = 0
1062: sigaction(SIGQUIT, 0x00000000, 0xFFBEFB00) = 0
1062: sigaction(SIGQUIT, 0xFFBEFA60, 0xFFBEFAE0) = 0
1062: sigaction(SIGILL, 0x00000000, 0xFFBEFB00) = 0

[ CUT LOTS OF STUFF ]

1062: getgid() = 1 [1]
1062: open64("./apache", O_RDONLY) = 3
1062: close(19) Err#9 EBADF
1062: fcntl(3, F_DUPFD, 0x00000013) = 19
1062: close(3) = 0
1062: fcntl(19, F_SETFD, 0x00000001) = 0
1062: ioctl(2, TCGETA, 0xFFBEFAAC) = 0
1062: ioctl(19, TCGETA, 0xFFBEFAAC) Err#25 ENOTTY
1062: read(19, " # ! / s b i n / s h\n #".., 128) = 128
1062: read(19, " A C H E _ H O M E = / u".., 128) = 128

PID 1062 is reading the './apache' file and has dupped the file descriptor
to '19'.

1062: write(1, " h t t p d s t a r t i".., 16) = 16
1062: read(19, " > & 1 `\n\n\n i f [ ".., 128) = 77
1062: pipe() = 3 [4]
1062: fork() = 1065
1065: fork() (returning as child ...) = 1062
1065: getpid() = 1065 [1062]
1062: close(4) = 0
1065: close(3) = 0
1065: fcntl(1, F_GETFD, 0x00000000) = 0
1065: close(1) = 0
1065: fcntl(4, F_DUPFD, 0x00000001) = 1
1065: close(4) = 0
1065: dup(1) = 3
1065: fcntl(2, F_GETFD, 0x00000000) = 0
1065: close(2) = 0
1065: fcntl(3, F_DUPFD, 0x00000002) = 2
1065: close(3) = 0
1065: brk(0x0006A388) = 0
1065: execve("/usr/apache/bin/apachectl", 0x00069C1C, 0x00069C2C) argc =
3
1065: mmap(0x00000000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF3A0000
1065: resolvepath("/usr/lib/ld.so.1", "/usr/lib/ld.so.1", 1023) = 16
1065: open("/var/ld/ld.config", O_RDONLY) Err#2 ENOENT

After reading 'status=`${APACHE_HOME}/bin/apachectl $1 2>&1` from the
./apache file, dups the file descriptor that is used to read the ./apache
file. It now has fd's 3 & 4 that refer to the ./apache file. Then the
parent (1062) forks 1065 to execute the command found in the backticks
(apachectl). The parent closes fd 4 and the new child closes fd 3. The
child (1065) redirects the fd to STDOUT. The net result is that now PID
1065's STDOUT will be read on fd 3 by PID 1062 (the parent).

So far, it all makes sense. The new child now goes of and executes
{$APACHE_HOME}/bin/apachectl and PID 1062 awaits the results.

[ CUT LOTS OF STUFF ]

1065: waitid(P_PID, 1067, 0xFFBEF560, WEXITED|WTRAPPED|WNOWAIT) = 0
1065: ioctl(0, TIOCGPGRP, 0xFFBEF51C) = 0
1065: ioctl(0, TCGETS, 0x000391B8) = 0
1065: waitid(P_PID, 1067, 0xFFBEF560, WEXITED|WTRAPPED) = 0
1075: munmap(0xFED80000, 4932) = 0
1065: write(1, " / u s r / a p a c h e /".., 47) = 47
1075: munmap(0xFED70000, 4197) = 0
1062: read(3, " / u s r / a p a c h e /".., 128) = 47
1065: unlink("/tmp/sh10650") = 0
1075: munmap(0xFEDA0000, 3112) = 0
1065: brk(0x0003CB98) = 0
1075: munmap(0xFED90000, 2607) = 0
1065: brk(0x0003C998) = 0
1075: munmap(0xFEDD0000, 5396) = 0
1075: munmap(0xFEDC0000, 4415) = 0
1065: llseek(0, 0, SEEK_CUR) = 34988
1065: _exit(0)

PID 1065 is now done, writes it result to the pipe, which the parent 1062
reads. PID 1065 then exits.

Here is where the problem comes in:

1062: read(3, 0xFFBEF7E0, 128) (sleeping...)

The parent now attempts to read from the pipe that was connected to PID
1065, but 1065 is dead. 1062 never gets an EOF, it just remains in the read
indefinitely. PID 1062 never gets interrupted by the SIGCHLD from 1065
because 1062 has not changed the default behavious of SIGCHLD from 'ignore.'
Therefore, it seems that the only way that 1062 will know that 1065 is dead
is when the read returns EOF. However, this doesn't appear to be happening.

In addition to the above, the following information should be helpful:

1. This apache install is the standard install. It was installed along with
the rest of the OS, which is Solaris 8.
2. I don't see this problem on any other systems that I have. The
'/etc/init.d/apache' startup script is the same one that runs on other
systems.

Given that this install is (theoretically anyway) the same as is on other
machines, why doesn't the problem show up elsewhere? Also, all evidence
except that shown by 'truss' points to a system configuration or other
system specific problem. But the output from 'truss' doesn't appear to
support that. What am I missing? Am I wrong in assuming that the attempted
read from the pipe will result in EOF? The 1065 PID still exists but is
listed as 'defunct', which is exactly what one would expect. Is this part
of the problem? If so, why doesn't it appear on other machines? Am I
making this excessively difficult because there is something simple that I
don't know about?

I sent this to the 'sunmanagers' list for 2 reasons:
1. There are some exceedingly knowledgable people on this list;
2. Based upon the output from 'truss', I'm not sure that it's an Apache
problem.

Thanks for any help that you can give,

Todd Urie
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:24:56 EDT