Obscure bad file descriptor error on FreeBSD
mal content
artifact.one at googlemail.com
Thu Jun 7 10:06:07 CEST 2007
Hello.
I've been trying to track down an incredibly obsure problem that's been
causing my client to use 100% of the CPU time when apparently doing
nothing. I tracked the problem down to an invalid file descriptor being
passed to poll() in a loop, resulting in a bad file descriptor error and
in effect creating a busy loop. My OS is FreeBSD 6.2, silc toolkit 1.1.
As can be seen in the output below, my program is reponsible for file
descriptor 0, but the silc client library seems to be under the
impression that file descriptor 0 belongs to it:
$ ./s706 TEST_CONF1 TEST_CONTROL1
request: enter private key password: ********
confdir_chdir opened 5
confdir_restoredir closed 5
confdir_chdir opened 0
confdir_restoredir closed 0
confdir_chdir opened 0
confdir_restoredir closed 0
confdir_chdir opened 0
confdir_restoredir closed 0
info: client running
info: pid 38065
info: user mc
info: connect 192.168.3.33 10706
confdir_chdir opened 9
confdir_restoredir closed 9
info: found cached server key
info: verified server key 9D48B5CA79D3CFE4A99974E97BD1124C069D9650
Assertion failed: (fcntl(fd, 3, 0) != -1), function silc_poll, file
silcunixschedule.c, line 147.
Abort trap (core dumped)
I've gone over this repeatedly and I'm pretty much certain the error is
not my fault (the confdir_* functions above are the only functions in my
client that call open() and close() and as can be seen, the
confdir_chdir function receives file descriptor 0 as the result of an
open() call and is therefore the owner of this descriptor).
I believe the silc client library is incorrectly caching a file
descriptor that it no longer owns, and as a result is passing this
descriptor to poll() and causing repeated 'bad file descriptor' errors.
This may alternatively be down to the way that the FreeBSD thread
library behaves - the fork() call looks suspect to me - that can
obviously mean that two different entities have references to the same
table of file descriptors.
I added an assertion in silcunixschedule.c, just before the call to
poll() to check for bad file descriptors. The assertion is essentially:
assert(fcntl(fd, F_GETFL, 0) == -1);
...which is just a call that can check if a file descriptor is valid or
not.
Here is a system call trace:
Note that the silc library reads from file descriptor 0 here, early in
the execution of the program:
37529 s706 RET write 752/0x2f0
37529 s706 CALL fcntl(0,0x3,0)
37529 s706 RET fcntl 2
37529 s706 CALL fcntl(0x4,0x3,0)
37529 s706 RET fcntl 6
37529 s706 CALL fcntl(0x7,0x3,0)
37529 s706 RET fcntl 6
37529 s706 CALL poll(0x806a000,0x3,0x7464)
37529 s706 RET poll 1
37529 s706 CALL read(0,0xbfbfe92f,0x1)
37529 s706 GIO fd 0 read 1 byte
"!"
37529 s706 RET read 1
37529 s706 CALL gettimeofday(0xbfbfe960,0)
37529 s706 RET gettimeofday 0
37529 s706 CALL fcntl(0,0x3,0)
37529 s706 RET fcntl 2
37529 s706 CALL fcntl(0x4,0x3,0)
37529 s706 RET fcntl 6
37529 s706 CALL fcntl(0x7,0x3,0)
37529 s706 RET fcntl 6
Then in the next system call:
37529 s706 CALL poll(0x806a000,0x3,0x73b6)
37529 s706 RET fork 0
37529 s706 CALL kse_release(0x805bf40)
37529 s706 RET kse_release 0
37529 s706 CALL read(0x7,0x8083000,0x10400)
37529 s706 GIO fd 7 read 784 bytes
The process fork()ed - this is most likely due to the FreeBSD thread
library, as hinted at by the presence of kse_release() straight after.
A few instructions later:
37529 s706 RET open 9
37529 s706 CALL fchdir(0x3)
37529 s706 RET fchdir 0
37529 s706 CALL open(0x804e5c6,0x104,0)
37529 s706 NAMI "server_key"
37529 s706 RET open -1 errno 2 No such file or directory
37529 s706 CALL fchdir(0x9)
37529 s706 RET fchdir 0
37529 s706 CALL close(0)
37529 s706 RET close 0
37529 s706 CALL write(0x2,0x80536c0,0x42)
37529 s706 GIO fd 2 wrote 66 bytes
"warn: unknown server key 9D48B5CA79D3CFE4A99974E97BD1124C069D9650
"
37529 s706 RET write 66/0x42
37529 s706 CALL open(0x804eb5d,0x4,0)
37529 s706 NAMI "."
37529 s706 RET open 0
37529 s706 CALL fchdir(0x3)
37529 s706 RET fchdir 0
37529 s706 CALL open(0x804e5eb,0x104,0)
37529 s706 NAMI "server_trust_initially"
37529 s706 RET open 10/0xa
37529 s706 CALL fstat(0xa,0xbfbfe5b0)
Note that my process called open() and got file descriptor 0 returned
to it. A bit further on:
37529 s706 RET read 1
37529 s706 CALL write(0x7,0x8065800,0x30)
37529 s706 GIO fd 7 wrote 48 bytes
0x0000 0026 0002 0a00 1008 0200 0000 0000 0000 |.&..............|
0x0010 0000 0000 0000 0000 0001 c0a8 0321 d229 |.............!.)|
0x0020 c4eb 2e70 c30e e2a6 4f1f da46 0000 0000 |...p....O..F....|
37529 s706 RET write 48/0x30
37529 s706 CALL fcntl(0,0x3,0)
^^^^^^^^^^^^^^
This is where I added an assertion in
silcunixschedule.c
37529 s706 RET fcntl -1 errno 9 Bad file descriptor
37529 s706 CALL write(0x2,0xbfbfe2c0,0x62)
37529 s706 GIO fd 2 wrote 98 bytes
"Assertion failed: (fcntl(fd, 3, 0) != -1), function silc_poll, file si\
lcunixschedule.c, line 147.
"
37529 s706 RET write 98/0x62
37529 s706 CALL kse_thr_interrupt(0,0x4,0x6)
37529 s706 NAMI "s706.core"
I believe that the silc client library thread still believes file
descriptor 0 to be the socket connected to the server when in fact
isn't. This could have happened because when the process called fork(),
my client modified the file descriptor table by closing a file and
obviously the silc client library couldn't be aware of this.
Unfortunately, I've still not been able to track down why the file
descriptor becomes invalid. I can connect to the server, run commands,
do all of the usual essentials but when the client is doing nothing,
it's endlessly looping on a failed poll() call...
Any assistance or ideas would be much appreciated.
MC
More information about the silc-devel
mailing list