SUMMARY: SIGFPE and siginfo_t

From: Eiler, James A. (James.Eiler@alcoa.com)
Date: Mon Jun 30 2003 - 17:29:41 EDT


Wow! What a list! The answer was waiting for me when I came in to
work this morning. (Original message at the end.)

Thanks to:

James Sainsbury
Joerg Bruehe

Joerg suggested:

"You could have a signal handler that does a "fork()", and then in the
child process send the same signal to yourself. Using the old "signal()"
semantics, this should result in the default action and produce a core.
You then need to backtrace beyond the handler function and arrive at
the point where the signal was originally generated."

While the concept sounds like it would work, our process sometimes
runs on a 20 ms timer and doing a fork() can be expensive.

James' solution is a better fit for our application. He suggests
using the third parameter to the signal handler and using the
ucontext_t structure (see the sys/context_t.h and machine/context.h
for more info).

The code for my test program now looks as follows:

> cat doit.c
#include <unistd.h>
#include <signal.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/wait.h>
main()
{
   sigset_t newmask;
   int ii;
   float fNum;

   struct sigaction action;
   void catchit();

   sigemptyset(&newmask);
   sigaddset(&newmask, SIGFPE);

   action.sa_flags = SA_SIGINFO;
   action.sa_sigaction = catchit;
 
   if (sigaction(SIGFPE, &action, NULL) == -1) {
       perror("sigusr: sigaction");
       _exit(1);
   }

   for( ii = 3; ii > -1; ii--) {
      fNum = 42.0 / (float) ii;
   }
}

void catchit(int signo, siginfo_t *info, void *extra)
{
   ucontext_t* uc;

   uc = extra;
   printf("Signal %d received. Generated by instruction at addr 0x%lx\n",
           signo, uc->uc_mcontext.__sc_pc);

   fflush( stdout );
    _exit(0);
}

Compile and run it as follows:

> cc -D_XOPEN_SOURCE_EXTENDED doit.c -g2 -o doit
> doit
Signal 8 received. Generated by instruction at addr 0x120001344

I can then use the nm command to find out which module generated
the floating point exception - in this simple test case, it
occurred (obviously) in "main":

> nm -v -t x doit | grep "T"

Name Value Type Size
_ftext | 0x000001200010d0 | T | 0x00000000000008
.text | 0x000001200010d0 | T | 0x00000000000000
__start | 0x00000120001190 | T | 0x00000000000008
_mcount | 0x00000120001260 | T | 0x00000000000008
__eprol | 0x00000120001270 | T | 0x00000000000008
eprol | 0x00000120001270 | T | 0x00000000000008
main | 0x00000120001280 | T | 0x00000000000008
catchit | 0x00000120001374 | T | 0x00000000000008
__INIT_00_add_pc_range_table | 0x00000120001410 | T | 0x00000000000008
__FINI_00_remove_pc_range_table | 0x00000120001460 | T | 0x00000000000008
__INIT_00_add_gp_range | 0x000001200014a0 | T | 0x00000000000008
__FINI_00_remove_gp_range | 0x00000120001580 | T | 0x00000000000008
_etext | 0x000001200016e0 | T | 0x00000000000008
etext | 0x000001200016e0 | T | 0x00000000000008

It would be nice to get the exact line number (similar to what
dbx and ladebug generate)...but that's probably asking too much!

Thanks again to James and Joerg!

Jim

-----Original Message-----
From: Eiler, James A. [mailto:James.Eiler@alcoa.com]
Sent: Friday, June 27, 2003 11:51 PM
To: tru64-unix-managers@ornl. gov (E-mail)
Subject: SIGFPE and siginfo_t

Hi All,

I apologize, but this is a bit long....

I'm running Tru64 UNIX, several versions (4.0G, 5.1A, 5.1B), various Patch
Kits.

I've got a C program that runs on all of these. This program is timer
driven
and reads analog input data and processes it. This program is fairly large
and was written by numerous folks over a period of years.

Occasionally, one or more of the analog input signals is zero. And when the
program does a divide by zero, a SIGFPE is generated.

We need this program to keep running, so I've put in a signal handler to
process the SIGFPE signal. I know I'm doing a bad thing in the SIGFPE
signal handler in that I'm doing an fprintf to stderr indicating that
the program has done a divide by zero - but at least we know when the
SIGFPE has been generated. But, we'd like to know which line of code is
generating the error.

If I took out the signal handler, I could very easily determine the exact
line of code from the core file. But, like I said, we need to keep this
program running....

I'm trying to use the siginfo_t structure that's described in Section 5.4,
"Realtime Signal Handling", of the Guide to Realtime Programming.
As I read it, when a SIGFPE is generated, siginfo_t should contain the
address of the bad instruction in member si_addr.

I've modified the example code from the section of the manual:

> cat doit.c
#include <unistd.h>
#include <signal.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/wait.h>

main()
{
   pid_t pid;
   sigset_t newmask;
   int ii, jj;
   float fNum;

   struct sigaction action;
   void catchit();

   sigemptyset(&newmask);
   sigaddset(&newmask, SIGFPE);
 
   action.sa_flags = SA_SIGINFO;
   action.sa_sigaction = catchit;
 
   if (sigaction(SIGFPE, &action, NULL) == -1) {
       perror("sigusr: sigaction");
       _exit(1);
   }

   for( ii = 3; ii > -1; ii--) {
      fNum = 42.0 / (float) ii;
   }
}
void catchit(int signo, siginfo_t *info, void *extra)
{
       int int_val = info->si_value.sival_int;
       printf("Signal %d, value %d received\n", signo, int_val);
       printf("si_addr = %d\n", info->si_addr );
       fflush( stdout );
       _exit(0);
}

I compile it, link it, and run it:

> cc -g2 doit.c -lrt -o doit
> ./doit
Signal 8, value 0 received
si_addr = 0

I would have thought si_addr should be something other than 0.

If I step through the signal handler, I can see part of the
address (see the ^^^^^ below):

(ladebug) where
>0 0x120001458 in catchit(signo=0x8, info=0x11fffbc98, extra=0x11fffbcf8)
"doit
.c":34
#1 0x3ff800d5af0 in __sigtramp(...) in /usr/shlib/libc.so
#2 0x120001404 in main() "doit.c":29
    ^^^^^^^^^^^
#3 0x1200012e8 in __start(...) in doit

(ladebug) p *info
struct siginfo {
  si_signo = 0x8;
  si_errno = 0x0;
  si_code = 0x3;
  _sifields = union {
    _sipad = [0] = 0x0,[1] = 0x0,[2] = 0x4,[3] = 0x0,[4] = 0x10,[5] =
0x0,[6] =
0x1,[7] = 0x0,[8] = 0xc0004900,[9] = 0x3ff,[10] = 0x0,[11] = 0x0,[12] =
0x2,[13]
 = 0x0,[14] = 0xe8ec3b2,[15] = 0x0,[16] = 0x1000,[17] = 0x0,[18] =
0x801fc548,[1
9] = 0x3ff,[20] = 0x0,[21] = 0x0,[22] = 0x0,[23] = 0x0,[24] =
0x20001404,[25] =
                                                              ^^^^^^^^^^
0x1,[26] = 0x8,[27] = 0x0;
More (n if no)?

Any help on this will be much appreciated!

Thanks,

Jim



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:25 EDT