Skip to content

Example 1

Nakagome Tomoyuki edited this page Sep 13, 2016 · 6 revisions

Memory Allocation Failure

This is stripped down version of the code that repeatedly generated cores in a lab system at a customer site.

int main(int argc, char **argv) {
    try {
        task();
    } catch (...) {
        log("uncaught exception. aborting.");
        abort();
    }
   return 0;
}

When the program had thrown an unhandled exception deep down the call chain, the exception was brought up to the catch(...) statement in the main() function. The code block logged an indication of uncaught exception, then aborted, generating a core file.

So the support team raised a ticket and sent the core to the dev team for analysis. However, gdb did not show the origin of exception in the back trace, and instead showed something like this:

(gdb) bt
#0  0xf77a8430 in __kernel_vsyscall ()
#1  0xf749c667 in raise () from /lib/libc.so.6
#2  0xf749dea3 in abort () from /lib/libc.so.6
#3  0x08048e98 in main (argc=1, argv=0xffc1d684) at Main.cpp:91

This call stack is useless because the root cause (the source of exception) is unknown.

libexray to the rescue

By preloading libexray, we could identify the source of exception, which is not obvious in the core.

31141[31141] ------------------------------------------------------------
31141[31141] Origin of Dump: __cxa_throw
31141[31141] Exception Time: 14:03:42.596848
31141[31141] Exception Type: std::bad_alloc
31141[31141] Stack Frames
31141[31141] #1: ../../libexray.so(__cxa_throw+0x66) [0xf77a145a]
31141[31141] #2: /lib/libstdc++.so.6(_Znwj+0x86) [0xf76d3b86]
...
31141[31141] #43: ./Server() [0x8048d62]
31141[31141] #44: ./Server() [0x8048da2]
31141[31141] #45: ./Server() [0x8048e84]
31141[31141] #46: /lib/libc.so.6(__libc_start_main+0xf3) [0xf7487943]
31141[31141] #47: ./Server() [0x8048b91]
31141[31141] ------------------------------------------------------------
31141[31141] Origin of Dump: __cxa_begin_catch
31141[31141] Exception Time: 14:03:42.603695
31141[31141] Stack Frames
31141[31141] #1: ../../libexray.so(__cxa_begin_catch+0x58) [0xf77a14db]
31141[31141] #2: ./Server() [0x8048e93]
31141[31141] #3: /lib/libc.so.6(__libc_start_main+0xf3) [0xf7487943]
31141[31141] #4: ./Server() [0x8048b91]
Aborted (core dumped)

The exception is thrown from the second frame of the first stack trace:

31141[31141] #2: /lib/libstdc++.so.6(_Znwj+0x86) [0xf76d3b86]

The function name is mangled as "_Znwj", so demangle it and we know what function failed in libstdc++.so.6.

$ c++filt _Znwj
operator new(unsigned int)

Obviously, this is a memory allocation failure.

Root Cause Analysis

It turned out that the customer configured 2,000 threads within the program, which consumed 2GB of memory for thread stacks. The application also had in-memory database that allocated 1GB of memory. The program was a 32-bit application, having 4GB address space. So 3 out of 4GB address space was reserved for these purposes as process was started. The code segment and static data also required a few hundred MB, so heap space was constrained, and memory allocation eventually failed. The suggestion was to reduce the number of threads, and instead run a few more instances of the same program to spread the load.

Clone this wiki locally