Archive for April, 2011

Diagnosing Stack/Heap Collision on AIX

April 29, 2011

I was recently confronted with a program that mysteriously aborted (Trace/BPT trap) at run time on AIX 7.1 (but not on AIX 6.1). Usually. But not on all systems or on all build settings.

This program is the ACE Message_Queue_Test; in particular, the stress test I added to it to ensure that blocks are counted properly when enqueues and dequeues are happening in different combinations from different threads. It ended up not being particular to ACE, but I did add a change to the test’s build settings to account for this issue. But I’m getting ahead of myself…

The symptoms were that after the queue writer threads had been running a while and the reader threads started to exit, a writer thread would hit a Trace/BPT trap. The ACE_Task in this thread had its members all zeroed out, including the message queue pointer, leading to the trap. I tried setting debug watches on the task content but still no real clues.

Yes, the all-zeroes contents of the wiped stack should have tipped me off, but hind-sight is always 20-20.

The other confusion was that the same program built on AIX 6.1 would run fine. But copy it over to AIX 7.1, and crash! So, I opened a support case with IBM AIX support to report the brokenness of the binary compatibility from AIX 6.1 to 7.1. “There. That’s off to IBM’s hands,” I thought. “I hope it isn’t a total pain to get a fix from them. Let’s see what Big Blue can do.”

If you’ve been reading this blog for a while you may recall another support experience I related here, from a different support-providing company that wears hats of a different color than Big Blue. As you may recall, I was less than impressed.

Within hours I got a response that IBM had reproduced the problem. Although they could crash my program on AIX 7.1 and 6.1. They wanted a test case, preprocessed source, to get more info. I responded that they could download the whole 12 MB ACE source kit – the source is in there. Meanwhile I set off to narrow down the code into a small test case, imagining the whole AIX support team laughing hysterically about this joker who wanted them to download a 12 MB tarball to help diagnose a case.

I came back from lunch yesterday gearing up to get my test case together. There was email from IBM support. “Is this where they remind me that they want a small test case?” I wondered.

Nope. The email contained the dbg steps they used to diagnose the problem (which was mine), the 3 choices of ways to resolve the problem, and pointers to the AIX docs that explained all the background.


AIX support rocks. I mean, I very often help customers diagnose problems under ACE support that end up being problems in the customer’s code. But I’ve never experienced that from any other company. Really. Outstanding.

So what was the problem in the end? The segment 2 memory area, which holds both the heap and the process stacks, was overflowing. The program was allocating enough memory to cause the heap to run over the stacks. (Remember the zeroed-out stack content? The newly allocated memory was being cleared.)

This is how the diagnosis went:

(dbx) run

Trace/BPT trap in
ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*) at line  39 in file "Task_T.inl" ($t12)
39     return this->msg_queue_->enqueue_tail (mb, tv);
(dbx) list 36,42
36   ACE_Task<ACE_SYNCH_USE>::putq (ACE_Message_Block *mb,
ACE_Time_Value *tv)
37   {
38     ACE_TRACE ("ACE_Task<ACE_SYNCH_USE>::putq");
39     return this->msg_queue_->enqueue_tail (mb, tv);
40   }
42   template <ACE_SYNCH_DECL> ACE_INLINE int

(dbx) 0x10000f20/12 i
7c0802a6        mflr   r0
9421ffc0        stwu   r1,-64(r1)
90010048         stw   r0,0x48(r1)
90610058         stw   r3,0x58(r1)
9081005c         stw   r4,0x5c(r1)
90a10060         stw   r5,0x60(r1)
80610058         lwz   r3,0x58(r1)
0c430200      twllti   r3,0x200
80610058         lwz   r3,0x58(r1)
806300a4         lwz   r3,0xa4(r3)
0c430200      twllti   r3,0x200
80630000         lwz   r3,0x0(r3)

(dbx) 0x2FF2289C/4 x
0x2ff2289c:  0000 0000 0000 0000

(dbx) malloc
The following options are enabled:

Implementation Algorithm........ Default Allocator (Yorktown)

Statistical Report on the Malloc Subsystem:
Heap 0
heap lock held by................ pthread ID 0x200248e8
bytes acquired from sbrk().......    267402864 <***!!!
bytes in the freespace tree......        15488
bytes held by the user...........    267387376
allocations currently active.....      4535796
allocations since process start..      9085824

The Process Heap
Initial process brk value........ 0x2001e460
current process brk value........ 0x2ff222d0 <***!!!
sbrk()s called by malloc.........       4071

*** Heap has reached the upper limit of segment 0x2 and
collided with the initial thread's stack.
Changing the executable to a 'large address model' 32bit
exe should resolve the problem (in other words give
it more heap space).

# ldedit -b maxdata:0x20000000 MessageQueueTest
ldedit:  File MessageQueueTest updated.
# dump -ov MessageQueueTest


***Object Module Header***
# Sections      Symbol Ptr      # Symbols       Opt Hdr Len     Flags
6      0x004cde82         142781                72     0x1002
Timestamp = "Apr 23 14:51:24 2011"
Magic = 0x1df  (32-bit XCOFF)

***Optional Header***
Tsize        Dsize       Bsize       Tstart      Dstart
0x001b7244  0x0001d8ec  0x000007b8  0x10000178  0x200003bc

SNloader     SNentry     SNtext      SNtoc       SNdata
0x0004      0x0002      0x0001      0x0002      0x0002

TXTalign     DATAalign   TOC         vstamp      entry
0x0007      0x0003      0x2001cc40  0x0001      0x20017f7c

maxSTACK     maxDATA     SNbss       magic       modtype
0x00000000  0x20000000  0x0003      0x010b        1L
# ./MessageQueueTest
#                     <-- NO CRASH!

Summary: Increasing the default heap space from 256M(approx.) to 512M resolved the problem.  IBM gave me three ways to resolve this:

  1. Edit the executable as above with ldedit
  2. Relink the executable with -bmaxdata:0x20000000
  3. Set environment variable LDR_CNTRL=MAXDATA=0x20000000

I ended up changing the Message_Queue_Test’s MPC description to add -bmaxdata to the build. That was the easiest way to always get it correct and make it easier for the regression test suite to execute the program.

Lastly, here’s the link IBM gave me for the ‘large address model’:

Bottom line – the test is running, the project is done, I have a sunny afternoon to write this blog entry and enjoy the nice New England spring day – instead of narrowing down a test case. Thanks, IBM!