I was recently confronted with a program that mysteriously aborted (Trace/BPT trap) at run time on AIX 7.1 (but not on AIX 6.1). Usually. But not on all systems or on all build settings.
This program is the ACE Message_Queue_Test; in particular, the stress test I added to it to ensure that blocks are counted properly when enqueues and dequeues are happening in different combinations from different threads. It ended up not being particular to ACE, but I did add a change to the test’s build settings to account for this issue. But I’m getting ahead of myself…
The symptoms were that after the queue writer threads had been running a while and the reader threads started to exit, a writer thread would hit a Trace/BPT trap. The ACE_Task in this thread had its members all zeroed out, including the message queue pointer, leading to the trap. I tried setting debug watches on the task content but still no real clues.
Yes, the all-zeroes contents of the wiped stack should have tipped me off, but hind-sight is always 20-20.
The other confusion was that the same program built on AIX 6.1 would run fine. But copy it over to AIX 7.1, and crash! So, I opened a support case with IBM AIX support to report the brokenness of the binary compatibility from AIX 6.1 to 7.1. “There. That’s off to IBM’s hands,” I thought. “I hope it isn’t a total pain to get a fix from them. Let’s see what Big Blue can do.”
If you’ve been reading this blog for a while you may recall another support experience I related here, from a different support-providing company that wears hats of a different color than Big Blue. As you may recall, I was less than impressed.
Within hours I got a response that IBM had reproduced the problem. Although they could crash my program on AIX 7.1 and 6.1. They wanted a test case, preprocessed source, to get more info. I responded that they could download the whole 12 MB ACE source kit – the source is in there. Meanwhile I set off to narrow down the code into a small test case, imagining the whole AIX support team laughing hysterically about this joker who wanted them to download a 12 MB tarball to help diagnose a case.
I came back from lunch yesterday gearing up to get my test case together. There was email from IBM support. “Is this where they remind me that they want a small test case?” I wondered.
Nope. The email contained the dbg steps they used to diagnose the problem (which was mine), the 3 choices of ways to resolve the problem, and pointers to the AIX docs that explained all the background.
Wow.
AIX support rocks. I mean, I very often help customers diagnose problems under ACE support that end up being problems in the customer’s code. But I’ve never experienced that from any other company. Really. Outstanding.
So what was the problem in the end? The segment 2 memory area, which holds both the heap and the process stacks, was overflowing. The program was allocating enough memory to cause the heap to run over the stacks. (Remember the zeroed-out stack content? The newly allocated memory was being cleared.)
This is how the diagnosis went:
(dbx) run
Trace/BPT trap in
ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*) at line 39 in file "Task_T.inl" ($t12)
39 return this->msg_queue_->enqueue_tail (mb, tv);
(dbx) list 36,42
36 ACE_Task<ACE_SYNCH_USE>::putq (ACE_Message_Block *mb,
ACE_Time_Value *tv)
37 {
38 ACE_TRACE ("ACE_Task<ACE_SYNCH_USE>::putq");
39 return this->msg_queue_->enqueue_tail (mb, tv);
40 }
41
42 template <ACE_SYNCH_DECL> ACE_INLINE int
(dbx) 0x10000f20/12 i
0x10000f20
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*))
7c0802a6 mflr r0
0x10000f24
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x4)
9421ffc0 stwu r1,-64(r1)
0x10000f28
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x8)
90010048 stw r0,0x48(r1)
0x10000f2c
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0xc)
90610058 stw r3,0x58(r1)
0x10000f30
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x10)
9081005c stw r4,0x5c(r1)
0x10000f34
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x14)
90a10060 stw r5,0x60(r1)
0x10000f38
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x18)
80610058 lwz r3,0x58(r1)
0x10000f3c
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x1c)
0c430200 twllti r3,0x200
0x10000f40
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x20)
80610058 lwz r3,0x58(r1)
0x10000f44
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x24)
806300a4 lwz r3,0xa4(r3)
0x10000f48
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x28)
0c430200 twllti r3,0x200
0x10000f4c
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x2c)
80630000 lwz r3,0x0(r3)
(dbx) 0x2FF2289C/4 x
0x2ff2289c: 0000 0000 0000 0000
(dbx) malloc
The following options are enabled:
Implementation Algorithm........ Default Allocator (Yorktown)
Statistical Report on the Malloc Subsystem:
Heap 0
heap lock held by................ pthread ID 0x200248e8
bytes acquired from sbrk()....... 267402864 <***!!!
bytes in the freespace tree...... 15488
bytes held by the user........... 267387376
allocations currently active..... 4535796
allocations since process start.. 9085824
The Process Heap
Initial process brk value........ 0x2001e460
current process brk value........ 0x2ff222d0 <***!!!
sbrk()s called by malloc......... 4071
*** Heap has reached the upper limit of segment 0x2 and
collided with the initial thread's stack.
Changing the executable to a 'large address model' 32bit
exe should resolve the problem (in other words give
it more heap space).
# ldedit -b maxdata:0x20000000 MessageQueueTest
ldedit: File MessageQueueTest updated.
# dump -ov MessageQueueTest
MessageQueueTest:
***Object Module Header***
# Sections Symbol Ptr # Symbols Opt Hdr Len Flags
6 0x004cde82 142781 72 0x1002
Flags=( EXEC DYNLOAD DEP_SYSTEM )
Timestamp = "Apr 23 14:51:24 2011"
Magic = 0x1df (32-bit XCOFF)
***Optional Header***
Tsize Dsize Bsize Tstart Dstart
0x001b7244 0x0001d8ec 0x000007b8 0x10000178 0x200003bc
SNloader SNentry SNtext SNtoc SNdata
0x0004 0x0002 0x0001 0x0002 0x0002
TXTalign DATAalign TOC vstamp entry
0x0007 0x0003 0x2001cc40 0x0001 0x20017f7c
maxSTACK maxDATA SNbss magic modtype
0x00000000 0x20000000 0x0003 0x010b 1L
# ./MessageQueueTest
# <-- NO CRASH!
Summary: Increasing the default heap space from 256M(approx.) to 512M resolved the problem. IBM gave me three ways to resolve this:
- Edit the executable as above with ldedit
- Relink the executable with -bmaxdata:0x20000000
- Set environment variable LDR_CNTRL=MAXDATA=0x20000000
I ended up changing the Message_Queue_Test’s MPC description to add -bmaxdata to the build. That was the easiest way to always get it correct and make it easier for the regression test suite to execute the program.
Lastly, here’s the link IBM gave me for the ‘large address model’:
http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/lrg_prg_support.htm
Bottom line – the test is running, the project is done, I have a sunny afternoon to write this blog entry and enjoy the nice New England spring day – instead of narrowing down a test case. Thanks, IBM!