Archive for January, 2009

Analysis of ACE_Proactor Shortcomings on UNIX

January 22, 2009

I’ve been looking into two related issues in the ACE development stream:

  1. SSL_Asynch_Stream_Test times out on HP-UX (I recently made a bunch of fixes to the test itself so it runs as well as can be on Linux, but times out on HP-UX)
  2. Proactor_Test shows a stray, intermittent diagnostic on HP-UX: EINVAL returned from aio_suspend()

Although I’ve previously discussed use of ACE_Proactor on Linux (https://stevehuston.wordpress.com/2008/11/25/when-is-it-ok-to-use-ace-proactor-on-linux/) the issues on HP-UX are of a different sort. If the previously discussed Linux aio issues are resolved inside Linux, the same problem I’m seeing on HP-UX may also arise, but it doesn’t get that far. Also, I suspect that the issues arising from these tests’ execution on Solaris are of the same nature, though the symptoms are a bit different.

The symptoms are that the proactor event loop either fails to detect completions, or it gets random errors that smell like the aiocb list is damaged. I believe I’ve got a decent idea of what’s going on, and it’s basically two issues:

  1. If all of the completion dispatch threads are blocked waiting for completions when new I/O is initiated, the new operation(s) are  not taken into account by the threads waiting for completions. This is basically the case in the SSL_Asynch_Stream_Test timeout on HP-UX – all the completion-detecting threads are already running before any I/O is initiated and no completions are ever detected.
  2. The completion and initiation activities modify the aiocb list used to detect completions directly, without interlocks, and without consideration of what affect it may have (or not) on the threads waiting for completions.

The ACE_Reactor framework uses internal notifications to handle the need to unblock waiting demultiplexing threads so they can re-examine the handle set as needed; something similar is needed for the ACE_Proactor to remedy issue #1 above. There is a notification pipe facility in the proactor code, but I need to see if it can be used in this case. I hope so…

The other problem, of concurrent access to the aiocb list by threads both waiting for completions and modifying the list is a much larger problem. That requires more of a fundamental change in the innards of the POSIX Proactor implementation.

Note that there are a number of POSIX Proactor flavors inside ACE (section 8.5 in C++NPv2 describes most of them). The particular shortcomings I’ve noted here only affect the ACE_POSIX_AIOCB_Proactor and ACE_POSIX_SIG_Proactor, which is largely based on the ACE_POSIX_AIOCB_Proactor. The newest one, ACE_POSIX_CB_Proactor, is much less affected, but is not as widely available.

So, the Proactor situation on UNIX platforms is generally not too good for demanding applications. Again, Proactor on Windows is very good, and recommended for high-performance, highly scalable networked applications. On Linux, stick to ACE_Reactor using the ACE_Dev_Poll_Reactor implementation; on other systems, stick with ACE_Reactor and ACE_Select_Reactor or ACE_TP_Reactor depending on your need for multithreaded dispatching.

My Experiences Porting Apache Qpid C++ to Windows

January 9, 2009

I recently finished (modulo some capabilities that should be added) porting Apache Qpid‘s C++ implementation to Microsoft Windows. Apache Qpid also sports Java broker and client as well as Python, Ruby, C# and .NET clients. For my customer’s project I needed C++ which had, to date, been developed and used primarily on Linux. What I thought would be a slam dunk 20-40 hour piece of work took about 4 months and hundreds of hours. Fortunately, my customer’s projects waiting for this completion also were delayed and my customer was very accommodating. Still, since I goofed the estimate so wildly I only billed the customer a fraction of the hours I worked. Fortunately, I rarely goof estimates that badly. This post-project review takes a look at what went wrong and why it ended up a very good thing.

When I first looked through the Qpid code base, I got some initial impressions:

  • It’s nicely layered, which will make life easy
  • It’s hard to find one’s way around it
  • The I/O layer (at the bottom of the stack) is easily modified for what I need

The first two impressions held; the third did not. Most of the troubles and false starts had to do with the I/O layer at the bottom of the stack. Most of the rest of the code ported over with relative ease. The original authors did a very nice job isolating code that was likely to need varying implementations. Those areas generally use the Bridge pattern to offer a uniform API that’s implemented differently as needed.

The general areas I had to work on for the port are described below.

Synchronization

Qpid uses multiple threads – no big surprise for a high-performance networked system. So there’s of course a need for synchronization objects (mutex, condition variables, etc.) The existing C++ code had nice wrapper classes and a Pthreads implementation. The options for completing the Windows implementation were:

  • Use native Windows system calls
  • ACE (relatively obvious for me)
  • Apache Portable Runtime (it’s an Apache project after all)
  • Boost (Qpid already made use of Boost in many other areas)

Windows system calls were ruled out fairly quickly because they don’t offer all that was needed (particularly, condition variables) on XP and the interaction of the date-time parts of the existing threading/synch objects and Windows system time was very clunky.

I was hesitant to add ACE as an outside requirement just for the Windows port. I was also sensitive to the fact that as a newbie on this particular project I could be seen as simply trying to wedge ACE in for my own sinister purposes (which is definitely not the case!). So scratch that one.

After a brief but unsuccessful attempt at APR (and being told that some previous APR use was abandoned) I settled on Boost. This was my first project using Boost and it took some getting used to, but overall was pretty smooth.

Thread Management

The code that actually spawned and managed threads was easily implemented using native Windows system calls. Straight-forward and easy.

I/O

This is where all the action is. The existing code comments (there aren’t many, but what was there was descriptive) talked about “Asynch I/O.” This was welcome since I planned to use overlapped I/O to get high throughput; Windows’ implementation of asynchronous (they call it overlapped) I/O is very good, scales well and performs very well. The interface to the I/O layer from the upper level in Qpid looked good for asynchronous I/O and I got a little over confident. In retrospect, the name of the event dispatcher class (Poller) should have tipped me off that I had some difficulty ahead.

The Linux code’s Poller implementation uses Linux epoll to get high performance and remain very scalable. The code is solid and well designed and implemented. However, it is event driven, synchronous I/O and that tends to show a bit more than maybe intended. Handles need to be registered with the Poller, for example, something that’s not done with overlapped I/O.

My first attempt at the Windows version of a Poller implementation was large and disruptive. Fortunately, once I offered it up for review I received a huge amount of help from the I/O layer expert on the project. He and I sat down for a morning to review the design, the code I came up with, and best ways to go forward. The people I’ve worked with on Apache Qpid are consummate professionals and I’m very thankful for their input and guidance.

My second design for the I/O layer went much better. It doesn’t interfere with the Linux code, and slides in nicely with very little code duplication. I think that after another port or two are done where more of these implementations need to be designed, it may be possible to refactor some of the I/O layer to make things a bit cleaner, but that’s very minor at this point – the code works very well and integrates without disruption.

Lessons Learned

So what did I learn from this?

  1. It pays to spend a little extra time reading the existing code before designing extensions. Even when it looks pretty straight-forward. Even if you have to write some design documentation and run it by the original author(s).
  2. Forge good relationships with the other people on the team. This is an easy one when you all work in the same group, even in the same building. It’s more often assumed to be difficult at best when the group is spread around the world and across (sometimes competing) companies. It’s worth the effort.

So although the project took far longer than I originally estimated, the result is a good implementation that fits with the rest of the system and performs well. I could have wedged in my original bad design in far less time, but someone would have had to pick up the pieces later. The design constraints and rules that were not written before are somewhat written now (at least in the Windows code). If I do another port, it’ll be much smoother next time.

Where to Go From Here?

There are a few difficulties remaining for the Windows port and a few capabilities that should be added:

  • Keep project files synched with generated code. The Qpid project’s build process generates a lot of code from XML protocol specifications. This is very nice, but runs into trouble keeping the Visual Studio project files up to date as the set of generated files changes. I’ve been using the MPC tool to generate Visual Studio projects and MPC can pick up names by wildcard, but that still leaves an extra step: generate C++ code, regenerate project files. This need has caused a couple of hiccups during the Qpid M4 release process where I had to regenerate the project files. It would be nice if Visual Studio’s C++ build could deal with wildcards, or if the C++  project file format allowed inclusion of an outside file listing sources (which could be generated along with the C++ code).
  • Add SSL support. The Linux code uses OpenSSL. I’d rather not pull in another prerequisite when Windows knows how to do SSL already. At least I assume it does, and in a way that doesn’t require an HTTP connection to use. I’ve yet to figure this out…
  • Persistent storage for messages. AMQP (and Qpid) allows for persistent message store, guaranteeing message delivery in the face of failures. There’s not yet a store for Qpid on Windows, and it should be added.
  • Add the needed “declspec”s to build the Qpid libraries as DLLs; currently they’re built as static libraries.
  • Minor tweaks making Qpid integrate better with Windows, such as a message definition for storing diagnostics in the event log and being able to set the broker up as a Windows Service.

There’s No Substitute for Experience with Threads

January 5, 2009

When your system performance is not all you had hoped it would be, are you tempted to think that adding more threads will speed things up? When your customers complain that they upgraded to the latest multicore processor but your application doesn’t run any faster, what answers do you have for them?

Even if race conditions, synchronization bottlenecks, and atomicity are part of your normal vocabulary, the world of multithreaded programming is one where one must really understand what’s below the surface of the API to get the full picture of why some piece of code is, or is not, working. I was reminded of this, and the deep truths of how deep your understanding must be, while catching up on some reading this week.

I was introduced to threads (DECthreads, the precursor to Pthreads, for you history buffs) in the early 1990s. Neat! I can do multiple things at the same time! Fortunately, I spent a fair amount of time in my programming formative years working on an operating system for the Control Data 3600 (anyone remember the TV show “The Bionic Man”? The large console in the bionics lab was a CDC-3600). I learned the hard way that the world can change in odd ways between instructions. So I wasn’t completely fooled by the notion of magically being able to do multiple things at the same time, but threading libraries make the whole area of threads much more approachable. But with power comes responsibility – the responsibility to know what you’re doing with that power tool.

I’ve been working on multithreaded code for many years now and find multithreading a powerful tool for building high-performance networked applications. So I was eager to read the Dr. Dobbs article “Lock-Free Code: A False Sense of Security” by Herb Sutter (his blog entry related to the article is here). His “Effective Concurrency” column is a regular favorite of mine. The article is a diagnosis of an earlier article describing a lock-free queue implementation. I previously read the article being diagnosed and, although I only skimmed it since I had no immediate need for a single-writer, single-reader queue at the time, I didn’t catch anything really wrong with it. So I was anxious to see what I missed.

Boy, I missed a few things. Now that I see them explained, it’s like “ah, of course” but I probably wouldn’t have thought about those issues before I was trying to figure out what’s wrong at runtime. Some may say the issues are sort of esoteric and machine-specific and I may agree, but it doesn’t matter – it’s a case of understanding your environment and tools and another situation where experience makes all the difference between banging your head on the wall and getting the job done.

I’m thankful that I can get more understanding by reading the works of smart people who’ve trodden before me. I’m sure that knowledge will save me some time at some point when debugging some odd race condition. And that’s what it’s all about – learn, experience, save time. Thanks Herb.