Archive for the ‘Uncategorized’ Category

Book Review: “Boost.ASIO C++ Network Programming” by John Torjo

April 20, 2013

Boost.ASIO C++ Network Programming

I make a living doing network programming, so I was very interested to review a new book, Boost.ASIO C++ Network Programming by John Torjo, published by Packt Publishing. On a scale of 1-10 I give it a solid 6. The good news is that I haven’t seen anything written about Boost.ASIO that’s better, so if you are interested in learning more about Boost.ASIO, I recommend you buy this book. I’ll explain further.

My first impression when starting to read this book is that it’s poorly edited. It feels like it was written in a rush and produced in an even bigger rush. There are spelling errors, grammar errors, and it just feels second rate. As I made my way through the book it didn’t get any better. Don’t let this dissuade you from learning from the real content, though.

Technically speaking Torjo attempts to do a few big things:

  • Explain use of Boost.ASIO
  • Explain network programming
  • Explain synchronous vs. asynchronous programming paradigms

If you are not already fluent in the general network programming concepts, I recommend you purchase other books (Stevens, Schmidt, Schmidt [disclaimer: I co-authored this]) as a prerequisite to Boost.ASIO. Torjo succeeds fairly well at the first point, though, which is the primary purpose of the book. You should also have a passing familiarity with other Boost classes. While Boost.ASIO is not template-heavy like some other Boost areas, shared pointers and function binding are used well in the examples.

Boost.ASIO explains the use of the io_service, resolver, endpoint, and socket classes fairly well. Torjo hand-waves a bit about the relationship of io_service and socket, but once you’ve become accustomed to programming with Boost.ASIO, that becomes second nature. The book includes many well-explained examples, which I really appreciate. If you follow the examples you will end up with working code.

The book at one point claims to be useful as a reference to return to over and over for details. While Boost.ASIO is certainly not a Boost.ASIO reference manual (you should bookmark the Boost docs for that) it is a very useful book for explaining how to make good use of this very flexible and useful class library for network programming.

Diagnosing Stack/Heap Collision on AIX

April 29, 2011

I was recently confronted with a program that mysteriously aborted (Trace/BPT trap) at run time on AIX 7.1 (but not on AIX 6.1). Usually. But not on all systems or on all build settings.

This program is the ACE Message_Queue_Test; in particular, the stress test I added to it to ensure that blocks are counted properly when enqueues and dequeues are happening in different combinations from different threads. It ended up not being particular to ACE, but I did add a change to the test’s build settings to account for this issue. But I’m getting ahead of myself…

The symptoms were that after the queue writer threads had been running a while and the reader threads started to exit, a writer thread would hit a Trace/BPT trap. The ACE_Task in this thread had its members all zeroed out, including the message queue pointer, leading to the trap. I tried setting debug watches on the task content but still no real clues.

Yes, the all-zeroes contents of the wiped stack should have tipped me off, but hind-sight is always 20-20.

The other confusion was that the same program built on AIX 6.1 would run fine. But copy it over to AIX 7.1, and crash! So, I opened a support case with IBM AIX support to report the brokenness of the binary compatibility from AIX 6.1 to 7.1. “There. That’s off to IBM’s hands,” I thought. “I hope it isn’t a total pain to get a fix from them. Let’s see what Big Blue can do.”

If you’ve been reading this blog for a while you may recall another support experience I related here, from a different support-providing company that wears hats of a different color than Big Blue. As you may recall, I was less than impressed.

Within hours I got a response that IBM had reproduced the problem. Although they could crash my program on AIX 7.1 and 6.1. They wanted a test case, preprocessed source, to get more info. I responded that they could download the whole 12 MB ACE source kit – the source is in there. Meanwhile I set off to narrow down the code into a small test case, imagining the whole AIX support team laughing hysterically about this joker who wanted them to download a 12 MB tarball to help diagnose a case.

I came back from lunch yesterday gearing up to get my test case together. There was email from IBM support. “Is this where they remind me that they want a small test case?” I wondered.

Nope. The email contained the dbg steps they used to diagnose the problem (which was mine), the 3 choices of ways to resolve the problem, and pointers to the AIX docs that explained all the background.

Wow.

AIX support rocks. I mean, I very often help customers diagnose problems under ACE support that end up being problems in the customer’s code. But I’ve never experienced that from any other company. Really. Outstanding.

So what was the problem in the end? The segment 2 memory area, which holds both the heap and the process stacks, was overflowing. The program was allocating enough memory to cause the heap to run over the stacks. (Remember the zeroed-out stack content? The newly allocated memory was being cleared.)

This is how the diagnosis went:

(dbx) run

Trace/BPT trap in
ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*) at line  39 in file "Task_T.inl" ($t12)
39     return this->msg_queue_->enqueue_tail (mb, tv);
(dbx) list 36,42
36   ACE_Task<ACE_SYNCH_USE>::putq (ACE_Message_Block *mb,
ACE_Time_Value *tv)
37   {
38     ACE_TRACE ("ACE_Task<ACE_SYNCH_USE>::putq");
39     return this->msg_queue_->enqueue_tail (mb, tv);
40   }
41
42   template <ACE_SYNCH_DECL> ACE_INLINE int

(dbx) 0x10000f20/12 i
0x10000f20
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*))
7c0802a6        mflr   r0
0x10000f24
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x4)
9421ffc0        stwu   r1,-64(r1)
0x10000f28
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x8)
90010048         stw   r0,0x48(r1)
0x10000f2c
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0xc)
90610058         stw   r3,0x58(r1)
0x10000f30
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x10)
9081005c         stw   r4,0x5c(r1)
0x10000f34
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x14)
90a10060         stw   r5,0x60(r1)
0x10000f38
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x18)
80610058         lwz   r3,0x58(r1)
0x10000f3c
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x1c)
0c430200      twllti   r3,0x200
0x10000f40
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x20)
80610058         lwz   r3,0x58(r1)
0x10000f44
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x24)
806300a4         lwz   r3,0xa4(r3)
0x10000f48
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x28)
0c430200      twllti   r3,0x200
0x10000f4c
(ACE_Task<ACE_MT_SYNCH>::putq(ACE_Message_Block*,ACE_Time_Value*)+0x2c)
80630000         lwz   r3,0x0(r3)

(dbx) 0x2FF2289C/4 x
0x2ff2289c:  0000 0000 0000 0000

(dbx) malloc
The following options are enabled:

Implementation Algorithm........ Default Allocator (Yorktown)

Statistical Report on the Malloc Subsystem:
Heap 0
heap lock held by................ pthread ID 0x200248e8
bytes acquired from sbrk().......    267402864 <***!!!
bytes in the freespace tree......        15488
bytes held by the user...........    267387376
allocations currently active.....      4535796
allocations since process start..      9085824

The Process Heap
Initial process brk value........ 0x2001e460
current process brk value........ 0x2ff222d0 <***!!!
sbrk()s called by malloc.........       4071

*** Heap has reached the upper limit of segment 0x2 and
collided with the initial thread's stack.
Changing the executable to a 'large address model' 32bit
exe should resolve the problem (in other words give
it more heap space).

# ldedit -b maxdata:0x20000000 MessageQueueTest
ldedit:  File MessageQueueTest updated.
# dump -ov MessageQueueTest

MessageQueueTest:

***Object Module Header***
# Sections      Symbol Ptr      # Symbols       Opt Hdr Len     Flags
6      0x004cde82         142781                72     0x1002
Flags=( EXEC DYNLOAD DEP_SYSTEM )
Timestamp = "Apr 23 14:51:24 2011"
Magic = 0x1df  (32-bit XCOFF)

***Optional Header***
Tsize        Dsize       Bsize       Tstart      Dstart
0x001b7244  0x0001d8ec  0x000007b8  0x10000178  0x200003bc

SNloader     SNentry     SNtext      SNtoc       SNdata
0x0004      0x0002      0x0001      0x0002      0x0002

TXTalign     DATAalign   TOC         vstamp      entry
0x0007      0x0003      0x2001cc40  0x0001      0x20017f7c

maxSTACK     maxDATA     SNbss       magic       modtype
0x00000000  0x20000000  0x0003      0x010b        1L
# ./MessageQueueTest
#                     <-- NO CRASH!

Summary: Increasing the default heap space from 256M(approx.) to 512M resolved the problem.  IBM gave me three ways to resolve this:

  1. Edit the executable as above with ldedit
  2. Relink the executable with -bmaxdata:0x20000000
  3. Set environment variable LDR_CNTRL=MAXDATA=0x20000000

I ended up changing the Message_Queue_Test’s MPC description to add -bmaxdata to the build. That was the easiest way to always get it correct and make it easier for the regression test suite to execute the program.

Lastly, here’s the link IBM gave me for the ‘large address model’:

http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/lrg_prg_support.htm

Bottom line – the test is running, the project is done, I have a sunny afternoon to write this blog entry and enjoy the nice New England spring day – instead of narrowing down a test case. Thanks, IBM!

Python Distutils and RPMs Targeting /opt

February 21, 2011

I think I’ve mentioned before that I’ve been writing more Python code lately. Python is very powerful and the modules available for it make lots of things very easy, including networked application programming.

Today, though, I want to share a little tidbit about creating Linux RPMs to distribute Python scripts/modules that install to a root other than /usr. In particular, a customer had a request to install under /opt. I’ll use /opt/riverace for this example.

Like many Python tasks, there’s already an easy way to generate a RPM. The Python distutils are very cool and it’s very easy to take a distutils-using setup and use python to generate the RPM. It’s as easy as:

python setup.py bdist_rpm

There are lots of blogs, doc pages, etc. that explain how to create your setup.py. The thing I couldn’t find, though, is how to get the RPM to install into /opt.

It turned out to be simple. Just add this to your setup.py:

import sys
sys.prefix = /opt/riverace
setup(name='MyStuff',
...

Voila!

Read and Follow All Directions Carefully. And, Firefox SSL Settings for Accessing IBM pSeries ASMI via HTTPS

January 13, 2011

This is a public service announcement for those with IBM pSeries servers who muck up the ASMI setup. And for those that don’t but it doesn’t work anyway.

Sometimes when I’m in a hurry I don’t always follow the directions to the letter, especially if I’m confident I know what’s going on.

Never do that. Especially setting up new hardware. The people who write the directions spell those steps out for a reason.

A while back I installed a new IBM pSeries server. I’m no sysadmin guru, but I thought hey, I’ve hooked up more than a few new computers in my time. How hard could it be?

I don’t have an HMC, so I needed to cable up my ethernet LAN to the HMC port to access the ASMI via a web browser. (In hindsight, I should have known I was wandering into shark-infested waters with all those new acronyms.) The installation manual has a rather lengthy description of how to do this, involving configuring a PC or laptop ethernet interface in a particular way, wire that directly to the HMC port, type a specific URL into a web browser, log in, then reconfigure the IP address etc for the local LAN, move the cable to the LAN and off you go. Easy.

Well I thought I could take a few shortcuts. I am, after all, a network programming guy. I’ve implemented IP. Multiple times. And I’m in a hurry.

Well, it may have been the install manual (it is a bit confusing and seems to contradict itself) but probably not. In any event, I somehow wedged the HMC ethernet port into an unusable state. Somehow I did manage to get the server up and things hummed along nicely.

Until it happened. The server crashed and hung on reboot. What a lovely paperweight. Without access to the ASMI I was stuck. As far as I could tell, I was going to have to reset the service processor to factory defaults and start over, following the directions carefully this time. Now how to reset it?

After a frantic call to IBM, I got a very helpful person on the phone. After explaining my bungling the HMC ethernet setup and why I needed to reset the SP, he asked “Why don’t you just use the serial port and reset the network parameters to what you need?”

Oh.

That went pretty quickly. Network parameters now set to the correct values, port connected to LAN, here we go… get Firefox up, give it the magic URL, and…

“Cannot communicate securely with peer: no common encryption algorithm(s).

(Error code: ssl_error_no_cypher_overlap)”

My friendly IBM fellow had no advice for this one.

So I got wireshark going and watched the exchange between Firefox and the server that produced this error. Short and sweet – one SSL exchange and connection reset.

I wondered if maybe the server needed to speak SSL2, so enabled that. Wireshark reported that the server really didn’t like that either – SSL2 start, SSL3 reset. So, it wants SSL3, but what else?

I poked around in the Firefox about:config page for SSL-related items and found a bunch that are disabled by default – less-secure options that are normally not used. Except for talking secure HTTP to pSeries ASMI, that is.

Long story short, if you need to use Firefox to access one of these IBM ASMI via web, the option that worked for me was to enable:

security.ssl3.rsa_rc4_40_md5

I’m guessing that this is because it’s a low-strength cipher that can be easily exported. In any event, that was the last piece of the puzzle I needed to get management access to this box. Maybe it will save someone a few days’ work.

How to Use Schannel for SSL Sockets on Windows

December 29, 2009

Did you know that Windows offers a native SSL sockets facility? You’d figure that since IE has SSL support, and MS SQL has SSL support, and .NET has SSL support, that there would be some lower-level SSL support available in Windows. It’s there. Really. But you’ll have a hard time finding any clear, explanatory documentation on it.

I spent most of the past month adding SSL support to Apache Qpid on Windows, both client and broker components. I used a disproportionate amount of that time struggling with making sense of the Schannel API, as it is poorly documented. Some information (such as a nice overview of how to make use of it) is missing, and I’ll cover that here. Other information is flat out wrong in the MSDN docs; I’ll cover some of that in a subsequent post.

I pretty quickly located some info in MSDN with the promising title “Establishing a Secure Connection with Authentication”. I read it and really just didn’t get it. (Of course, now, in hindsight, it looks pretty clear.) Part of my trouble may have been a paradigm expectation.   Both OpenSSL and NSS pretty much wrap all of the SSL operations into their own API which takes the place of the plain socket calls. Functions such as connect(), send(), recv() have their SSL-enabled counterparts in OpenSSL and NSS; adding SSL capability to an existing system ends up copying the socket-level code and replacing plain sockets calls with the equivalent SSL calls (yes, there are some other details to take care of, but model-wise, that’s pretty much how it goes).

In Schannel the plain Windows Sockets calls are still used for establishing a connection and transferring data. The SSL support is, conceptually, added as a layer between the Winsock calls and the application’s data handling. The SSL/Schannel layer acts as an intermediary between the application data and the socket, encrypting/decrypting and handling SSL negotiations as needed. The data sent/received on the socket is opaque data either handed to Windows for decrypting or given by Windows after encrypting the normal application-level data. Similarly, SSL negotiation involves passing opaque data to the security context functions in Windows and obeying what those functions say to do: send some bytes to the peer, wait for more bytes from the peer, or both. So to add SSL support to an existing TCP-based application is more like adding a shim that takes care of negotiating the SSL session and encrypting/decrypting data as it passes through the shim.

The shim approach is pretty much how I added SSL support to the C++ broker and client for Apache Qpid on Windows. Once I got my head around the new paradigm, it wasn’t too hard. Except for the errors and omissions in the encrypt/decrypt API documentation… I’ll cover that shortly.

The SSL support did  not get into Qpid 0.6, unfortunately. But it will be in the development stream shortly after 0.6 is released and part of the next release for sure.

My Initial Impressions of Zircomp

November 13, 2009

I checked out a webinar yesterday describing Zircomp by Zircon Computing. I’m always interested in new tools for developing distributed systems that make more effective use of networks and available systems, and that’s just what the description of Zircomp billed it as. I have no relationship with Zircon Computing or Zircomp, but ACE’s inventor and my co-author of C++ Network Programming, Dr. Douglas C. Schmidt, is Zircon’s CTO and Zircomp uses ACE, so I was immediately interested.

I have to admit, I expected this to another CORBA-esque heavyweight conglomeration of fancy-acronymed services. I was surprisingly delighted to find no CORBA mentioned anywhere – the framework described is all layered on ACE, and my come-away summary was this is like RPC on steroids, then simplified so humans can understand it and distributed across an incredibly scalable set of compute nodes. I was duly impressed.

There are, as I understand, a series of webinars planned to address the various aspects of Zircomp’s range of uses. This first webinar focused on distributing a parallel set of actions around a set of compute resources. It allows one to take the code for an action, wrap it for distribution, and replace it with a proxy that finds available servers and forwards the call. After having programmed RPC and CORBA, as well as hand-crafting RPC-ish type services, the simplicity of this new framework really blew me away. A lot of thought went into making this easy to use for your average programmer while allowing power users to really push the envelope.

I’ll be saving this one away as another tool in my bag of tricks for helping customers get the most value from both their engineering budgets and their computing equipment. Congrats, Zircon!

How We Converted the Apache Qpid C++ Build to CMake

June 1, 2009

A previous post covered why the Apache Qpid C++ build switched to CMake; this post describes how it was done.

The project was generously funded by Microsoft. We started the conversion in February 2009. At this point, the builds have been running well for a while; the test executions are not quite done. So, it took about 3 months to get the builds running on both Linux and Windows. We’re working on the testing aspects now. We have not really addressed the installation steps yet. There were only two aspects of the Qpid build conversion that weren’t completely straight forward:

  1. The build processes XML versions of the AMQP specification and the Qpid Management Framework specification to generate a lot of the code. The names of the generated files are not known a priori. The generator scripts produce a list of the generated files in addition to the files themselves. This list of files obviously needs to be plugged into the appropriate places when generating the makefiles.
  2. There are a number of optional features to build into Qpid. In addition to explicitly enabling or disabling the features, the autoconf scheme checked for the requisite capabilities and enabled the features when the user didn’t specify. It built as much as it could if the user didn’t specify what to build (or not to build).

To start, one person on the team (Cliff Jansen of Interop Systems) ran the existing automake through the KDE conversion steps to get a base set of CMakeLists.txt files and did some initial prototyping for the code generation step. The original autoconf build ran the code generator at make time if the source XML specifications were available at configure time (in a release kit, the generated sources are already there, and the specs are not in the kit). The Makefile.am file then included the generated lists of sources to generate the Makefile from which the product was built. Where to place the code generating step in the CMake scheme was a big question. We considered two options:

  • Do the code generation in the generated Makefile (or Visual Studio project). This had the advantage of being able to leverage the build system’s dependency evaluation and regenerate the code as needed. However, once generated, the Makefile (or Visual Studio project) would need to be recreated by CMake. Recall that the code generation generates a list of source files that must be in the Makefile. We couldn’t get this to be as seamless as desired.
  • Do the code generation in the CMake configuration step. This puts the dependency evaluation in the CMakeLists.txt file, and had to be coded by hand since we wouldn’t have the build system’s dependency evaluation available. However, once the code was generated, the list of generated source files was readily available for inclusion in the Makefile (and Visual Studio project) file generation and the build could proceed smoothly.

We elected the second approach for ease of use. The CMakeLists code for generating the AMQP specification-based code looks like this (note this code is covered by the Apache license):

# rubygen subdir is excluded from stable distributions
# If the main AMQP spec is present, then check if ruby and python are
# present, and if any sources have changed, forcing a re-gen of source code.
set(AMQP_SPEC_DIR ${qpidc_SOURCE_DIR}/../specs)
set(AMQP_SPEC ${AMQP_SPEC_DIR}/amqp.0-10-qpid-errata.xml)
if (EXISTS ${AMQP_SPEC})
  include(FindRuby)
  include(FindPythonInterp)
  if (NOT RUBY_EXECUTABLE)
    message(FATAL_ERROR "Can't locate ruby, needed to generate source files.")
  endif (NOT RUBY_EXECUTABLE)
  if (NOT PYTHON_EXECUTABLE)
    message(FATAL_ERROR "Can't locate python, needed to generate source files.")
  endif (NOT PYTHON_EXECUTABLE)

  set(specs ${AMQP_SPEC} ${qpidc_SOURCE_DIR}/xml/cluster.xml)
  set(regen_amqp OFF)
  set(rgen_dir ${qpidc_SOURCE_DIR}/rubygen)
  file(GLOB_RECURSE rgen_progs ${rgen_dir}/*.rb)
  # If any of the specs, or any of the sources used to generate code, change
  # then regenerate the sources.
  foreach (spec_file ${specs} ${rgen_progs})
    if (${spec_file} IS_NEWER_THAN ${CMAKE_CURRENT_SOURCE_DIR}/rubygen.cmake)
      set(regen_amqp ON)
    endif (${spec_file} IS_NEWER_THAN ${CMAKE_CURRENT_SOURCE_DIR}/rubygen.cmake)
  endforeach (spec_file ${specs})
  if (regen_amqp)
    message(STATUS "Regenerating AMQP protocol sources")
    execute_process(COMMAND ${RUBY_EXECUTABLE} -I ${rgen_dir} ${rgen_dir}/generate gen
                           {specs} all ${CMAKE_CURRENT_SOURCE_DIR}/rubygen.cmake
                           WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR})
  else (regen_amqp)
    message(STATUS "No need to generate AMQP protocol sources")
  endif (regen_amqp)
else (EXISTS ${AMQP_SPEC})
  message(STATUS "No AMQP spec... won't generate sources")
endif (EXISTS ${AMQP_SPEC})

# Pull in the names of the generated files, i.e. ${rgen_framing_srcs}
include (rubygen.cmake)

With the code generation issue resolved, I was able to get the rest of the project building on both Linux and Windows without much trouble. The cmake@cmake.org email list was very helpful when questions came up.

The remaining not-real-clear-for-a-newbie area was how to best handle building optional features. Where the original autoconf script tried to build as much as possible without the user specifying, I put in simpler CMake language to allow the user to select options, try the configure, and adjust settings if a feature (such as SSL libraries) was not available. This took away a convenient feature for building as much as possible without user intervention, though with CMake’s ability to very easily adjust the settings and re-run the configure, I didn’t think this was much of a loss.

Shortly after I got the first set of CMakeLists.txt files checked into the Qpid subversion repository, other team members started iterating on the initial CMake-based build. Andrew Stitcher from Red Hat quickly zeroed in on the removed capability to build as much as possible without user intervention. He developed a creative approach to setting the CMake defaults in the cache based on some initial system checks. For example, this is the code that sets up the SSL-enabling default based on whether or not the required capability is available on the build system (note this code is covered by the Apache license):

# Optional SSL/TLS support. Requires Netscape Portable Runtime on Linux.

include(FindPkgConfig)

# According to some cmake docs this is not a reliable way to detect
# pkg-configed libraries, but it's no worse than what we did under
# autotools
pkg_check_modules(NSS nss)

set (ssl_default ${ssl_force})
if (CMAKE_SYSTEM_NAME STREQUAL Windows)
else (CMAKE_SYSTEM_NAME STREQUAL Windows)
  if (NSS_FOUND)
    set (ssl_default ON)
  endif (NSS_FOUND)
endif (CMAKE_SYSTEM_NAME STREQUAL Windows)

option(BUILD_SSL "Build with support for SSL" ${ssl_default})
if (BUILD_SSL)

  if (NOT NSS_FOUND)
    message(FATAL_ERROR "nss/nspr not found, required for ssl support")
  endif (NOT NSS_FOUND)

  foreach(f ${NSS_CFLAGS})
    set (NSS_COMPILE_FLAGS "${NSS_COMPILE_FLAGS} ${f}")
  endforeach(f)

  foreach(f ${NSS_LDFLAGS})
    set (NSS_LINK_FLAGS "${NSS_LINK_FLAGS} ${f}")
  endforeach(f)

  # ... continue to set up the sources and targets to build.
endif (BUILD_SSL)

With that, the Apache Qpid build is going strong with CMake.

During the process I developed a pattern for naming CMake variables that play a part in user configuration and, later, in the code. There are two basic prefixes for cache variables:

  • BUILD_* variables control optional features that the user can build. For example, the SSL section shown above uses BUILD_SSL. Using a common prefix, especially one that collates near the front of the alphabet, puts options that users change most often right at the top of the list, and together.
  • QPID_HAS_* variables note variances about the build system that affect code but not users. For example, is a header file present, or a particular system call.

Future efforts in this area will complete the transition of the test suite to CMake/CTest, which will have the side affect of making it much easier to script the regression test on Windows. The last area to be addressed will be how downstream packagers make use of the new CMake/CPack system for building RPMs, Windows installers, etc. Stay tuned…

USB pass-thru in RHEL 5 Xen VM doesn’t work; why do I buy support?

February 23, 2009

As part of my efforts to maintain ACE+TAO on LabVIEW RT (with Pharlap ETS kernel) I have a setup to run the test suite on a National Instruments chassis, driven by the build system on Windows. This arrangement is easily handled by ACE+TAO build environment, including a mechanism to reboot the NI box when things go wrong. The reboot is triggered by a USB-connected NI USB-6009 device that trips the reset signal on the NI box. It’s very slick and keeps from having to cycle power. The hitch is that it requires a USB 2.0 connection from the Windows machine.

In the past I’d used a VMware virtual machine hosted on Linux (RHEL 4) running a Windows guest OS to host this test environment. The VMware software passed the USB device through to the Windows VM without a hitch. However, over the winter I got a new machine set up with a great deal more capacity and decided to move the LabVIEW RT test environment to the new machine which runs RHEL 5 and Xen.

And that’s when the trouble started…

First I had to search quite a bit to find out how to configure the Xen VM to pass the USB device to the guest OS. After a bit of googling and reading, I found the magic configuration lines to add. I also found another blog entry (http://www.olivetalks.com/2008/02/03/usb-forwarding-on-xen-it-just-does-not-work/) saying it wouldn’t work right. But I forged on, confident that even if it didn’t work “out of the box” I had purchased support from Red Hat and could get any help I needed.

Well, long story short, the USB device didn’t pass through correctly from Xen. On December 9, 2008 I opened a support case with Red Hat to have them do whatever is needed to make it work. After twelve (12) exchanges over 22 days I requested escalation to someone who could do more to help than quote manual sections that were not applicable to what I needed.

After 11 more exchanges with 3 more support engineers over another 49 days I got the long-awaited answer: “It doesn’t work.”

Well, I wasn’t totally surprised since I had no success and had already seen a blog posting saying it won’t work. But I was still clinging to hope that my support contract would come through and Red Hat would make it work. Nope. Sorry. It don’t work. End of story.

So why do I buy support? Sure, I get all the updates, but I paid extra for someone to actually work on problems for me and all I get is “It doesn’t work.”? When my customers raise issues about ACE not working, they get fixes. Solutions. You know, like they paid for.

Apparently, solutions are optional for other providers.

So what happened in the end? I went back to running the Windows VM in a VMware environment, where it’s happily chugging along.

My Experiences Porting Apache Qpid C++ to Windows

January 9, 2009

I recently finished (modulo some capabilities that should be added) porting Apache Qpid‘s C++ implementation to Microsoft Windows. Apache Qpid also sports Java broker and client as well as Python, Ruby, C# and .NET clients. For my customer’s project I needed C++ which had, to date, been developed and used primarily on Linux. What I thought would be a slam dunk 20-40 hour piece of work took about 4 months and hundreds of hours. Fortunately, my customer’s projects waiting for this completion also were delayed and my customer was very accommodating. Still, since I goofed the estimate so wildly I only billed the customer a fraction of the hours I worked. Fortunately, I rarely goof estimates that badly. This post-project review takes a look at what went wrong and why it ended up a very good thing.

When I first looked through the Qpid code base, I got some initial impressions:

  • It’s nicely layered, which will make life easy
  • It’s hard to find one’s way around it
  • The I/O layer (at the bottom of the stack) is easily modified for what I need

The first two impressions held; the third did not. Most of the troubles and false starts had to do with the I/O layer at the bottom of the stack. Most of the rest of the code ported over with relative ease. The original authors did a very nice job isolating code that was likely to need varying implementations. Those areas generally use the Bridge pattern to offer a uniform API that’s implemented differently as needed.

The general areas I had to work on for the port are described below.

Synchronization

Qpid uses multiple threads – no big surprise for a high-performance networked system. So there’s of course a need for synchronization objects (mutex, condition variables, etc.) The existing C++ code had nice wrapper classes and a Pthreads implementation. The options for completing the Windows implementation were:

  • Use native Windows system calls
  • ACE (relatively obvious for me)
  • Apache Portable Runtime (it’s an Apache project after all)
  • Boost (Qpid already made use of Boost in many other areas)

Windows system calls were ruled out fairly quickly because they don’t offer all that was needed (particularly, condition variables) on XP and the interaction of the date-time parts of the existing threading/synch objects and Windows system time was very clunky.

I was hesitant to add ACE as an outside requirement just for the Windows port. I was also sensitive to the fact that as a newbie on this particular project I could be seen as simply trying to wedge ACE in for my own sinister purposes (which is definitely not the case!). So scratch that one.

After a brief but unsuccessful attempt at APR (and being told that some previous APR use was abandoned) I settled on Boost. This was my first project using Boost and it took some getting used to, but overall was pretty smooth.

Thread Management

The code that actually spawned and managed threads was easily implemented using native Windows system calls. Straight-forward and easy.

I/O

This is where all the action is. The existing code comments (there aren’t many, but what was there was descriptive) talked about “Asynch I/O.” This was welcome since I planned to use overlapped I/O to get high throughput; Windows’ implementation of asynchronous (they call it overlapped) I/O is very good, scales well and performs very well. The interface to the I/O layer from the upper level in Qpid looked good for asynchronous I/O and I got a little over confident. In retrospect, the name of the event dispatcher class (Poller) should have tipped me off that I had some difficulty ahead.

The Linux code’s Poller implementation uses Linux epoll to get high performance and remain very scalable. The code is solid and well designed and implemented. However, it is event driven, synchronous I/O and that tends to show a bit more than maybe intended. Handles need to be registered with the Poller, for example, something that’s not done with overlapped I/O.

My first attempt at the Windows version of a Poller implementation was large and disruptive. Fortunately, once I offered it up for review I received a huge amount of help from the I/O layer expert on the project. He and I sat down for a morning to review the design, the code I came up with, and best ways to go forward. The people I’ve worked with on Apache Qpid are consummate professionals and I’m very thankful for their input and guidance.

My second design for the I/O layer went much better. It doesn’t interfere with the Linux code, and slides in nicely with very little code duplication. I think that after another port or two are done where more of these implementations need to be designed, it may be possible to refactor some of the I/O layer to make things a bit cleaner, but that’s very minor at this point – the code works very well and integrates without disruption.

Lessons Learned

So what did I learn from this?

  1. It pays to spend a little extra time reading the existing code before designing extensions. Even when it looks pretty straight-forward. Even if you have to write some design documentation and run it by the original author(s).
  2. Forge good relationships with the other people on the team. This is an easy one when you all work in the same group, even in the same building. It’s more often assumed to be difficult at best when the group is spread around the world and across (sometimes competing) companies. It’s worth the effort.

So although the project took far longer than I originally estimated, the result is a good implementation that fits with the rest of the system and performs well. I could have wedged in my original bad design in far less time, but someone would have had to pick up the pieces later. The design constraints and rules that were not written before are somewhat written now (at least in the Windows code). If I do another port, it’ll be much smoother next time.

Where to Go From Here?

There are a few difficulties remaining for the Windows port and a few capabilities that should be added:

  • Keep project files synched with generated code. The Qpid project’s build process generates a lot of code from XML protocol specifications. This is very nice, but runs into trouble keeping the Visual Studio project files up to date as the set of generated files changes. I’ve been using the MPC tool to generate Visual Studio projects and MPC can pick up names by wildcard, but that still leaves an extra step: generate C++ code, regenerate project files. This need has caused a couple of hiccups during the Qpid M4 release process where I had to regenerate the project files. It would be nice if Visual Studio’s C++ build could deal with wildcards, or if the C++  project file format allowed inclusion of an outside file listing sources (which could be generated along with the C++ code).
  • Add SSL support. The Linux code uses OpenSSL. I’d rather not pull in another prerequisite when Windows knows how to do SSL already. At least I assume it does, and in a way that doesn’t require an HTTP connection to use. I’ve yet to figure this out…
  • Persistent storage for messages. AMQP (and Qpid) allows for persistent message store, guaranteeing message delivery in the face of failures. There’s not yet a store for Qpid on Windows, and it should be added.
  • Add the needed “declspec”s to build the Qpid libraries as DLLs; currently they’re built as static libraries.
  • Minor tweaks making Qpid integrate better with Windows, such as a message definition for storing diagnostics in the event log and being able to set the broker up as a Windows Service.

Apache Qpid graduates incubator; now a top-level project

December 11, 2008

The Apache Qpid project has been in incubation at the Apache Software Foundation for quite a while now, having delivered at least 3 releases of Apache Qpid. Recently the Apache Software Foundation board of directors voted to graduate the project from the incubator as a new top-level project (TLP) at Apache. This is a major milestone for Qpid and is based on:

  • Proven ability to manage and coordinate development and release a product
  • Cultivate a community of developers with sufficient diversity

I joined the Apache Qpid project this past summer, primarily to lead the port to Windows. I’ve been impressed with the development team’s professionalism, experience, and commitment to quality.

Congratulations to the Apache Qpid team on this great accomplishment!