Archive for the ‘tcp/ip’ Category

Book Review: “Boost.ASIO C++ Network Programming” by John Torjo

April 20, 2013

Boost.ASIO C++ Network Programming

I make a living doing network programming, so I was very interested to review a new book, Boost.ASIO C++ Network Programming by John Torjo, published by Packt Publishing. On a scale of 1-10 I give it a solid 6. The good news is that I haven’t seen anything written about Boost.ASIO that’s better, so if you are interested in learning more about Boost.ASIO, I recommend you buy this book. I’ll explain further.

My first impression when starting to read this book is that it’s poorly edited. It feels like it was written in a rush and produced in an even bigger rush. There are spelling errors, grammar errors, and it just feels second rate. As I made my way through the book it didn’t get any better. Don’t let this dissuade you from learning from the real content, though.

Technically speaking Torjo attempts to do a few big things:

  • Explain use of Boost.ASIO
  • Explain network programming
  • Explain synchronous vs. asynchronous programming paradigms

If you are not already fluent in the general network programming concepts, I recommend you purchase other books (Stevens, Schmidt, Schmidt [disclaimer: I co-authored this]) as a prerequisite to Boost.ASIO. Torjo succeeds fairly well at the first point, though, which is the primary purpose of the book. You should also have a passing familiarity with other Boost classes. While Boost.ASIO is not template-heavy like some other Boost areas, shared pointers and function binding are used well in the examples.

Boost.ASIO explains the use of the io_service, resolver, endpoint, and socket classes fairly well. Torjo hand-waves a bit about the relationship of io_service and socket, but once you’ve become accustomed to programming with Boost.ASIO, that becomes second nature. The book includes many well-explained examples, which I really appreciate. If you follow the examples you will end up with working code.

The book at one point claims to be useful as a reference to return to over and over for details. While Boost.ASIO is certainly not a Boost.ASIO reference manual (you should bookmark the Boost docs for that) it is a very useful book for explaining how to make good use of this very flexible and useful class library for network programming.

Trouble with ACE and IPv6? Make Sure Your Config is Consistent

July 2, 2010

I just spent about 5 hours over the week debugging a stack corruption problem. The MSVC debugger was politely telling me the stack was corrupted in the area of an ACE_INET_Addr object instantiated on the stack. But all I did was create it then return from the method. So the problem had to be localized pretty well. Also, I was seeing the problem but none of the students in the class I was working this example for saw the problem. So it was somehow specific to my ACE build.

I stepped through the ACE_INET_Addr constructor and watched it clear the contents of the internal sockaddr to zeroes. Fine. I noted it was clearing 28 bytes and setting address family 23. “IPv6. Ok, ” I thought. But I knew the stack was being scribbled on outside the bounds of that ACE_INET_Addr object. I checked to see if ACE somehow had a bad definition of sockaddr_in6. After rummaging around ACE and Windows SDK headers I was pretty sure that wasn’t it. But there was definitely some confusion on the size of what needed to be cleared.

If you haven’t looked at the ACE_INET_Addr internals (and, really, why would you?), when ACE is built with IPv6 support (the ACE_HAS_IPV6 setting) the internal sockaddr is a union of socketaddr_in and sockaddr_in6 so both IPv4 and IPv6 can be supported. The debugger inside the ACE_INET_Addr constructor was showing me both v4 and v6 members of the union. But as I stepped out of the ACE_INET_Addr constructor back to the example application, the debugger changed to only showing the IPv4 part. Hmmm… why is that? The object back in the example is suddenly looking smaller (since the sockaddr_in6 structure is larger than the sockaddr_in structure, the union gets smaller when you leave out the sockaddr_in6). Ok, so now I know why the stack is getting smashed… I’m passing a IPv4-only ACE_INET_Addr object to a method that thinks it’s getting a IPv4-or-IPv6 object which is larger. But why?

I checked my $ACE_ROOT/ace/config.h since that’s where ACE config settings usually are. No ACE_HAS_IPV6 setting there. Did the ACE-supplied Windows configs add it in somewhere sneakily? Nope. I checked the ACE.vcproj file ACE was built with. Ah-ha… in the compile preprocessor settings there it is – ACE_HAS_IPV6.

AAAAARRRRRGGGGGGG!!!!! Now I remember where it came from. IPv6 support is turned on/off in the MPC-generated Visual Studio projects using an MPC feature setting, ipv6=1 (this is because some parts of ACE and tests aren’t included without the ipv6 feature). When I generated the ACE projects that setting was used, but when I generated the example program’s projects it wasn’t. So the uses of ACE_INET_Addr in the example had only the IPv4 support, but were passed to an ACE build that was expecting both IPv4 and IPv6 support – a larger object.

Solution? Regenerate the example’s projects with the same MPC feature file ACE’s projects were generated with. That made all the settings consistent between ACE and my example programs. No more stack scribbling.

Resolving the CPU-bound ACE_Dev_Poll_Reactor Problem, and more

February 5, 2010

I previously wrote about improvements to ACE_Dev_Poll_Reactor I made for ACE 5.7. The improvements were important for large-scale uses of ACE_Dev_Poll_Reactor, but introduced a problem where some applications went CPU bound, particularly on CentOS. I have made further improvements in ACE_Dev_Poll_Reactor to resolve the CPU-bound issue as well as to further improve performance. These changes will be in the ACE 5.7.7 micro release; the customer that funded the improvements is running load and performance tests on them now.

Here’s what was changed to improve the performance:

  • Change the notify handler so it’s not suspended/resumed around callbacks like normal event handlers are.
  • Delay resuming an auto-suspended handle until the next call to epoll_wait().

I’ll describe more about each point separately.

Don’t Suspend/Resume the Notify Handler

All of the Reactor implementations in ACE have an event handler that responds to reactor notifications. Most of the implementations (such as ACE_Select_Reactor and ACE_TP_Reactor) pay special attention to the notify handler when dispatching events because notifications are always dispatched before I/O events. However, the ACE_Dev_Poll_Reactor does not make the same effort to dispatch notifications before I/O; they’re intermixed as the epoll facility dequeues events in response to epoll_wait() calls. Thus, there was little special-cased code for the notify handler when event dispatching happened. When event handler dispatching was changed to automatically suspend and resume handlers around upcalls, the notify handler was also suspended and resumed. This is actually where the CPU-bound issues came in – when the dispatched callback returned to the reactor, the dispatching thread needs to reacquire the reactor token so it can change internal reactor state required to verify the handler and resume it. Acquiring the reactor token can involve a reactor notification if another thread is currently executing the event dispatching loop. (Can you see it coming?) It was possible for the notify handler to be resumed, which caused a notify, which dispatched the notify handler, which required another resume, which caused a notify, which… ad infinitum.

The way I resolved this was to simply not suspend/resume the notify handler. This removed the source of the infinite notifications and CPU times came back down quickly.

Delay Resuming an Auto-Suspended Handle

Before beginning the performance improvement work, I wrote a new test, Reactor_Fairness_Test. This test uses a number of threads to run the reactor event loop and drives traffic at a set of TCP sockets as fast as possible for a fixed period of time. At the end of the time period, the number of data chunks received at each socket is compared; the counts should all be pretty close. I ran this test with ACE_Select_Reactor (one dispatching thread), ACE_TP_Reactor, and ACE_Dev_Poll_Reactor initially. This was important because the initial customer issue I was working on was related to fairness in dispatching events. ACE_Dev_Poll_Reactor’s fairness is very good but the performance needed to go up.

With the notify changes from above, the ACE_Dev_Poll_Reactor performance went up, to slightly better than ACE_TP_Reactor (and the test uses a relatively small number of sockets). However, while examining strace output for the test run I noticed that there were still many notifies causing a lot of event dispatching that was slowing the test down.

As I described above, when the reactor needs to resume a handler after its callback completes, it must acquire the reactor token (the token is released during the event callback to the handler). This often requires a notify, but even when it doesn’t, the dispatching thread needs to wait for the token just to change some state, then release the token, then go around the event processing loop again which requires it to wait for the token again – a lot of token thrashing that would be great to remove.

The plan I settled on was to keep a list of handlers that needed to be resumed; instead of immediately resuming the handler upon return from the upcall, add the handler to the to-be-resumed list. This only requires a mutex instead of the reactor token, so there’s no possibility of triggering another notify, and there’s little contention for the mutex in other parts of the code. The dispatching thread could quickly add the entry to the list and get back in line for dispatching more events.

The second part of the to-be-resumed list is that a thread that is about to call epoll_wait() to get the next event will first (while holding the reactor token it already had in order to get to epoll_wait()) walk the to-be-resumed list and resume any handlers in the list that are still valid (they may have been canceled or explicitly resumed by the application in the meantime).

After this improvement was made, my reactor fairness test showed still excellent fairness on the ACE_Dev_Poll_Reactor, but with about twice the throughput. This with about 1/2 the CPU usage. These results were gathered in a less than scientific measurements and with a specific usage pattern – your mileage may vary. But if you’ve been scared away from ACE_Dev_Poll_Reactor by the discussions of CPU-bound applications getting poor performance, it’s time to take another look at ACE_Dev_Poll_Reactor.

How to Use Schannel for SSL Sockets on Windows

December 29, 2009

Did you know that Windows offers a native SSL sockets facility? You’d figure that since IE has SSL support, and MS SQL has SSL support, and .NET has SSL support, that there would be some lower-level SSL support available in Windows. It’s there. Really. But you’ll have a hard time finding any clear, explanatory documentation on it.

I spent most of the past month adding SSL support to Apache Qpid on Windows, both client and broker components. I used a disproportionate amount of that time struggling with making sense of the Schannel API, as it is poorly documented. Some information (such as a nice overview of how to make use of it) is missing, and I’ll cover that here. Other information is flat out wrong in the MSDN docs; I’ll cover some of that in a subsequent post.

I pretty quickly located some info in MSDN with the promising title “Establishing a Secure Connection with Authentication”. I read it and really just didn’t get it. (Of course, now, in hindsight, it looks pretty clear.) Part of my trouble may have been a paradigm expectation.   Both OpenSSL and NSS pretty much wrap all of the SSL operations into their own API which takes the place of the plain socket calls. Functions such as connect(), send(), recv() have their SSL-enabled counterparts in OpenSSL and NSS; adding SSL capability to an existing system ends up copying the socket-level code and replacing plain sockets calls with the equivalent SSL calls (yes, there are some other details to take care of, but model-wise, that’s pretty much how it goes).

In Schannel the plain Windows Sockets calls are still used for establishing a connection and transferring data. The SSL support is, conceptually, added as a layer between the Winsock calls and the application’s data handling. The SSL/Schannel layer acts as an intermediary between the application data and the socket, encrypting/decrypting and handling SSL negotiations as needed. The data sent/received on the socket is opaque data either handed to Windows for decrypting or given by Windows after encrypting the normal application-level data. Similarly, SSL negotiation involves passing opaque data to the security context functions in Windows and obeying what those functions say to do: send some bytes to the peer, wait for more bytes from the peer, or both. So to add SSL support to an existing TCP-based application is more like adding a shim that takes care of negotiating the SSL session and encrypting/decrypting data as it passes through the shim.

The shim approach is pretty much how I added SSL support to the C++ broker and client for Apache Qpid on Windows. Once I got my head around the new paradigm, it wasn’t too hard. Except for the errors and omissions in the encrypt/decrypt API documentation… I’ll cover that shortly.

The SSL support did  not get into Qpid 0.6, unfortunately. But it will be in the development stream shortly after 0.6 is released and part of the next release for sure.

Sometimes Using Less Abstraction is Better

March 24, 2009

Last week I was working on getting the Apache Qpid unit tests running on Windows. The unit tests are arranged to take advantage of the fact that the bulk of the Qpid client and broker is built as shared/dynamic libraries. The unit tests invoke capabilities directly in the shared libraries, making it easier to test. Most of the work needed to get these tests built on Windows was taken care of by the effort to build DLLs on Windows. However, there was a small but important piece remaining that posed a challenge.

Being a networked system, Qpid tests need to be sure it correctly handles situations where the network or the network peer fails or acts in some unexpected way. The Qpid unit tests have a useful little class named SocketProxy which sits between the client and broker. SocketProxy relays network traffic in each direction but can also be told to drop pieces of traffic in one or both directions, and can be instructed to drop the socket in one or both directions. Getting this SocketProxy class to run on Windows was a challenge. SocketProxy uses the Qpid common Poller class to know when network data is available in one or both directions, then directly performs the socket recv() and send() as needed. This use of Poller, ironically, was what caused me problems. Although the Windows port includes an implementation of Poller, it doesn’t work in the same fashion as the Linux implementation.

In Qpid proper, the Poller class is designed to work in concert with the AsynchIO class; Poller detects and multiplexes events and AsynchIO performs I/O. The upper level frame handling in Qpid interacts primarily with the AsynchIO class. Below that interface there’s a bit of difference from Linux to Windows. On Linux, Poller indicates when a socket is ready, then AsynchIO performs the  I/O and hands the data up to the next layer. However, the Windows port uses overlapped I/O and an I/O completion port; AsynchIO initiates I/O, Poller indicates completions (rather than I/O ready-to-start), and AsynchIO gets control to hand the resulting data to the next layer. So, the interface between the frame handling and I/O layers in Qpid is the same for all platforms, but the way that Poller and AsynchIO interact can vary between platforms as needed.

My initial plan for SocketProxy was to take it up a level, abstraction-wise. After all, abstracting away behavior is often a good way to make better use of existing, known-to-work code, and avoid complexities. So my first approach was to replace SocketProxy’s direct event-handling code and socket send/recv operations with use of the AsynchIO and Poller combination that is used in Qpid proper.

The AsynchIO-Poller arrangement’s design and place in Qpid involves some dynamic allocation and release of memory related to sockets, and a nice mechanism to do orderly cleanup of sockets regardless of which end initiates the socket close. Ironically, it is this nice cleanup arrangement which tanked its use in the SocketProxy case. Recall that SocketProxy’s usefulness is its ability to interrupt sockets in messy ways, but not be messy itself in terms of leaking handles and memory. My efforts to get AsynchIO and Poller going in SocketProxy resulted in memory leaks, sockets not getting interrupted as abruptly as needed for the test, and connections not getting closed properly. It was a mess.

The solution? Rather than go up a level of abstraction, go down. Use the least common denominator for what’s needed in a very limited use case. I used select() and fd_set. This is just what I advise customers not to do. Did I lose my mind? Sell out to time pressure? No. In this case, using less abstraction was the correct approach – I just didn’t recognize it immediately.

So what made this situation different from “normal”? Why was it a proper place to use less abstraction?

  • The use case is odd. Poller and AsynchIO are very well designed for running the I/O activities in Qpid, correctly handling all socket activity quickly and efficiently. They’re not designed to force failures, and that’s what was needed. It makes no sense to redesign foundational classes in order to make a test harness more elegant.
  • The use is out of the way. It’s a test harness, not the code that has to be maintained and relied on for efficient, correct performance in deployed environments.
  • It’s needs are limited and isolated. SocketProxy handles only two sockets at a time. Performance is not an issue.

Sometimes less is more – it works in abstractions too. The key is to know when it really is best.

Analysis of ACE_Proactor Shortcomings on UNIX

January 22, 2009

I’ve been looking into two related issues in the ACE development stream:

  1. SSL_Asynch_Stream_Test times out on HP-UX (I recently made a bunch of fixes to the test itself so it runs as well as can be on Linux, but times out on HP-UX)
  2. Proactor_Test shows a stray, intermittent diagnostic on HP-UX: EINVAL returned from aio_suspend()

Although I’ve previously discussed use of ACE_Proactor on Linux (https://stevehuston.wordpress.com/2008/11/25/when-is-it-ok-to-use-ace-proactor-on-linux/) the issues on HP-UX are of a different sort. If the previously discussed Linux aio issues are resolved inside Linux, the same problem I’m seeing on HP-UX may also arise, but it doesn’t get that far. Also, I suspect that the issues arising from these tests’ execution on Solaris are of the same nature, though the symptoms are a bit different.

The symptoms are that the proactor event loop either fails to detect completions, or it gets random errors that smell like the aiocb list is damaged. I believe I’ve got a decent idea of what’s going on, and it’s basically two issues:

  1. If all of the completion dispatch threads are blocked waiting for completions when new I/O is initiated, the new operation(s) are  not taken into account by the threads waiting for completions. This is basically the case in the SSL_Asynch_Stream_Test timeout on HP-UX – all the completion-detecting threads are already running before any I/O is initiated and no completions are ever detected.
  2. The completion and initiation activities modify the aiocb list used to detect completions directly, without interlocks, and without consideration of what affect it may have (or not) on the threads waiting for completions.

The ACE_Reactor framework uses internal notifications to handle the need to unblock waiting demultiplexing threads so they can re-examine the handle set as needed; something similar is needed for the ACE_Proactor to remedy issue #1 above. There is a notification pipe facility in the proactor code, but I need to see if it can be used in this case. I hope so…

The other problem, of concurrent access to the aiocb list by threads both waiting for completions and modifying the list is a much larger problem. That requires more of a fundamental change in the innards of the POSIX Proactor implementation.

Note that there are a number of POSIX Proactor flavors inside ACE (section 8.5 in C++NPv2 describes most of them). The particular shortcomings I’ve noted here only affect the ACE_POSIX_AIOCB_Proactor and ACE_POSIX_SIG_Proactor, which is largely based on the ACE_POSIX_AIOCB_Proactor. The newest one, ACE_POSIX_CB_Proactor, is much less affected, but is not as widely available.

So, the Proactor situation on UNIX platforms is generally not too good for demanding applications. Again, Proactor on Windows is very good, and recommended for high-performance, highly scalable networked applications. On Linux, stick to ACE_Reactor using the ACE_Dev_Poll_Reactor implementation; on other systems, stick with ACE_Reactor and ACE_Select_Reactor or ACE_TP_Reactor depending on your need for multithreaded dispatching.

There’s No Substitute for Experience with TCP/IP Sockets

December 31, 2008

The number of software development tools and aids available to us as we begin 2009 is staggering. IDEs, code generators, component and class libraries, design and modeling tools, high-level protocols, etc. were just speculation and dreams when I began working with TCP/IP in 1985. TCP and IP were not yet even approved MIL-STDs and the company I worked for had to get US Department of Defense permission to connect to the fledgling Internet. The “Web” was still 10 years away. If you wanted to use TCP/IP for much more than FTP, Telnet, or email you had to write the protocol and the code to run it yourself. The Sockets API was the highest level access we had at the time. That is a whole area of difficulty and complexity in and of itself, which C++ Network Programming addresses. But the API is more of a usage and programming efficiency issue – what I’m talking about today is the necessity of experience and understanding what’s going on between the API and the wire when working with TCP/IP, regardless of the toolkit or language or API you use.

A lot of software running on what many people consider “the net” piggy-backs on HTTP in one form or another. There are also many helpful libraries, such as .NET and ACE, to assist in writing networked applications at a number of levels. More specific problem areas also have very useful targeted solutions, such as message queuing systems and Apache Qpid. And, like most programming tasks, when everything’s ideal, it’s not too hard to get some code running pretty well. It’s when things don’t work as well as you planned that the way forward becomes murky. That’s when experience is needed. These are some examples of issues I’ve assisted people with lately:

  1. Streaming data transfer would periodically stop for 200 msec, then resume
  2. Character strings transferred would intermittently be bunched together or split apart
  3. Asynchronous I/O-based code stopped working when ported to Linux

The tendency when a problem such as this comes up is to find out who, or what, is to blame. In my experience, the first attempt at blame is usually laid on the most recent addition to the programming toolset – the piece trusted the least and that’s usually closest to the application being written. For ACE programs, this is usually why I get involved so early.

I’ve spent many years debugging applications and network protocol code. I spent way too much time trying to blame the layer below me, or the OS, or the hardware. The biggest lesson I learned is that when something goes wrong with code I wrote, it’s usually my problem and it’s usually a matter of some concept or facility I don’t understand enough to see the problem clearly or find the way to a solution. That’s why it’s so important to understand the features and functionality you are making use of – there’s no substitute for experience.

Helping my clients solve the three problems I mentioned above involved experience. Knowing where to target further diagnosis and gathering the right information made the difference between solving the problem that day and staring at code for days wondering what’s going on. Curious about what the problems were?

  1. Slow-start peculiarity on the receiver; disable Nagle’s on the receiving side.
  2. That’s the streaming nature of TCP. Need to mark the string boundaries and check for them on receive.
  3. Linux silently converts asynchronous socket I/O operations to synchronous and executes them in order; need to restructure order of operations in a very restricted way, or switch paradigm on Linux.

Although each client initially targeted blame at the same place, the real solution was in a different layer for each case. And none involved the software initially thought to be at fault.

When you are ready to begin developing a new networked application, or you’re spending yet another night staring at code and network traces, remember: there’s a good chance you need a little more clarity on something. Take a step back, assume the tools you’re using are probably correct, and begin to challenge your assumptions about how you think it’s all supposed to work. A little more understanding and experience will make it clear.