Winsock Programmer's FAQ: Advanced Winsock Issues

Winsock Programmer's FAQ
Section 4: Advanced Winsock Issues

4.1 - How can I open a raw data socket?

Under Winsock 1.1, the SOCK_RAW socket type is optional. Some of the non-Microsoft stacks implemented it, but these implementations are essentially extinct. SOCK_RAW in Winsock 1.1 is also problematic because the Winsock spec's writers did not try to rigorously define what we should expect from a SOCK_RAW implementation.

The Winsock 2 spec gives more details about raw sockets, and Microsoft's Winsock 2 stacks do implement some types of raw sockets. Windows 2000 has by far the best implementation of raw sockets; details below.

On all other platforms, raw socket support is fairly sparse: Microsoft only supports raw IGMP and ICMP sockets on these platforms. The latter allows you to send "ping" packets in a standard way. These stacks do not support raw IP or "packet capturing" from the Winsock layer. (See the next two questions for information on capturing packets and changing packet headers.)

If you really must have complete raw sockets support and can't use Windows 2000, you might think about a platform change. Most flavors of Unix (including the free BSD flavors and Linux) have good raw socket support.

Available raw sockets support in Microsoft stacks:

	Winsock 1.1 (all platforms)	Win9x with Winsock 2	WinNT 4.0	Windows 2000
Raw I[CG]MP	No	Yes	Yes	Yes
IP_HDRINCL	No	No	No	Yes
Raw TCP/UDP	No	No	No	No

Notice that raw TCP and UDP aren't possible directly under Winsock 2. Instead, you must use IP_HDRINCL (a.k.a. raw IP) and build your own IP and TCP or UDP headers.

Under Windows NT and Windows 2000, only users that are members of the Administrator group can open raw sockets.

4.2 - How can I capture packets on a LAN with Winsock?

Winsock does not allow promiscuous IP packet captures. To get at raw packet data, you have to bypass Winsock and talk to the Transport Data Interface (TDI) or Network Device Interface Specification (NDIS) layers. The TDI layer is just above the system's NDIS (network driver) layer.

Some of the Windows packet sniffers in the FAQ's debugging resources section include source, which you could pick apart to figure out how this works. Probably the easiest one to work with is WinDump, because its capture code is separated into a free library called WinPCap. If you're familiar with the Unix libpcap mechanism, you should be able to pick up WinPCap quickly. For a second example of a program that uses WinPCap, see Ethereal.

If you want to roll your own TDI or NDIS code, PCAUSA sells a package that is supposed to make writing to these layers easier. I have not tried this product, so I can't say how well it works. They also have several FAQs that talk about various low-level network stack access methods. These FAQs also point you to various bits of sample code, most of it from Microsoft's various DDKs.

If all you want is a way to help you debug your Winsock program by showing you what is happening on the network, the Resources section mentioned above has links to many capable network sniffers and other debugging tools.

4.3 - How can I change the IP or TCP header of a packet?

Windows 2000 can do this with raw sockets; no other Microsoft stack can do this. They do allow you to set a few IP header fields can be set with setsockopt() and/or ioctlsocket(). One such field is TTL.

If that is not enough control for your application, you will have to resort to lower-level techniques. One of these is to add a layer to the network stack with Winsock 2's Layered Service Provider mechanism. That mechanism is not covered in this FAQ, but there is some useful code and documentation on the MSDN site and disks.

Another option is to do raw data I/O using the Transport Data Interface (TDI) or the Network Driver Interface Specification (NDIS). Further information is available in PCAUSA's FAQs.

Also, don't rule out the option of building your application on a platform that does have easy access to the packet headers. Most Unix flavors (including Linux) offer copious tools for low-level network I/O. For information on raw network programming on Unixlike platforms, see Thamer Al-Herbish's Raw IP Networking FAQ.

4.4 - How can I "ping" another machine with Winsock?

The "official" method uses the IPPROTO_ICMP raw socket type defined by Winsock 2. All of Microsoft's Winsock 2 stacks support this. I suspect that this also works on some non-Microsoft Winsock 1.1 stacks, but I don't personally know of any such success. [C++ example]

The other method uses ICMP.DLL, which is an extension specific to Microsoft stacks. Though it works on all Windows systems as of this writing, Microsoft discourages its use in the strongest terms possible, claiming that the API will disappear as soon as a better method exists. (It hasn't actually happened yet, despite several years of threats. :) ) ICMP.DLL's main advantage is that it works under Winsock 1.1. It is, however, less flexible than the raw sockets method. [C++ example]

I want to point out that many programs misuse ping. Naturally it has good uses, but it's a sign of a broken program or protocol if you find yourself resorting to regular use of ping packets. For example, I'm always seeing people ask about pinging when what they really want is to detect dropped connections.

4.5 - How do I pass a socket from one process to another?

Winsock 2 provides support for this through the WSADuplicateSocket() facility. The spec describes this method in detail, including some example code. You should also read article Q150523 in the Microsoft Knowledge Base. It describes how socket inheritance differs between the various flavors of Windows.

Another fun feature of the Win32 API is that it allows you to give a new process different "standard handles" (stdin, stdout and stderr) when you create it. MSKB article Q190351 addresses this. Note that this feature only allows you to do this with a child process; you can't redirect your own standard I/O handles to a socket. Also, the item notes that some processes may behave stangely when you do this to them. Clearly, this functionality is not as powerful as the Unix world's dup2() system call.

Winsock 1.1 does not support passing file handles to other processes.

4.6 - Is it possible to create sockets that map to a DLL rather than an application?

Under Windows, a DLL's data is actually owned by the application that loads the DLL. If you need the DLL to own a single socket no matter how many processes load the DLL, you need to create a "helper process" which will perform all Winsock operations on behalf of the DLL. Naturally you'll need some kind of interprocess communication channel between the DLL and the helper process.

Note that this issue only matters if you're using a DLL to let multiple processes share a socket. If you only have one process using the DLL, or if it's okay for each process to remain ignorant of the other processes using the DLL, this issue won't matter to you.

4.7 - How can I get access to the {route, ARP, interface, etc.} table?

Stas Khirman and Raz Galili have written a great tutorial on the art of using the poorly-documented SNMP API. This API allows you to access many "hidden" parts of the Windows networking subsystem, including the network interface list, the route and ARP tables, the list of connected network sockets, your Ethernet cards' hardware addresses, etc.

4.8 - How do I get the MAC (a.k.a. hardware) address of the local Ethernet adapter?

This FAQ has example code for two hackish methods and one complex but reliable method.

The first method involves asking the NetBIOS API for the adapter addresses. This method will fail on systems where NetBIOS isn't present, and it sometimes gives bogus answers.

There is a second method that depends on a property of the RPC/OLE API. This property is documented but not guaranteed to do what we want, and in fact it fails in a number of situations. (Details in the example program's commentary.) As a result, I have to recommend that you give this method a miss.

The third method uses the sparsely-documented SNMP API to get MAC addresses. This method seems to work all the time, but it's far more complex than the other two methods.

There are some lower-level methods in PCAUSA's NDIS FAQ that may also be helpful to you.

4.9 - How many simultaneous sockets can I have open with Winsock?

On Win9x machines, there's a quite-low limit imposed by the kernel: 100 connections. You can increase this limit by editing the registry key HKLM\System\CurrentControlSet\Services\VxD\MSTCP\MaxConnections. On Windows 95, the key is a DWORD; on Windows 98, it's a string. I've seen some reports of instability when this value is increased to more than a few times its default value.

The rest of this discussion will cover only Windows NT and Windows 2000. These systems have much higher intrinsic capabilities, and thus allow you to use many more sockets. But, the Winsock specification does not set a particular limit, so the only sure way to tell is to try it on all the Winsock stacks you plan on supporting.

Beyond that vague advice, things get more complicated. The simplistic test is to just write a program that just opens sockets, to see where the program stops running: [C++ Example].

The above program isn't terribly realistic. I've seen it grab more than 30,000 sockets before failing on Windows NT 4.0. Anecdotal evidence from the Winsock 2 mailing list puts the real limit much lower, typically 4,000 to 16,000 sockets, even on NT systems with hundreds of megabytes of physical memory. The difference is that the example program just grabs socket handles, but does not actually create connections with them or tie up any network stack buffers.

According to people at Microsoft, the WinNT/Win2K kernel allocates sockets out of the non-paged memory pool. (That is, memory that cannot be swapped to the page file by the virtual memory subsystem.) The size of this pool is necessarily fixed, and is dependent on the amount of physical memory in the system.

On Intel x86 machines, the non-paged memory pool stops growing at 1/8 the size of physical RAM, with a hard maximum of 128 megabytes. The hard limit is 256 megabytes on Windows 2000. Thus for NT 4, the size of the non-paged pool stops increasing once the machine has 1 GB of RAM. On Win2K, you hit the wall at 2 GB.

The amount of data associated with each socket varies depending on how that socket's used, but the minimum size is around 2 KB. Overlapped I/O buffers also eat into the non-paged pool, in blocks of 4 KB. (4 KB is the x86's memory management unit's page size.) Thus a simplistic application that's regularly sending and receiving on a socket will tie up at least 10 KB of non-pageable memory.

Assuming that simple case of 10 KB of data per connection, the theoretical maximum number of sockets on NT 4.0 is about 12,800s, and on Win2K 25,600.

I have seen reports of a 64 MB Windows NT 4.0 machine hitting the wall at 1,500 connections, a 128 MB machine at around 4,000 connections, and a 192 MB machine maxing out at 4,700 connections. It would appear that on these machines, each connection is using between 4 KB and 6 KB. The discrepancy between these numbers and the 10 KB number above is probably due to the fact that in these servers, not all connections were sending and receiving all the time. The idle connections will only be using about 2 KB each.

So, adjusting our "average" size down to 6 KB per socket, NT 4.0 could handle about 21,800 sockets and Win2K about 43,700 sockets. The largest value I've seen reported is 16,000 sockets on Windows NT 4.0.

There's one more complication to keep in mind: your server program will not be the only thing running on the machine. If nothing else, there will be core OS services running. These other programs will be competing with yours for space in the non-paged memory pool.

4.10 - What are the "64 sockets" limitations?

There are two limitations in Winsock where you're limited to 64 sockets.

The Win32 event mechanism (e.g. WaitForMultipleEvents()) can only wait on 64 event objects at a time. Winsock 2 provides the WSAEventSelect() function which lets you use Win32's event mechanism to wait for events on sockets. Because it uses Win32's event mechanism, you can only wait for events on 64 sockets at a time. If you want to wait on more than 64 Winsock event objects at a time, you need to use multiple threads, each waiting on no more than 64 of the sockets.

The select() function is also limited in certain situations to waiting on 64 sockets at a time. The FD_SETSIZE constant defined in winsock.h determines the size of the fd_set structures you pass to select(). It's defined by default to 64. You can define this constant to a higher value before you #include winsock.h, and this will override the default value. Unfortunately, at least one non-Microsoft Winsock stack and some Layered Service Providers assume the default of 64; they will ignore sockets beyond the 64th in larger fd_sets.

You can write a test program to try this on the systems you plan on supporting, to see if they are not limited. If they are, you can get around this with threads, just as you would with event objects.

4.11 - How do I make Winsock use a specific network interface?

On a machine with multiple network interfaces (a modem for dialup Internet and a LAN card, for example), it can sometimes be useful to force Winsock to use a specific interface. Before I go into how, keep in mind that the routing layer of the stack exists to handle this for you. If your setup isn't working the way you want, maybe you just need to change the routing tables. (This is done with the "route" and "netstat" command-line programs on Microsoft stacks.)

There are two common reasons why you might want to force Winsock to use a particular network interface. The first is when you only want your server program to handle incoming connections on a particular interface. For example, if you have an NT machine set up as an Internet gateway, and it also runs a server that you only want internal LAN users to be able to access, you will want to set it to only listen on the LAN interface. The other reason is that you have two or more possible outgoing routes, and you want your client program to connect using a particular one without the routing layer getting in the way.

You can do both of these things with the bind() function. Using one of the "get my IP addresses" examples, you can present your user with a list of possible addresses. Then they can pick the appropriate address to use, which your program will use in the bind() call. Obviously, this is only feasible for programs intended for advanced users.

Incidentally, this is how virtual hosting on the Internet works. A single server is set up with a single network card but several IP addresses. Windows NT/2000 can do this, but Win9x cannot. To set this up in NT, go into the TCP/IP area of the Network control panel, and then click the Advanced button. IIRC, you can enter up to five network addresses per interface in NT Workstation, perhaps more in NT Server.

Note that this information does not apply to the Windows 95/98 multihomed computer Dialup Networking bug. This problem cannot be fixed by bind()ing to the LAN interface in an effort to force the OS to use it exclusively. The problem is due to a bug in the OS's name resolver. See the DUN bug FAQ item for workarounds.

4.12 - What is the { SYN, ACK, FIN, RST } bit?

See the FAQ article Debugging TCP.

4.13 - Is it a bad idea to `bind()` to a particular port in a client program?

It's occasionally justifiable, but most of the time it's a very bad idea.

I've only heard of two good uses of this feature. The first is when your program needs to bind to a port in a particular range. Some implementations of the Berkeley "r commands" (e.g. rlogin, rsh, rcp, etc.) do this for security purposes. Because only the superuser on a Unix system can bind to a low-numbered port (1-1023), such an r command tries, sequentially, to bind to one of the ports in this range until it succeeds. This allows the remote server to surmise that if the connection is coming from a low-numbered port, the remote user must be a superuser. (This port range limit also applies to Windows NT and Windows 2000, but not to Windows 9x.)

The second justifiable example is FTP in its "active" mode: the client binds to a random port and then tells the server to connect to that port for the next data transfer (whether it is an upload, download, or a file listing). This is justifiable because it arguably cleans up the protocol, and the FTP client doesn't need to bind to any particular port, it just needs to bind to a port. (Incidentally, it does this by binding to port 0 the stack chooses an available port when you do this.) This is also justifiable because the FTP client is acting as a server in this case, so it makes sense that it has to bind to a port.

By contrast, it is almost always an error to bind to a particular port in a client. (Notice that both of the above examples are flexible about the ports they bind to.) To see why this is bad, consider a web browser. They often create several connections to download a single web page, one each to fetch all of the individual pieces of the page: images, applets, sound clips, etc. If they always bound to a particular local port, they could only have one connection going at a time. Also, you couldn't have a second instance of the web browser downloading another page at the same time.

That's not the biggest problem, though. When you close a TCP connection, it goes into the TIME_WAIT state for a short period (between 30 and 120 seconds, typically), during which you cannot reuse that connection's "5-tuple:" the combination of {local host, local port, remote host, remote port, transport protocol}. (This timeout period is a feature of all correctly-written TCP/IP stacks, and is covered in RFC 793 and especially RFC 1122.) In practical terms, this means that if you bind to a specific port all the time, you cannot connect to the same host using the same remote port until the TIME_WAIT period expires. I have personally seen anomalous cases where the TIME_WAIT period does not occur, but when this happens, it's a bug in the stack, not something you should count on.

For more on this matter, see the Lame List.

4.14 - What is the TCP window size?

A naive transport protocol simply sends one packet at a time: it does not send another packet until it gets an acknowledgement for the first one, or it times out waiting for the acknowledgement.

The limit of data throughput over a network link is the maximum amount of data it is possible to have in transit at once divided by the round trip time. Imagine a naive TCP/IP implementation running over a 100BaseT Ethernet. The maximum Ethernet frame (less the TCP/IP headers) is 1460 bytes, and the 100BaseT round trip time is roughly 0.3 ms. 1460 divided by 0.0003 seconds comes out to 4.8 MB/s.

If you've done any speed testing on a 100BaseT Ethernet, you know you can hit about 6 MB/s without Ethernet switches, and 9 MB/s with switched Ethernet. That's about twice the data rate we calculated above. We owe that speed jump to TCP's "sliding window".

A sliding window means that the stack allows a small amount of data to go unacknowledged before it stops and waits for the remote peer to acknowledge the first packet. In Microsoft Winsock stacks, the sliding window defaults to 8 KB. That means that if it sends 8 KB of data without receiving an acknowledgement for the first packet, the stack won't send any more data until the first packet is acknowledged or the retry timer goes off, at which point it will try to send the first packet again. As each packet at the front of the "window" gets acknowledged, the 8 KB window "slides" along the data stream, allowing more data to go out.

Dividing Microsoft's 8 KB value by 0.0003 seconds gives about 26 MB/s, which means you hit the medium's maximum data rate (about 9 MB/s) before you hit the limit imposed by the round trip time.

Some networks have long round trip times which require large TCP windows if your application needs to be able to fill the entire pipe with a single TCP stream. Satellite systems are the most common example of this: 600 ms round trip times are common on the satellite system we use at work! Some DSL systems have pretty long round trip times, too, though not nearly as bad as satellite systems. You need to run the numbers to find out what the situation is for your system.

For what it's worth, typical modem round trip times are in the 100-250 ms range. Calculating for 250 ms comes out to 32 KB/s, about five times the data rate of the fastest modem connections you're likely to see. In other words, an 8 KB window is plenty large for modems, despite the long round trip times.

MS Knowledge Base articles Q120642 and Q158474 show how to change the TCP window size for Windows NT/2000 and Windows 95/98, respectively.

4.15 - What is the connection backlog?

When a connection request comes into a network stack, it first checks to see if any program is listening on the requested port. If so, the stack replies to the remote peer, completing the connection. The stack stores the connection information in a queue called the connection backlog. (When there are connections in the backlog, the accept() call simply causes the stack to remove the oldest connection from the connection backlog and return a socket for it.)

The purpose of the listen() call is to set the size of the connection backlog for a particular socket. When the backlog fills up, the stack begins rejecting connection attempts.

Rejecting connections is a good thing if your program is written to accept new connections as fast as it reasonably can. If the backlog fills up despite your program's best efforts, it means your server has hit its load limit. If the stack were to accept more connections, your program wouldn't be able to handle them as well as it should, so the client will think your server is hanging. At least if the connection is rejected, the client will know the server is too busy and will try again later.

The proper value for backlog depends on how many connections you expect to see in the time between accept() calls. Let's say you expect an average of 1000 connections per second, with a burst value of 3000 connections per second. [Ed. I picked these values because they're easy to manipulate, not because they're representative of the real world!] To handle the burst load with a short connection backlog, your server's time between accept() calls must be under 0.3 milliseconds. Let's say you've measured your time-to-accept under load, and it's 0.8 milliseconds: fast enough to handle the normal load, but too slow to handle your burst value. In this case, you could make backlog relatively large to let the stack queue up connections under burst conditions. Assuming that these bursts are short, your program will quickly catch up and clear out the connection backlog.

The traditional value for listen's backlog parameter is 5. On some stacks, that is also the maximum value: this includes Windows NT Workstation and Win9x at least. Windows NT/2000 Server's maximum connection backlog size is 200, unless the dynamic backlog feature is enabled. (More info on dynamic backlogs below.) The stack will use its maximum backlog value if you pass in a larger value. There is no standard way to find out what backlog value the stack chose to use.

If your program is quick about calling accept(), low backlog limits are not normally a problem. However, it does mean that concerted attempts to make lots of connections in a short period of time can fill the backlog queue. This makes Windows NT Workstation in particular a bad choice for a high-load server: either a legitimate load or a SYN flood attack can overload a server on such a platform. (See below for more on SYN attacks.)

There is a special constant you can use for the backlog size, SOMAXCONN. This tells the underlying service provider to set the backlog queue to the largest possible size. This is defined as 5 in winsock.h, and 0x7FFFFFFF in winsock2.h. The Winsock.h definition limits its value somewhat.

There are even better reasons not to use SOMAXCONN: that large backlogs make SYN flood attacks much more, shall we say, effective. When Winsock creates the backlog queue, it starts small and grows as required. Since the backlog queue is in non-pageable system memory, a SYN flood can cause the queue to eat a lot of this precious memory resource.

After the first SYN flood attacks in 1996, Microsoft added a feature to Windows NT called "dynamic backlog". (The feature is in service pack 3 and higher.) This feature is normally off for backwards compatibility, but when you turn it on, the stack can increase or decrease the size of the connection backlog in response to network conditions. (It can even increase the backlog beyond the "normal" maximum of 200, in order to soak up malicious SYNs.) The Microsoft Knowledge Base article that describes the feature also has some good practical discussion about connection backlogs.

You will note that SYN attacks are dangerous for systems with both short and very long backlog queues. The point is that a middle ground is the best course if you expect your server to withstand SYN attacks. Either use Microsoft's dynamic backlog feature, or pick a value somewhere in the 20-200 range and tune it as required.

A program can rely too much on the backlog feature. Consider a single-threaded blocking server: the design means it can only handle one connection at a time. However, it can set up a large backlog, making the stack accept and hold connections until the program gets around to handling the next one. (See this example to see the technique at work.) You should not take advantage of the feature this way unless your connection rate is very low and the connection times are very short. (Pedagogues excepted.)