A word of warning: This is the first edition of this document and there are bound to be errors. My ego isn't so fragile as to be bothered if I made a misstatement of fact when writing this. Just tell me.
My university possesses a generally excellent network, but on occasion certain dorms would grind to a halt for no apparent reason. Seeking answers, I used a windows platform pinger to see if there were correlations between network downtimes and the presence of specific IP´s on a specific subnet. We use essentially static IP´s distributed from a DHCP server--a cookie seems to be assigned to a given MAC address on first request for an IP, and all future IP´s are given on that IP. Nothing out of the ordinary was found using the Windows pingers, so I decided I´d automate the testing process over time using an excellent Linux tool entitled fping. (In another environment, I might have merely shoved up a sniffer, but the secure hubs and my lack of permission to modify them in any way prevented that possibility.
Very quickly, I noticed some very strange entries in the fping logs(IP´s changed):
10.0.9.42 : duplicate for [3], 84 bytes, 3.28 ms 10.0.9.73 : duplicate for [3], 84 bytes, 3.59 ms 10.0.10.33 : duplicate for [3], 84 bytes, 3.51 ms 10.0.10.99 : duplicate for [3], 84 bytes, 3.81 ms
I thought there might be a bug in fping, so I pinged the offending machines from Windows 98:
C:\WINDOWS>ping -f 10.0.9.42
Pinging 10.0.9.42 with 32 bytes of data:
Reply from 10.0.9.42: bytes=32 time=4ms TTL=126 Reply from 10.0.9.42: bytes=32 time=3ms TTL=126 Reply from 10.0.9.42: bytes=32 time=4ms TTL=126 Reply from 10.0.9.42: bytes=32 time=3ms TTL=126
Ping statistics for 10.0.9.42: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 3ms, Maximum = 4ms, Average = 3ms
Confusing, everything seemed normal from here. Then I tried the Linux ping command.
effugas@doxpara:~> ping 10.0.9.42 PING 10.0.9.42 (10.0.9.42): 56 data bytes 64 bytes from 10.0.9.42: icmp_seq=0 ttl=127 time=3.5 ms 64 bytes from 10.0.9.42: icmp_seq=0 ttl=127 time=14.7 ms (DUP!) 64 bytes from 10.0.9.42: icmp_seq=1 ttl=127 time=6.2 ms 64 bytes from 10.0.9.42: icmp_seq=1 ttl=127 time=7.5 ms (DUP!) 64 bytes from 10.0.9.42: icmp_seq=2 ttl=127 time=3.3 ms 64 bytes from 10.0.9.42: icmp_seq=2 ttl=127 time=3.8 ms (DUP!) 64 bytes from 10.0.9.42: icmp_seq=3 ttl=127 time=15.0 ms 64 bytes from 10.0.9.42: icmp_seq=3 ttl=127 time=15.4 ms (DUP!)
--- 10.0.9.42 ping statistics --- 4 packets transmitted, 4 packets received, +4 duplicates, 0% packet loss round-trip min/avg/max = 3.3/8.6/15.4 ms
This was disturbing, especially since there was a very high correlation between subnets experiencing high collisions and slow networks and the number of TCP-Chorusing machines on that subnet. What was causing this? The first step was to hunt down the machines exhibiting the bug and do a little exploratory surgery. It didn't take much deduction once I got access to a few of the affected machines to realize that there were the same number of extra TCP/IP stacks bound to the main adapter as there were extra pings. Plus, just because there were machines with extra stacks didn't mean it wasn't the NIC´s fault--a bug in the NIC installer could have have created the extra TCP/IP entries. And what about wiring? All of these machines were exhibiting these reactions on a rather non-standard "secure hub". Perhaps that was the cause of the stacks reacting so strangely?
Further investigation did shed some light. The automated installation routines are suspect, since they´re the routines that most commonly add the stacks. All cards, though, from generic Linksys´s to an Intel 8255x 10/100 board to the entire bevy of PCI and PCMCIA that 3Com offers can have additional TCP/IP stacks merely added onto them for this behavior. While only 3Com cards have been seen by me suffering from unintentional TCP/IP Chorusing, this is probably because of the 90%+ market share 3Com enjoys on campus and not because of a flaw in their drivers. It´s quite likely that, since students and not staff install network drivers on campus, this is more of a wetware problem--the student does whatever he or she can to "just make it work like the directions say", and if adding TCP/IP multiple times happens to "Just Work", so be it.
Before I could be sure that this was the problem, though, I needed to isolate a computer from the University network first. I used my dorm room 100baseT internal network to do so. The following tcpdump is from a single character typed from the chorusing machine into the telnet port of the Linux machine:
11:31:02.390000 10.0.6.195.1043 > 10.0.6.194.telnet: P 6:7(1) ack 171 win 7756 (DF) [Initial Keypress] 11:31:02.390000 10.0.6.194.telnet > 10.0.6.195.1043: P 171:172(1) ack 7 win 16352 (DF) [Pressed key is echoed from the Linux machine to be displayed on the Windows box.] 11:31:02.390000 10.0.6.195.1043 > 10.0.6.194.telnet: . ack 172 win 7755 (DF) [Windows machine acknowledges receipt of data signifying what character it should display.] 11:31:02.390000 10.0.6.195.1043 > 10.0.6.194.telnet: . ack 172 win 7755 (DF) [Windows machine again acknowledges receipt. This is the "chorus".]
There´s most probably no limit to the number of extra ACKs--If I had ten TCP/IP stacks, I´d have nine duplicate packets, as far as I can tell.
A final note--I have thus far been able to locate the bug in Windows 98 and Windows 95 OSR2. The original version of Windows 95 was simply unavailable for testing, but I would appreciate an email verifying the bug harkens back that far. . |