The Mysterious Layer 2 Problem
So it looks like the RANCID server has already paid off.
Now, I am not at all well-versed in the world of VPNs and ASAs. I can do some basic config on them via the web interface, but I am by no means an expert. Knowing my limitations, the director called in the outside contractor to pre-build one side of a VPN tunnel. This was done in a bit of a vacuum, and nobody in the IT department except for the director knew that it happened, until we were alerted by RANCID that the config on the ASA had changed. The director was fairly certain that this change could not have cause the problem that we were experiencing. What was the problem? I’m glad you asked!
The ASA is connected to our main prod switch, as well as our main internal router. We had terrible problems accessing hosts on the main prod switch (only 1 VLAN). Pings from other subnets were spotty, and worse, pings inside the same VLAN were spotty! Because we had multiple hosts on the same broadcast domain that couldn’t maintain a solid TCP connection, my first inclination was to blame the switch. Then, I got the email from RANCID. Inspired by a quote from /r/networking by carollr, ‘The truth is what’s on the wire,’ I fired up a packet capture, and what I saw blew me away (I’ve changed the internal IP’s).
ARP request: Who has 192.168.1.1? Tell 192.168.1.20.
ARP reply: 192.168.1.1 is at 00:00:00:00:00:01 <-- Internal router (correct)
ARP reply: 192.168.1.1 is at 00:00:00:00:00:02 <-- ASA (incorrect)
Two ARP replies with different MAC addresses for the same host! Must be an IP conflict! To the ASA, as it should not be on 192.168.1.1. And it wasn't. Now what? The only other thing that I know of that would cause a Cisco device to respond to an ARP request that doesn't belong to it is proxy-arp, but I didn't enable that anywhere, and neither did the outside vendor, right? Wrong.
One of the commands that they entered, as revealed by the RANCID diff, was:
ip nat (outside, inside) source static Local_subnet Local_subnet destination static remote_network remote_network
First - what does this command do? It's a network address translation command that is used in conjunction with the VPN tunnel, to tell the ASA that when traffic goes through the tunnel, we want to maintain the existing source and destination addresses - we don't want to change them. I noticed 2 differences between this particular NAT command and other NAT commands that we had configured for remote sites. I'll present an example of one of our other NAT commands here:
ip nat (inside, outside) source static Local_subnet Local_subnet destination static remote_subnet remote_subnet no-proxy-arp route-lookup
The differences are glaring. First, the inverted order of (outside, inside) told the ASA that the local_subnet addresses were actually remote, and the last nail in the coffin was the lack of a no-proxy-arp command. The route-lookup just tells the ASA to use its routing table to find a route the the remote host. So why did these differences cause our problems?
Because proxy-arp is enabled by default on that command, and the ASA thought that the local_subnet addresses were on the other end of the VPN tunnel, he was sending ARP replies to every request for a host on the local subnet! Essentially, it was like having an IP conflict for every host connected to that switch, since everything is in the same VLAN/broadcast domain!
Director called up the vendor and asked them to undo their config changes, and within minutes (waiting for arp caches to clear out, I manually cleared the caches on the router and ASA), our network was back to life!
Mystery solved, and now everybody can get back to work.
‹ Rancid with WebSVN on Ubuntu 12.04.4 LTS Windows DHCP Superscope – NOT a convenient way to organize VLANs at a site ›