Pings Fail Between Two devices

Question

What initial steps should one use to troubleshoot ping failures between two devices?

Answer

Introduction

The failure of Layer 3 communication between two devices in the network is most commonly reported as the devices being unable to ping each other. The problem could be anywhere from the physical layer through to the network layer.

The most common reasons for this problem are:

A physical link failure somewhere in the data path between the devices.
A misconfiguration at the VLAN or IP level somewhere along the data path between the devices.
Missing route information somewhere along the data path between the devices.

Ping

The original PING utility stood for "Packet Internet Groper", and was a package of diagnostic utilities used by DARPA personnel to test the performance of the ARPANET. The modern Ping program was written by Mike Muuss in December, 1983, and has since become one of the most versatile and widely used diagnostic tools on the Internet.

The Ping program works much like a sonar echo-location. It sends a small packet of information containing an ICMP ECHO_REQUEST to a specified computer or IP address, which then sends an ECHO_REPLY packet in return.

awplus#ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=114 time=11.9 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=114 time=11.6 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=114 time=11.6 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=114 time=11.6 ms
64 bytes from 8.8.8.8: icmp_seq=5 ttl=114 time=11.6 ms

--- 8.8.8.8 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4003ms
rtt min/avg/max/mdev = 11.619/11.695/11.902/0.125 ms
awplus#

Traceroutes

The best way to start debugging this problem is to use traceroute to narrow down the point in the data path where the problem lies.

Traceroute works by increasing the time-to-live (TTL) value of each successive batch of packets sent. The first three packets sent have a TTL value of one (implying that they are not forwarded by the next router and make only a single hop). The next three packets have a TTL value of 2, and so on. When a packet passes through a host, normally the host decrements the TTL value by one, and forwards the packet to the next host. When a packet with a TTL of one reaches a host, the host discards the packet and sends an ICMP time exceeded (type 11) packet to the sender. The traceroute utility uses these returning packets to produce a list of hosts that the packets have traversed on route to the destination.

Traceroutes from each of the devices toward each other should show that traffic is forwarded OK for a certain number of hops at each end. The points identified from each end which the traceroutes show the data to be lost or misdirected, are the places to start looking for the problem. It could be that the problem location identifies by the traces from each end turns out to be one common location, or it could be that there are problems occurring at multiple locations.

Once the traceroute has pointed you to where to start investigating, the debugging process should work through the possible causes listed above.

awplus#traceroute 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 38 byte packets
 1  10.52.201.1 (10.52.201.1)  1.145 ms  1.190 ms  3.840 ms
 2  10.52.253.1 (10.52.253.1)  1.872 ms  0.565 ms  0.917 ms
 3  10.52.127.249 (10.52.127.249)  0.876 ms  1.606 ms  1.930 ms
 4  *  *  *
 5  12.118.186.109 (12.118.186.109)  2.030 ms  1.622 ms  1.926 ms
 6  12.123.138.118 (12.123.138.118)  19.252 ms  12.234 ms  18.969 ms
 7  12.122.2.190 (12.122.2.190)  11.692 ms  15.498 ms  16.377 ms
 8  12.122.116.37 (12.122.116.37)  11.014 ms  11.846 ms  11.506 ms
 9  12.255.11.2 (12.255.11.2)  11.549 ms  12.255.11.0 (12.255.11.0)  11.664 ms  11.214 ms
10  108.170.240.97 (108.170.240.97)  12.078 ms  12.466 ms  12.835 ms
11  216.239.49.47 (216.239.49.47)  13.337 ms  108.170.232.19 (108.170.232.19)  13.588 ms  142.250.232.77 (142.250.232.77)  12.369 ms
12  8.8.8.8 (8.8.8.8)  13.295 ms  12.575 ms  12.208 ms
awplus#

Cable Problems

Check for cabling problems (links down) in that location.

If the links are all OK, start doing pings or traceroutes from the switch(es) at the problem location towards the end-device that is not reachable from that location. Also, perform pings and traceroutes to other switches along the path towards the unreachable device. Look at where the ping and trace-route traffic is directed.

Is a misconfiguration causing the switch to send the traffic out an incorrect port?
Is the egress VLAN not configured on the correct ports?
Is the IP address on the egress VLAN incorrect?

If the configurations at each hop look correct, then check whether a cabling mistake is actually taking traffic to an incorrect destination.

Routing Problems

The problem could very possibly not be a misconfiguration of VLANs and IP interfaces, but could be due to a failure to transfer route information between the switches. If the switch where the problem occurs is reporting that it does not have a route to the destination device, then examine why it does not have this route. Was it a failure to configure a static route on the switch? Is it a failure of another switch to advertise the route to the switch in question?

If the problem needs to be escalated, then the most important information to capture on the switch(es) at the problem location is the output of show tech-support - it will contain the full set of information to illustrate the state of IP routing on the switch, at software and hardware levels.