Debugging IPSec VPNs in FortiGate

Debugging what is going wrong with a VPN setup is difficult. The IKE protocol is “chatty”, and negotiates back and forth between the two ends for several rounds. The GUI offers not much help, it is either  UP or Down. Most of the real debugging happens inside the CLI.

One problem in particular that has always bugged me is that you need access to the end machines involved to initiate traffic across the link. The network admin typically doesn’t have direct access on the computers on either side of the VPN in order to initiate that traffic. I’ll show you a method that can be used to initiate traffic from that network as well.

Here are some basic steps to troubleshoot VPNs for FortiGate.

In IKE/IPSec, there are two phases to establish the tunnel. Phase1 is the basic setup and getting the two ends talking. Then IKE takes  over in Phase2 to negotiate the shared key with periodic key rotation as well as dealing with NAT-T (NAT tunnelling), and all the other “higher-end” parameters.

The first trouble shooting step is to verify your parameters are all correct and matching.

For Phase1, is the end gateway dynamic or static? Fortigate to Fortigate can use both Main and Aggressive modes for dynamic connections, but many other brands can not. In general, if you are supporting a dynamic IP client end, you will have to use Aggressive mode Phase1, so make sure that mode is set for dynamic clients. If this a static config, you should use Main mode for Phase1, which is a bit more secure on the initial handshake.

For Phase2, are both sides setup to use PFS? Replay Detection? Dead-peer detection? While most VPN setups include a set of encryption and hash algorithms, you only need one that are the same. The reason for the set is to offer many choices. In practice, just pick one that your base client supports and go from there. Now-a-days, AES256/SHA1 is probably supported across the board, and that is all I ever use. You don’t have to match the set of them exactly, each side just needs a common one to talk.

After that all checks out, we need to see what IKE is doing that is failing.

So SSH or console into the CLI.

If this is debugging a VDOM
(like in this case), you may have to switch into the root VDOM if you
are the system admin of the firewall as opposed to a VDOM admin.

fgt300C-fw # config vdom
fgt300C-fw # edit root
current vf=root:0

fgt300C-fw (root) #

as the diag commands are only available in the individual VDOMs or from the root VDOM for the system admin.

To enable debug logging on the console (should be default) do

fgt300C-fw (root) # diagnose debug console

To enable debugging output

fgt300C-fw (root) # diagnose debug enable

Phase1 debugging isn’t too useful. IKE/Phase2 debugging is where the problem almost always is. Lets turn on full debugging logs there.

fgt300C-fw (root) # diagnose debug application ike -1

Now, the problem I’ve always run up against is getting the tunnel to trigger to open up with traffic running on the link. You either have to conference in somebody with access to help you, or use this nifty trick…

Open another SSH connection to the FW CLI.  (If this is a VDOM, you’ll have to ‘conf vdom; edit “vdom3″ to get into
the VDOM context where the network is you want to troubleshoot).

Set the ping source IP address to be in the inside network of the host you are trying to troubleshoot..

fgt300C-fw (vdom3) # execute ping-options source 172.30.3.254

And now, ping away from the CLI in order to bring up the tunnel interface

fgt300C-fw (vdom3) # execute ping 192.168.0.1

(assuming 192.168.0.1 is an existing host only reachable via the VPN tunnel, and the ping service is allowed through the tunnel).

fgt300C-fw (vdom3) # execute ping 192.168.0.1
PING 192.168.0.1 (192.168.0.1): 56 data bytes
64 bytes from 192.168.0.1: icmp_seq=0 ttl=64 time=46.9 ms
64 bytes from 192.168.0.1: icmp_seq=1 ttl=64 time=47.3 ms
64 bytes from 192.168.0.1: icmp_seq=2 ttl=64 time=45.5 ms
64 bytes from 192.168.0.1: icmp_seq=3 ttl=64 time=66.3 ms
64 bytes from 192.168.0.1: icmp_seq=4 ttl=64 time=45.7 ms

--- 192.168.0.1 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 45.5/50.3/66.3 ms

The trick here is that you are source as the network you are setting up, which should trigger the tunnel to come up if it isn’t up already, and you can see real live traffic. I don’t know how many times I’ve been stuck on a conference call waiting for whoever had access to do something to get around to doing the test I asked of them.

Back in the first debug window, you should see a whole bunch of IPSec and IKE messages fly past on the screen.

You have to learn to pick out the lines that are important, and zone in on them as everything is flying by. Learn to pause the display (or do a quick ‘diag debug dis’ to stop the output). Scrolling back and zeroing in on the one error out of 100 lines is going to be your key skill here.

If all is well, you should get something about the SA being established with the SPI value (not important).

ike 3:MyVPN_GW:18690:MyVPN:49143: added IPsec SA: SPIs=939fc892/b54d030

and of course, if it is configured for SNMP, something like

ike 3:MyVPN_GW:18690:MyVPN:49143: sending SNMP tunnel UP trap

is a nice confirmation that all is well with the VPN.

If you are seeing a lot of errors repeating with Phase1, and you see messages like

ike 3:MyVPN_GW:18698: sent IKE msg (P1_RETRANSMIT): ....

Most likely the problem is a mismatch preshare key for the VPN tunnel, as it isn’t passing out of P1 (which doesn’t have much to negotiate).

Also check again if this is dynamic client (generally requiring Aggressive mode) or a static connection that probably should be set to Main mode, but could be using Aggressive Mode.

If you don’t have a common encryption alg/hash, you should see some errors like..

ike 3:MyVPN_GW:18707: no SA proposal chosen

As it can’t find a matching SA between the two ends using the same encryption algorithm/hash combo to encrypt the tunnel. Fixup the encryption alg/hash and everything should go better.

The hardest problems to detect are different keylength timers (you’ll just have to review them on both sides to make sure your P1 and P2 keylife timers are identical on both sides). Problems that you encounter with different timers show up as a VPN that works for a while, but then stops work, and won’t come up unless you bounce both sides. With valid timers the same on both sides, the VPN should keep up and key rollovers happen automatically.

Also, DPD may not always negotiate. One side may have it on and let a VPN connection stay up for a certain time until the timer kicks off and closes the connection for the lack of keep-alive packets. Make sure both sides have it on, or both sides have it off.

There are a few other error conditions that may come up, but these are the more common errors.

The most important thing with the low level debugging like this is to learn to pick out the important error lines from all the rest of the junk flying by. It just takes practice. You may want to deliberately break an existing setup just to see what happens. But once you can zero in on that one error line out of a 100 that is important, it will be a lot easier to troubleshoot what problems may come at you.