Failing Over Between pfSense Boxes

At work we have several buildings. Most of our buildings use pfSense for firewalling and splitting off subnets.

Say we have two buildings, "Building-Office" and "Building-Cafe", that are physically near each other. Each of these buildings has its own Internet connection and pfSense box: pfsense-office and pfsense-cafe. Each building has a set of subnets associated with it, as follows:

Building-Office has the following subnets:

STAFF: 192.168.100.0/24
WORKSHOP: 10.10.10.0/24
LAB: 10.20.0.0/24

Building-Cafe has the following subnets:

STAFF: 192.168.1.0/24
CAFE: 172.19.19.0/24

As you can see, both buildings have a STAFF subnet with the same IP address range. There is a wireless bridge that connects the two buildings on the STAFF subnet. On the STAFF subnet, pfsense-office has an IP address of 192.168.100.254 and pfsense-cafe has an IP address of 192.168.100.253 .

I have the following goals:

When both Building-Office and Building-Cafe have working Internet, the subnets associated with each building should use the closest Internet connection.
When one Internet is down (for whatever reason) but the pfSense box is running, I want to route all traffic from the failing building to the other using the wireless bridge.
When failing over I do not want to touch individual client settings (for example, I do not want to change the DHCP servers to set new gateways).
Failing over should not otherwise decrease the security of isolating subnets (which is why we have different subnets in the first place).
I should be able to launch the failover relatively easily (by "flipping a switch").
I should be able to choose specific subnets to fail over if I want. For example, maybe I want to fail over LAB but not WORKSHOP. I think I always want to fail over STAFF, however.

I have a feeling that any competent network admin could set up pfSense to accomplish these goals within minutes (which is one reason I have felt intimidated about asking this question online). It took me YEARS to get something working properly, so I want to document the procedure that works for me in the hopes that other people can learn from my incompetence.

As it turns out there are a few more important considerations to our situation:

In Firewall -> NAT -> Outbound, we use Manual Outbound NAT generation, and specify a rule for each subnet going out its WAN interface. We disable automatic NAT because it seems to interfere with some of our VPN software (namely Hamachi).

Non-Solutions

Most failovers for pfSense talk about CARP, but I think that applies when you have multiple pfSense boxes monitoring the same Internet connection. We have two pfSense boxes monitoring two different Internet connections, with different associated subnets.
Similarly there is some functionality called "Virtual IPs", but I never figured out how they worked or whether they would solve my problem.
I think that gateway groups (in System -> Routing) might be useful for automatic failover but they do not solve the problem on their own.

Phase 1: Failing over STAFF

Failing over STAFF over the wireless link is relatively easy. The key is to specify some new gateways in System -> Gateways of the pfSense interface:

On pfsense-office, make a gateway called GW_CAFE. This should use the STAFF interface, and have the gateway IP address of pfsense-cafe (in this example 192.168.100.253).
Similarly, on pfsense-cafe, make a gateway called GW_OFFICE, also on the STAFF interface, with an gateway IP of 192.168.100.254 .

At this point failover across the wireless for the STAFF interface should be possible. Say that the internet connection goes down at Building-Office. Then to fail over STAFF to Building-Cafe, do the following:

On pfsense-office, in System -> Gateways, change the default gateway from GW_WAN to GW_CAFE
Maybe disable GW_WAN if that is not enough to make failover work.

If for some reason you have different sets of firewall rules for the STAFF interfaces be aware that the rules for the pfsense-cafe STAFF interface will apply during failover.

Phase 2: Failing over other subnets

This is where things get tricky. The wireless link is on the STAFF network, so we need to route other traffic via that interface. Here are the broad steps:

Set up an alias with the subnets to failover.
Set up routes for on the failover pfsense box.
Set up rules to allow traffic from the STAFF subnet of the failover pfsense.
Fix manual outbound NAT rules.

Let's set up the failover for the LAB and WORKSHOP subnets over pfsense-cafe.

On pfsense-cafe, set up an alias called office_subnet_failover. It should consist of two networks: WORKSHOP: 10.10.10.0/24 and LAB: 10.20.0.0/24

On pfsense-cafe, go to System -> Routing -> Routes. Make a new static route with the destination network office_subnet_failover and the gateway GW_OFFICE. Make the description descriptive: "Fail over pfsense-office subnets."

On pfsense-cafe, in Firewall -> Rules -> STAFF make a firewall rule:

Action: Pass
Interface: STAFF
Protocol: any
Source: office_subnet_failover
Destination: any

Be careful! If you are using subnet isolation then you want to put this rule after your isolation rules so that LAB and WORKSHOP clients cannot access STAFF resources.

If you have manual NAT, go to Firewall -> NAT -> Outbound and make a NAT rule:

Interface: WAN
Protocol: any
Source: office_subnet_failover
Destination: any
Address: Interface Address
Static-port: checked

If you do not have this rule then there will be no NAT for outgoing packets on LAB and WORKSHOP, and the destinations will try to return packets to your internal subnets instead of the actual IP address of pfsense-cafe.

As far as I know, this is sufficient to get failover working (in one direction). You can set up a similar set of rules on pfsense-office to failover the CAFE subnet.

You then "flip the switch" on the failover in the same way as in Phase 1: Make the appropriate gateway the default, and disable the other one.

If you are paranoid you can also disable the failover firewall rules until it is failover time, and then enable them to make failover work. But this adds additional steps to flipping the switch.

Failing Over Between pfSense Boxes

Non-Solutions

Phase 1: Failing over STAFF

Phase 2: Failing over other subnets

Sidebar!