Ubiquiti offers dual-WAN functionality on their professional line of UDM security gateways in either failover or load balancing mode. This can be an important feature for mission-critical networks and businesses, but upon setting up a device for failover recently, I discovered an interesting behaviour for the device.
To illustrate, let's take the scenario where there are two sites, SiteA and SiteB, each with their own internet service provider (ISP). SiteA is configured for its primary ISP and dual-wan in failover mode to the internet shared with SiteB. This gives two networks for SiteA which point to each respective internet provider, (1.0.0.0/29 and 2.0.0.0/29 respectively).
Given the scenario above for SiteA where the primary connection is 1.0.0.2 with a backup of 2.0.0.2 in failover mode, one (including myself), could assume that the device only activates the backup gateway upon connection failure of the primary interface. Since it's called "failover", I feel this would be a safe assumption.
Under the presumption that the backup interface is only enabled upon failure of the primary, it would make sense to have records point only to the primary interface.
In this example, a mail server would be behind both SiteA and SiteB, with SiteA using "Mail.SiteA" as the mail exchange handler for the domain "SiteA". This host would then have an A record in DNS for the primary IP address of the Ubiquiti device.
If SiteB wants to send a message to a user on SiteA, SiteB would first query for the MX
record for SiteA, which DNS would respond with "Mail.SiteA" and subsequently the primary IP address.
The email server on SiteB would then connect to SiteA to send its EHLO
handshake via the 1.0.0.0/29 network, traveling through the public WAN.
However upon sending the HELO
response from SiteA back to SiteB, the Ubiquiti UDM with a failover connection will see the connection came from the 2.0.0.0/29 network, and instead of following the original NAT source, will send the reply directly to the SiteB gateway via the shared connection.
This is because, despite SiteA's gateway is configured for failover-only mode, it will keep the secondary connection active at all times.
As such, the routing table of the gateway would look something similar to:
Destination Gateway Iface
0.0.0.0/0 1.0.0.1 pri-int
1.0.0.0/29 1.0.0.1 pri-int
2.0.0.0/29 2.0.0.1 sec-int
Default traffic still flows through the primary interface, but a more exact destination instead will be directed towards that specific gateway. Additionally the NAT table does not appear to be being used when a direct connection is available; this may be a bug or a feature, I'm not sure which. Either way, having both interfaces active and ignoring the NAT source interface poses a major issue, as the response seen will be coming from 2.0.0.2 but is expected to be coming from 1.0.0.2, (as that is where SiteB originally sent it to).
As such, SiteB will see this incoming response as invalid and simply drop the packet. SiteB's mail server will never receive a handshake and will (hopefully) queue the message for some time as undelivered.
To address this issue, one must disregard the "failover" terminology and instead presume the device operates in "load balancing" mode instead, (as that is how the UDM behaves in this type of situation).
Luckily support for redundant systems like this is a core feature of the internet and is easy to implement, (once you understand that "failover" isn't an all-or-nothing switch).
The DNS settings would be two A
records for each ISP endpoint pointing to the gateway and two MX
records for the domain SiteA, with the primary hostname having a smaller number (higher preference).
Email servers trying to talk to SiteA would first try the primary interface, and if failed to receive a response, will automatically retry the connection on the secondary interface.
The same technique can be applied to other protocols as well, with a CNAME
for "Mail.SiteA" pointing to both "Primary.SiteA" and "Secondary.SiteA".
This issue only arises on directly-attached networks, ie: 2.0.0.0/29
includes the IP addresses 2.0.0.1
through 2.0.0.6
with broadcast on 2.0.0.7
. If a device sits at 2.0.0.8
, the Ubiquiti gateway at SiteA will see it as an external network and will send replies via the default connection as expected.
Networks that are not directly linked in the gateway's routing table will behave correctly, with the NAT source lookup table working as expected.