Mastering DNS Resilience: How to Assess and Ensure Network Survivability

David · October 2, 2024, 8:00pm

”It’s not DNS.”
"There’s no way it’s DNS”
”It was DNS.”

It’s as old as the internet. My age, basically.

For most people, DNS is like electricity. We don’t celebrate its very existence. We just moan and complain when it’s mysteriously absent for even a second. Such first world problems are met with less and less attention at the core, it seems, to get it right.

However, this is where we (ADAMnetworks) live. We protect people. Part of that protection is the emotion and disruption that happens when this stuff doesn’t work.

Let’s take a step back. In the DNS world, the original design was about recursing right back to the root servers if, and when needed.

Since most end users didn’t have recursive resolvers of their own, it became a standard that is in place to this day: ISPs stand up caching recursive DNS servers for their subscriber base to use. One dynamic that we witnessed at the onset of ubiquitous connectivity and the rise of social media was nation-state level disruption where ISPs were required to selectively block DNS queries to social media platforms, which gave rise to Google Public DNS (8.8.8.8/8.8.4.4), and while that was a very temporary route around the blockage, it served as good marketing for fast, anycast DNS servers, and now every sysadmin defaults to one of these centralized non-ISP DNS servers.

As the internet grew in scale and we saw a healthy tension between not-for-profit and for-profit enterprise, we saw this space evolve. See DNS encryption, for example:

Root DNS servers do not, and will not offer DoT, DoH or DoQ as their official statement states.

And yet, every commercially-operated “Protective DNS Resolver”, including those that are free to use, offer DNS encryption of one form or another.

Back to the purpose of this article: DNS resilience. The healthy tension described above is what has us thinking about the best approach to making sure DNS is resilient all of the time for the people we love.

Ideally, we can keep working in any of these scenarios:

internal networking problems with DNS
internal networking issues with a gateway
ISPx internal disruptions
Network transit issues within peering arrangements (including BGP disruptions)
Upstream Protective Resolver reachability outages

None of these issues are addressed with one silver bullet, but a proven approach to the bulk of these issues can be addressed with these redundancies:

high availability of a gateway
high availability of endpoint-referenced DNS servers (but not by having two DNS servers used at an endpoint)
High availability of Active Directory DNS servers (not to be confused with the prior bullet)

A lot of these resilience efforts are within the control of the enterprise or outsourced technology partner. However, what if your DNS supply chain is attacked?

Public resolvers in use are down (say your ISP dropped the BGP route to your chosen resolvers)
Your main ISP is experiencing total failure

The approach is to ensure there isn’t a single point of failure anywhere at any time. Let’s use the character trait of the original internet design, to be able to route around any obstacle to ensure there’s redundancy everywhere. To that end, here’s a simple checklist you can use as an audit, if everything works correctly, including a mature bug-free DNS resolver environment. We have taken the liberty of allocating a thoughtful weighting of points per check:

Audit check	Answer	Yes Value
1 Endpoints’ assigned DNS server offers High Availability	Yes/No	30%
2 Active Directory domain answers are available from multiple AD DNS servers	Yes/No	25%
3 Multiple resolver sets are used with DNSharmony	Yes/No	25%
4 Multiple ISPs via separate circuits	Yes/No	10%
5 Resolver pairs are split between ISP connections for best resilience	Yes/No	10%
	Max:	100%

Let’s dive into each of the above with more detail and examples:

Endpoints’ assigned DNS server offers High Availability

The syntax used in many networks is to use the last octet as a way of identifying the real vs virtual hosts for a typical /24 network:

10.128.1.1 is NODE1
10.128.1.2 is NODE2
10.128.1.254 is the VIP (and the VIP is the designated DNS server at the endpoints)

Using this syntax, we see the endpoint is assigned like this:

*NIX terminal shows us this:

% cat /etc/resolv.conf |grep nameserver
nameserver 10.128.1.254

Similarly in Windows, we see this:

C:\Users\David>ipconfig /all | find "DNS Servers"
   DNS Servers . . . . . . . . . . . : 10.128.1.254

Notice that the DHCP-assigned resolver is the VIP, yet all 3 of them offer DNS answers from macOS/Linux:

Let’s start by doing the query to the one and only DHCP-assigned resolver:

% dig captive.apple.com +short
17.253.119.202
17.253.119.201

For good measure, let’s check directly with the real nodes behind the VIP:

% dig @10.128.1.1 captive.apple.com +short
17.253.119.202
17.253.119.201
% dig @10.128.1.2 captive.apple.com +short
17.253.119.201
17.253.119.202

And the same goes for Windows:

C:\Users\David>nslookup dns.msftncsi.com
Server:  gateway.hq.anycorp.io
Address:  10.128.1.254

Non-authoritative answer:
Name:    dns.msftncsi.com
Address:  131.107.255.255

C:\Users\David>nslookup dns.msftncsi.com 10.128.1.1
Server:  gateway1.hq.anycorp.io
Address:  10.128.1.1

Non-authoritative answer:
Name:    dns.msftncsi.com
Address:  131.107.255.255

C:\Users\David>nslookup dns.msftncsi.com 10.128.1.2
Server:  gateway2.hq.anycorp.io
Address:  10.128.1.2

Non-authoritative answer:
Name:    dns.msftncsi.com
Address:  131.107.255.255

Since all of them responded as expected, we have a 30-point score so far.

Active Directory domain answers are available from multiple AD DNS servers

An important design element in the enterprise is that endpoints never reference Active Directory DNS directly. Instead, designated resolvers, know to consult AD DNS only for domains where AD DNS is authoritative. With that understanding, here’s how this check can be validated, depending on the environment:

We use an SRV query to validate that such records exist and are answered via our DHCP-assigned single DNS server:

% dig SRV _ldap._tcp.dc._msdcs.hq.anycorp.io

; <<>> DiG 9.18.16 <<>> SRV @10.128.1.254 _ldap._tcp.dc._msdcs.hq.anycorp.io
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37719
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 3

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 2f05f36be2bd44330abc73e5e534efff2afbb8c4f18200d945da8eab9aac760a (good)
;; QUESTION SECTION:
;_ldap._tcp.dc._msdcs.hq.anycorp.io. IN SRV

;; ANSWER SECTION:
_ldap._tcp.dc._msdcs.hq.anycorp.io. 600 IN SRV 0 100 389 dc1.hq.anycorp.io.
_ldap._tcp.dc._msdcs.hq.anycorp.io. 600 IN SRV 0 100 389 dc2.hq.anycorp.io.

;; ADDITIONAL SECTION:
dc1.hq.anycorp.io. 3600	IN	A	10.128.3.201
dc2.hq.anycorp.io. 3600	IN	A	10.128.3.202

;; Query time: 1 msec
;; SERVER: 10.128.1.254#53(10.128.1.254) (UDP)
;; WHEN: Sat Sep 28 14:43:36 EDT 2024
;; MSG SIZE  rcvd: 231

The important observation is to see if both AD servers were queried for the DNS query itself, which is captured in the logs of adam:ONE and/or centralized SIEM, notice that the query was forwarded to two (2) domain controllers:

%: grep "_ldap._tcp.dc._msdcs.hq.anycorp.io" /var/log/anmuscle.log | grep “DNS>”
I 28/9 14:43:36.744308 67114 DNS> 10.128.2.100@54699 UDP4 _ldap._tcp.dc._msdcs.hq.anycorp.io SRV | 10.128.3.201@53 [timeout 5000]
I 28/9 14:43:36.744418 67114 DNS> 10.128.2.100@54699 UDP4 _ldap._tcp.dc._msdcs.hq.anycorp.io SRV | 10.128.3.202@53 [timeout 5000]

Multiple resolver sets are used with DNSharmony

In DNSharmony, multiple resolver sets can be created and then used to harmonize (if any protective resolver blocks an FQDN, the answer is blocked). Here’s an example:

Here’s how it is then applied to the policy itself:

Multiple ISPs on separate circuits

Let’s use VyOS as an example where a load-balanced WAN setup can be verified:

vyos@vyos:~$ show wan-load-balance
Interface:  eth0
Status:  failed
Last Status Change:  Tue Sep 26 20:12:19 2024
-Test:  ping  Target:
    Last Interface Success:  55s
    Last Interface Failure:  0s
    # Interface Failure(s):  5

Interface:  eth1
Status:  active
Last Status Change:  Tue Sep 26 20:06:42 2024
+Test:  ping  Target:
    Last Interface Success:  0s
    Last Interface Failure:  6m26s
    # Interface Failure(s):  0

Using the above example, WAN1 runs over eth0 and WAN2 runs over eth1, and as long as the upstream resolver pairs are split-routed, no DNS outage is experienced.

Resolver pairs are split between ISP connections for best resilience

When more than one ISP is being used, the DNS resolver sets should be distributed also (in most cases). Here’s an example of how they are split between two ISPs for the same resolver set (9.9.9.9 first):

                                    My traceroute  [v0.95]
gateway1.hq.anycorp.io (10.0.0.206) -> 9.9.9.9 (9.9.9.9)         2024-10-02T15:40:38-0400
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                                      Packets               Pings
 Host                                               Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 10.0.0.1                                         0.0%     4    1.3   1.4   1.1   1.8   0.3
 2. pool-99-242-108-1.cpe.net.cable.rogers.com       0.0%     3   16.7  16.7  16.6  16.8   0.1
 3. 8079-dgw02.lndn.rmgt.net.rogers.com              0.0%     3   14.8  15.7  14.8  17.0   1.1
 4. unallocated-static.rogers.com                    0.0%     3   16.5  17.6  16.5  18.8   1.2
 5. 209.148.235.214                                  0.0%     3   18.2  21.5  18.2  26.8   4.7
 6. ae58.bar3.Toronto1.Level3.net                    0.0%     3   34.7  33.4  31.1  34.7   2.0
 7. ae2.3601.ear1.NewYork6.net.lumen.tech           50.0%     3   31.8  31.8  31.8  31.8   0.0
 8. 42-3356-nyc.sp.lumen.tech                        0.0%     3   34.5  35.0  34.5  36.0   0.8
 9. dns9.quad9.net                                   0.0%     3   36.2  34.3  32.1  36.2   2.1

Second, let’s review the path to 149.112.112.112:

                                    My traceroute  [v0.95]
gateway1.hq1.anycorp.io (10.0.3.201) -> 149.112.112.112 (149.112.12024-10-02T15:51:03-0400
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                                      Packets               Pings
 Host                                               Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 10.0.3.1                                         0.0%     7    0.4   0.3   0.3   0.4   0.0
 2. 216.181.8.129                                    0.0%     7   13.2  11.5   7.9  13.9   2.3
 3. 216.181.206.66                                   0.0%     6   15.5  19.8  15.2  29.2   5.8
 4. 216.181.209.65                                   0.0%     6   12.7  16.4  11.8  20.8   3.6
 5. 216.254.130.174                                  0.0%     6   15.8  16.6  14.4  20.6   2.1
 6. xe-1-1-1-991.ipr02.mtlcnds.distributel.net       0.0%     6   15.1  14.8  11.7  16.7   2.1
 7. 206.80.255.6                                     0.0%     6   17.0  16.3  11.5  19.8   2.7
 8. equinixmetal-b.ip4.torontointernetxchange.net    0.0%     6   20.0  17.4  15.5  20.0   1.7
 9. (waiting for reply)
10. (waiting for reply)
11. paz1.qyyz2                                       0.0%     6   16.1  16.6  14.8  17.8   1.1
12. dns.quad9.net                                    0.0%     6   15.5  15.3  13.8  18.1   1.6

Standard network monitor settings

Finally, if and when such redundancies fail at any stage, there must be a monitoring instance that alerts the network engineering team to a failure. What better way to test than to repeatedly confirm that the DNS services are running on each node.

For this reason, our managed clients have a listener on localhost which can then be systematically monitored for domain mytools.management that will resolve to the LAN interface from which it was queried, and any non-answer, or public IP answer means the DNS services are failing. This can be done by integrating tools such as zabbix or cronitor alerts.

Final notes

One more feature of adam:ONE, the on-premise caching resolver service is an automatic back-off for non-responsive resolvers. This allows for redundancy without sacrificing performance. Most importantly, this facilitates resilience even in typical supply-chain problems (ISPs, public resolvers, etc).

Update

UPDATE: including a diagram that shows a sample of a fully-resiient network that meets all the requirements above:

nckrwlmn · October 3, 2024, 12:55am

Thank you David interesting article. One thing that I see as a potential single point of failure is the single software resolver on the gateway that can fail, and not know it’s failed, and can’t trigger a CARP fail over either. I would like to see Redundant muscles per node that are cross checking one another, or one node checking the responsiveness of another, and answering the query of the client if the other node isn’t doing its job. Or a combination of all of these.

Usually even a node that isn’t properly resolving can still route traffic so the carp fail over need not happen, but at least the client’s DNS query would be answered so the connection can occur. It would be a little engineering feat to make this happen, but I see this as One of the last big holdouts for having a resilient DNS system.

David · October 3, 2024, 2:31pm

That is a good point and that is what the standard network monitor catches. In the case of managed services, every minute the monitoring service executes net.dns check which queries localhost (on pfSense we alias 127.0.0.2 for this purpose) for mytools.management and generates an alert if it fails. This runs on each node so if and when an outage happens, the following sequence happens already:

Ticket is auto-generated
Self-healing / service restart is attempted automatically
Ticket is marked resolved when name resolution is successful again

In most cases, the above process is triggered when there are network changes that impact the network interfaces of listeners.

It goes without saying that this does not and cannot cover endpoint queries, should there be any layer 1 or 2 issues on an internal segment, but in theory, nothing stops the premises of deploying additional internal monitors to cover internal networking issues.

atw · October 3, 2024, 2:54pm

Automatic failover on DNS resolution issues is possible when running adam:ONE on VyOS which supports health check scripts. For eg. a /config/scripts/vrrp-check.sh script could verify that anmuscle is operating correctly, and if not force a fault state.