I came up against an issue where some DNS providers were not getting data from our authoritative DNS server. Using a site like dnschecker.org, I was seeing something like 30% of the nodes reporting SERVFAIL, and not just far removed international nodes. AT&T (USA) consistently reported failures. The other 70% (OpenDNS, Google, etc) returned the expected records.
Everything looked fine at the firewall... no blacklisted IPs or filtering. BIND was fine and was fielding queries. Nothing strange at the ISP level.
The cause ended up being that the name servers we set at the registrar were aliased names.
i.e.
nameserver at registrar = ns1.domain.net
ns1.domain.net IN CNAME ns1.domain.com
ns1.domain.com IN A 1.1.1.1
And switching them to A records resolved the issue (i.e. change ns1.domain.net to IN A 1.1.1.1).
My question is... is CNAME at that level against spec but most DNS servers didn't care, or is it just some weird bug with the "problem" servers?
I'm just trying to understand the underlying issue. I'll probably never see this again, and will avoid CNAMEs for this. And yeah, I know I could have just pointed the registrar to ns1.domain.com, and that's probably the more efficient way to do it. But this was a legacy domain and it was easier to edit the DNS record than modify the registrar (and wait for whatever that propagation time is).