A few months ago, I registered an additional shorter domain name for my blog – matveychuk.com and was tweaking the DNS configuration so that users can access the site using both domains. My DNS provider supports the URL record type, from the technical perspective, that was a piece of cake. I just created the redirect records for the bare domain name and www sub-domain, and was ready to go:
As I didn’t expect it to be so easy and planned to spend more time on that, I decided to have a sneak peek at DNSSEC, which is an extension to regular DNS that allows a DNS client to verify the query results weren’t tampered in transit.
Of course, there is a complex process for ensuring the authenticity of a DNS record behind the curtains, but for the sake of this post, it is not very relevant.
Both DNSimple and Cloudflare automated the configuration of DNSSEC for a domain and made it quite easy to set up even for a not very technically proficient customer. So, after half an hour, the DS record was in place, and the Cloudflare control panel indicated that DNSSEC is successfully configured. The task is done, closed, and forgotten.
First signs of trouble
In a meanwhile, my superior (Hi Dan! 😉) reached me regarding my site unavailability. That was quite unusual because I haven’t heard any such complaints up to that moment. A quick check confirmed that the site is up and running ‘on my PC’ and also can be accessed by external services like Pingdom. I assumed that the issue might be temporary and specific for a particular ISP provider in Norway that my colleague used at his home office. As a confirmation for my assumption, the affected colleague was able to reach my site when connected to the corporate network via a VPN. From the IT Ops point of view, the workaround was found, the incident was resolved, and the user could use the service again.
The problem continues
In a month or so, after the first issue occurrence, I started receiving similar complaints from some of my readers. One person from New Zealand reported that my site seems to be down. Once again, a set of quick tests indicated that the site is running and reachable via both my landline and cellular providers. It seemed like yet another network fluctuation.
In another case, the reader that couldn’t access my site was located in NJ, USA, but that time I was able to get a bit of client diagnostic information, and it showed that the client computer couldn’t resolve my site’s DNS name. At the same time, the reader could successfully reach other services on the Internet.
That started really worrying me. I decided to go the extra mile and set up the ongoing availability monitoring along with instrumenting the site with Azure Application Insights. Still, in a couple of days after that, I had no leads to the problem root cause.
The moment of truth
Last week, my teammate, Jon, notified me about the issue with all the same symptoms I observed before. After checking his client DNS configuration, I noticed that he was using Google Public DNS on the home router. A lookup for 'andrewmatveychuk.com' using Google DNS server IPs returned generic non-existing domain error. Knowing how many system administrators, as well as home users, set up Google Public DNS for their name resolution, the scale of the problem made my palm sweat. I also was confused about why my blog’s domain name resolved successfully by my ISP name servers and services like Pingdom and Azure, but not by the Google DNS.
A 5-minute searching on the Internet pointed me in the direction of checking the DNSSEC configuration for my domain. Test queries on https://dns.google.com/ confirmed that the domain name resolved successfully without DNSSEC validation and failed when using it:
The DNSSEC check with specialized tools confirmed that I had a malfunction in my configuration, and the chain of trust was broken. In other words, all DNS servers that handle DNSSEC appropriately distrusted my domain name and were ‘not able’ to resolve it. It meant that I was seriously busted.
As it was almost impossible to figure out what I did wrong configuring DNSSEC more than a few months ago, I deleted the existing configuration and verified that the Google DNS stated resolving my domain name. After that, I went through the process of setting up DNSSEC once more, but this time I run all the checks to ensure that it is correct and actually works:
Considering that the content on my blog was regularly crawled by Google search engine and the new posts were indexed according to Google Search Console and the search results, I might assume that the described issue didn’t affect the search ranking. However, from the end-user point of view, for a significant portion of my readers, the experience of not getting the information they were looking for was quite unpleasant. If correlating the issue life span with the decrease of active users in Google Analytics over the same period, it might have affected almost one-third of my audience, which is a lot.
Could that issue be prevented? A good question. From the technical perspective, that issue reminded me about the importance of having end-user experience monitoring for your services. Even if all ‘server-side’ metrics and logs seem fine, the consumers might still be affected by some weird bug or configuration issue, as in my example.
Currently, I’m considering options for improving in that area for my blog. So, stay tuned and subscribe to the latest updates on my blog 👇 to follow up on this postmortem updates!