NAS Outage #1
Incident scope: NAS user not able to access Web Interface for NAS
Incident duration: At least 30 hours, up to 72 hours
Incident resolution: Change router port
Last weekend I was not at home. Some time on Saturday afternoon, I wanted to access some of the self hosted services I host on my NAS. I entered the URL and waited. I knew that sometimes the first request takes some time (maybe the NAS powers down the hard disks), so after waiting some time, I just hit refresh. Still nothing. I tried another service. Nothing. I tried all of them. Nothing. I tried going to the IP address of the NAS directly. I get the default page from the web server, which comes up when I don't use the right hostname.
Okay, we have a problem. My first suspicion is that the DNS of my domain is broken somehow. I do a DNS query with
dig. I freak out a bit when I see two IP addresses there, but after some searching, I find that one of them belongs to my DNS provider and is needed because I have DynamicDNS. I log into my provider, Namecheap from whom I bought the domain and who is managing my DNS. Everything looks normal.
Then I try to ssh into the NAS. This works pretty well, although it seems to be pretty slow. I try to look around in the logs, nothing suspicious. I restart the nginx server. I restart several other packages on the NAS that I think might have anything to do with this. Nothing.
Then I think about trying to run
curl to download the main page from my NAS. And lo and behold, I see the HTML of the login page to the Web Interface appear. Okay, let's try again in the browser. After several minutes, the loading stops, but nothing shows up. I checkout the DOM Inspector and indeed things do show up there as well. What? There are no errors in the console, but the network tab does show that everything is really slow.
After several hours of investigations, I give up and enjoy my time with my in-laws. I put my SRE hat back on only after I get back home the next day. Unsurprisingly, I still can't connect even from the LAN. It's actually a bit better, because some things open, but it's very flaky. And even more surprisingly, I discovered that I had Monica open on my phone browser and I can navigate it there! What is going on?
I had recently moved the NAS to another room and I had tested the speed of my home network with iperf3. I decided to test it again. On the NAS I ran the following command:
sudo docker run -it --rm -p 5201:5201 networkstatic/iperf3 -s
And on my desktop I ran:
iperf3 -c 192.168.100.15 -p 5201 -t 30
The first time I tested it, in the first room, it was around 900 Mbits/sec.
After I moved it to the other room, it dropped to around 500 Mbits/sec, but I assumed that's because of the lower quality cable that had been placed in the wall. But now, when I ran it again, I got around 40-50 Mbits/sec, sometimes even 0 Mbit/sec for several seconds. I ran the test several times and then I noticed something that looked suspicious: a column called "Retr" with values like "165", "229", "66", "160" in it. "Retr" looks very much like "Retries". My hunch is confirmed when looking in the iperf manual: some TCP packets are being retransmitted, which means I have packets dropped somewhere or packet corruption.
netstat -i confirms that yes, I have errors only among the received packets. Searching for this issue reveals that the most common issue is a bad cable. I had crimped the Ethernet cable in the other room when I moved the NAS there, so I thought that's the problem. I recrimped it, but no luck. Then I thought that maybe the other end is bad. Nope. Then I tried plugging it into the other empty LAN port in the router. Bam. No more packet retries. Speed between my desktop and NAS is back to 900 Mbits/sec. I can access all the services. Outage over. The problem was a faulty port on my ISP provided router.
This was my biggest outage I ever had so far with my NAS. All previous ones were several hours long at best, if the internet went out. I have a UPS, so even short electrical outages don't affect it. I didn't like the router I had from my ISP, but now I'll have an even bigger reason to get it replaced as soon as possible. A lesson I learned from this is that intermittent network errors can cause very weird issues, which can be partially masked because TCP has a lot of redundancies and retries built in.