The month is January, goals and plans are warmer than ever, weather? just as cold as it can get. I cannot hyperfixate on problems endlessly during the winter, it’s cold, it’s distracting.

Anyway, the team is on their usual sprint - not to combat cold, just sprints on Jira. Amidst of it all, I get pinged for a peculiar response from nginx, 421 Misdirected Request. It was my first encounter with the status code 421.

Then, I searched for the definition & conditions for 421 to be produced. RFC-9110 suggest clients to retry the request. So I hint the team to check for changes that could have possibly affected their HTTP Client or their payload, and to my surprise there was none.

Nobody made changes

The clients (i.e android/ios/web) made no changes, neither did backend nor infra. The 421 response was rare but not non-existent, often occuring only on few devices, I was equally confused. We couldn’t set steps to reproduce.

Converging Paths

I went ahead to look for more details and on MDN, I came across the following statement in one of their examples.

In cases such as a wildcard certificate (*.example.com) where a connection is reused for multiple domains (abc.example.com, def.example.com), the server may respond with a 421:

Okay noted, many of our host do reside on the same server. I also remembered that we have two replicated edge servers, and I thought of pinning the requests to one of them hoping for a better luck at reproducing the result. So I did that, updated my /etc/hosts, and set all the hosts to resolve to the same server.

And that still didn’t work or did it? It seemed to me that I wasn’t making any progress, but turns out this was crucial which I at the time failed to acknowledge.

Narrowing down

We saw 421 response across all platforms i.e android, ios and web - there must be something common between them right ? and yes, there was.

On android & iOS, the error was observed specifically on WebView. I placed an empirical bet - “it has to be the browser engine”.

Playing for the bet

Now, I had a clue, a direction. The only thing left was assertion, play for the bet I placed.

Then I decided to load all the hosts/sites present on the same server in a random order. And just like that, I got myself at the time infamous 421 Misdirected Request response.

I noted the things I did before getting there, quite simple surprisingly. First, I loaded payment-portal.nepalipatro.com.np, then upon attempting to open nepalipatro.com.np, I was presented with 421 Misdirected Request.

On macOS I use Orion Browser, it uses WebKit. I expected Safari to behave the same, and it did.

I didn’t exactly know what the issue was, but it was clear to me that it had something to do with WebKit - but wait, android saw this issue too and it does not run on WebKit. That’s right, “it cannot be webkit“ I said to myself.

To test my theory, I went ahead on Brave Browser which uses Chromium to perform same steps as I did earlier, except for this time I did not get a 421 Misdirected Request response. I did the same thing on both browsers MULTIPLE times to reassure myself, but the result was clear - Brave Browser was working fine but Orion/Safari kept receiving 421.

Programming vs Networking

To be honest as absurd as it sounds, my programmer instinct kicked in and I thought of reading the source code for Chromium & WebKit to understand the behavior, but for sane reasons I chose to inspect packets with Wireshark.

Peeking inside

Now it was time to dive deep and monitor closely. I ran tcpdump in the background and performed the steps of reproducing 421 Misdirected Request on both the browsers respectively.

 sudo tcpdump -i en0 dst 157.10.100.57 -w test_01.pcap
There were many things to look at but as the examples had suggested to reinitiate the connection in such cases, I had this feeling that it is related to TLS connection caching/reuse.

Inspecting Requests

On WebKit, only one Client Hello was observed1.

On Chromium, two Client Hello were observed, one for each host.

Since both host reside on the same server i.e 157.10.100.57, it seemed that WebKit used the same TLS connection for both host which led to nginx to respond with 421.

It cannot just be webkit right ? some of the android clients that faced the issue must be doing the same.

It was clear to me that some browsers are reusing the same TLS connection for different host because they have the same destination. The behavior was rare and since we have two edge server, it led me to believe that each host must have been being resolved to different addresses for the most time.

$ dig nepalipatro.com.np +short
157.10.100.23
157.10.100.57

$ dig payment-portal.nepalipatro.com.np +short
157.10.100.23
157.10.100.57

Now what?

The issue was clear, it’s the browser re-using the same TLS connection for different host. On the other hand, Chromium initiated a different TLS connection for each host even if they reside on the same server.

I don’t have control over the browser, even if I were to submit a patch, the behaviour persists for pre-patch builds.

- “but what if it is a misconfiguration on our side ?

I sat back, tried to visualize the flow again, view the packets again, read the examples again. And on re-reading about the status code 421, I came across this statement.

This can be sent by a server that is not configured to produce responses for the combination of scheme and authority that are included in the request URI.

- Mozilla Developer Network

Hmmm, so 421 can be triggered if the scheme is different. But we are terminating TLS at the same point for both the host so it didn’t make sense so I ignored it but it did lead me to check our nginx configuration.

As I took a closer look at nginx configurations, I noticed some differences. One has http2 enabled and the other one does not. I enabled http2 on both hosts and stopped receiving 421.

So it turns out WebKit and some browser engines negotiated TLS and trasmitted HTTP2 data. But when the same TLS connection was used to transmit HTTP1, nginx responded with 421.