Take The Time To Truly Understand
The e-commerce web server cluster was due for an upgrade. New hardware was
purchased. Windows and IIS were installed. Microsoft's NLB clustering was
configured and the various e-commerce websites and services were deployed to the
new two-node cluster.
And everything seemed just fine…
…and it was—for most folks, in most places.
A few weeks after the new cluster was rolled out, a customer from Singapore
contacted support to say that they were unable to purchase anything on the
store. Dutifully, a couple of developers tried to reproduce the issue:
I add an item to my cart and click the shopping cart icon to check out. When I
arrive at the 'cart' page, all of my items are there. But, when I click
'proceed' to go to the billing page, nothing happens.
The checkout path was fairly straightforward: cart -> billing -> shipping ->
confirmation -> order summary.
From looking at the logs, it was clear that the server was receiving the HTTP
request to load the billing page, but, surprisingly, was redirecting the
customer back to the cart. Based on the code, such a redirect would only be
returned if the cart was empty. So, the cart page displayed a list of items but,
once the customer clicked to proceed to the billing page, those items
disappeared. When the cart page loaded again, though, the items reappeared.
A few days later, the customer emailed to say that he had been able to complete
his order, so the issue was chalked up to an issue on the customer's end.
Over the next few months, a customer would email to report this problem a few
more times—usually from Singapore1. But we had plenty of customers from
Singapore and, based on their order histories, most of them were able to place
their orders without problem. The IIS server logs were reviewed to find more
incidences of the cart -> billing page -> cart redirection pattern, but nothing
really stood out and the issue was relatively infrequent and was put on the back
burner accordingly.
What if the problem is with the customer's ISP?
Approximately six months after the new cluster was deployed, a handful of the
customers—relatively big spenders from Singapore—had become quite frustrated and
escalated their concerns to the CEO. Tracking down the issue became a top
priority.
During this round of investigation, a seemingly interesting detail emerged: all
of the customers from Singapore were coming from a small number of IPs on a
single subnet. Further, the IP address usage overlapped. So, it appeared that
they were sharing these IPs in some kind of network address translation pool.
At this point, the team investigating it speculated that a there was a
misbehaving caching proxy somewhere in Singapore—beyond their control. Again,
they considered the issue essentially "resolved" because the problem was
believed to be with a component beyond their control. 2.
If that were true, wouldn't the customers have problems like this frequently and with other sites?
I joined the team responsible for most of the company's web
applications—including the e-commerce websites—around the time the issue was
escalated. I had a bit of a chance to weigh in with my new teammates as they
worked through the issue, but I was the new guy and my background involved a
minimal amount of ASP.NET/IIS on Windows. I had quite a lot of web experience,
but felt a bit rusty having spent the past 2 years working on Windows and Mac OS
desktop apps.
Even they declared the problem beyond their control a second time, the customer
emails still kept coming in. The group's director was frustrated, but felt
stuck—he didn't want to tell the customer tough luck. I asked him if we could
have a few days to dig into the issue. I could bring a fresh set of eyes and it
would be a useful exercise for accelerating my learning of the new-to-me
technology stack. That, and I had a nagging sense in the back of my mind that
the problem was on our side. If this was really an issue on the customer's
end, it seemed like they'd have experienced it elsewhere, too. By all accounts,
they only experienced it on our site.
Reading The 🤬 Manual™
As noted, I was relatively new to ASP.NET/IIS. My background was building web
applications with primarily Java, PHP, Python, and, regrettably, some Cold
Fusion. So, I started by reading the IIS documentation and auditing our current
configuration against the documentation's standard recommendations.
One thing became clear immediately—the documentation expected all servers to be
configured with identical configurations. The documentation-prescribed way to
do this was to
place the configuration on a highly-available network share
and point all cluster members at that shared configuration so that they were all
using literally the same config file. Our servers were not setup like this;
they each had their own, local configuration files.
Upon inspection, the files were very, very similar, but not identical. In
particular, the id property of a given site on one server did not necessarily
match the id property for the same site on the other server. This was notable
because it was incorrect, but we didn't yet know if, how, or why it related to
the problem at hand.
Reviewing Logs
With that config file discrepancy in the back of my mind, we dug into the logs
and code to see if there was anything my fresh eyes would see that had
previously gone unnoticed.
Now, the logging system was setup a bit crudely—errors were logged to a mailing
list (yes, it was terrible) and everything else just went to the local disk of
the individual servers. The lack of a unified view of the logs masked an
interesting detail for a little while, until we merged the log files.
The logs showed that most customers' browsers would submit all requests to the
same server for their entire session—product page, add to cart, cart, billing,
shipping, order summary, etc. We suspected, and then confirmed, that we had
configured
"Class C" affinity
on the NLB cluster. Once we merged the logs, we noticed that the customers from
Singapore who were reporting the problem stayed on the same server for most of
their session, but there was a single request to the other server in the
cluster when loading the billing page. Given the discrepancy between site ids in
the configuration files we'd noticed above, this turned what had been a yellow
flag orange.
Spelunking Through the Code
Reviewing the code, there were two preconditions enforced by the billing page:
it required that the page was loaded via an HTTPS URI and that the cart
contained at least one item.
The HTTPS precondition's implementation was a bit awkward. At the time, the
website enforced HTTPS for "sensitive" pages like billing and account settings;
but it also enforced unsecured HTTP for everything else.3 If it received a
request via HTTP for a "sensitive" page, it would return an HTTP 302 redirect to
the same URI using HTTPS. So, when a typical user clicked the "proceed" button
from the cart page, what happened was:
1 HTTP GET to http:/foo.com/checkout/billing
302 Found w/ Location: https://foo.com/checkout/billing
2 HTTP GET to https://foo.com/checkout/billing
200 OK
But for the affected Singapore users, it looked more like this:
1 GET http:/foo.com/checkout/billing
302 Found w/ Location: https://foo.com/checkout/billing
2 GET https://foo.com/checkout/billing
302 Found w/ Location: https://foo.com/cart
3 GET https://foo.com/cart
302 Found w/ Location: http://foo.com/cart
4 GET http://foo.com/cart
200 OK
It was clear from the web server logs that the billing page was being loaded
over HTTPS, so that was being enforced.
The failing precondition was the requirement that there be at least one item in
the cart. The cart page and the billing page considered different data sources
when evaluating the contents of the cart and a single integer—quoteId—stored
in ASP.NET Session State related the two data sources. We could see from
application logs that, in the failing case, quoteId was 0—which was an
invalid value.
But why did it only misplace the quoteId for people from Singapore?
Back to The Documentation
It turns out that
SqlSessionStateStore incorporates the IIS site id into the session key it generates.
So,
if the site ids don't match across all servers in your farm, you'll seemingly lose session data when people jump from one server to another.
But wasn't the NLB cluster configured to use "Class C" affinity? It turns
out—from inspecting the logs again—that requests for the affected customer
sessions were originating from one IP for unsecured HTTP and a different IP on
a separate subnet for HTTPS. If, by chance, both IPs were affined to the same
server, everything would "just work". But, if they happened to become affined to
a separate servers, customers using that Singapore ISP could not proceed
through checkout because ASP.NET Session state could not be shared between
servers if the site ids did not match up across the cluster.
Once the site ids were aligned across both servers, the problem was actually
solved. The remaining problems—relying on affinity, making the entire site use
HTTPS, eliminating (or, at least, reducing) usage of ASP.NET Sessions, etc—were
addressed separately.
Takeaways
Take the time to truly understand the way the software you're using works
This is my favorite takeaway because it's the one I've seen lead to the most
wasted time and effort. Often, we settle on an explanation based on an untested
speculation rather than analysis. In this case, some hypothesized that a
mysteriously misbehaving proxy or middlebox was at fault, but that hypothesis
was never really validated. Mere plausibility is not an adequate threshold to
stop investigating.
Other times, it's an unwarranted confidence that a software vendor (open source
or commercial) wouldn't do something that seemingly foolish (yes, they
would).
In other cases, it's more subtle—a bias toward looking for the source of a
problem in the areas where the practitioner/investigator is most expert or
stigmatizing technology that is foreign, unfamiliar, or has a bad reputation in
general.
Instead, we should read the documentation and use tools like profilers, tracing
frameworks, decompilers, and, if available, source code, to inspect and
understand the software we are running.
Log Everything (or, at least a lot more than you do now)
I've begun to take centralized logging for granted, but not long ago, I heard
from a friend and former co-worker who started at a new company where there was
little, if any, logging setup for their production systems.
I rarely encounter people who are logging too much. In this case, the only
logs we had were the IIS server logs and some really thin application logs.
Had this issue come up a couple of years later, after we'd deployed centralized
logging and application monitoring, it would have been solved much more easily.
Be willing to audit your code and config and even revisit past decisions
Be willing to audit even seemingly basic things like the web server
configuration to make sure things are actually configured properly.
Always be willing to revise a decision made in the past that seemed reasonable
given the context and information available at the time but is no longer
appropriate given new information or changed context.