Thursday, February 13, 2014

Be tenacious

A Release Management tale


Last night we performed a Release that affected a production database and a website. It was a fairly simple release, and we had done the same steps previously. So with a little effort, we prepared the Release Plan which contains the steps to be performed, and executed on those after hours. Within 15 minutes, all backups and snapshots and the like were done. Within a few more minutes, all new code had been successfully pushed out. Testing ensued and the Release was labeled a success.

We all finished up our tasks, our compares, our post snapshots, documentations, and so on. Emails were sent, and we logged off. All was well.

Until the morning.

The users, darn them, starting using the website and noticing some issues. They complained. Those complaints reached our ears early in the morning, before most of us made it in to the office. So from comfy home office chairs, we logged in and started looking around. Email was the form of communication initially, but this became burdensome to await for responses, and a chat room was opened up in our internal IM product so we could talk more freely.

Initially, there were members of the troubleshooting team that wanted action. Something is broke and its only natural to want to fix it as quickly as possible. Especially since users were using it and seeing the issues. Its different at night when no one is online. Less pressure then. But now, in the morning time, people are anxious and that transfers rather quickly to the rest of us.

I had to say no. We are not just going to roll back. Just be patient.

Once we all gathered and started troubleshooting, we could dig into the why. What was happening. What we thought was happening. Reading logs. Watching processes. Watching memory. And so on. At one point we even said that it was not the database. And it was suggested that I could go back to my normal tasks. But I stuck around. I didn't feel confident that we knew what was going on, and even though I could show that the database was not under any duress, I stuck it out. I kept working on it. I helped, we all helped. Others were brought in to the mix and their ideas were considered.

Fast forward. We still do not know what is happening, except that the IIS server will get a lot of memory pressure, the site will cease to function, and once it all blows up, things start over, and the site seems to work. We see this over and over. Users are in there. We are in there. All of us contributing, but there is still no smoking gun.

So I open Profiler and limit it to the particular database and web server that is having an issue. We capture everything that is happening on the db, which is a lot, and just cross our fingers. After a few more iterations of the memory consumption and release, I notice a repeating query in the profiler, just as all hell breaks loose. Its the last statement, seemingly, that was trying to execute. I grab it as is, and attempt to run it raw. It gives a divide by zero error.

Divide by Zero!

What is this query doing? Does anyone recognize it? does it have anything to do with what we pushed last night? Is data goofed? And other relevant questions were asked. After digging a bit, sure enough, deep in a view that was altered last night, a field was being divided by, and it could be zero on occasion.

I hear a muffled 'Oops' escape the developer standing behind me. 'How did that get past testing?', he asks no one in particular. We discuss for a bit, come up with a solution, and make an alteration on the fly in production that fixes this little issue. After that, the query run raw was able to complete. And as soon as we made the change, we notice the memory consumption and explosion slow down.

It didn't cease, but it did slow.

This gave us more time. More time to look deeper. We continued to watch the Profiler results. We continued to perform tests, and we continued to see the web server work for a bit, then struggle, then use all its memory, then flush everything and continue on as if it had a goldfish sized memory. All's well now, lets go. seemingly forgetting that mere seconds ago it had used and flushed all its memory.

Another query started being the last query executed just prior to the spike in memory usage. As I captured and executed this manually, it too gave us an error. Something about a field it couldn't find in the db. Some field that looked like a valid field, yet it didn't exist. After pointing it out to the developer, he incredulously stammered something like 'where did that come from?'. Turns out that the staging environment had an extra field. This field was built in to the middle ware code that had been generated, and now was trying to do its thing against production where no such field exists.

And the web server simply crashed.

Instead of throwing an error that was helpful, or logging that it got a bad result or no result or some error, it simply kept attempting the query, letting its memory consumption expand to biblical proportions, and come crashing down. Only to try again in a few minutes, as if it had no memory of the preceding few minutes.

So now we fix it.

Now we know what is causing it. And the quickest route to fixing it is to roll back. Roll back all the changes and the site should work like it did yesterday. Not like an Alzheimer patient continually performing the same unsuccessful task. Roll back the code.

The point here is that more than half this story ago you will recall that was the suggestion. Roll it back. But that suggestion was in the heat of the moment. Something was broke. We changed something. Roll it back. If we had done this, then the 2 pieces of code that were hiding well hidden within would have never been known or fixed. Dev would have re factored the release, we would have performed it again on another day, probably tested a lot better, and found the same results. Something not working right, and we had no idea what.

So it took us a few hours. So it was frustrating. So the users were unable to use the site for a bit. With sufficient communication, we let the users know. They were placated. With some time, we dug and dug and discussed, and tried, and ultimately found the culprits. Silly little things bugs are. Scampering here and there, just out of the corners of your eyes. But havoc is being caused, until they are eradicated.

I am happy that the end result was more knowledge, time spent in the forge of troubleshooting, and an actual cause to the problem instead of a quick acting rollback, ultimately hiding the problem, but reverting us to a known state.

Its the unknown that kills you. Or if not kills you, at least puts a damper on your day.

Be patient. Be thorough. Be smart. Be tenacious.

1 comment:

Unknown said...

Interesting take TJay. I've tried to lean towards rolling back right away and to resist the urge to troubleshoot live - an hour can turn into three all too easily. It's good to remember that sometimes it's the only way to find the problem though, for better or worse.