Monday, December 02, 2013

How a patch for IE11 filled up my transaction log

The story

Our developers and QA group noticed a bug in our systems with regards to IE11. When the website was used with an IE11 browser, things didn't quite work as expected. After some research and testing, they determined that a patch could be applied to our web server that would fix IE11. It introduced some new code and some changes to the .Net framework on our web server.

So far, nothing affecting the database, right? Right! After all, they did some testing and saw that the effects of applying the patch were negligible to the test system. So, onward ho, we go!

The patch was applied one Saturday night a few weeks ago. Around 6:30 pm. I was at a basketball game with my family at the time, and noticed an alert from RedGate's SQLMonitor about a failed job. Odd I thought. (I had no idea that the patch had been applied at this point). Since it was a single job failure, I chose to continue with the family activity. An hour later, another one appeared. Another job failure. Odd. This is a simple job that is doing some simple validation of errors received from our web site services. Its never failed before.

Again, the game is almost over, and since it a once an hour job, I pushed it off a bit. When the game was over, instead of heading home, we headed to the office. My entire family. They love it, for about 5 minutes. But they bear it when it happens. In this case, we only spent an hour or so at the office. As I dig into the failed job, i realize that its a powershell script that is failing. It seems to be failing on the send email portion of the script. Duh. How lame. I could have swore that it had been tested and functioning at one point, but alas, it is failing. With some tweeking (not twerking) I get it to work, and execute it again manually. This time it sends mail. Yeah! Problem solved? No.

The email it sent me went on to describe that the errors received from the web site services had surpassed a threshold of acceptability. Oops. So a few hours ago, let's lie and say 2, I received an alert that should have told me that a threshold had been exceeded. Something bad was going on. But since the monitor couldn't email me, I didn't know the true depth of the issue.

A bit more digging finds me locating some 7k emails indicating an error I've never seen before. A short phone call later to a developer helps me see that this is a new error, and it might be associated with the patch that was applied to the web site (news i uncovered during my research of the situation). We do a quick cycle of the app pool and viola, the error stops happening in the frequency it was just bombarding us. Yeah!. Problem solved? No.

The Next Day

The next day several folks hit this issue and dig and dig and determine that the patch did something odd to the web site services, which in turn caused this new error to arise when our engines communicated from the outside world to our central services, attempting to synchronize their data with us. So, each one of the attempts failed, causing them to be told to go away and try again another day. In the meantime, the error didn't go away. It was just suppressed for a bit. It would raise again, causing the app pool to crash, and when the app pool was restarted, it would sometimes be good for a while, and other times start or continue spewing the error message email. Still, this seemed isolated from the database. Not affecting it, only in that engines were unable to connect. Some engines. Some of the time.

After more research and several days of looking, it was determined that we should roll the patch back from the system. Maybe that would revert us to the state we were in before. Maybe. How to test for this? Not a lot of confidence was given to several ideas, but the one that was picked was to increase the verbose logging on some of the services so that we could track down what was going wrong with said services as they attempted to connect and process. Yeah!. Problem solved? No.

This increased logging was applied to the web server, which is a VM, which has a small footprint. The increased logging caused some issues as it tried to write out the massive amount of logs for all the connections made that night. This caused the response to the engines to be something like "Hey, I know you tried to contact me, your central server, but I am borked right now. So, instead of accepting your data, why don't you reset your data next time we try to talk." Of course this message was editorialized, but that is the gist. Each of the engines that tried to connect and errored was told to 'reset' next time. A reset means, purge all data on central (which can be like a gig or 2 gig. maybe 5) and resend all said data. This reset process occurs all the time, from a lot of engines. It is a process that shouldn't be a bear on the topology. Usually. So as the web services did their job this night, with more verbose logging, we should have seen a fix, no? No!. Problem solved? No.

The next day was when we realized that the verbose logging caused bottlenecks on writing out all the logging, causing the engine's requests to time out, and send back the lovely error requesting a reset. This is the day before Thanksgiving. Which I had taken off as a vacation day. So I find out this information while off of work. I was informed that tons of resets may occur tonight, but everyone thinks it will be OK. No worries. Yeah!. Problem solved? No.

Thanksgiving Eve

That night, around 130am, after we had finished watching several movies as a family, I was headed to bed, and looked at my emails. Tons of emails. Tons of alerts. Tons of something hit the fan. I log into work to find that a database log file had filled up, grown to capacity on the drive, filled the drive and was basically sitting in a bad state. Other processes were still working. For example, replication was still sending data from this db server to the reporting system. Other requests were being serviced. But this db was in a bad state. Poor SQLMonitor was freaking out trying to keep up with the errors in the log about the tlog fulled up. It turned out that some 65 engines were performing resets. Right now. All at the same time. Some had finished, but others were still trying. Data was flying in and out of the system at a breakneck speed. Delete here, insert there. Tons of changes. Since it all needed to be replicated, it stuck around in the tlog longer and larger than it had ever before done. After some research, calming my beating heart, and maybe a little cussing under my breath, I add another tlog to the system on another drive, and see some pressure released, and processes continue to flow. I watch it and monitor it for an hour or so, until after 3am, and finally call it good. Things were working as expected, just a bit backed up. Several emails explaining the situation were sent out to the team, mainly to document what was going on, what I had found, and making sure others knew what was happening in realtime. I then decided to go to bed. Yeah!. Problem solved? No.

Around 5 am I awake and check my mail only to find out that just before that time, the replication database log file had filled up as well. It didn't fill the drive, but was 100% full, and things were halted. Fearing that I would have to rebuild replication in a few days, or worse, over the holiday, I got up, and drug my self downstairs to see what I could do with my magic DBA pixie dust. After reviewing things, looking at the state of several pieces of the puzzle, I created another log file on another drive for the distribution database. I double checked the status of the other log file, to find out that it was doing fine. Replication was backed up, and latent. But it started flowing again. It was like finding those pesky beavers had built yet again another dam across my river of data. Once the river was diverted, flow occurred. Yeah!. Problem solved? At this point, its 6am the morning of a major holiday, and I have no idea. It looked good at the moment. But would take a while to catch up. After watching a bit, I went to bed.

I slept in until after 10am on Thanksgiving morning. When I awoke, I grabbed my phone and looked, and viola, all seemed well. There was even an apology from the developer that had spearheaded the movement to rollback the patch, increase the logging, and see what happens. Well, we now know what happens. We can potentially affect systems and services to the point that they get blocked up way upstream, causing untold blockage down the river of our data, and inadvertently cause database problems.

What did I learn?

This was a perfect storm situation that I had not planned for. What can i take from this experience? What can I learn from it and do better?

  • I had sized the log files and drives appropriately, at least as far as normal processing occurred and the baseline of normality allowed me to measure. But in this situation, more log file size, and a bit of a larger disk, would have been the preferred configuration. Something I need to look into. 
  • I was happy that I had planned on having extra luns attached that were unused for Data and Log, in the case of an emergency. Since these already existed, and had space, I could easily add files to dbs and sit them on these drives, which are raring to go. This saved a lot of heartache and pressure. The solution was easily reached in both cases, and when things settled down to normal, the extra log files were removed, returning the drives to their emergency waiting state again. 
  • I also realize that for some types of alerts, it would be better if a more noisy alert occurred. When the db log file first filled up, it was around 11pm. I noticed it at 130am. I could have fixed it a lot earlier, had i been properly notified. I need to fix this. 
  • We all need to be more careful about touching our production system. Something as simple as an IE11 patch to a webserver was ultimately responsible for the log file filling up. If we could have tested this much better, in conjunction with the rest of the services, and not just assuming that this patch would affect a website, that it could reach into services that could cause errors to spew, and pile up too many errors and logging which could ultimately affect data coming in, etc. etc. If, If, If. We have to be better at seeing and planning for the If situations, and not be overly cautious that it causes us to freeze in our tracks and not let any change occur.
  • As smart as we all are individually and collectively, there is still much to be learned, even about our system that we have created and believe to know intimately.