So today, I thought it was going to be a fairly slack day. I noticed on the bus didn't have a lot of people on it people probably decided to start their long weekends early. A busy week was finally winding down. I had finished all my readings for my courses, and the night before I gave a fairly good presentation in class about my research. At work, I had finished putting together a website for a demonstration for prospective buyers, so what better way to end a week huh? When I got into work, I even joked that I should have taken today off so that I could enjoy a five day long weekend.
I was happily working on my primary project until my co-worker asked for help in a section of her code. I started debugging it, and ran a few tests on it. Shortly after 11:55am, our server starts slowing down drastically. I do a few checks and oh my goodness, something has drained ALL of our system resources, and the server went down. This was great because my boss was at an all day meeting today. I quickly called our system administrator, and he jumped into action.
12:10pm, my boss comes in because she had a lunch break from her meeting. She asks how things were going. I gave her a wierd smile and a nervous laugh.... uhhh.... the server is down. Shortly after, the server came back up, and the system administrator calls back and says all is fine now. Less than 10 minutes later, the server shuts down again. Same problem, something ate up all the resources on the system. I'm panicking at this point a bit.... are we being hacked? Are we under a denial of service attack?
My boss cancels the rest of her meeting because our system is in trouble. We're able to reboot the server, and if nobody's allowed to log in, the system stays up. The second we throw the switch on to let people use the system again, the thing dies. I put up a page to inform our users that the system is having problems. Then I just remembered....... crap, those potential clients were supposed to be trying out our demonstration website today, and the system is down! Thankfully, my boss told me that the clients had rescheduled for Tuesday anyways.... so it didn't affect them. Phew.
While we were reading through log files for clues, my boss told me to eat lunch as it was already 2:00pm. I told her.... I know the drill.... always eat first before you code. At the first sign of trouble at 12:20pm, I quickly ate my lunch because I knew this was going to be a long day. I'm glad I ate first because it was a long time before things stablized. It's funny, when I got hired for this job, they gave a similar interview question. They asked, it's lunch time, and there's a serious problem with the system and you're hungry.... what do you do? I naively answered, oh I would put all my effort into fixing the system. They then asked, then when would you eat? And I answered, after the system was fixed. They replied, well... what if it took 8 hours to fix? Apparently the correct answer was to eat first, then fix the problem... never code on an empty stomach

. So, I had learned my lesson from the interview.
I think we got the system up, and then it crashed about six times. This was getting frustrating because we had no idea what was going on. Our system administrator had reinstalled a bunch of stuff, and still no luck. I came up with the idea of logging what users were sending us because every time users logged into the system, the system would die. So I went on our secondary system and started building a logging tool for our software. I tried to run the logging tool, and it didn't work. Oh, oops, my code wasn't up to date, so I upgraded my code from the main system. I ran my code again, and now the secondary system was showing the same symptons as our downed server. It slowed down, and all the resources were eaten up, and then it froze up.
I had found the link. The code from the main system was killing our secondary server as well. It had to be because when I ran that code, our secondary system went down in the same fashion. I connected the dots and realized that when I was looking at my co-worker's branch of the code, and ran it, that's when the system went down. I was installing the logging tool on the same code that my co-worker had been working on. Now we knew what was causing the system to crash... and it wasn't because we were under attack, phew.
This branch of code she was working on is a part of an experimental bleeding edge branch of our software. It is the next version of our software with tons of improvements and upgrades, but at the same time, very very unstable.
Anyways, now that we figured out what was going on, we brought the system back up at 4:30pm. We stayed an hour later trying to figure out what had changed in the experimental code that caused the system to die. The day before, this experimental code was working perfectly fine.
In either case, people should thank us as we helped start their long weekends early by disabling the system for the entire afternoon. I'm exhausted. But now you know why I don't take vacation days

.