We had a slight problem yesterday – access to the Internet became a bit flaky. People kept calling to say that access was denied or sites were taking a long time to open. At first, we thought that this was just people being impatient, but quickly we realised that there was a problem.
The firewall device seemed to be struggling a bit – the connection log showed a very high level of packet transmission. That wasn’t too unusual as our current connection gets maxed out on a regular basis and we have seen it much worse. A few tweaks and it all seemed OK – so we thought no more of it.
Later in the afternoon, we had the same problem; tweak again, all OK – but then it happened again very quickly afterwards. A brief discussion and it was decided that we should re-boot the device to clear anything that was cached that might cause a problem. One quick reboot and everything was hunky dory.
When I got into the office this morning I was a bit surprised to find about 20 or so emails that were exact duplicates of ones that I had received yesterday. I asked around and a number of other people had the same problem. I did a few checks, but couldn’t see any problems. There seemed to be the usual level of network activity – nothing that would indicate any issues so I put it down to the previous day’s problems.
Over the next couple of hours, I worked on various items including a few support issues. During that time, I received several more emails, some internal, some external. Around mid morning, I thought about it and realised that I had actually received no new external mail, they had all been duplicates. I did a quick check using an external mail service, and realised that there were no incoming or outgoing mails at all.
The guys and I did some tests and quickly realised that something was seriously wrong with the firewall – it was running like a 3 legged dog and several pages of the control menu just would not open at all. We called the support team at the mail service and they checked but confirmed mail was coming in – so we called the vendor of the firewall. They checked but also found that it was running slow so they escalated the problem to the manufacturer.
About an hour later, we got a call from the vendor – the support guys from the manufacturer had found that there were a lot of emails in the cache of the device – about a 1000 or so. They said that they would run a script to clear the cache and expected that this would fix the problem. About 20 minutes later, they phoned back again – it wasn’t a thousand, but one hundred thousand! - and more coming in by the second.
Eventually, they cleared the cache and the email started to move, and our spam mailbox suddenly started to groan under the weight of the mail. It was all from one IP address in Japan, to one mailbox, with one subject line. A quick calculation showed over 10,000 incoming mail every hour. To deal with it, we set-up a PC logged on with the user account for the spam mailbox, and then we set a rule within Outlook to delete the incoming mail from the specific sender. Once this was set running, we could see the incoming mail, but also see it being deleted – it was really cool to watch.
All in all, we feel pretty good about it; once the problem was identified, we had a solution really quickly. Yes we did have a period of a couple of hours with no email, but no-one actually realised this. One of the directors did have an issue with trying to send an important mail to a potential client; but I was able to do that for him using a specific backup external mail facility set-up for that purpose.
After identifying the problem we had outgoing mail within about 15 minutes – incoming mail took slightly longer because of the backlog of garbage, but still less than 30 minutes. The staff were all kept informed – but later it seemed, most of them hadn’t even realised that there was a problem until they got the email from me to tell them about it.
I’m going to sit down with the guys in the next few days – we will draw up a brief outline of what happened and will use that to see if there was anything else we could have done to (1) prevent it, (2) detect it, (3) prepare for it happening again. This will be added to our Business Continuity / Disaster Recovery Plans