We had a very rough week. There were several issues affecting access to Targetprocess servers, images and attachments. We are extremely sorry about what happened. Even though technically it happened because of the infrastructure problems at Softlayer, our hosting provider, the fault is still ours.
We would like to explain what happened.
First issue, May 12-13.
It was related to File Servers that have uploads, mashups and some configuration files required for application access. We used to have a standard geographically distributed backup system that served us well for quite some time. We have data centers in the USA and Europe and each of them has its file servers. So, the data is replicated from one data center to another to make sure no data is lost even in case of disasters when the whole data center goes down forever. The chances that both servers go down simultaneously are very low, but guess what happened…
It all started with the EU server having disk problems. Right before it got discovered, the US server went down completely and we had an automatic failover to the EU server. The failover was way too much the EU server could handle, and as result no one could access the application. We had to set up a brand new server and move tons of data from Europe back to the USA, so it took quite some time.
So in order to prevent this from happening again we added another independent server that would be used for data replication only, minimizing the chances of similar issues.
We also plan some improvements related to configuration files caching. It would help us prevent situations when access to the account is lost when file servers (all three) are not available.
Second issue, May 19
After a several days break we had another emergency - a power outage affecting all our servers in the European data center (not affecting customers in the USA). Of course, there was supposed to be a reserve power system, but for some reason it did not work. We are trying to get that information from Softlayer, but with no success so far.
Once the power was restored, our European accounts went online again.
It’s pretty hard to do anything when the whole environment is not available. The data is safe of course, it is about the time we need to restore it. If that is just one server, you probably won’t even notice it, but for multiple servers it is way more tricky and several hours are needed.
Anyway, here is what we plan to improve to reduce the chances and impact:
Find out why Softlayer reserve power system did not work and prevent this from happening again.
Have some backup and reserve servers in different sectors/floors of the same data center. This might help if the data center is only partially offline.
Start a read-only copy at a different data-center. Read-only access is required to prevent any data loss. It will take about an hour or two.
We’d really like to say that this will never happen in the future, but we can’t. We’ve already tried several data centres and can say that there are no perfect ones, all based on our experience. There is no way to exclude the possibility of hardware, network or human error completely. However, we’ll do everything possible to prevent that from happening in the future, provide better restoration time and ensure no data is lost.