We had a very rough week. There were several issues affecting access to Targetprocess servers, images and attachments. We are extremely sorry about what happened. Even though technically it happened because of the infrastructure problems at Softlayer, our hosting provider, the fault is still ours.
We would like to explain what happened.
First issue, May 12-13.
It was related to File Servers that have uploads, mashups and some configuration files required for application access. We used to have a standard geographically distributed backup system that served us well for quite some time. We have data centers in the USA and Europe and each of them has its file servers. So, the data is replicated from one data center to another to make sure no data is lost even in case of disasters when the whole data center goes down forever. The chances that both servers go down simultaneously are very low, but guess what happened…
It all started with the EU server having disk problems. Right before it got discovered, the US server went down completely and we had an automatic failover to the EU server. The failover was way too much the EU server could handle, and as result no one could access the application. We had to set up a brand new server and move tons of data from Europe back to the USA, so it took quite some time.
So in order to prevent this from happening again we added another independent server that would be used for data replication only, minimizing the chances of similar issues.
We also plan some improvements related to configuration files caching. It would help us prevent situations when access to the account is lost when file servers (all three) are not available.
Second issue, May 19
After a several days break we had another emergency – a power outage affecting all our servers in the European data center (not affecting customers in the USA). Of course, there was supposed to be a reserve power system, but for some reason it did not work. We are trying to get that information from Softlayer, but with no success so far.
Once the power was restored, our European accounts went online again.
It’s pretty hard to do anything when the whole environment is not available. The data is safe of course, it is about the time we need to restore it. If that is just one server, you probably won’t even notice it, but for multiple servers it is way more tricky and several hours are needed.
Anyway, here is what we plan to improve to reduce the chances and impact:
Find out why Softlayer reserve power system did not work and prevent this from happening again.
Have some backup and reserve servers in different sectors/floors of the same data center. This might help if the data center is only partially offline.
Start a read-only copy at a different data-center. Read-only access is required to prevent any data loss. It will take about an hour or two.
We’d really like to say that this will never happen in the future, but we can’t. We’ve already tried several data centres and can say that there are no perfect ones, all based on our experience. There is no way to exclude the possibility of hardware, network or human error completely. However, we’ll do everything possible to prevent that from happening in the future, provide better restoration time and ensure no data is lost.
Unfortunately we have problems with one of our file servers and have to temporally switch to the reserved one. As a result you may experience difficulties with the login to Targetprocess. We are working on the issue and will bring back the application working capacity asap.
Please check http://status.tpondemand.com for updates or follow our Twitter account https://twitter.com/targetprocess
UPDATE: Accounts are operational again, all attachments are recovered.
This post is to let you know that all Targetprocess On-Demand accounts are now using only a HTTPS connection. This change will go unnoticed by most users since any required redirects will be performed automatically, but we wanted to let you know why we are making this change.
Certainly, using an SSL/HTTPS makes the connection more secure. However, this is not the only reason why we decided to implement this change. Some important infrastructure changes are included to the 3.2 release, such as switching to websockets and enabling the SPDY protocol. These changes require that an HTTPS connection works reliably.
For you, this means better connection, faster load times (especially, the first load) and the ability to have more than 6 tabs of Targetprocess 3 open.
For us, switching to a more modern framework means we can deliver cool features easier.
We know that some of you use custom plugins, reporting services, mashups, integrations, etc. These should not be affected by this change, since we still allow HTTP requests for the REST and SOAP APIs. However, we do recommend updating them to use HTTPS as well. If something goes wrong, please contact our support team.
The changes do not affect local installations (aka On-Site). We do not force HTTPS with On-Site, but it might be a good idea to turn it on as well.
TargetProcess v.3 Lists Feature Team is waiting for Nadia (who is enjoying her vacations in NYC)
It really is:
Our real Kanban Board snapshot. Note that it is enhanced by two mashups: Team Load (by @alexsane) and Classes of Services (by EugeneKha).
Here’s what we received from our hosting provider:
Our data center which is owned by Verizon Canada is having a global outage and they are in complete dark right now. None of their backup power flipped over. We have a backup generator and UPS that should all automatically pick up.
They will be releasing a statement apologizing for the issue once all is said and done that you can extend to your clients. We can also talk about compensation at that time.
For the time being we are anxiously waiting on power to be restored, we will update you as soon as we have more from Verizon.
We will update you about the status.
UPDATE: 6 AM EST
Power is restored, our datacenter is having trouble with the network but the problems should be resolved shortly.
They have 24 hours of UPS and 24 hours of diesel fuel onhand. The issue was the flip did not work properly and did not flip over to an alternate power source. Power has been restored for quite sometime but they are having troubles getting their networking equipment going.
We use the top datacenter in Toronto, most of the banks use the same one as us.
We have had constant communication with them every 10 minutes, this issue is effecting multiple other large clients of Verizon as well. They won’t issue an ETA but say it could be any minute.
All of the servers are on, we are just waiting for them to fix connectivity. I will make sure you have something to give to your clients that explains the issue in detail.
UPDATE: 8:12 EST
Some of Verizons(datacenter) routers were damaged in the power surge. They are reloading the configuration on those devices. They won’t issue an ETA but are working on it with all hands on deck.
We have numerous staff at the datacenter and keep phoning for updates frequently.
I don’t have a firm ETA but I would think by the amount of man power working on this for the past few hours, it should be fixed any time.
UPDATE: 9:20 EST
We received an ETA from the datacenter of 1 hour. We were in the process of moving the equipment to our 2nd datacenter but given the new ETA. We will give it the hour before moving.
UPDATE: 10:47 AM EST
Servers are back! Please let us know if you have any problems.
UPDATE: 1:15 PM EST We received update from our hosting provider with some explanations:
At 11:30PM on Monday, January 9th the Verizon datacenter lost complete power. Normally this would not be a problem as they should failover to Verizon’s UPS or generator. For reasons that have not yet
been explained to us, none of the alternate power sources picked up. This caused the entire datacenter to lose all power and lights. We had someone onsite by 12AM on Tuesday, January 10th to investigate the issue.
At this time numerous other large customers of Verizon also arrived, demanding answers about the situation. We knew very little until the power was restored at roughly 2AM. We were then able to power all servers and equipment and ensure everything was in a healthy condition. Verizon announced at this time they were having “internal network problems” that they were trying to resolve and did not have an ETA as to when it would be fixed.
We escalated the case along with the other companies using the facility. Verizon was unable to provide any ETA or information on what was wrong until roughly 10AM EST. The issue was found to be 4 cards in a Cisco gateway were damaged as a result of a short circuit within Verizon. We were told 4 cards were being rushed to the facility and we would be back online in roughly 1 hour. All systems came back online at 10:45AM EST.
It was very difficult for us to provide much information to our clients during the outage because the issue was not with our equipment or infrastructure directly.
Verizon will be making an announcement and providing more detail in the coming days.
UPDATE: Jan 13
We’ve received official email from Verizon
Dear Verizon Canada customer:
On Monday, January 09 at 11:02 PM EST Verizon’s network connections in
the Toronto data center experienced a service interruption. This
incident was triggered by numerous utility voltage fluctuations followed
by a total loss of utility power. Verizon backup power systems came
online as designed but a failure occurred during the transfer to
At approximately 11:20 PM EST the power in the Verizon colocation
facility failed. By this time Verizon field engineers were on site to
investigate the power incidents. Utility power was restored at
approximately 11:31 PM EST which also restored colocation power.
Residual problems with the UPS systems were observed. Verizon field
engineers engaged vendor and third party assistance to restore full
functionality. At approximately 12:00 AM EST the UPS came back on line
and the power systems were fully restored throughout the facility.
At about 12:30AM commercial power returned and site was transferred
Verizon observed problems within the network infrastructure and began
troubleshooting. At approximately 03:00 AM EST it was determined that
there were multiple hardware failures in the switching infrastructure.
At 06:49 AM EST network field technicians were dispatched to replace
By 08:49 AM EST the incidents were escalated to level 3 engineers.
At 09:37 AM EST it was determined that additional hardware was required
from another Verizon site within Toronto.
At 10:32 AM EST all faulty hardware was replaced.
Network services were fully restored at 10:41 AM EST.
Corrective actions were planned and an emergency maintenance was
performed on January 12th 2012, at 12:00 AM EST to bypass the power
equipment which failed during the transfer to generator power. Verizon
procedures have been adjusted to improve the speed of escalation for
network infrastructure devices at this facility.
Here is what people say about our new Teams Board area released in TargetProcess v.2.21.
- The relatively simple overview is just great.
- The simple fact that this really is a KanBan board with assignment swimlanes. Been waiting for this!
- The overall feature is very good, well thought out. Kudos.
- The board is very nice as a tool for standups
- Using it for daily stand-up, is great view into what people are doing!
- Overview of all projects in one place, I’ve been struggling with this in TP for over a year!
- Overall, that’s a major improvement. We are using scrum and I was hoping to have a good task board as this one.
- Great use of drag and drop, very intuitive.
- Great feature to be able to zoom in and out for detail or for high-level scope on stories.
- I really like this idea. Very helpful from a project management point of view. Great Job with this!
- Nice tool to set priority on product backlog item presenting many projects with bugs and features all in the sample place with intuitive drag and drop: WoW!
- Overall, I love it! This is a great way to view everything at a glance and the ability to drill in is prefect. More feedback as I use it
We have implemented cumulative flow diagram in TargetProcess v 2.15 as a tracking/reporting chart to get a quick overview of user stories in To Do, In Progress and Completed state:
Why we have only 3 fixed states for cumulative flow diagram while all the states for User Stories and other entities are customizable? We could have enabled showing the count of User Stories in all the customized states, be it 6 or 7 or 8 states. But the visual diagram would have been too clogged this way. We would have had to either follow the reporting stats meticulously, replicate all the states and counts in the digram and lose the visual UX of this chart, or to get it down to 3 basic states such as ToDo, InProgress and Completed (all the customized states are just a variation for those 3 states, one way or another) and retain a good info-design.
Some people asked us to enable more detailed views in cumulative flow digram and we will consider implementing this in the future.
You’re welcome to submit your requests on cumulative flow diagram and other features to TargetProcess HelpDesk or to TargetProcess UX Group.