At 12:26am Eastern Time this morning, Newstex Emergency Support Staff began receiving automated alerts indicating that several of our key-systems were encountering errors. We quickly jumped into action and verified that several of our FTP servers and delivery servers had been effected by a temporary outage. We tracked the root problem back to an issue with Amazon SimpleDB.
We continued to monitor the situation and followed the AWS Status Page, continuing to keep our servers operating while we prepared to migrate to a different region. During this time, no Files or Stories were lost, and we were able to continue to deliver some stories to clients. Many of our non-essental services such as our Client Search System were disabled during this period due to their low-priority during a crisis situation. Our staff continued to monitor the situation but did not need to switch regions due to Amazon’s timely response to the situation. By 3:30am Eastern Time, services were restored and we began bringing up all non-essental services.
Newstex’s Disaster recovery plan includes several responses based on the time of day, volume of traffic, and severity of the outage. During this outage, we were able to continue to accept story files and continue to deliver high-priority story files. Additionally, we were operating during a low-volume time, so we did not conclude that we needed to roll-over to a new region at the time. We re-assessed the situation every 15 minutes, and determined that if the outage had gone past 4am Eastern Time, we would migrate the services over to the us-west-1 region in California. Fortunately, Amazon was able to fix the issue before we had to make that decision.
Newstex takes all outages very seriously, and we strive to provide minimal customer impact for any situation, even if it is out of our control. We stand by our decision to use Amazon Web Services and SimpleDB for all of our systems as it still has the best uptime of any available solution we have available. In this instance, Amazon was also able to solve the root cause quicker then we could, which further emphasizes our commitment to using them.
For more information, see my Blog Post on Best Practices for handling Crisis situations in the Amazon Cloud.