Thursday, February 26, 2009

Oregon Gmail February 24 Outage Postmortem

If you had difficulty accessing Oregon Gmail between 3 AM and 6 AM on Thursday, you weren't alone. Below is Google's explanation of the outage.  


Begin forwarded message:

From: The Google Apps Team <apps-notify@google.com>
Date: February 26, 2009 12:38:47 AM CST
To: jon.tanner@gmail.com
Subject: Google Apps Update: February 24 Outage Postmortem
Reply-To: apps-notify@google.com

Dear Google Apps customer,

Between approximately 9AM to 12PM GMT / 1AM to 4AM PST on Tuesday, February 24, 2009, some Google Apps mail users were unable to access their accounts. The actual outage period varied by user because the recovery process was executed in stages. No data was lost during this time. The root cause of the problem was a software bug that caused an unexpected service disruption during the course of a routine maintenance event. The root cause of this unexpected service disruption has been found and fixed.

Additional Details

A few months ago, new software was implemented to optimize data center functionality to make more efficient use of Google's computing resources, as well as to achieve faster system performance for users.

Google's software is designed to allow maintenance work to be done in data centers without affecting users. User traffic that could potentially be impacted by a maintenance event is directed towards another instance of the service. On Tuesday, February 24, 2009, an unexpected service disruption occurred during a routine maintenance event in a data center. In this particular case, users were directed towards an alternate data center in preparation for the maintenance tasks, but the new software that optimizes the location of user data had the unexpected side effect of triggering a latent bug in the Google Mail code. The bug caused the destination data center to become overloaded when users were directed to it, and which in turn caused multiple downstream overload conditions as user traffic was automatically shifted in response to the failures. Google engineers acted quickly to re-balance load across data centers to restore users' access. This process took some time to complete.

The recently launched Apps Status Dashboard includes greater detail on this February 24th incident, including actions we are taking to continually improve performance.  For a direct link to this Incident Report, visit http://www.google.com/appsstatus/ir/1nsexcr2jnrj1d6.pdf (English only).

For ongoing service performance information, please access the Apps Status Dashboard at http://www.google.com/appsstatus (English only).

We are very sorry for the inconvenience that this incident has caused. We understand that system problems are inconvenient and frustrating for customers who have come to rely on our products to do many different things. One of Google's core values is to focus on the user, so we are working very hard to make improvements to our technology and operational processes so as to prevent service disruptions. We are confident that we will achieve continuous improvements quickly and persistently.

Once again, we apologize for the impact that this incident has caused. Thank you very much for your continued support.

Sincerely,

The Google Apps Team

Email preferences: You have received this mandatory email service announcement to update you about important changes to your Google Apps product or account.

Google, Inc.
1600 Amphitheatre Parkway
Mountain View, CA 94043

1 comment:

cd said...

Sounds like the kind of tech smoke screen I used to write about chemicals! My bigger problem is that I routinely can't access any of my calendars from home- I have to clear cache and cookies and then reboot. All things considered, Google's still better than the MSN page that carries the same story for months. Thanks for the update JT