Hi. We're building a bug and issue tracker and you can watch.

Friday's Unscheduled Maintenance and Data Loss

On Friday morning we made a mistake that led to data loss for everything created during an eight and a half ten hour window. Naturally, this isn't something that we take lightly, and we want to be entirely transparent about how it happened and what our response is. Note: Updated the length of the window to be more accurate.

Was I affected?

Any data entered into Sifter between Friday, March 18, 2011 at 4:00 3:59 UTC and 12:30 16:52 UTC has been lost. Note: Updated times to be more accurate. 16:52 UTC is when Sifter came back online, so this time window is larger than the window of lost data because we had disabled web access for the last hour and a half to two hours during this time window.

What if I was affected?

While some of the data in the database has been lost, teams should collectively have a record of the data created in their email due to Sifter's email notifications. It isn't a perfect solution, but the emails should be able to help minimize the chance that anything slips through the cracks.

What should I do now?

If you were affected by the data loss or downtime, please contact us via our support request form. There's no way for us to put a value on the inconvenience to you and your team, but if you've been affected, let us know, and we will credit your account for the month. If you have any questions, or would like clarification, don't hesitate to ask.

What happened?

Around 1:30 AM Friday, March 18, 2011 UTC, we began a slice resize to boost performance in the short-term as we made long-term plans for significant improvements to our production environment. Everything checked out and ran smoothly into the evening. This morning, upon reviewing and double-checking everything, we decided that the resize wasn't making a difference and decided against it.

At that point, we had a decision to make. We could either rollback the slice with about a minute of downtime or confirm the resize and then resize downwards at a later time with 20-30 minutes more downtime. Unfortunately, we decided to try and minimize the downtime and just rollback the slice. As a result, all data that had been entered on the new slice was lost as the new slice was deleted in the rollback.

We realized this almost immediately after the resize and immediately turned Sifter off as we quickly began researching and exploring our options. We verified with Slicehost that the data was indeed lost. We run daily offsite backups every night at midnight our time, so we knew that we could recover all of the data prior to that time.

With Sifter disabled, we restored from our most current backup. Unfortunately, the backup didn't include any data created after 4:00 AM Friday, March 18, 2011 UTC. So anything created after that time has been lost. We know that apologies don't go very far, but it should go without saying that we are truly sorry for the problems that this has inevitably caused.

What steps are you taking to prevent this in the future?

First and foremost, this is one of those lessons that you learn from and absolutely never forget. So, we won't make a mistake like this again. Of course, that's not enough. Prior to this incident, we were already in the process of starting to explore our options for improving our architecture. Our main priority was improving performance, but we also plan on adding additional layers of redundancy and backups as well. We'll be better than ever as a result of this, but we know that doesn't bring back the lost data.

We don't have full technical details yet because we're still evaluating our next steps, but rest assured we'll be making significant updates to our architecture and backup system.

A Sincere Apology

Words always seem kind of empty when something like this happens, but anyone who's ever contacted us should know how passionate we are about taking incredible care of our customers. We're taking this hard, and we'll be working even harder to make amends. We sincerely apologize for our mistake and look forward to making this up to you.

,

Comments

It's refreshing to see a company owning up to a mistake they made instead of trying to place the blame elsewhere. While data loss is never a good thing at least you are open and honest about the causes and what you'll do to prevent this in the future.

Yes, it's good to hear what really happened, but: i spent the whole morning to recover our work from friday. We are using the ticket IDs in our timetracking as well, so we had to sync the data there so on. This should never happen again!!

@peterlih - That goes without saying. In fact, it simply never should have happened in the first place. I assure you that we couldn't be more concerned about the mistake and resulting inconvenience. We're actively working at this very moment to improve our systems so that this not only doesn't happen again, but can't happen again.

The Latest

We built a simple bug and issue tracker named Sifter and we blog about it when we're not working on it. We think it’s a great way to get feedback and keep everyone updated on our status.

Signup for our occasional emails…

We'll only send emails for significant product announcements, and those happen every couple of months at most. Of course, we won't give away or sell your e-mail address either.

You can unsubscribe anytime.