Dropbox Takes Blame For Cloud Outage

Post-mortem analysis says Friday’s cloud service outage was because of bad script in routine maintenance update.

Deadly Downtime: The Worst Network Outages Of 2013

Deadly Downtime: The Worst Network Outages Of 2013

(click image for larger view and for slideshow)

Dropbox says it knows tips on how to avoid the kind of outage that struck its service Friday: Someday, this may check the state of its servers before making additions to their code.

That will prevent an operating system update from being applied to a running production server, a process that typically leads to the server shutting down. Enterprise users who rely upon Dropbox for offsite storage already understand the primary. Some can be wondering whether Dropbox has the operational smarts to be relied upon for the long run.

Dropbox was seeking to upgrade some servers’ operating systems Friday evening in “a routine maintenance episode” when a buggy script caused probably the most updates to be applied to production servers, a move that ended in the upkeep effort being anything but routine. Dropbox customers experienced a 2.5-hour lack of access to the service, with some services out for a lot of the weekend.

Dropbox uses thousands of database servers to store pictures, documents, and other complex user data. Each database system encompasses a master database server and two slaves, an approach that leaves two copies of the information intact in case of a server hardware failure. The upkeep script appears to have launched new servers on running database servers.

“A subtle bug within the script caused the command to reinstall a small collection of active machines. Unfortunately, some master-slave pairs were impacted which ended in the positioning happening,” wrote Akhil Gupta, head of infrastructure, in a post-mortem blog Sunday.

[Some cloud customers have become bored to death with outages. See Amazon Cloud Outage Causes Customer To go away.]

Dropbox went off the air abruptly Friday between 5:30 and six:00 p.m. Pacific time. For 2 hours, Dropbox’s site remained dark, then reappeared at 8:00 p.m., in accordance with user observations posted to Twitter and other sites. Users were ready to log in again starting about 8:30 p.m. PT.

It wasn’t clear from Gupta’s post mortem what number servers have been directly affected; at one point he spoke of “a handful.” Gupta assured customers, “Your files were never in danger in the course of the outage. These databases don’t contain file data. We use them to supply a number of our features (as an instance, photo album sharing, camera uploads, and a few API features).”

On the alternative hand, operation of a few of the paired master/slave database systems seem to have been entirely lost, something that a cloud operation tries always to circumvent. Normally, if a master system is lost, its operations are taken offline long enough for both slaves to create a 3rd copy of the knowledge and appoint one of several three because the new master.

Gupta explained inside the blog, “To revive service as fast as possible, we performed the recovery from our backups.” This implies that Dropbox needed to load stored copies of database systems to get its production systems running again.

“We were capable of restore most functionality within 3 hours,” he wrote, “however the large size of a few of our databases slowed recovery, and it took until 4:40 p.m. PT [Sunday] for core service to completely return,” Gupta wrote. This was a minimum of 46 hours and 40 minutes after the outage began. Dropbox Photo Lab service was still being worked on after 48 hours.

Two-and-a-half hours into the outage, Dropbox responded to rumors and denied that its site have been hacked. At 8:30 p.m. Friday, the corporate tweeted: “Dropbox site is back up! Claims of leaked user info are a hoax. The outage was caused during internal maintenance. Thanks in your patience!”

One Twitter user agreed: “Dropbox not hacked, just stupid.”

Gupta’s post mortem took forthright responsibility for the outage, admitting Dropbox caused it with the faulty operating system upgrade script. It reassured users about their data, while explaining why it had taken goodbye to bring all services back online. However the proven fact that master/slave systems appear to have gone down together in “a routine maintenance episode” just isn’t fully explained. If the operating system upgrade were staged in order that only 1 of 3 database servers was changed at a time, two systems would have remained intact and recovery would has been faster.

More Insights