The sky didn't fall but the cloud was dark over the weekend as Dropbox faced service disruptions that angered many users. The company reported its online storage service went down on Friday evening during scheduled maintenance and was back up and running about three hours later, with core service fully restored by 4:40 p.m. PT on Sunday.
So what happened? And what can we learn from the outage? Akhil Gupta, head of infrastructure at Dropbox, offered his insights in a blog post Sunday.
Gupta said Dropbox relies on thousands of databases to run -- and each database has one master and two slave machines for redundancy. The company performs full and incremental data backups and stores them in a separate environment. The trouble came during an operating system upgrade to some of Dropbox's machines.
What Really Happened?
"During this process, the upgrade script checks to make sure there is no active data on the machine before installing the new OS," Gupta said. "A subtle bug in the script caused the command to reinstall a small number of active machines. Unfortunately, some master-slave pairs were impacted, which resulted in the site going down."
Gupta assured users that their files were never at risk during the outage. These databases do not contain file data, he said, but are used to provide some Dropbox features, like photo album sharing, camera uploads, and some API features.
To restore service as fast as possible, Dropbox performed the recovery from its backups. Gupta said the company was able to restore most functionality within three hours, but the large size of some of the Dropbox databases slowed recovery, and it took until several more hours for complete restoration.
What Dropbox Learned
In response to the incident, Dropbox has added an additional layer of checks that require machines to locally verify their state before executing incoming commands. This, Gupta said, enables machines that self-identify as running critical processes to refuse potentially destructive operations.
"When running infrastructure at large scale, the standard practice of running multiple slaves provides redundancy. However, should those slaves fail, the only option is to restore from backup. The standard tool used to recover MySQL data from backups is slow when dealing with large data sets," he said. "To speed up our recovery, we developed a tool that parallelizes the replay of binary logs. This enables much faster recovery from large MySQL backups. We plan to open-source this tool so others can benefit from what we've learned."
What It All Means
So what does all this mean for cloud-based service users? We asked Charles Weaver, CEO of the International Association of Cloud and Managed Service Providers, for his take on the deeper meaning. He told us the Dropbox outage draws attention to the inherent risks and issues with public cloud services.
"Not just regarding security and privacy, but also with respect to transparency. When private cloud providers have outages, their customers usually have a better sense of accountability about what their cloud provider is doing and who is managing their data. Not so with public cloud," Weaver said.
"The important thing for businesses to realize is that cloud computing can come in many different flavors. There are consumer-grade and business-grade cloud providers, and it is important for organizations to assess their needs prior to selecting a cloud platform. This includes both data privacy and security requirements, which impact the type of cloud provider you choose."