At Dropbox, we use monitoring and alerting to gain insight into what our server applications are doing, how they’re performing, and to notify engineers when systems aren’t behaving as expected. Our monitoring systems operate across 1,000 machines, with 60,000 alerts evaluated continuously. In 2018, we reinvented our monitoring and alerting processes, obviating the need for manual recovery and repair. We boosted query speed by up to 10000x in some cases by caching more than 200,000 queries. This work has improved both user experience and trust in the platform, allowing our engineers to monitor, debug, and ship improvements with higher velocity and efficiency.
Layer-7 load balancing (LB) is a core building block to scale today’s web services. In an earlier post, we introduced Bandaid, the service proxy we built in house at Dropbox that load balances the vast majority of our user requests to backend services. More balanced load distribution among backend servers is desirable because it improves the performance and reliability of our services which benefits our users. Our Traffic/Runtime team recently explored leveraging real-time load information from backend servers to make better load balancing decisions in Bandaid. In this post, we will share the experiences and results, as well as discussing load balancing in general.
Open source is not just for software. The same benefits of rapid innovation and community validation apply to hardware specifications as well. That’s why I’m happy to write that the v1.0 of the RunBMC hardware spec has been contributed to Open Compute Project (OCP). Before I get into what BMCs (baseboard management controllers) are and why modern data centers are dependent on them, let’s zoom out to what companies operating at cloud scale have learned.
Cloud software companies like Dropbox have millions, and in some cases, billions of users. When these cloud companies started building out their own data centers,
A year ago, we became the first major tech company to adopt high-density SMR (Shingled Magnetic Recording) technology for our storage drives. At the time, we faced a challenge: while SMR offers major cost savings over conventional PMR (Perpendicular Magnetic Recording) drives, the technology is slower to write than conventional drives. We set out on a journey to reap the cost-saving benefit of SMR without giving up on performance. One year later, here’s the story of how we achieved just that.
The Best Surprise Is No Surprise
When the first production machines started arriving in September,
At Dropbox, we run more than 35,000 builds and millions of automated tests every day. With so many tests, a few are bound to fail non-deterministically or “flake.” Some new code submissions are bound to break the build, which prevents developers from cutting a new release. At this scale, it’s critical we minimize the manual intervention necessary to temporarily disable flaky tests, revert build-breaking commits, and notify test owners of these issues. We built a system called Athena to manage build health and automatically keep the build green.
What we used to do
To ensure basic correctness,
Ever since we launched Magic Pocket, our in-house multi-exabyte storage system, we’ve been continuously looking for opportunities to improve efficiency, while maintaining our high standards for reliability. Last year, we pushed the limits of storage density by being the first major tech company to adopt SMR storage. In this post, we’ll discuss another advance in storage technology at Dropbox: a new cold storage tier that’s optimized for less frequently accessed data. This storage runs on the same SMR disks as our more active data, and through the same internal network.
The Lifetime of a file
The access characteristics of a file at Dropbox varies heavily over time.