The Dropbox Traffic team is charged with innovating our application networking stack to improve the experience for every one of our users—over half a billion of them. This article describes our work with NS1 to optimize our intelligent DNS-based global load balancing for corner cases that we uncovered while improving our point of presence (PoP) selection automation for our edge network. By co-developing the platform capabilities with NS1 to handle these outliers, we deliver positive Dropbox experiences to more users, more consistently.
Spoiler alert: BBRv2 is slower than BBRv1 but that’s a good thing.
BBRv1 Congestion Control
Three years have passed since “Bottleneck Bandwidth and Round-trip” (BBR) congestion control was released. Nowadays, it is considered production-ready and added to Linux, FreeBSD, and Chrome (as part of QUIC.) In our blogpost from 2017, “Optimizing web servers for high throughput and low latency,” we evaluated BBRv1 congestion control on our edge network and it showed awesome results:
Since then, BBRv1 has been deployed to Dropbox Edge Network and we got accustomed to some of its downsides.
Dropbox server-side software lives in a large monorepo. One lesson we’ve learned scaling the monorepo is to minimize the number of global operations that operate on the repository as a whole. Years ago, it was reasonable to run our entire test corpus on every commit to the repository. This scheme became untenable as we added more tests. One obvious inefficiency is the pointless and wasteful execution of tests that can’t possibly be affected by a particular change.
We addressed this problem with the help of our build system. Code in our monorepo is built and tested exclusively with Bazel.
At Dropbox, we use monitoring and alerting to gain insight into what our server applications are doing, how they’re performing, and to notify engineers when systems aren’t behaving as expected. Our monitoring systems operate across 1,000 machines, with 60,000 alerts evaluated continuously. In 2018, we reinvented our monitoring and alerting processes, obviating the need for manual recovery and repair. We boosted query speed by up to 10000x in some cases by caching more than 200,000 queries. This work has improved both user experience and trust in the platform, allowing our engineers to monitor, debug, and ship improvements with higher velocity and efficiency.
Layer-7 load balancing (LB) is a core building block to scale today’s web services. In an earlier post, we introduced Bandaid, the service proxy we built in house at Dropbox that load balances the vast majority of our user requests to backend services. More balanced load distribution among backend servers is desirable because it improves the performance and reliability of our services which benefits our users. Our Traffic/Runtime team recently explored leveraging real-time load information from backend servers to make better load balancing decisions in Bandaid. In this post, we will share the experiences and results, as well as discussing load balancing in general.
Open source is not just for software. The same benefits of rapid innovation and community validation apply to hardware specifications as well. That’s why I’m happy to write that the v1.0 of the RunBMC hardware spec has been contributed to Open Compute Project (OCP). Before I get into what BMCs (baseboard management controllers) are and why modern data centers are dependent on them, let’s zoom out to what companies operating at cloud scale have learned.
Cloud software companies like Dropbox have millions, and in some cases, billions of users. When these cloud companies started building out their own data centers,