The Dropbox Traffic team is charged with innovating our application networking stack to improve the experience for every one of our users—over half a billion of them. This article describes our work with NS1 to optimize our intelligent DNS-based global load balancing for corner cases that we uncovered while improving our point of presence (PoP) selection automation for our edge network. By co-developing the platform capabilities with NS1 to handle these outliers, we deliver positive Dropbox experiences to more users, more consistently.
Spoiler alert: BBRv2 is slower than BBRv1 but that’s a good thing.
BBRv1 Congestion Control
Three years have passed since “Bottleneck Bandwidth and Round-trip” (BBR) congestion control was released. Nowadays, it is considered production-ready and added to Linux, FreeBSD, and Chrome (as part of QUIC.) In our blogpost from 2017, “Optimizing web servers for high throughput and low latency,” we evaluated BBRv1 congestion control on our edge network and it showed awesome results:
Since then, BBRv1 has been deployed to Dropbox Edge Network and we got accustomed to some of its downsides.
Dropbox server-side software lives in a large monorepo. One lesson we’ve learned scaling the monorepo is to minimize the number of global operations that operate on the repository as a whole. Years ago, it was reasonable to run our entire test corpus on every commit to the repository. This scheme became untenable as we added more tests. One obvious inefficiency is the pointless and wasteful execution of tests that can’t possibly be affected by a particular change.
We addressed this problem with the help of our build system. Code in our monorepo is built and tested exclusively with Bazel.
At Dropbox, we use monitoring and alerting to gain insight into what our server applications are doing, how they’re performing, and to notify engineers when systems aren’t behaving as expected. Our monitoring systems operate across 1,000 machines, with 60,000 alerts evaluated continuously. In 2018, we reinvented our monitoring and alerting processes, obviating the need for manual recovery and repair. We boosted query speed by up to 10000x in some cases by caching more than 200,000 queries. This work has improved both user experience and trust in the platform, allowing our engineers to monitor, debug, and ship improvements with higher velocity and efficiency.
Dropbox is a big user of Python. It’s our most widely used language both for backend services and the desktop client app (we are also heavy users of Go, TypeScript, and Rust). At our scale—millions of lines of Python—the dynamic typing in Python made code needlessly hard to understand and started to seriously impact productivity. To mitigate this, we have been gradually migrating our code to static type checking using mypy, likely the most popular standalone type checker for Python. (Mypy is an open source project, and the core team is employed by Dropbox.)
Dropbox has been one of the first companies to adopt Python static type checking at this scale.