The Dropbox Traffic team is charged with innovating our application networking stack to improve the experience for every one of our users—over half a billion of them. This article describes our work with NS1 to optimize our intelligent DNS-based global load balancing for corner cases that we uncovered while improving our point of presence (PoP) selection automation for our edge network. By co-developing the platform capabilities with NS1 to handle these outliers, we deliver positive Dropbox experiences to more users, more consistently.
Spoiler alert: BBRv2 is slower than BBRv1 but that’s a good thing.
BBRv1 Congestion Control
Three years have passed since “Bottleneck Bandwidth and Round-trip” (BBR) congestion control was released. Nowadays, it is considered production-ready and added to Linux, FreeBSD, and Chrome (as part of QUIC.) In our blogpost from 2017, “Optimizing web servers for high throughput and low latency,” we evaluated BBRv1 congestion control on our edge network and it showed awesome results:
Since then, BBRv1 has been deployed to Dropbox Edge Network and we got accustomed to some of its downsides.
This is an expanded version of my talk at NginxConf 2017 on September 6, 2017. As an SRE on the Dropbox Traffic Team, I’m responsible for our Edge network: its reliability, performance, and efficiency. The Dropbox edge network is an nginx-based proxy tier designed to handle both latency-sensitive metadata transactions and high-throughput data transfers. In a system that is handling tens of gigabits per second while simultaneously processing tens of thousands latency-sensitive transactions, there are efficiency/performance optimizations throughout the proxy stack, from drivers and interrupts, through TCP/IP and kernel, to library, and application level tunings.
Large-scale networks are complex, dynamic systems with many parts, managed by many different teams. Each team has tools they use to monitor their part of the system, but they measure very different things. Before we built our own infrastructure, Magic Pocket, we didn’t have a global view of our production network, and we didn’t have a way to look at the interactions between different parts in real time. Most of the logs from our production network have semi-structured or unstructured data formats, which makes it very difficult to track a large amount of log data in real-time.
At Dropbox, our traffic team recently upgraded the front-end Nginx servers to enable HTTP/2 for our web services. In this article, we would like to share our experiences and findings during the HTTP/2 transition. The overall upgrade was smooth for us, although there are also a couple of caveats that might be helpful to others.
Background: HTTP/2 and Dropbox web service infrastructure
HTTP/2 (RFC 7540) is the new major version of the HTTP protocol. It is based on SPDY and provides several performance optimizations compared to HTTP/1.1. These optimizations include more efficient header compression,