This is the first in a series of posts that Dropbox Principal Engineer James Cowling has published on his personal Medium blog about technical leadership. Being a strong tech lead is very different from being a strong engineer and we thought the readers of our tech blog would find his experiences relevant and interesting.
Back when I was a first-time tech lead at Dropbox I had the misfortune of juggling two intimidating responsibilities at the same time:
- Build a multi-exabyte distributed storage system and migrate our data off Amazon S3,
The Dropbox desktop client is relied on by millions of users across the world to save their most important files and keep them in sync across their devices. Weighing in at over 1 million lines of Python logic, we had a massive surface area for potential issues in our migration from Python 2 to Python 3. In this process, we knew that we had to be worthy of the trust that users place in Dropbox and keep their information safe.
Over the last few months, we’ve explored why and how we rolled out our Python 3 migration,
Apache Kafka is a popular solution for distributed streaming and queuing for large amounts of data. It is widely adopted in the technology industry, and Dropbox is no exception. Kafka plays an important role in the data fabric of many of our critical distributed systems: data analytics, machine learning, monitoring, search, and stream processing (Cape), to name a few.
At Dropbox, Kafka clusters are managed by the Jetstream team, whose primary responsibility is to provide high quality Kafka services. Understanding Kafka’s throughput limit in Dropbox infrastructure is crucial in making proper provisioning decision for different use cases,
Dropbox needs its underlying network infrastructure to be reliable, high-performing, cost-effective, and truly scalable. In previous posts we described how the edge network was designed to improve user performance, and how the supporting multi-terabit backbone network spans continents to interconnect edge PoPs and multiple data centers.
In this post we describe how we evolved the Dropbox data center network from the legacy chassis based four-post architecture to a scalable multi-tier, quad-plane fabric. Also, we successfully deployed our first fabric at our newest data center in California earlier this year!
Dropbox network physical footprint
We currently have global network presence and multiple data centers in California,
As a company that manages our own infrastructure we need to be able to rapidly install new server capacity and ensure that the equipment entering our production environment is highly reliable. Prior to the creation and implementation of the Pirlo system, engineering personnel at Dropbox manually intervened in most aspects of server/switch provisioning and validation.
Pirlo was designed to eliminate and automate many of these manual processes. In this post we will describe Pirlo, a flexible system designed to validate and configure network switches and to ensure the reliability of servers before they enter production.
Dropbox runs hundreds of services, written in different languages, which exchange millions of requests per second. At the core of our Service Oriented Architecture is Courier, our gRPC-based Remote Procedure Call (RPC) framework. While developing Courier, we learned a lot about extending gRPC, optimizing performance for scale, and providing a bridge from our legacy RPC system.
Note: this post shows code generation examples in Python and Go. We also support Rust and Java.
The road to gRPC
Courier is not Dropbox’s first RPC framework. Even before we started to break our Python monolith into services in earnest,