As a company that manages our own infrastructure we need to be able to rapidly install new server capacity and ensure that the equipment entering our production environment is highly reliable. Prior to the creation and implementation of the Pirlo system, engineering personnel at Dropbox manually intervened in most aspects of server/switch provisioning and validation.
Pirlo was designed to eliminate and automate many of these manual processes. In this post we will describe Pirlo, a flexible system designed to validate and configure network switches and to ensure the reliability of servers before they enter production.
Dropbox runs hundreds of services, written in different languages, which exchange millions of requests per second. At the core of our Service Oriented Architecture is Courier, our gRPC-based Remote Procedure Call (RPC) framework. While developing Courier, we learned a lot about extending gRPC, optimizing performance for scale, and providing a bridge from our legacy RPC system.
Note: this post shows code generation examples in Python and Go. We also support Rust and Java.
The road to gRPC
Courier is not Dropbox’s first RPC framework. Even before we started to break our Python monolith into services in earnest,
We introduced Cape in a previous post. In a nutshell, Cape is a framework for enabling real-time asynchronous event processing at a large scale with strong guarantees. It has been over a year since the system was launched. Today Cape is a critical component for Dropbox infrastructure. It operates with both high performance and reliability at a very large scale. Here are a few key metrics, Cape is:
- running on thousands of servers across the continent
- subscribing to over 30 different event domains at a rate of 30K/s
- processing jobs of various sizes at rate of 150K/s
- delivering 95% of events under 1 second after they are created.
Dropbox invests heavily in our security program. We have lots of teams dedicated to securing Dropbox, each working on exciting things. Some recent examples covered on our tech blog include:
- Our Product Security team rolled out support for WebAuthn to boost user adoption of two-step verification and upleveled our industry-leading public bug bounty program
- Because security is everyone’s responsibility, our Security Culture team helps our employees make consistently secure and informed decisions that protect Dropbox, our users, and our employees
- Our Detection and Response Team (DART) implementation of extensive instrumentation throughout our infrastructure to catch any indications of compromise.
Dropbox stores petabytes of metadata to support user-facing features and to power our production infrastructure. The primary system we use to store this metadata is named Edgestore and is described in a previous blog post, (Re)Introducing Edgestore. In simple terms, Edgestore is a service and abstraction over thousands of MySQL nodes that provides users with strongly consistent, transactional reads and writes at low latency.
Edgestore hides details of physical sharding from the application layer to allow developers to scale out their metadata storage needs without thinking about complexities of data placement and distribution.
One of the greatest challenges associated with maintaining a complex desktop application like Dropbox is that with hundreds of millions of installs, even the smallest bugs can end up affecting a very large number of users. Bugs inevitably will strike, and while most of them allow the application to recover, some cause the application to terminate. These terminations, or “crashes,” are highly disruptive events: when Dropbox stops, synchronization stops. To ensure uninterrupted sync for our users we automatically detect and report all crashes and take steps to restart our application when they occur.
In 2016, faced with our impending transition to Python 3,