SMR: What we learned in our first year

A year ago, we became the first major tech company to adopt high-density SMR (Shingled Magnetic Recording) technology for our storage drives. At the time, we faced a challenge: while SMR offers major cost savings over conventional PMR (Perpendicular Magnetic Recording) drives, the technology is slower to write than conventional drives. We set out on a journey to reap the cost-saving benefit of SMR without giving up on performance. One year later, here’s the story of how we achieved just that.

The Best Surprise Is No Surprise

When the first production machines started arriving in September,

Read more

Athena: Our automated build health management system

At Dropbox, we run more than 35,000 builds and millions of automated tests every day. With so many tests, a few are bound to fail non-deterministically or “flake.” Some new code submissions are bound to break the build, which prevents developers from cutting a new release. At this scale, it’s critical we minimize the manual intervention necessary to temporarily disable flaky tests, revert build-breaking commits, and notify test owners of these issues. We built a system called Athena to manage build health and automatically keep the build green.

What we used to do

To ensure basic correctness,

Read more

How we optimized Magic Pocket for cold storage

Ever since we launched Magic Pocket, our in-house multi-exabyte storage system, we’ve been continuously looking for opportunities to improve efficiency, while maintaining our high standards for reliability. Last year, we pushed the limits of storage density by being the first major tech company to adopt SMR storage. In this post, we’ll discuss another advance in storage technology at Dropbox: a new cold storage tier that’s optimized for less frequently accessed data. This storage runs on the same SMR disks as our more active data, and through the same internal network.

The Lifetime of a file

The access characteristics of a file at Dropbox varies heavily over time.

Read more

Finding Kafka’s throughput limit in Dropbox infrastructure

Apache Kafka is a popular solution for distributed streaming and queuing for large amounts of data. It is widely adopted in the technology industry, and Dropbox is no exception. Kafka plays an important role in the data fabric of many of our critical distributed systems: data analytics, machine learning, monitoring, search, and stream processing (Cape), to name a few.

At Dropbox, Kafka clusters are managed by the Jetstream team, whose primary responsibility is to provide high quality Kafka services. Understanding Kafka’s throughput limit in Dropbox infrastructure is crucial in making proper provisioning decision for different use cases,

Read more

The scalable fabric behind our growing data center network

Dropbox needs its underlying network infrastructure to be reliable, high-performing, cost-effective, and truly scalable. In previous posts we described how the edge network was designed to improve user performance, and how the supporting multi-terabit backbone network spans continents to interconnect edge PoPs and multiple data centers.

In this post we describe how we evolved the Dropbox data center network from the legacy chassis based four-post architecture to a scalable multi-tier, quad-plane fabric. Also, we successfully deployed our first fabric at our newest data center in California earlier this year!

Dropbox network physical footprint

We currently have global network presence and multiple data centers in California,

Read more

Automating Datacenter Operations at Dropbox

 

Introduction

As a company that manages our own infrastructure we need to be able to rapidly install new server capacity and ensure that the equipment entering our production environment is highly reliable. Prior to the creation and implementation of the Pirlo system, engineering personnel at Dropbox manually intervened in most aspects of server/switch provisioning and validation.

Pirlo was designed to eliminate and automate many of these manual processes. In this post we will describe Pirlo, a flexible system designed to validate and configure network switches and to ensure the reliability of servers before they enter production.

Read more