Enhancing Bandaid load balancing at Dropbox by leveraging real-time backend server load information

Layer-7 load balancing (LB) is a core building block to scale today’s web services. In an earlier post, we introduced Bandaid, the service proxy we built in house at Dropbox that load balances the vast majority of our user requests to backend services. More balanced load distribution among backend servers is desirable because it improves the performance and reliability of our services which benefits our users. Our Traffic/Runtime team recently explored leveraging real-time load information from backend servers to make better load balancing decisions in Bandaid. In this post, we will share the experiences and results, as well as discussing load balancing in general.

Read more

RunBMC: OCP hardware spec solves data center BMC pain points

Open source is not just for software. The same benefits of rapid innovation and community validation apply to hardware specifications as well. That’s why I’m happy to write that the v1.0 of the RunBMC hardware spec has been contributed to Open Compute Project (OCP). Before I get into what BMCs (baseboard management controllers) are and why modern data centers are dependent on them, let’s zoom out to what companies operating at cloud scale have learned.

Cloud software companies like Dropbox have millions, and in some cases, billions of users. When these cloud companies started building out their own data centers,

Read more

SMR: What we learned in our first year

A year ago, we became the first major tech company to adopt high-density SMR (Shingled Magnetic Recording) technology for our storage drives. At the time, we faced a challenge: while SMR offers major cost savings over conventional PMR (Perpendicular Magnetic Recording) drives, the technology is slower to write than conventional drives. We set out on a journey to reap the cost-saving benefit of SMR without giving up on performance. One year later, here’s the story of how we achieved just that.

The Best Surprise Is No Surprise

When the first production machines started arriving in September,

Read more

Athena: Our automated build health management system

At Dropbox, we run more than 35,000 builds and millions of automated tests every day. With so many tests, a few are bound to fail non-deterministically or “flake.” Some new code submissions are bound to break the build, which prevents developers from cutting a new release. At this scale, it’s critical we minimize the manual intervention necessary to temporarily disable flaky tests, revert build-breaking commits, and notify test owners of these issues. We built a system called Athena to manage build health and automatically keep the build green.

What we used to do

To ensure basic correctness,

Read more

How we optimized Magic Pocket for cold storage

Ever since we launched Magic Pocket, our in-house multi-exabyte storage system, we’ve been continuously looking for opportunities to improve efficiency, while maintaining our high standards for reliability. Last year, we pushed the limits of storage density by being the first major tech company to adopt SMR storage. In this post, we’ll discuss another advance in storage technology at Dropbox: a new cold storage tier that’s optimized for less frequently accessed data. This storage runs on the same SMR disks as our more active data, and through the same internal network.

The Lifetime of a file

The access characteristics of a file at Dropbox varies heavily over time.

Read more

Finding Kafka’s throughput limit in Dropbox infrastructure

Apache Kafka is a popular solution for distributed streaming and queuing for large amounts of data. It is widely adopted in the technology industry, and Dropbox is no exception. Kafka plays an important role in the data fabric of many of our critical distributed systems: data analytics, machine learning, monitoring, search, and stream processing (Cape), to name a few.

At Dropbox, Kafka clusters are managed by the Jetstream team, whose primary responsibility is to provide high quality Kafka services. Understanding Kafka’s throughput limit in Dropbox infrastructure is crucial in making proper provisioning decision for different use cases,

Read more