At Dropbox, we use monitoring and alerting to gain insight into what our server applications are doing, how they’re performing, and to notify engineers when systems aren’t behaving as expected. Our monitoring systems operate across 1,000 machines, with 60,000 alerts evaluated continuously. In 2018, we reinvented our monitoring and alerting processes, obviating the need for manual recovery and repair. We boosted query speed by up to 10000x in some cases by caching more than 200,000 queries. This work has improved both user experience and trust in the platform, allowing our engineers to monitor, debug, and ship improvements with higher velocity and efficiency.
One of the biggest challenges of the mobile developer community at Dropbox in 2018 was our custom build system. Our build system was slow, hard to use, and didn’t support some use cases which were out of scope of the original design. After 4 months of work by our Mobile Platform team, we were able to remove our unicorn implementation for something much more modern and easy to maintain.
In our new build system, we wanted to improve on a couple of things that our current build system was hindering:
- Make it easy to create new modules
- Allow developers to easily modify the build files
- Improve on build times for local development
- Industry standard approaches and tooling,
What does an engineer do after planning and decision on an approach? More planning! This time, the planning was laser-focused on what features to build into the new Android build system. Our team created a doc which outlined the milestones of the project and shared it to our internal customers for feedback. Subsequently, we agreed on the project scope and the key ideas for the migration.
Foundation for migration
Our goal in implementing a Gradle-only solution was to introduce flexibility, but also keep some of the great features BMBF had: guaranteed layered dependency order and reduced boilerplate in build files.
The Dropbox Detection and Response Team (DART) detects and mitigates information security threats to our employees, infrastructure, and customer data. DART ingests security-relevant logs for building detection, threat hunting and responding to potential incidents. Our log volume is huge, averaging tens of terabytes a day.
The problem we’re solving
Apart from building detections to track suspicious behavior and triaging incidents, we also spend large chunks of our time triaging false positive alerts and building context around individual alerts. This was time not spent hunting for attackers. As a result, any way to automate or improve triage process efficiency was appealing.
At Dropbox, we are building smart features that use machine intelligence to help reduce people’s busywork. Since introducing content suggestions, which we described in our previous blog post, we have been improving the underlying infrastructure and machine learning algorithms that power content suggestions.
One new challenge we faced during this iteration of content suggestions was the disparate types of content we wanted to support. In Dropbox, we have various kinds of content—files, folders, Google Docs, Microsoft Office documents, and our own Dropbox Paper.
Layer-7 load balancing (LB) is a core building block to scale today’s web services. In an earlier post, we introduced Bandaid, the service proxy we built in house at Dropbox that load balances the vast majority of our user requests to backend services. More balanced load distribution among backend servers is desirable because it improves the performance and reliability of our services which benefits our users. Our Traffic/Runtime team recently explored leveraging real-time load information from backend servers to make better load balancing decisions in Bandaid. In this post, we will share the experiences and results, as well as discussing load balancing in general.
Dropbox is a big user of Python. It’s our most widely used language both for backend services and the desktop client app (we are also heavy users of Go, TypeScript, and Rust). At our scale—millions of lines of Python—the dynamic typing in Python made code needlessly hard to understand and started to seriously impact productivity. To mitigate this, we have been gradually migrating our code to static type checking using mypy, likely the most popular standalone type checker for Python. (Mypy is an open source project, and the core team is employed by Dropbox.)
Dropbox has been one of the first companies to adopt Python static type checking at this scale.