In our previous blog posts, we talked about how we updated the Dropbox search engine to add intelligence into our users’ workflow, and how we built our optical character recognition (OCR) pipeline. One of the most impactful benefits that users will see from these changes is that users on Dropbox Professional and Dropbox Business Advanced and Enterprise plans can search for English text within images and PDFs using a system we’re describing as automatic image text recognition.
The potential benefit of automatically recognizing text in images (including PDFs containing images) is tremendous.
In our previous post, we discussed the architecture of our new search engine, named Nautilus, and its use of machine intelligence to scale our search–ranking and content–understanding models. Along with best–in–class performance, scalability, and reliability, we also provided a foundation for implementing intelligent document ranking and retrieval features. This flexible system allows our engineers to easily customize the document–indexing and query–processing pipelines while maintaining strong safeguards to preserve the privacy of our users’ data.
In this post, we will discuss the process that we undertook to ensure optimal performance and reliability.
Each of the hundreds of our search leaves runs our retrieval engine,
Over the last few months, the Search Infrastructure engineering team at Dropbox has been busy releasing a new full-text search engine called Nautilus, as a replacement for our previous search engine.
Search presents a unique challenge when it comes to Dropbox due to our massive scale—with hundreds of billions of pieces of content—and also due to the need for providing a personalized search experience to each of our 500M+ registered users. It’s personalized in multiple ways: not only does each user have access to a different set of documents, but users also have different preferences and behaviors in how they search.
Compressing your files is a good way to save space on your hard drive. At Dropbox’s scale, it’s not just a good idea; it is essential. Even a 1% improvement in compression efficiency can make a huge difference. That’s why we conduct research into lossless compression algorithms that are highly tuned for certain classes of files and storage, like Lepton for jpeg images, and Pied-Piper-esque lossless video encoding. For other file types, Dropbox currently uses the zlib compression format, which saves almost 8% of disk storage.
We introduce DivANS,
Magic Pocket, the exabyte scale custom infrastructure we built to drive efficiency and performance for all Dropbox products, is an ongoing platform for innovation. We continually look for opportunities to increase storage density, reduce latency, improve reliability, and lower costs. The next step in this evolution is our new deployment of specially configured servers filled to capacity with high-density SMR (Shingled Magnetic Recording) drives.
Dropbox is the first major tech company to adopt SMR technology, and we’re currently adding hundreds of petabytes of new capacity with these high-density servers at a significant cost savings over conventional PMR (Perpendicular Magnetic Recording) drives.
With this post we begin a series of articles about our Service Oriented Architecture components at Dropbox, and the approaches we took in designing them. Bandaid, our service proxy, is one of these components. Follow along as we discuss Bandaid’s internal design and the approaches we chose for the implementation.
Bandaid started as a reverse proxy that compensated for inefficiencies in our server-side services. Later we developed it into a service proxy that accelerated adoption of Service Oriented Architecture at Dropbox.
A reverse proxy is a device or service that forwards requests from multiple clients to servers (i.e.