Aug 01, 2023

How We Built It: Block Storage for AI/ML Workloads, Powered by Lightbits

lightbits

lightbits

As part of our "How We Built It" series, we're exploring the technologies and partnerships that help make Crusoe Cloud a reality. In this blog post, we'll dive into how Lightbits has become an integral part of Crusoe Cloud, and how it enables Crusoe Cloud's customers to run high performance AI/ML workloads at petabyte scale.

What is Crusoe Cloud?

Crusoe Cloud is an Infrastructure-as-a-Service (IaaS) platform designed specifically to tackle the challenges of energy intensive computing. By leveraging wasted, stranded, and clean energy sources and optimizing our systems for energy efficiency, Crusoe Cloud provides a sustainable cost-effective solution for high-performance computing workloads, such as Large Language Model (LLM) training, inference, and generative AI.

Why did Crusoe Cloud integrate Lightbits?

Crusoe Cloud wanted to expand its IaaS offering by including robust and high-performance block storage solution. Historically, Crusoe Cloud only offered storage that was local to the instance, as depicted in the diagram below:

There were several problems with this system:

  • Since the VM's operating system lived on local disks, we weren't able to move (and therefore resize) stopped VMs

  • Customers weren't able to add additional storage or easily share stored data between VMs

We knew that we needed to move from a world where servers had local, persistent block storage to a world where all local storage was ephemeral (and tied to the lifecycle of the hardware) and persistent storage was handled by a remote block storage solution.

We also knew that we needed a solution that was:

  • Highly performant, to meet the demanding latency and throughput requirements of performance-sensitive AI/ML workloads

  • Software defined and easy to operate by a small team, composed mostly of software engineers

  • Economically aligned, with a cost model that scaled (at worst) linearly with capacity

  • Backed by high quality, developer-focused support

Lightbits wins on performance

After considering several potential options, including other vendors and open source software projects such as Ceph, we compared capabilities through extensive performance testing. Our findings revealed that Lightbits was superior in terms of IOPS, with notably lower latencies, even when operating on the same hardware.

As shown in the figures below, in a test vs the leading OSS option, Lightbits demonstrates up to 4x performance advantage in terms of bandwidth, particularly notable for smaller IO sizes. Additionally, Lightbits consistently maintains latencies under 500 microseconds, outperforming the competition in terms of latency boundaries, especially for random accesses.

Lightbits scales IOPS with increased load while maintaining latency under 0.5ms. The competition faces challenges in scaling IOPS, leading to latency exceeding 2.5ms under random access.

Increasing block size from 4KB to 8KB  shows a similar behavior: Again, the competition experienced limitations in scaling IOPS, leading to latency exceeding 2.5ms under random access, while Lightbits scales while maintaining consistently low latency

From data preprocessing to real-time inference, the advantages of lower and more consistent latency, higher throughput, and linear scalability make Lightbits backed block storage an excellent offering for Crusoe Cloud customers to optimize their AI workflows. Lightbits fills performance and operational gaps that other block storage solutions struggle to address for these workloads, and does so while providing a comprehensive set of enterprise functionality to help cloud builders like ourselves operate the system at scale.

How does Lightbits enable such high performance?

Lightbits utilizes NVMe/TCP, a protocol pioneered by Lightbits (part of the NVMe-oF standard), that enables direct access to NVMe storage over Ethernet networks. This innovative approach significantly reduces latency and maximizes throughput, making it ideal for demanding AI and ML workloads.

Furthermore, Lightbits’ clustering architecture enables linear scalability of performance and storage capacity, while maintaining sub-ms tail latency. This clustering architecture provides up to 3 replicas per volume across multiple availability zones for high availability.

What does our system look like now?

Implementing Lightbits within Crusoe Cloud enabled us to level up our cloud offering by allowing us to complete the transition from local-only block storage to local, ephemeral NVMe devices as well as remote, persistent block storage.

Customer OSes are now stored in the shared cluster, enabling us to free up stopped VMs, as well as enabling customers to resize their VMs. Additionally, customers can now consume persistent and high performance storage from the Lightbits cluster, in the form of persistent disks.

What are we building going forward?

In addition to how we're currently using Lightbits, we have a few additional use cases being built out on top of our block storage offering.

Curated/custom images:

Building on top of our existing OS images (which are based on Ubuntu 20.04, plus NVIDIAs CUDA drivers and toolkit), we're able to leverage HashiCorp Packer templates to build OS images, which are then stored within Lightbits, and served to customers on-demand. In the near future, customers will be able to leverage the same pipelines to generate their own workload specific images, e.g. LLM training with Jax or Generative AI with Stable Diffusion. This flexibility allows our users to choose the most optimized environment for their research and development or production workload needs. This functionality is a relatively minor lift, due in large part to Lightbits' robust API surface area.

Snapshots/backups:

Lightbits' advanced snapshot and backup capabilities provide crucial data protection and recovery options for our users. With the ability to take storage-efficient snapshots and perform rapid data restoration, Crusoe Cloud users will soon be able to periodically snapshot their data to safeguard against unexpected or unintended loss.

Summary

Crusoe Cloud's close collaboration with Lightbits has leveled our GPU cloud platform up, enabling us to provide unparalleled performance and scalability to meet the needs of AI and ML, scientific computing, and graphics customers.

Lightbits' high-performance block storage solution has addressed our storage challenges, unlocking the full potential of Crusoe Cloud's infrastructure and empowering users to pursue innovative research and development in the field of climate science and AI. Together, Crusoe Cloud and Lightbits are driving the future of climate-focused computing towards a more sustainable and efficient tomorrow.

To learn more about Crusoe Cloud and experience the power of Lightbits, visit crusoecloud.com and lightbitslabs.com, respectively.


Liked what you just read? Share it

Relevant Articles

View all