Fungible Data Centers: Delivering the Promise of Composable Infrastructure at Scale

Fungible Data Centers are a culmination of our vision to revolutionize the agility, security, performance, and economics of scale-out data centers. Fungible Data Centers promise to improve the economics of hyper-scale data centers by over 2x and that of Enterprise data centers by over 8x while also making substantial improvements to security, reliability, and performance.

The greatest advance in computing was the invention of general-purpose computers in 1945. Prior to this, computing was done by physically wiring electronic components to perform calculations. Needless to say, this was manual, slow, expensive, and prone to error. What this concept did above all else was to introduce the idea of programming as a formal way to describe computations: It abstracted the logical view of a computation from the physical in a way that delivered large improvements to productivity and computational performance. The principal inventor, John Von Neumann, clearly understood the awesome power of this concept, but it is likely that even he did not foresee the incredible advances that would be unleashed in the next 75 years: an increase in the total computational power available to the world of around fifteen decimal orders of magnitude.

This increase was driven by the insatiable demand for computing and came from a combination of improvements in general-purpose computer architecture and advances in silicon technology. By the early 2000s, both sources of improvement had slowed sufficiently that the industry turned to “scale-out architectures.” This was the idea that data centers should be built by connecting large numbers of general-purpose servers to each other over a standards-based IP/Ethernet local-area network. In particular, a data center should be thought of as a single “warehouse” sized computer shared by many users and capable of performing a broad range of computations. This has now become the standard way of building modern data centers.

Throughout the evolution of computing, the industry has endeavored to improve its agility, defined as the speed with which new applications can be deployed on existing hardware while also increasing their runtime performance. Unfortunately, there appears to be an unavoidable tradeoff between agility and performance: improving agility comes at a cost in performance, and conversely, improving performance involves compromises to agility. This tradeoff is familiar to most system designers but will also be recognized by end-users as a performance tax imposed by virtualization. The inescapable conclusion is that the only way to break this tradeoff is through architectural innovation, as was done first in 1945.

At its core, the architecture of a computing system is about making two distinct choices: a choice of the functionality implemented by elemental building blocks and a choice of how to interconnect them. We assert that the computing industry’s central problem is that both aspects of data center architecture have stagnated for over a decade!

The functionality of elemental building blocks—general-purpose CPUs, specialized GPUs, DRAMs, SSDs, HDDs—and particularly the way they are assembled in a server has not changed significantly in decades, although their individual performance has made large improvements. Recently this rate of improvement too has slowed due to technology scaling limits. As a result, there have been calls for “domain-specific” silicon. Two examples are the emergence of specialized silicon for AI; and the development of “Smart-NICs.” However, the idea of domain-specific silicon runs headlong into silicon economics: building silicon is economical only when volumes are large—the “domain” that is being specialized needs to be ubiquitous.

The situation for server interconnection at data center scale is similar: the architecture of networks has stagnated even as the performance of the network building blocks has improved dramatically. Fortunately, the industry settled on IP/Ethernet as the de-facto network standard early on, displacing niche technologies like Infiniband and FiberChannel. The primary reason was that IP/Ethernet was correctly seen as the only available technology capable of meeting the scale requirements of data centers. Aside from the adoption of CLOS networks as the connection topology, incremental improvements to TCP and significant hype around “software-defined networks,” the interconnection aspect of data centers has seen little innovation.

Consequently data centers across the board face severe challenges: hyper-scalers have huge power bills, need dozens of server SKUs to cover the range of computations they need to support, and encounter security and reliability problems that are increasingly difficult to resolve. At the other end of the scale, Enterprise data centers have abysmally low utilization (below 8%), cannot deploy new applications quickly, and also face a multitude of security and reliability challenges.

It is in this context that Fungible asked a few key questions at its inception: was there a new elemental building block that would (a) dramatically improve scale-out data centers along relevant dimensions; (b) whose use would be pervasive enough to justify building silicon; and (c) would facilitate the deployment of infrastructure as code? The answer to these questions was a resounding yes, and it led to the invention of the Fungible Data Processing Unit (DPU). The functions implemented in the DPU focused on the top three problems in data centers:

  • Data-centric computations are ubiquitous but are performed poorly by general-purpose CPUs because they are fundamentally at odds with the techniques used to provide high performance for user applications. Examples of data-centric computations are the network stack, the storage stack, the virtualization stack, and the security stack.
  • Interactions between servers in a data center are highly inefficient. Network utilization is low (typically less than 30%); latency and jitter are unacceptably high; and CPUs waste a significant fraction of their resources in pushing packets. Data centers face a difficult choice: either they have to pay for inefficient interactions if they want resource pooling, or they have to give up on resource pooling entirely.
  • The configuration, control, and telemetry interfaces to servers are inconsistent and ad-hoc (this is true for Enterprise, less so for hyper-scale). This results in major complexity for data center operators and is a significant barrier to agility, reliability, and security.

The Fungible DPU provides comprehensive, clean sheet solutions to all three problems: It performs high-performance data-centric computations at least 20X more efficiently than general-purpose CPUs, enabling these computations to be offloaded from CPUs freeing them to run applications more efficiently. It implements TrueFabric on top of standard IP/Ethernet to enable the efficient disaggregation and pooling of virtually all data center resources. And it provides standardized APIs to orchestration software for control configuration and telemetry. The Fungible DPU also significantly improves the reliability and security of data centers. Finally, and perhaps most significantly, the DPU is fully programmable in a high-level language to provide agility.

To complement the resource disaggregation enabled by the Fungible DPU, we have developed a Fungible Data Center Composer to assemble complete virtualized “bare-metal” data centers in minutes starting from underlay infrastructure built using DPU powered servers connected via a standard high performance IP/Ethernet network. We introduce a Fungible Data Center as a data center built using three complementary pieces:

  • A standard high-performance IP/Ethernet network
  • DPU powered server instances chosen from a small set of server types
  • A logically centralized Data Center Composer implemented on standard X86 servers

The Composer makes full use of the configuration, control, and telemetry interfaces supported by the DPU. It treats an entire virtualized data center as code. Executing this code results in the creation of a new instance of a virtualized data center implicit in the code. Treating a virtual data center this way results in a highly agile, reliable process for creating a ready-to-use data center, time and time again; it also permits templatization of commonly occurring themes so that creating a new data center that is similar to a previous one is straightforward. In fact, all of the learnings from developing agile software at scale are applicable directly to delivering infrastructure as a service. Finally, the ability to create and destroy virtual data centers in minutes opens the door to maximizing the utilization of the underlying resources by multiplexing them across multiple tenants.

Fungible data centers break the traditional tradeoff imposed by current architectures allowing us to improve both performance and agility for data centers both small and large. They will deliver on the promise of composable infrastructure by providing unprecedented levels of infrastructure efficiency combined with high performance across a broad range of applications. Specifically, Fungible Data Centers will improve the economics, both CAPEX and OPEX of hyper-scale data centers by over 2x and that of Enterprise data centers by over 8x. Finally, they will also measurably improve the security and reliability of data centers.