The Fungible DPU: Putting the FAST in the Datapath

The Fungible DPU Enables High-Performance Single Server Nodes

In a perfect world, every application running within a data center should be able to take advantage of every storage or compute resource at its maximum possible performance and utilization.

In the real world, however, bottlenecks abound. While the performance of individual components like SSDs and GPUs are improving rapidly, applications are unable to take full advantage of these improvements because I/O intensive workloads are stalled at the CPU level, stranding these expensive resources.

The Fungible Data Processing Unit (DPU) brings us closer to the ideal state of data centers by executing data-centric computations an order of magnitude more efficiently than CPUs.

In this post, we’ll explore how the Fungible DPU improves single-node performance for a wide range of use cases.

Superior Performance

Whether handling virtualized or bare metal workloads, security services, storage datapaths, networking datapaths, or layer 4 to layer 7 services, the Fungible DPU improves single-node performance substantially compared to existing solutions.



For general-purpose, warehouse-scale application workloads, the Fungible DPU improves overall performance by more than 2X. It achieves this by offloading more than 50% of data-centric computations from the CPU, allowing it to focus exclusively on the demands of the application.

Clean Slate Architecture

The reasons behind these huge performance gains lie in the Fungible DPU’s distinctive design and innovative architecture.

The Fungible DPU comprises a large number of multi-threaded general-purpose cores tightly coupled with carefully selected hardware accelerators. Data processed at the DPU seamlessly traverses the embedded cores and the hardware accelerators, which are optimized for the highest performance and highest flexibility.

The Fungible DPU’s hardware accelerators are implemented as programmable primitives, making them suitable for a wide spectrum of data transformations and computations, including compression and deduplication, security at motion and at rest, data durability, filtering, data integrity and data movement.

The hardware accelerators offer line rate throughput without compromising other key attributes. For example, the Fungible DPU:

  • Offers a standards-based compression with a compression ratio 20% higher than gZIP, while delivering throughput more than two orders of magnitude faster than software-only algorithms running on high-end CPUs.
  • Provides line rate filtering capability while matching a large number of complex advanced PCRE expressions deeper in the payload.
  • Supports network-based inline erasure coding at line rate without any impact on read/write latency. It offers the same level of protection as replication, but with lower storage overhead. Additionally, it lets you choose the level of data durability based on the needs of each application.

Efficient Software<->Hardware Interface

Featuring a custom designed datapath operating system (FunOS), the Fungible DPU natively enables an asynchronous programming model that utilizes CPU cycles more efficiently. At the same time, it also inherits the ease of synchronous programming models.

Here is an example of a traditional synchronous approach invoking a hardware accelerator:

The Fungible DPU’s native programming model implements a generalization of the procedure call. The software uses a procedure call to invoke a job. The job can be run on embedded processors or on hardware accelerators. Instead of waiting for the job to complete, the software context switches to another job and procedure call continues to another core – enabling a true asynchronous programming.

Other Unique Architectural Highlights

The Fungible DPU includes a number of architectural enhancements that provide the right balance of performance and programmability. These include:

  • Flexible work scheduler: A hardware-based, scalable and programmable scheduler removes the burden of work scheduling from software, preserving precious CPU cycles. The scheduler provides lockless datapath processing, enabling the DPU to handle hundreds of simultaneous work requests at very high speeds.
  • Lightweight context switching: The Fungible DPU provides a mechanism for lightweight software context switching while keeping a high rate of instructions per cycle (IPC).
  • Cache/memory hierarchy suitable for datapath: Data-centric computations have two kinds of memory accesses interleaved together: one exhibits both spatial and temporal locality while the other has only spatial locality. The latter, if not handled carefully, can pollute caches, which in turn impact the overall performance. The Fungible DPU resolves this conflict at its root, resulting in performance that is consistent and predictable.
  • An enhanced internal fabric: With very high throughput and a low latency network on chip, the Fungible DPU provides uniform memory access for any and all resources, including cores and accelerators.
  • Programmable in high-level languages: The Fungible DPU is programmable in high-level languages like ANSI C using unmodified standard GNU or LLVM toolchains, making it suitable for a variety of use cases.

Key Benefits of the Fungible DPU

  • Deliver robust performance: Traditional approaches that are hardwired for a specific use case suffer from sharp performance drops for two reasons. The first reason is that slight variations in the datapath can shift processing from the hardware to embedded cores. The second reason is that the working data set does not fit into on-chip SRAM or caches. By contrast, the Fungible DPU’s advanced architecture eliminates these pitfalls and provides predictable and reliable performance.
  • Support multiple functions at line rate performance: Alternate approaches can only provide limited functions without impacting performance. The Fungible DPU can apply multiple data-centric services simultaneously for every packet without compromising throughput, latency, or quality.
  • Democratize premium services: Now for the first time, compute-intensive services like network-based erasure coding, in-line compression, security in motion and at rest, and more can be realized without trade offs or compromises.

An Important Building Block For Hyperdisaggregated Infrastructure

At Fungible, armed with a deep understanding and empathy for the challenges scale-out data centers are facing, we recognized that a fundamental new approach was needed. That’s why we designed the Fungible DPU using a clean-sheet approach based on first principles.

In this post we explored how the Fungible DPU’s versatile performance capabilities and distinctive technologies deliver accelerated computing in a single node. These highly efficient nodes become important building blocks for scale-out hyperdisaggregated architectures.

Our next post will delve into how Fungible resolves another fundamental problem in the data center: the need to scale to thousands of devices without compromising performance or efficiency.

Leave a Reply