skip to main content
research-article
Open access

DxPU: Large-scale Disaggregated GPU Pools in the Datacenter

Published: 14 December 2023 Publication History
  • Get Citation Alerts
  • Abstract

    The rapid adoption of AI and convenience offered by cloud services have resulted in the growing demands for GPUs in the cloud. Generally, GPUs are physically attached to host servers as PCIe devices. However, the fixed assembly combination of host servers and GPUs is extremely inefficient in resource utilization, upgrade, and maintenance. Due to these issues, the GPU disaggregation technique has been proposed to decouple GPUs from host servers. It aggregates GPUs into a pool and allocates GPU node(s) according to user demands. However, existing GPU disaggregation systems have flaws in software-hardware compatibility, disaggregation scope, and capacity.
    In this article, we present a new implementation of datacenter-scale GPU disaggregation, named DxPU. DxPU efficiently solves the above problems and can flexibly allocate as many GPU node(s) as users demand. To understand the performance overhead incurred by DxPU, we build up a performance model for AI specific workloads. With the guidance of modeling results, we develop a prototype system, which has been deployed into the datacenter of a leading cloud provider for a test run. We also conduct detailed experiments to evaluate the performance overhead caused by our system. The results show that the overhead of DxPU is less than 10%, compared with native GPU servers, in most of user scenarios.

    1 Introduction

    The great success of Artificial Intelligence (AI) has boosted demands for Graphics Processing Units (GPUs). In pursuit of scalability, elasticity, and cost efficiency, users tend to rent GPUs in the cloud to accelerate their AI workloads. Hence, cloud providers leverage datacenter-scale GPU clusters to satisfy different user requirements.
    For GPU clusters in the cloud, GPUs are attached to the host server as PCIe devices with fixed proportions (e.g., 2, 4, 6, or 8 GPUs). With the help of I/O virtualization [39], CPUs and GPUs can be split and allocated to several virtual machines (VMs) on the same host server. However, the virtualization techniques fail to support flexible GPU provisioning between host servers. For instance, even if one or more GPUs on one host server are free, they cannot be used by VMs on a different host server. In this server-centric architecture [33, 36], the fixed coupling of host servers and GPUs brings the following practical limitations for cloud providers:
    CPU and GPU fragments are common in the datacenter. We collect the distribution of GPU instances with different configurations for NVIDIA V100 and T4 in the datacenter of a leading cloud provider and plot it in Figure 1. Obviously, in the cloud, user requests are diverse in resource requirements. Considering the diverse and dynamically changing resource requirements of different workloads, it is difficult to statically equip each host server with the right balance of hardware resources. Hence, this causes either CPU or GPU fragments in server clusters and brings additional cost for cloud providers.
    Fig. 1.
    Fig. 1. Distribution of GPU instances with different hardware configurations in the cloud. User requests are diverse in hardware requirements.
    The cost of hardware maintenance and upgrade remains to be reduced. On the host server, any single point of failure makes all hardware resources unavailable. Also, different hardware (e.g., CPU, GPU, and memory) exhibits distinct upgrade trends [36]. Under the fixed coupling scheme, the upgrade of any hardware needs to shut down the server and influences the availability of other resources. As a result, it costs excessive efforts to maintain the availability and up-to-dateness of the cloud infrastructure at the same time [30].
    To address these limitations, the GPU disaggregation [38] technique has been proposed, which aims to physically decouple GPUs from host servers and group them into pools. As a result, GPU workloads from the applications on the host server are forwarded to allocated node(s) in the pool, named as disaggregated GPU. With such a design, GPU provisioning is no longer restricted on the same host server. Confronted with varying requests, cloud providers can dynamically allocate or reclaim GPUs in the pool without worrying about resource fragments. Meanwhile, CPUs and GPUs can be managed, scaled, and upgraded independently, optimizing infrastructure expenses.
    Throughout the software and hardware stack, we can implement GPU disaggregation at the user-mode library, kernel-model driver, and PCIe level. Yet, systems at the library and driver level have three natural limitations. First, the applicable software and hardware are restricted. For example, CUDA-based disaggregation techniques [27, 34] cannot be used for applications built on OpenGL. Second, it is a continuous work for developers to keep the wrapper libraries and drivers up-to-date all the time. Last, reducing the communication overhead and optimizing the performance are troublesome and time-consuming. By contrast, exploring forwarding strategies at the PCIe level brings the benefits of full-scenario support, software transparency, and good performance in nature.
    However, existing solutions at the PCIe level fall short of the expected disaggregation scope and capacity. Specifically, they are all built on the PCIe fabric [17, 38, 42], where servers and switches are connected via PCIe cables. The scope of disaggregation is limited to a very small number of racks, 1failing to reach the datacenter scale.1 What is more, to reduce the bit error rate and enable high-speed peer-to-peer communications, the PCIe fabric can only accommodate dozens of GPUs, which is far from enough.
    To solve the above problems, we propose a new design of GPU disaggregation at the PCIe level, named DxPU. In DxPU, a proxy (DxPU_PROXY) is added on both the host server and GPU side. The proxies are interconnected via a network fabric and act as the medium for the interaction between the host server and GPUs. Specifically, DxPU_PROXY is responsible for the conversion between PCIe TLPs and network packets. From the host server’s perspective, GPUs are still accessed locally over PCIe links. Thus, DxPU is transparent to the user and supports GPU virtualization techniques. Moreover, GPUs are dynamically selected and hot-plugged into the host server, meaning they are dedicated to the host server during use. Compared with the PCIe fabric-based method, it allows the management of GPU pools on the datacenter scale. In summary, DxPU advances the first step towards exploring datacenter-scale GPU disaggregation at the PCIe level.
    To make such a new design into a ready-to-deploy product in the cloud, the following question, i.e., What is the performance impact of DxPU for GPU workloads in the cloud?, needs to be answered before a prototype is implemented. To this end, we introduce a performance model in software. Specifically, the performance overhead introduced by DxPU mainly comes from the longer latency for the interaction between the host server and GPUs. We choose the AI scenario, the dominant consumer of GPUs in the cloud, and build up a performance model via API hooking to estimate the performance impact of DxPU. With the built model, we can demonstrate the performance impact of DxPU for GPU workloads in the cloud.
    Guided by the performance model, we build an implementation system of DxPU. Our evaluation in basic, AI, and graphics rendering scenarios illustrates that the performance of DxPU over the native GPU server, which is better than 90% in most cases, and is able to satisfy cloud providers’ requirements. What is more, the pool can hold up to 512 GPU nodes. Besides, we point out major factors in the current software stack that undermine the performance and explore optimization space of DxPU from the view of software-hardware co-design. Last but not least, we believe that DxPU can be extended to realize disaggregation of other hardware resources in the cloud, which are instructive for both academia and industry.
    In summary, contributions of our study are as follows:
    We propose a new design of datacenter-scale GPU disaggregation system at the PCIe level (Section 3), named DxPU. It is excellent in full-scenario support, software transparency, datacenter-scale disaggregation, large capacity, and good performance.
    We build up a performance model to estimate the performance impact of DxPU for GPU workloads in the cloud (Section 3.4) and verify its correctness with the implementation of DxPU (Section 3.5).
    We make detailed evaluations of DxPU with multiple benchmarks to demonstrate the practicality of it. The performance of DxPU over the native one is better than 90% in most of cases, which satisfies cloud providers’ requirements (Section 4). Moreover, we present the insights to optimize DxPU from the software-hardware co-design perspective (Section 5).

    2 Background and Motivation

    2.1 Resource Disaggregation

    Traditionally, all hardware resources (e.g., CPU, GPU, and memory) are tightly coupled with host servers. However, as different hardware exhibits diverse iteration cycles, this server-centric architecture makes it hard to upgrade hardware resources independently. Recently, there emerges a paradigm shift towards resource disaggregation in the datacenter [3, 31, 33]. By decoupling these resources and grouping them into corresponding pools, it is easier to maintain, upgrade them, and deploy cutting-edge hardware technologies in time. For example, based on the abstraction of a Transport layer, NVMe over Fabrics (NVMe-oF) [9] enables memory operations over various kinds of interconnects (e.g., Fibre Channel, RDMA, or TCP). With the help of NVMe-oF, the storage resource can be maintained, and upgraded independently.

    2.2 GPU Disaggregation

    As mentioned above, the tight coupling of host servers and GPUs causes resource fragments and restricts the independent maintenance, upgrade of hardware resources. To overcome these problems, GPU disaggregation is proposed. Specifically, GPU workloads on the host server are encapsulated and redirected to disaggregated GPUs. For example, DGSF [35] makes use of API remoting to realize GPU disaggregation for the serverless environment and supports live migration at the same time. In general, we can implement GPU disaggregation at the software level (user-mode library, kernel-model driver) and hardware level (PCIe level). As the software solutions are based on API hooking, the applicable scenarios and GPU performance are greatly affected. By contrast, exploring forwarding strategies at the PCIe level can overcome these problems easily. However, existing solutions at the PCIe level fall short of the expected disaggregation scope and capacity. In the following paragraphs, we will first explain why we implement GPU disaggregation at the PCIe level (Section 2.2.1). Then, we summarize why existing solutions at the PCIe level fall short (Section 2.2.2), which motivates the design of DxPU. Table 1 makes a comprehensive comparison between existing GPU disaggregation systems.
    Table 1.
    Implementation LevelRepresentativesSoftware-hardware CompatibilityScope of DisaggregationCapacity of the GPU PoolOverall Performance \(^{3}\)
    User-mode LibraryrCUDA [34], bitfusion [27]LimitedDatacenter-scale \(^{1}\) Large80%–95%
    Kernel-mode DrivercGPU [7]LimitedDatacenter-scaleLarge80%–99%
    PCIeLiqid AI Platform [17]UnlimitedRack-scale \(^{2}\) Small99%–100%
    DxPU (our solution)UnlimitedDatacenter-scaleLargemostly 90%–99%
    Table 1. Comparison of Existing GPU Disaggregation Systems at Different Levels
    \(^{1}\) Datacenter-scale means GPU nodes can be allocated to host servers from anywhere in the datacenter. \(^{2}\) Rack-scale means host servers and disaggregated GPUs must reside in a very small number of racks. \(^{3}\) Performance denotes GPU compute capability in the specific implementation over the native one.

    2.2.1 Limitations of Solutions at the Software Level.

    As NVMe [8] specifies how software interacts with non-volatile memory, it is not difficult to extract a Transport Layer to permit communication between host servers and memory over diverse interconnects. However, as there does not exist a uniform specification of the GPU architecture and driver, it is impossible to build such a Transport Layer for GPU disaggregation. So, we can only analyze the invoking sequences throughout the software and hardware stack to determine the implementation levels. Yet, GPU disaggregation systems at the library and driver level have their own limitations that cannot be bypassed.
    Limited Application Scenarios. For example, rCUDA [34] and VMware bitfusion [27] implement GPU disaggregation by replacing CUDA library with a wrapper one, meaning confined scenario applicability. In other words, they cannot be used for other applications (e.g., OpenGL-based applications). In addition, as CUDA is developed by NVIDIA, users fail to utilize AMD or Intel GPUs. In contrast, GPU disaggregation implemented at the driver level can fulfill user requests for general-purpose computing and graphics rendering, (e.g., Alibaba cGPU [7]). However, as GPU drivers for different models are distinct, not all of them are suitable for GPU workload redirecting, meaning GPU models are still restricted. By comparison, building GPU disaggregation systems at the PCIe level can easily circumvent these compatibility restrictions.
    Sensitivity to Frequent Software Updates. To improve the products, software development companies tend to release a new version every few months. Obviously, it is a continuous task to keep the corresponding wrapper software up-to-date all the time.
    Excessive Efforts in Performance Optimization. During workload forwarding, the parameter encapsulation, transfer, and extraction will introduce much latency. Especially for encapsulation and extraction at the software level, they cost much more time compared with those at the PCIe level. So, it requires developers to be good at communication optimization. In contrast, the performance of DxPU is better than 90% in most cases without any optimization, which is a huge advantage.
    By comparison, exploring forwarding strategies at the PCIe level, which touches low-level hardware, can overcome these limitations easily.

    2.2.2 Existing Solutions at the PCIe Level Fall Short.

    Existing solutions of GPU disaggregation at the PCIe level utilize the PCIe fabric as its interconnect [2, 16, 17, 42]. Specifically, the PCIe fabric is composed of several specialized PCIe switches. Meanwhile, host servers and GPUs are directly connected to the fabric via PCIe cables. Unlike common PCIe switches in PCIe specification [22], the specialized ones need to isolate different host servers’ PCIe domains and map GPU node(s) to their memory space [38].
    Although GPU disaggregation on the basis of the PCIe fabric achieves almost native performance, there also exist two tough problems that cannot be solved easily.
    Limited Scope of Disaggregation. In the PCIe fabric, switches and servers are connected through PCIe cables. As the length of PCIe cables is limited, the scope of disaggregation can only be restricted to a very small number of racks [36]. Host servers and GPUs are still coupled with each other on the rack scale. In other words, it lacks the flexibility to manage, scale, and upgrade CPUs or GPUs independently. As a result, it fails to support GPU disaggregation on the datacenter scale.
    Limited Capacity of the GPU Pool. To enable high-speed communications between different devices, the tolerable error rate in PCIe links is extremely low. Statistically speaking, the bit error rate (BER) must be less than \({10^{-12}}\) [1, 4, 5]. However, if the scale of the PCIe fabric is too large, the BER will go beyond the limitation and fault location, and isolation will be particularly difficult. Hence, although delicate circuit design can decrease the BER, there still exists an upper limit on the number of disaggregated GPUs in the PCIe fabric. For example, in Liquid Powered GPU Composable AI Platform [17], the maximum number of accommodable GPUs is 24. Confronted with datacenter-scale user requests, the capacity of the PCIe fabric is far from enough. Setting up too many GPU pools will lead to the resource fragment problems mentioned in Section 1 again.
    Compared with solutions mentioned above, DxPU is excellent in full-scenario support, software transparency, datacenter-scale disaggregation, large capacity, and good performance.

    3 Design

    To begin with, We present our design goals of DxPU.
    G1: Scope. The scope of disaggregation should not be restricted to racks. On the contrary, in a datacenter-scale GPU disaggregation system, GPU nodes can be allocated to host servers from anywhere in the datacenter.
    G2: Capacity. The fabric should accommodate at least hundreds of GPU nodes (e.g., 128, 256, or 512).
    G3: Performance. DxPU should not incur high-performance overhead.
    In the following paragraphs, we will first introduce the architecture of DxPU briefly (Section 3.1). Next, we describe the design details of GPU boxes and DxPU_PROXY s (Sections 3.2 and 3.3). Then, we build up a performance model to estimate effects of network latency on GPU compute capability in DxPU (Section 3.4), which can guide the implementation of DxPU. Last, with the assistance of the performance model, we build the implementation system of DxPU. The performance of DxPU confirms the accuracy of our model in turn (Section 3.5).

    3.1 Architecture

    Figure 2(a) shows the overall architecture of DxPU. Compared with the non-disaggregated architecture, which attaches GPUs to the host server via PCIe endpoints, GPUs are physically decoupled from host servers and put into separate GPU boxes. Correspondingly, each GPU box contains several GPUs. To enable the remote access to GPUs from the host server, a proxy, named DxPU_PROXY, is added on both the server and the GPU side. It acts as medium for the interactions between the host server and GPUs. The DxPU_PROXY is responsible for the conversion between PCIe TLP and network packets. On the one hand, DxPU_PROXY provides PCIe interfaces to get connected with the host server or GPUs. On the other hand, DxPU_PROXYs are interconnected via a network fabric protocol. Specifically, when receiving PCIe TLP packets, DxPU_PROXY encapsulates them into network packets and sends out these packets. Correspondingly, on the other side, DxPU_PROXY extracts the original TLP packets from the received network packets. Therefore, the conversion between PCIe TLP and network packets is transparent to both the host server and GPUs. From the perspective of the host server, GPUs are still attached locally via the PCIe link. The combination of DxPU_PROXYs at both sides works as a PCIe switch (Figure 2(b)). Moreover, DxPU_PROXY provides configuration interfaces to enable the dynamic distribution of GPUs for the cloud orchestration system.
    Fig. 2.
    Fig. 2. Physical and logical architecture of DxPU.
    The key observation that motivates the adoption of the network fabric is that nowadays network cables can provide the same (or even faster) data transmission speed when compared with PCIe links. For example, a PCIe Gen 3 interface offers about 7.87 Gbps per lane at the physical layer [43], and the PCIe interface provided by a GPU has no more than 16 lanes, which can be supported easily with two 100 GbE ethernet cables. What is more, the adoption of the network fabric can perfectly solve the scope and capacity problems mentioned above.
    Supporting Datacenter-scale GPU Disaggregation. In DxPU, servers and GPUs are connected to the network fabric through network cables, whose maximum length can be up to 100 meters [12]. Hence, the scope of GPU disaggregation is no longer limited to the rack scale. GPU nodes can be allocated to host servers from anywhere in the datacenter, enabling fine-grained and efficient GPU provisioning [36].
    Supporting Large-capacity GPU Pools. Driven by the developments of network technologies, nowadays, the number of ports [10, 15] in a datacenter-level network is enough for DxPU to hold more GPU nodes. Specifically, compared with the PCIe fabric, data transfer in the network fabric is more reliable in its error check (e.g., Receive Errors, Send Errors, and Collisions) and packet retransmission. So, the network fabric is more tolerant to transmission error and expansible, where we can simply cascade the switches to build a larger network. With its help, a GPU pool in DxPU can accommodate hundreds of nodes.
    It is to be noted that, in DxPU, GPUs are dedicated to the host server during use. The binding relationships between GPUs and the host server will not be changed unless users free them.

    3.2 GPU Box

    With respect to GPU Boxes, there are two kinds of design schemes. The first one learns from NVIDIA DGX Systems [19], which specialize in GPGPU acceleration. Under this scheme, GPUs and the motherboard containing NVSWITCHes are welded together, leaving PCI interfaces for DxPU_PROXY s. Thus, NVLINKs can be utilized in DxPU to support more efficient GPU communications. The second one is a simple PCIe switch that we have mentioned above. All kinds of GPUs that connected via PCI interfaces can be deployed and provisioned in this design. The former design is suitable for users requesting multiple GPUs and high compute capability, where users should be allocated with nodes connected by NVLINKs. While the latter one is appropriate for single GPU request scenarios.

    3.3 DxPU_PROXY

    In DxPU, DxPU_PROXY acts as a PCIe virtual switch for the interactions between the host server and GPUs. From the host server and GPU side, DxPU_PROXYs are inserted into the PCIe slot of the host server and GPU Box. And DxPU_PROXYs are interconnected via a network fabric protocol. During the boot stage of host server, BIOS program only enumerates the PCIe virtual switch displayed by DxPU_PROXY. So, it would reserve enough memory space for disaggregated GPUs, which would be hotplugged into the host server later.

    3.3.1 Mapping Tables in DxPU_PROXY.

    As can be seen from Table 2 and Table 3, there are two types of mapping tables stored in DxPU_PROXY on the host server and GPU Box side, respectively. These tables record the mapping relationships between host servers and GPUs. Here, we introduce DxPU_MANAGER, which is responsible for GPU provisioning and reclaim in the fabric. During the initialization stage of the network fabric, the node IDs of DxPU_PROXYs on each host server and GPU box are allocated by DxPU_MANAGER automatically.
    Table 2.
    Table Entry ID
    Used (whether the bus has been used)
    Bus ID (allocated by OS during enumeration)
    Device ID
    Memory Base (allocated by OS during enumeration)
    Memory Limit (allocated by OS during enumeration)
    GPU Box ID (allocated by DxPU_MANAGER)
    Slot ID (allocated by DxPU_MANAGER)
    Path ID (allocated by DxPU_MANAGER)
    Table 2. Content of Mapping Tables in DxPU_PROXY (on the Host Server Side)
    Table 3.
    Table Entry ID
    Valid (whether the GPU is in place)
    Used (whether the GPU has been used)
    Slot ID
    Host Node ID (allocated by DxPU_MANAGER)
    Path ID (allocated by DxPU_MANAGER)
    Table 3. Content of Mapping Tables in DxPU_PROXY (on the GPU Box Side)
    Mapping Tables on the Host Server Side. From the perspective of the host server, the disaggregated GPU is a PCI device on the PCIe virtual switch displayed by the DxPU_PROXY. So, the mapping table must record the relationship between the PCI device on the host server (Bus ID, Device ID, Memory Base, Memory Limit) and the GPU (GPU box ID, slot ID). When one host server or VM in the fabric send GPU requests to DxPU_MANAGER, it would search for free GPUs and allocate them. On the host server side, it would search free PCI bus in the mapping table, write the GPU Box, Slot, and path ID, then set the used bit. After detecting new PCI device, OS of the host server would enumerate the PCI bus tree again, allocating memory space and I/O resource for GPUs. DxPU_PROXY would rewrite the value of memory base and limit in the mapping table based on the TLP configuration packets.
    Mapping Tables on the GPU Box Side. Since each GPU is inserted into one slot in a GPU Box, the mapping table on the GPU Box side mainly records the slot ID of each GPU and its host server. During the allocation of GPUs, DxPU_MANAGER would search a free GPU in the fabric, write the host node, and path ID, then set the used bit.

    3.3.2 Formats of Network Packets.

    Since the DxPU_PROXY is responsible for the conversion between PCIe TLP and network packets, we describe the basic formats of network packets to show the conversion process. Generally, DxPU_PROXY would split header and data packets of original PCIe packets and encapsulate them into network header and data packets, correspondingly. Formats of the Header Packets. The content of the network header packets can be split into network route information, PCIe TLP header, Cyclic Redundancy Check (CRC) code of TLP, CRC code of network packets. In addition, the network route information should include source information (source node and slot ID2), destination information (destination node and slot ID), and path ID.
    Formats of the Data packets. For network data packets, they mainly contain original TLP packets and CRC code of network packets.

    3.3.3 Routing for Different Types of TLPs.

    On the host server side, DxPU_PROXY should consider the routing for different types of PCIe TLPs. For address-based routing transactions (e.g., Memory and I/O Transactions), DxPU_PROXY can retrieve the destination GPU box ID and slot ID by comparing the address in the TLP with (Memory Base, Memory Limit) in the mapping table. For ID-based routing transactions (e.g., Configuration Transactions), DxPU_PROXY can get the target tuple via (Bus ID, Device ID). For implicit routing transactions, DxPU_PROXY should first identify the message routing value in TLPs and categorize them into address-based, ID-based routing, or other local transactions. On the GPU side, as a GPU is only dedicated to one host server during use, DxPU_PROXY just needs to encapsulate PCIe packets into network packets directly.

    3.4 Performance Model

    In this section, we start by explaining how network latency incurs performance overhead. To measure this impact and guide the implementation of DxPU, we build up a performance model, which is based on AI workloads. It is to be noted that, here, the performance denotes GPU compute capability in DxPU compared with the native one, which is different from the model prediction accuracy in AI scenarios.
    Despite the comparable data transmission rate with a network cable, DxPU still introduces non-negligible latency in the PCIe TLP transfer between the host server and GPUs. Furthermore, the introduced latency may affect the throughput of the data transfer between them for some PCIe TLP transactions.
    PCIe TLP transactions are categorized into non-posted (e.g., Memory reads) and posted transactions (e.g., Memory Writes), depending on whether the Request (i.e., the issuer of the transaction) expects completion TLPs in a transaction. In non-posted transactions, the Requester will not finish the request until they receive one or more completion TLPs from Completer (i.e., the receiver of the transaction request). Therefore, the Requester needs to trace the status of all in-flight non-posted transactions. To this end, the Requester assigns each non-posted transaction with an identification (named tag) and is responsible for ensuring the uniqueness of tag for each in-flight non-posted transaction to distinguish them from each other. When there are enormous data transfers, the number of tags supported, which is device-specific, is generally enough, and new non-posted transactions will not be blocked (Figure 3(a)). However, as shown in Figure 3(b), the latency introduced by DxPU makes the tags used up quickly. New transactions are postponed until a free tag is available. As a result, the throughput of data transfer based on non-posted TLP transactions (e.g., DMA read issued by GPUs, which is implemented via PCIe Memory Read) would decrease significantly.
    Fig. 3.
    Fig. 3. Comparison of PCIe non-posted transaction TLP flows. Assuming that the Requester maintains 5 tags, (a) shows that 5 tags are enough to keep the whole TLP flow working in pipeline, (b) shows that new transaction gets blocked due to the lack of free tags. This makes round-trip latency fail to be hidden and decreases the data transfer throughput.
    As we have discussed, DxPU brings the longer latency for PCIe TLP transfers ❶ and lower throughput for non-posted PCIe TLP transactions ❷. The next question arising is: What is the performance impact of DxPU for GPU workloads? To answer this question, we build up a performance model via API hooking techniques to simulate the introduced latency by DxPU (❶❷). Moreover, our performance model targets AI workloads, since they cover the most GPU application scenarios in the cloud.

    3.4.1 Design of Performance Model .

    We analyze the main functions of AI workloads that happen during interactions between the host server and GPUs, and categorize them into three aspects:
    Memcpy(HtoD). It means the data transfer from host memory to GPU memory. It is executed via issuing DMA read operations by GPUs, which would generate PCIe Memory Read transaction(s) correspondingly. Memcpy(HtoD) is mainly used to transfer input data so GPU kernels can consume.
    Memcpy(DtoH). It means the data transfer from GPU memory to host memory. It is executed via issuing DMA write operations by GPUs. Correspondingly, PCIe Memory Write transaction(s) are generated. Memcpy(DtoH) is mainly used to transfer the GPU kernel’s result back.
    Kernel Execution. GPU kernels are functions executed in GPUs. When a GPU kernel finishes execution, it needs to access host memory, which would generate PCIe Memory Read transaction(s) correspondingly.
    We then summarize how ❶ and ❷ affect the above aspects as follows: For convenience, we introduce some denotations. \(RTT_{DxPU}\) represents the round-trip latency of PCIe memory read transaction with DxPU, while \(RdTP_{DxPU}\) represents the lower throughput of PCIe memory read transaction with DxPU. Similar definitions also apply to \(RTT_{ori}\) and \(RdTP_{ori}\) for non-disaggregated architecture. Moreover, \(RTT_{delta}\) denotes \(RTT_{DxPU}\) minus \(RTT_{ori}\) .
    Memcpy(HtoD). There are two scenarios, since it is based on non-posted transactions. First, if the data being transferred does not consume all tags, then there will be just \(RTT_{delta}\) overhead introduced ❶. Second, if there is a large mount of data being transferred, then the tags would quickly be depleted, and new transactions would get blocked. As a result, the data would be transferred at a rate of \(RdTP_{DxPU}\) ❷.
    Memcpy(DtoH). Since it is realized on posted transactions, only ❶ would cause effect and its bandwidth is not affected by ❷. Moreover, considering no completion TLP is needed, the introduced latency overhead can be estimated to be about 0.5 \(RTT_{delta}\) .
    Kernel Execution. There would be only ❶ that cause effect and introduces \(RTT_{delta}\) overhead.
    In fact, the quantitative relationship between \(RdTP_{DxPU}\) and \(RTT_{DxPU}\) can be estimated using the following formula:
    \(\begin{equation} RdTP_{DxPU} \; = \; \#tags \: * \: MRS \: / \: RTT_{DxPU} , \end{equation}\)
    (1)
    where #tags denotes the number of tags supported by a GPU. MRS represents Max_Read_Request_ Size, which determines the maximal data size that the Requester can request in a single PCIe read request.

    3.4.2 Implementation of Performance Model .

    We implement our performance model by hooking CUDA Driver API to inject the effect of ❶ and ❷, respectively.
    Memcpy(HtoD). We hook all related CUDA Driver APIs for Memcpy(HtoD), including synchronous (e.g., cuMemcpyHtoD) and asynchronous (e.g., cuMemcpyHtoDAsync). For small data, we insert one more Memcpy(HtoD) operation to copy a fix-sized buffer, where the time spent would be \(RTT_{delta}\) . For large data transfers, we would insert additional Memcpy(HtoD) operations to increase the time it takes for data copy to \(RdTP_{ori}\) / \(RdTP_{DxPU}\) times. In this way, we can simulate the time effect of ❷.
    Memcpy(DtoH). All related CUDA Driver APIs for (synchronous or asynchronous) Memcpy(DtoH) are hooked, such as cuMemcpyDtoH and cuMemcpyDtoHAsync. In the hooker, we insert one more Memcpy(DtoH) operation, which would copy a fix-sized buffer and time spent would be half of \(RTT_{delta}\) .
    Kernel Execution. We hook the related CUDA Driver APIs for kernel launch (e.g., cuLaunchKernel). We launch another dummy GPU kernel before the original GPU kernel. The execution time of the dummy GPU kernel is equal to \(RTT_{delta}\) . Moreover, since the execution of memset shares similar behavior with the kernel execution, we hook memset-related CUDA Driver APIs and also insert the dummy GPU kernel.

    3.4.3 Results of Performance Model .

    We run our performance model on a server with a single GPU (NVIDIA Tesla V100 SXM2 16GB). ResNet-50 (AI framework: TensorFlow 1.15) is chosen as our target AI workload, considering its representative and popularity. In addition, our experiment shows that the number of effective in-flight #tags in a GPU is about 140, while MRS is 128.
    Figure 4 demonstrates the relationship between \(RTT_{DxPU}\) and the AI workload performance. We can see that the performance of AI workload decreases with the increase of \(RTT_{DxPU}\) . Specifically, when \(RTT_{DxPU}\) is about 19 us, the performance would decrease to 80%. Meanwhile, to ensure that the performance is above 90%, \(RTT_{DxPU}\) needs to be within 8 us.
    Fig. 4.
    Fig. 4. Results of Performance Model for ResNet-50. To ensure that the performance is above 90%, \(RTT_{DxPU}\) needs to be within 8 us.

    3.4.4 Limitations of Performance Model .

    Since our performance model is based on the round-trip latency between the host server and GPUs, it cannot be directly applied to communications between GPUs. Additionally, in our cases, there are three types of scenarios for GPU-to-GPU interactions, which involve NVLinks, PCIe Links, and DxPU_PROXY s. Thus, it is difficult to cover all of them and make accurate modeling in multi-GPU scenarios. Moreover, model setup via API hooking brings fluctuations and can not cover all interaction functions between the host server and GPUs. We consider multi-GPU support and more accurate performance modeling (e.g., via cycle-level GPU simulator, or GPGPUsim) as our future work.

    3.5 Implementation of DxPU

    According to results of performance model, we notice that the training performance of typical ResNet-50 model can still remain above 90% with about 8 us \(RTT_{DxPU}\) . This motivates us to further prototype the design of DxPU. There are two kinds of straightforward options for us to build the implementation system of DxPU. The first option is leveraging off-the-shelf PCIe-over-TCP commercial products. However, these products target low-speed devices and thus are not suitable due to the performance requirement mentioned above. The second option is building by ourselves. Obviously, building such a product that achieves the product-ready status brings huge development burden and hinders our prototype process. Instead, we present the above design to our third-party vendors and seek for a satisfying product to act as DxPU_PROXY.
    As a result, we build two customization-level implementation systems of DxPU, whose \(RTT_{DxPU}s\) are 6.8 us and 4.9 us, respectively. According to our performance model, such \(RTT_{DxPU}s\) keep the performance of AI workloads in GPUs above 90%. Our systems implement a proprietary protocol for the conversion between PCIe TLP and network packets, which is neither PCIe over TCP nor RoCE (RDMA over Converged Network). From the host server and GPU side, DxPU_PROXYs are inserted into the PCIe slot of the host server and GPU Box. And DxPU_PROXYs are interconnected via the protocol. The PCIe interface for a GPU is PCIe Gen 3 \(\times\) 16, while the two 100GbE network cables are used for each DxPU_PROXY to get interconnected. Note that the interconnection between DxPU_PROXYs is not limited to the network cable in design. For example, another different implementation could be building the network fabric based on Infiniband RDMA technology. Furthermore, The implementation systems demonstrate the practicality of DxPU and verify the accuracy of our performance model.

    3.5.1 Verification of Our Performance Model.

    We evaluate the throughput of our implementation systems with the bandwidthTest tool provided as part of CUDA code samples, which turns out to be 2.7 GB/s (for 6.8 us) and 3.9 GB/s (for 4.9 us). According to Equation (1), \(RdTP_{DxPU}s\) are 2.64 GB/s and 3.66 GB/s, respectively. The accuracy of Equation (1) is verified.
    Accuracy of Performance Model. As shown in Table 4, the training performances of ResNet-50 under our implementation systems are about 89.56% (for 6.8 us) and 91.5% (for 4.9 us). Correspondingly, the estimated results of our performance model are 91.4% (for 6.8 us) and 92.56% (for 4.9 us). The accuracy of our performance model is verified. Moreover, we can see our estimated results are a little higher than the real scenario. We think it is because our performance model only targets the main interaction functions, while the other interaction functions, such as event synchronization, are ignored.
    Table 4.
    \(\mathbf {RTT_{\mathbf {DxPU}}}\) Results of Performance ModelPerformance of Implementation System
    4.9 us92.56%91.50%
    6.8 us91.40%89.56%
    Table 4. Performance Model Result Validation in ResNet-50 Training
    The estimated results are close to performance of the implementation system, verifying the correctness of the performance model.

    4 Evaluation

    As we have mentioned above, compared with solutions at the software layer, DxPU is excellent in full-scenario support, software transparency, and good performance. Compared with existing PCIe fabric-based solutions, DxPU supports both datacenter-scale GPU disaggregation and large-capacity GPU pools. Hence, in the evaluation, we mainly analyze the performance overhead incurred by DxPU and seek to answer the following three research questions (RQs):
    RQ1
    What causes the performance overhead in DxPU?
    RQ2
    How do different parameter configurations affect the performance overhead?
    RQ3
    What causes the extra performance overhead in the multi-GPU environment?
    To begin with, we analyze the components of \(RTT_{DxPU}\) and show its impact on the bandwidth between the host server and GPUs (Section 4.2). Then, we conduct detailed experiments to analyze the performance overhead (Section 4.3) with typical user cases in the cloud and present its limitations (Section 4.4). The results show that the performance overhead is no more than 10% in most cases and proves feasibility of DxPU. Note that, although the conversion between PCIe TLPs and network packets is applicable to other PCIe devices, currently, the performance model and implementation systems are designed for GPU workloads. And disaggregation of other PCIe devices in this way is considered as our future work.

    4.1 Experiment Setup

    We set up a disaggregated GPU pool consisting of 32 NVIDIA GPUs (16 Tesla V100 GPUs in the first type GPU boxes and 16 RTX 6000 GPUs in the second type GPU boxes), coupled with four host servers. In addition, the \(RTT_{DxPU}s\) is 6.8 us. And the detailed configuration can be seen from Table 5.
    Table 5.
    CPUIntel Xeon Platinum 8163
    GPUBasic and AI Workloads:
    NVIDIA Tesla V100 SXM2 16 GB
    (supporting NVLINKs)
    Graphics Rendering Workloads:
    NVIDIA GeForce RTX 6000 24 GB
    Memory768 GB
    Storage1,200 GB
    NetworkDxPU200 Gbit/s Fabric Network
    Single-hop Switches
    Table 5. Experiment Configuration
    The native environment is the server-centric architecture. GPUs in DxPU are selected from the resource pool. Other configurations are the same.

    4.2 Components of RTTDxPU

    To measure the components of \(RTT_{DxPU}\) , we request a single Tesla V100 GPU from the pool and perform repeated read-write operations in the GPU memory space. During these operations, we calculate the time length of different parts in \(RTT_{DxPU}\) . Generally, \(RTT_{DxPU}\) can be split into three parts: original time latency, network transmission, and packet conversion. As can be seen from Table 6, they make up 17.7%, 27.9%, and 54.4% of \(RTT_{DxPU}\) , respectively. The original time latency pertains to the duration taken to transmit PCIe packets without DxPU, where GPUs are PCIe devices on the host server. The network transmission time denotes the time required to transmit network packets between DxPU_PROXYs. The packet conversion time indicates the duration required to convert PCIe packets to network packets and any additional time spent due to the design of the network protocol (such as error control and redundancy check).
    Table 6.
    Time LatencyProportion
    Original Time Latency1.2 us17.7%
    Network Transmission1.9 us27.9%
    Packet Conversion3.7 us54.4%
    Table 6. Source of \(RTT_{DxPU}\)
    Time latency incurred by DxPU is around 5.6 us, increasing communication latency between the host server and GPUs.
    To demonstrate the impact of \(RTT_{DxPU}\) on the bandwidth between the host server and GPUs, we record their values in Table 7. The bandwidth from the host server to GPUs experiences a rapid drop to only 24.1% due to non-posted transactions, whereas the bandwidth from GPUs to the host server remains largely unaffected.
    Table 7.
    DxPUNativeProportion
    Host Server to GPU2.7 GB/s11.2 GB/s24.1%
    GPU to Host Server11.6 GB/s12.5 GB/s92.8%
    Table 7. Bandwidth between GPUs and Host Servers Based on CUDA Bandwidth Test
    Restricted by non-posted transactions, the bandwidth between the host server and GPU drops rapidly.

    4.3 Performance of DxPU

    From the GPU pool, We request a single RTX 6000 GPU for basic workloads, eight Tesla V100 GPUs (supporting NVLINKs) for AI workloads, and a single RTX 6000 GPU for graphics rendering workloads. We compare the evaluation results in DxPU with those in the native environment, whose architecture is server-centric. And other configurations of DxPU and the native environment are the same. It is to be noted that, here, the performance denotes GPU compute capability in DxPU compared with the native one, which is different from the model prediction accuracy in AI scenarios.
    We aim at measuring the performance impact brought by DxPU on the typical user cases in the cloud. To this end, we adopt the following standard benchmarks:
    Basic Workloads: NVIDIA CUDA Samples. CUDA Samples [18] are NVIDIA official benchmarks that can be used to measure GPU performance. Here, we mainly use them to test DxPU for several basic workloads (e.g., add, copy, read, and write) and have a preliminary understanding of its performance.
    AI Workloads: ResNet-50 and BERT. ResNet-50 (image classification), BERT (language modeling), SSD320 (object detection), and NCF (recommendation) are popular benchmarks in AI scenarios. We use tools from DeepLearningExamples [11] and tf_cnn_benchmarks [25]. TensorFlow 21.02-tf2-py3 NGC container is the running environment [24].
    Graphics Rendering Workloads: valley, heaven, and glmark2. valley [26], heaven [14], and glmark2 [13] are well-known benchmarks that simulate real-world graphics rendering applications,

    4.3.1 Performance in Basic Workloads .

    According to NVIDIA official benchmark documents, we select commonly used operations and record the performance of DxPU in these basic operations. As can be seen from Table 8, the test cases cover general matrix multiplication, fast Fourier transform, and stream operations (i.e., copy, scale, add, triad, read, and write). The performance overhead is no more than 4% in all basic test cases.
    Table 8.
    General Matrix MultiplicationFast Fourier TransformStream Operations
    FP16FP32FP64FP16FP32FP64CopyScaleAddTriadReadWrite
    97.2%99.5%99.1%96.3%97.5%98.3%98.8%99.1%99.3%99.3%97.3%97.5%
    Table 8. Performance of DxPU in Basic Workloads with a Single GPU
    The performance is better than 96% in these common operations.

    4.3.2 Performance in AI Workloads .

    To answer all above research questions in AI scenarios, we show and analyze the performance overhead of DxPU in AI workloads for single GPU and multi-GPU environments.
    To begin with, we conduct experiments in ResNet-50 utilizing a single GPU with diverse parameter configurations, where the performance overheads are different. Table 9 shows that the best performance can reach 99.6%. The default parameter values are: batch size = 64, local parameter device = GPU, xla = off, mode = training, dataset = synthetic. It can also be discovered that setting smaller batch size or CPU as local parameter device increases the overhead gradually. In the following paragraphs, we first figure out the components of performance overhead in the single GPU environment to answer RQ1.
    Table 9.
    Batch SizeLocal Parameter DeviceSynthetic DatasetImageNet Dataset
    Training PerformanceInference PerformanceTraining PerformanceInference Performance
    XLA OFF \(^{1}\) XLA ONXLA OFFXLA ONXLA OFFXLA ONXLA OFFXLA ON
    32GPU85.2%93.3%95.4%94.6%83.8%92.4%94.1%93.2%
    64GPU92.7%97.5%98.6%98.1%89.4%96.8%97.2%97.1%
    128GPU95.5%97.9%99.5%99.6%94.3%97.3%97.7%97.6%
    CPU \(^{2}\) 90.9%88.1%83.4%74.0%90.5%88.0%84.1%73.5%
    Table 9. Performance of DxPU in ResNet-50 with a Single GPU
    The performance is better than 90% in most cases with different parameters.
    \(^{1}\) Accelerated linear algebra (xla) can be utilized for operation fusion.
    \(^{2}\) In DxPU, setting CPU as local parameter service increases network communication overhead greatly. Thus, this parameter configuration is not recommended, and related statistics are not covered in the summary of performance. So, we mark them gray.
    Single GPU Performance Analysis. Considering that GPU disaggregation in DxPU is implemented at the PCIe level, it is transparent to OS and applications. Therefore, the differences between DxPU and the native environment are the bandwidth and command latency, which are generated by network latency between the host server and GPUs (mentioned in Section 3.4). We utilize NVIDIA Nsight System [21] to conduct a detailed investigation. All detailed data is collected from experiments on the native GPU servers.
    For bandwidth, we collect the total transferred data between the host server and GPUs in each training step, which are only around 0.01 MB and 40 MB for the synthetic and ImageNet dataset. Thus, the bandwidth is not the critical cause of performance overhead in single GPU training or inference for DxPU. Therefore, we pay more attention to the effect of the command latency in the following analysis.
    Most of GPU workloads are made up of GPU kernel operations and memory load/store operations. With the default parameter configuration, GPU kernel execution makes up the most of GPU hardware time (around 99%). So, we mainly analyze GPU kernel execution in these cases. For simplicity, we categorize GPU kernels into short-duration kernels (smaller than or equal to 10 us) and long-duration kernels (longer than 10 us). Then, collected data shows that short-duration kernels account for about 60% of all kernels. Obviously, the proportion of \(RTT_{delta}\) is larger when the kernel duration is smaller, especially for our cases where the percentage of short-duration kernels is larger than 50%. Therefore, in our cases, the command latency is the major factor that causes the performance decline in the single GPU environment. What is more, it illustrates that users of DxPU should consider reducing the proportion of short-duration kernels in their applications.
    Moreover, it can be inferred from Table 9 that different parameters influence the performance of DxPU greatly. Therefore, we conduct a detailed analysis and explain the causes to answer RQ2.
    Analysis of batch size: We make experiments with different batch sizes based on ResNet-50 synthetic dataset training. Similar to the above findings, the proportions of memory operations are still no more than 1% for each batch size, revealing that the batch size mainly affects kernel duration. From the above analysis, we know that more than half of GPU kernels are short-duration kernels. If the percentage of short-duration kernels increases, then the percentage of \(RTT_{delta}\) increases correspondingly, further worsening the performance. Thus, in Figure 5(a), we plot the Cumulative Distribution Function (CDF) based on the number of the same duration kernels in all kernels. However, the proportions of short-duration kernels are 59.3%, 58.9%, 58.3% when the batch sizes are 32, 64, 128. Therefore, the batch size does not have significant effects on the proportion of short-duration kernels.
    Fig. 5.
    Fig. 5. CDF of kernel number and time proportions with duration in ResNet.
    Yet, further analysis shows that the long-duration kernels are key factors influencing the performance. We display the CDF based on the total time spent on the same duration kernels in all of time in Figure 5(b). It can be concluded that the long-duration kernels influence the average value greatly. For example, kernels ranging from 200–800 us account for 58.9%, 68.8%, 53.6% of the total kernel duration when the batch sizes are 32, 64, 128, respectively. In addition, the average kernel durations are 56.0 us, 102.3 us, 193.0 us correspondingly. So, in ResNet-50, a larger batch size causes the average kernel duration to increase, further reducing the percentage of \(RTT_{delta}\) and improving the performance.
    Analysis of local parameter device: In terms of local parameter device, if the parameter device is set to CPU, then more parameter-related operations are executed between the host server and GPUs. Correspondingly, the proportion of \(RTT_{delta}\) increases and the performance in Table 9 decreases. To demonstrate it in statistics, we collect proportions of memory operations with different local parameter devices in Table 10. It can be seen that the value increases rapidly when changing the parameter device to CPU, no matter which dataset we use.
    Table 10.
    DatasetModeLocal Parameter Device
    GPUCPU
    SyntheticTraining0.2%7.5%
    Inference0.2%15.6%
    ImageNetTraining2.8%9.9%
    Inference9.0%22.3%
    Table 10. Proportions of Memory Operations in GPU Hardware Workloads with Different Local Parameter Devices (batch size = 128)
    The value increases rapidly when changing the parameter device to CPU.
    Analysis of xla: Accelerated linear algebra (xla) is a domain-specific compiler that realizes the kernel fusion in the computational graph optimization. Accordingly, the collected data shows that the average kernel duration in the default situation increases from 102.3 us to 131.0 us when xla is on, thus improving the overall performance.
    However, in Table 9, turning on xla worsens the performance when local parameter device is CPU or the mode is inference. So, we do a further analysis in these cases and record the changes. For the former situation, we observe that the proportion of memory operations increases from 14.9% to 16.7%, and average duration of them decrease from 35.1 us to 11.3 us, all incurring more performance overhead. For the latter one, we find that no matter in which case, turning xla on would increase the number and duration of memory operations. In training cases, this increase can be ignored when compared with the total GPU kernels. Yet, in inference cases, the number of kernels is far smaller than that in training. And the proportion of memory operations increases from 0.2% to 7.9%, showing that it cannot be overlooked.
    Analysis of mode and dataset: With regard to mode and dataset, we discover that the proportions of kernels and memory operations are the same in training and inference, generally. However, the average kernel duration increases in inference. Statistically speaking, the average value increase by 48.2%, 55.5%, 55.1%, compared with training cases, when the batch sizes are 32, 64, 128. Therefore, the performance is better in inference. As the ImageNet dataset is larger than the synthetic one in size, more transmission time between the host server and GPUs causes the proportion of \(RTT_{delta}\) to increase. Consequently, the performance overhead incurred by \(RTT_{delta}\) increases a little bit for ImageNet dataset.
    Analysis in Object Detection and Recommendation Scenarios: Table 11 demonstrates performance of DxPU in object detection and recommendation scenarios. In test cases of NCF, the performance is above 96% with different batch sizes. However, when it comes to SSD320, the performance is around 83%. And we also plot the CDFs based on the number and time of kernels. As can be seen from Figure 6, the distributions of kernel durations with different batch sizes are similar. In statistics, the average kernel durations for these cases are 10.7 us, 8.2 us, 7.9 us, and 8.1 us, respectively. So, the performances do not change a lot with different batch sizes. Different from ResNet-50, the proportion of short-duration kernels are more than 90% with different batch sizes. And this explains why its performance is only around 83%.
    Table 11.
    SSD320 v1.2Batch Size
    8163264
    81.53%84.62%84.47%83.64%
    NCFBatch Size
    16384655362621441048576
    96.78%98.04%97.55%98.25%
    Table 11. DxPU’s Performance in Object Detection and Recommendation Scenarios
    Fig. 6.
    Fig. 6. CDF of kernel number and time proportions with duration in object detection scenarios.
    Multi-GPU Performance Analysis. Besides ResNet-50, we also conduct multi-GPU experiments in BERT (fine-tuning training for SQuAD v1.1). We set the number of GPU(s) to 1, 4, and 8 and measure the performance overhead in multi-GPU scenarios to answer RQ3. Since more GPUs incur more interactions (e.g., data, parameter, and synchronous operations) between the host server and GPUs, the performance of DxPU declines when more GPUs are allocated. In statistics, the performance of DxPU in BERT and ResNet are 94.6%, 93.8%, 93.4% and 92.7%, 87.5%, 82.4%, respectively.
    Analysis of Host-server-to-GPU and GPU-to-Host-server bandwidth: In multi-GPU scenarios, besides command latency, bandwidth between the host server and GPUs may also affect the performance. Since the amount of network packets, which DxPU_PROXY can process at the same time, is limited, a single DxPU_PROXY is not enough when more GPUs are allocated. To measure this impact in statistics, we change the number of GPUs and record the bandwidth in Table 12. When the number is no more than 4, the bandwidth increases at a linear speed. However, when it switches from 4 to 8, the change of bandwidth does not follow the law. It reveals that the communication bottleneck is triggered. Thus, in multi-GPU scenarios, users should take this effect into consideration and set up more DxPU_PROXY s according to the communication bandwidth. Additionally, if multiple host servers access GPUs in the same GPU Boxes, then the bandwidth will increase quickly and communication bottleneck will be triggered, too. In this case, users also need to deploy more DxPU_PROXY s according to the communication bandwidth.
    Table 12.
     Batch Size 64Batch Size 128
     Host to GPU(s)GPU(s) to HostPerformanceHost to GPU(s)GPU(s) to HostPerformance
    1 GPU1.5 GB/s0.8 GB/s92.7%1.4 GB/s0.7 GB/s96.5%
    2 GPUs2.6 GB/s1.3 GB/s90.3%2.6 GB/s1.3 GB/s94.7%
    4 GPUs4.9 GB/s2.3 GB/s87.5%5.0 GB/s2.3 GB/s92.4%
    8 GPUs8.4 GB/s3.6 GB/s82.4%8.3 GB/s3.6 GB/s90.7%
    Table 12. Bandwidth between the Host Server and GPU(s) in DxPU
    The increase of bandwidth does not follow the linear law when there are more than four GPUs, showing that the communication bottleneck is triggered.
    Analysis of GPU-to-GPU bandwidth: To demonstrate the bandwidth between GPUs across different DxPU_PROXYs, we test the corresponding PCIe read and write bandwidth. As can be seen from Figure 7, the bandwidth between GPUs across different DxPU_PROXYs is around 74% of that across a single PCIe bridge. Obviously, in multi-GPU cases, it is recommended that GPUs are allocated under the same DxPU_PROXY to avoid extra expense. As we have mentioned, in our experiments, GPUs are connected via NVLINKs. So, bandwidth between GPUs are not affected by DxPU_PROXY in these cases.
    Fig. 7.
    Fig. 7. Bandwidth between GPUs based on CUDA p2pBandwidthLatencyTest (defaultly p2p enabled). C1 means DxPU_PROXY. C2 means the native environment. C3 means GPUs are connected across one NVLINK. C4 means GPUs are connected across a bonded set of two NVLINKs. GPUs in the native environment are communicated via a single PCIe bridge. In DxPU, GPUs under the same DxPU_PROXY are communicated through either PCIe bridges or NVLINKs, which relies on the design of GPU Boxes.

    4.3.3 Performance in Graphics Rendering Workloads .

    In this section, we test and analyze the performance of DxPU to answer RQ1 and RQ2 in graphics rendering scenarios.
    According to Table 13, performance overheads in valley and heaven are 2.6% and 11.3%, respectively. Therefore, the performance of DxPU is acceptable in the real graphics rendering applications.
    Table 13.
    Benchmarkvalleyheaven
    DxPU FPS2,121.051,870.29
    Native FPS2,177.672,108.11
    Performance97.4%88.7%
    Table 13. DxPU’s Performance in Graphics Rendering Scenarios
    The performance of DxPU is acceptable (around 10%) in real graphics rendering applications.
    Moreover, we evaluate the performance of DxPU when it is assigned with specific tasks. In glmark2, we select three targets (jellyfish, effect2d, ideas) and resolutions (1,920 \(\times\) 1,080, 3,840 \(\times\) 2,160, 7,680 \(\times\) 4,320) for rendering. And we discover that the performance increases with a higher resolution. To explain this phenomenon, we collect the duration of GPU workloads. Typically, the average GPU workload duration tends to be longer for OpenGL applications at higher resolutions. Take ideas in Table 14 as an example: The average values for these resolutions are 65.6 us, 122.8 us, 221.6 us, respectively. So, the performance of DxPU is better with a higher resolution.
    Table 14.
    ResolutionAverage GPU Workload DurationPerformance
    1,920 \(\times\) 1,08065.6 us87.9%
    3,840 \(\times\) 2,160122.8 us91.0%
    7,680 \(\times\) 4,320221.6 us93.0%
    Table 14. Average GPU Workload Duration and Performance in Glmark2 with Different Resolutions (object = ideas)
    To summarize, answers to research questions are as follows:
    RQ1
    In our cases, command latency is the major cause of the performance overhead incurred by DxPU.
    RQ2
    Effects of parameter configurations can be attributed to the changes in kernel duration and proportions of memory operations.
    RQ3
    Bandwidth between the host server and GPUs affects performance overhead in multi-GPU scenarios greatly.

    4.4 Limitations of DxPU

    As is mentioned above, in DxPU, the command latency is longer, compared with the native environment. Hence, DxPU is not appropriate for GPU workloads consisting of too many short-duration kernels and memory operations. Reducing the command latency is also considered as our future work.
    For users of DxPU, the proportions of short-duration kernels and memory operations should be reduced in applications. And in multi-GPU scenarios, the number of DxPU_PROXY s should be considered based on the communication bandwidth.

    5 Discussion and Future Work

    5.1 Software-hardware Optimization Space

    Command Latency Overhead Mitigation. The effect caused by command latency could be mitigated with two strategies: interaction reduction and latency hiding. For the former, it aims to reduce the interaction between the host server and GPUs. For example, kernel fusion technique could be adopted to reduce the number of kernels executed. For the latter, we could leverage concurrent and asynchronous techniques (such as streaming in CUDA) to perform multiple operations simultaneously and in pipeline.
    Adoption of PCIe Gen 4. Currently, DxPU is built on PCIe Gen 3. As PCIe Gen 4 supports the double throughput of PCIe Gen 3, the number of in-flight #tags in GPUs would increase. Correspondingly, the throughput of PCIe memory read transaction and memcpy(HtoD) would increase, too. Meanwhile, communication efficiency and performance of DxPU in multi-GPU scenarios would also be better. In the future, we will deploy DxPU on PCIe Gen 4.
    PCIe Read Transaction Avoidance. As memcpy(HtoD) is built on PCIe Memory Read Transaction, the throughput of memcpy(HtoD) in DxPU decreases dramatically (refer to Equation (1)). To avoid this effect, an alternative memcpy(HtoD) implementation strategy can be adopted: replacing PCIe Memory Read Transaction with PCIe Memory Write Transaction. It can be achieved in two ways by leveraging hardware features of CPUs. First, CPUs execute data move instructions to transfer data from host memory to GPU memory directly, which can be accelerated by leveraging the SIMD feature. Second, we can leverage the DMA engine to finish the data transfer, which would issue PCIe Write Memory Transaction. We implement a prototype for the first one and the result shows that \(RdTP_{DxPU}\) increases from 2.7 GB/s to 9.44 GB/s.

    5.2 GPU Distribution Scheme Design

    Although GPU disaggregation is beneficial to cloud providers in resource utilization and elasticity, the distribution scheme should also be well designed. For example, GPUs in the same pool can be grouped and indexed in advance, so user requirements can be satisfied quickly. In addition, there should be spare GPUs in accordance with the equipment failure rate. So, broken GPUs can be replaced quickly to ensure that users have a good experience.

    5.3 End-to-end TLP Encryption

    As more and more users choose to store their private data in the cloud, the importance of data privacy in the cloud is highlighted. Thus, there have been many researches towards data privacy protection [23, 51]. In AI training and inference, the model and data are considered as user privacy, which is stored in TLPs during transmission. To prevent the privacy leakage during network communication, end-to-end encryption can be applied in our future work.

    6 Related Work

    6.1 Resource Disaggregation

    Resource disaggregation has attracted extensive attention from academia and industry due to its efficiency in resource management, such as Intel Rack Scale Design [3], IBM Composable System [33], dReDBox [31], and so on. However, although there are some studies trying to provide software support for them, it is still challenging to adopt them in practice due to the compatibility or performance issue.

    6.2 GPU Virtualization

    With increasing demands for GPUs in the cloud, GPU virtualization is proposed to split compute capability of a single GPU server and allocate it in a more flexible way. VOCL [49], rCUDA [34], vCUDA [45], and DS-CUDA [44] realize GPU virtualization via API remoting without significant performance penalty. To deal with software maintenance in high-level libraries, LoGV [37], GPUvm [46], and gScale [50] implement Para and full virtualization at the GPU driver level. Among the hardware virtualization methods [47], Intel VT-d [28] and AMD-Vi [6] fail to support sharing a single GPU between multiple VMs. And NVIDIA GRID [20] overcomes the difficulties and supports multiplexing a single GPU between VMs. However, the mechanism of GPU virtualization means that it cannot provide stronger compute capability than a single GPU server.

    6.3 Distributed Machine Learning

    As the era of big data comes, the scale of Machine Learning system is becoming larger in training data and model parameters. Since increasing the compute capability of a single server costs massive systematic efforts [32, 48], distributed Machine Learning is proposed to overcome these challenges. However, such a distributed system requires more knowledge in parallel programming and multi-machine communication, forming bottlenecks for researchers. Hence, programmers need to pay more attention to reducing the network latency and optimizing the whole system [40, 41].

    6.4 Network Processor

    Network processors, like Intel IXP Series [29], are specialized integrated circuits (ICs) designed to efficiently handle and process network-related tasks, such as data packet routing, switching, security, and quality of service (QoS) management. They play a crucial role in enabling the high-speed data communication that underpins modern computer networks. In contrast, DxPU_PROXY serves the purpose of converting between PCIe and network packets, as described in Section 3.3.

    7 Conclusion

    In this article, we design a new implementation of datacenter-scale GPU disaggregation at the PCIe level, named DxPU. DxPU is effective in full-scenario suppport, software transparency, datacenter-scale disaggregation, large capacity, and good performance. Meanwhile, to measure the performance overhead incurred by DxPU, we build up a performance model to make estimations. Instructed by modeling results, we develop an implementation system of DxPU and conduct detailed experiments. The experimental statistics demonstrate that the overhead of DxPU is acceptable in most cases, which is less than 10%. In addition, we make extra experiments to figure out causes of the performance overhead and propose optimization advice, which is regarded as our future work.

    Acknowledgments

    We thank the anonymous reviewers for their comments that greatly helped improve the presentation of this article. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of funding agencies.

    Footnotes

    1
    vCPU is the abbreviation for virtual CPU.
    1
    There can be hundreds of GPU servers in the datacenter.
    2
    For host server, this value should be 0.

    References

    [2]
    2018. Intel Rack Scale Design. Retrieved from https://www.kernel.org/doc/Documentation/ntb.txt
    [4]
    2019. The Impact of Bit Errors in PCI Express® Links. Retrieved from https://www.asteralabs.com/insights/impact-of-bit-errors-in-pci-express-links/
    [5]
    2019. PCI Express® 5.0 Architecture Channel Insertion Loss Budget. Retrieved from https://pcisig.com/pci-express%C2%AE-50-architecture-channel-insertion-loss-budget-0
    [6]
    2021. AMD I/O Virtualization Technology (IOMMU) Specification. Retrieved from https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
    [11]
    2022. DeepLearningExamples. Retrieved from https://github.com/NVIDIA/DeepLearningExamples
    [12]
    2022. Ethernet Cables Explained. Retrieved from https://www.tripplite.com/products/ethernet-cable-types
    [13]
    [17]
    2022. Liqid Powered GPU Composable AI Platform. Retrieved from https://www.liqid.com/solutions/ai/liqid-composable-ai-platform
    [18]
    2022. NVIDIA CUDA Samples. Retrieved from https://github.com/NVIDIA/cuda-samples
    [19]
    [21]
    2022. NVIDIA Nsight System. Retrieved from https://developer.nvidia.com/nsight-systems
    [22]
    2022. PCI Express Base Specification Revision 6.0, Version 1.0. Retrieved from https://members.pcisig.com/wg/PCI-SIG/document/16609
    [23]
    2022. SGXLock: Towards efficiently establishing mutual distrust between host application and enclave for SGX. In 31st USENIX Security Symposium (USENIX Security’22). USENIX Association. Retrieved from https://www.usenix.org/conference/usenixsecurity22/presentation/chen-yuan
    [25]
    2022. tf_cnn_benchmarks. Retrieved from https://github.com/tensorflow/benchmarks
    [27]
    [28]
    Darren Abramson, Jeff Jackson, Sridhar Muthrasanallur, Gil Neiger, Greg Regnier, Rajesh Sankaran, Ioannis Schoinas, Rich Uhlig, Balaji Vembu, and John Wiegert. 2006. Intel virtualization technology for directed I/O. Intel Technol. J. 10, 3 (2006).
    [29]
    Matthew Adiletta, Mark Rosenbluth, Debra Bernstein, Gilbert Wolrich, and Hugh Wilkinson. 2002. The next generation of Intel IXP network processors. INTEL Technol. J. 6, 3 (2002), 6–18.
    [30]
    Luiz André Barroso and Urs Hölzle. 2007. The case for energy-proportional computing. Computer 40, 12 (2007), 33–37.
    [31]
    Maciej Bielski, Ilias Syrigos, Kostas Katrinis, Dimitris Syrivelis, Andrea Reale, Dimitris Theodoropoulos, Nikolaos Alachiotis, Dionisios Pnevmatikatos, E. H. Pap, George Zervas, and others. 2018. dReDBox: Materializing a full-stack rack-scale system prototype of a next-generation disaggregated datacenter. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), IEEE, 1093–1098.
    [32]
    Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGARCH Comput. Archit. News 42, 1 (2014), 269–284.
    [33]
    I-Hsin Chung, Bulent Abali, and Paul Crumley. 2018. Towards a composable computer system. In International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia’18). Association for Computing Machinery, New York, NY, 137–147. DOI:DOI:
    [34]
    José Duato, Antonio J. Peña, Federico Silla, Rafael Mayo, and Enrique S. Quintana-Ortí. 2010. rCUDA: Reducing the number of GPU-based accelerators in high performance clusters. In International Conference on High Performance Computing Simulation. 224–231. DOI:DOI:
    [35]
    Henrique Fingler, Zhiting Zhu, Esther Yoon, Zhipeng Jia, Emmett Witchel, and Christopher J. Rossbach. 2022. DGSF: Disaggregated GPUs for serverless functions. In IEEE International Parallel and Distributed Processing Symposium (IPDPS’22). IEEE, 739–750.
    [36]
    Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 249–264.
    [37]
    Mathias Gottschlag, Marius Hillenbrand, Jens Kehne, Jan Stoess, and Frank Bellosa. 2013. LoGV: Low-overhead GPGPU virtualization. In IEEE 10th International Conference on High Performance Computing and Communications and the IEEE International Conference on Embedded and Ubiquitous Computing. IEEE, 1721–1726.
    [38]
    Anubhav Guleria, J. Lakshmi, and Chakri Padala. 2019. EMF: Disaggregated GPUs in datacenters for efficiency, modularity and flexibility. In IEEE International Conference on Cloud Computing in Emerging Markets (CCEM’19). IEEE, 1–8.
    [39]
    Cheol-Ho Hong, Ivor Spence, and Dimitrios S. Nikolopoulos. 2017. GPU virtualization and scheduling methods: A comprehensive survey. ACM Comput. Surv. 50, 3 (2017), 1–37.
    [40]
    Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, and Onur Mutlu. 2017. Gaia: Geo-distributed machine learning approaching LAN speeds. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI’17). USENIX Association, 629–647. Retrieved from https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/hsieh
    [41]
    Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. 2017. Speeding up distributed machine learning using codes. IEEE Trans. Inf. Theor. 64, 3 (2017), 1514–1529.
    [42]
    Jonas Markussen, Lars Bjørlykke Kristiansen, Pål Halvorsen, Halvor Kielland-Gyrud, Håkon Kvale Stensland, and Carsten Griwodz. 2021. SmartIO: Zero-overhead device sharing through PCIe networking. ACM Trans. Comput. Syst. 38, 1-2 (2021), 1–78.
    [43]
    Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W. Moore. 2018. Understanding PCIe performance for end host networking. In Conference of the ACM Special Interest Group on Data Communication. 327–341.
    [44]
    Minoru Oikawa, Atsushi Kawai, Kentaro Nomura, Kenji Yasuoka, Kazuyuki Yoshikawa, and Tetsu Narumi. 2012. DS-CUDA: A middleware to use many GPUs in the cloud environment. In Conference on High Performance Computing, Networking Storage and Analysis. IEEE, 1207–1214.
    [45]
    Lin Shi, Hao Chen, Jianhua Sun, and Kenli Li. 2011. vCUDA: GPU-accelerated high-performance computing in virtual machines. IEEE Trans. Comput. 61, 6 (2011), 804–816.
    [46]
    Yusuke Suzuki, Shinpei Kato, Hiroshi Yamada, and Kenji Kono. 2014. GPUvm: Why Not Virtualizing GPUs at the Hypervisor?109–120. Retrieved from https://www.usenix.org/conference/atc14/technical-sessions/presentation/suzuki
    [47]
    Leendert Van Doorn. 2006. Hardware virtualization trends. In ACM/Usenix International Conference on Virtual Execution Environments, Vol. 14. 45–45.
    [48]
    Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S. Rellermeyer. 2020. A survey on distributed machine learning. ACM Comput. Surv. 53, 2 (2020), 1–33.
    [49]
    Shucai Xiao, Pavan Balaji, Qian Zhu, Rajeev Thakur, Susan Coghlan, Heshan Lin, Gaojin Wen, Jue Hong, and Wu-chun Feng. 2012. VOCL: An optimized environment for transparent virtualization of graphics processing units. In Innovative Parallel Computing (InPar’12). IEEE, 1–12.
    [50]
    Mochi Xue, Kun Tian, Yaozu Dong, Jiacheng Ma, Jiajun Wang, Zhengwei Qi, Bingsheng He, and Haibing Guan. 2016. gScale: Scaling up GPU virtualization with dynamic sharing of graphics memory space. In USENIX Annual Technical Conference (USENIX ATC’16). USENIX Association, 579–590. Retrieved from https://www.usenix.org/conference/atc16/technical-sessions/presentation/xue
    [51]
    Bingsheng Zhang, Yuan Chen, Jiaqi Li, Yajin Zhou, Phuc Thai, Hong-Sheng Zhou, and Kui Ren. 2021. Succinct scriptable NIZK via trusted hardware. In European Symposium on Research in Computer Security. Springer, 430–451.

    Cited By

    View all
    • (2024)Unbiased Feature Learning with Causal Intervention for Visible-Infrared Person Re-identificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3674737Online publication date: 27-Jun-2024

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 20, Issue 4
    December 2023
    426 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3630263
    • Editor:
    • David Kaeli
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 December 2023
    Online AM: 05 October 2023
    Accepted: 16 August 2023
    Revised: 07 June 2023
    Received: 31 January 2023
    Published in TACO Volume 20, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Clouds
    2. clusters
    3. data centers

    Qualifiers

    • Research-article

    Funding Sources

    • National Key R&D Program of China
    • National Natural Science Foundation of China (NSFC)
    • Alibaba Group through Alibaba Research Intern Program

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,766
    • Downloads (Last 6 weeks)269

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Unbiased Feature Learning with Causal Intervention for Visible-Infrared Person Re-identificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3674737Online publication date: 27-Jun-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media