From the GPU pool, We request a single RTX 6000 GPU for basic workloads, eight Tesla V100 GPUs (supporting NVLINKs) for AI workloads, and a single RTX 6000 GPU for graphics rendering workloads. We compare the evaluation results in DxPU with those in the native environment, whose architecture is server-centric. And other configurations of DxPU and the native environment are the same. It is to be noted that, here, the performance denotes GPU compute capability in DxPU compared with the native one, which is different from the model prediction accuracy in AI scenarios.
We aim at measuring the performance impact brought by DxPU on the typical user cases in the cloud. To this end, we adopt the following standard benchmarks:
4.3.2 Performance in AI Workloads .
To answer all above research questions in AI scenarios, we show and analyze the performance overhead of DxPU in AI workloads for single GPU and multi-GPU environments.
To begin with, we conduct experiments in ResNet-50 utilizing a single GPU with diverse parameter configurations, where the performance overheads are different. Table
9 shows that the best performance can reach 99.6%. The default parameter values are: batch size = 64, local parameter device = GPU, xla = off, mode = training, dataset = synthetic. It can also be discovered that setting smaller batch size or CPU as local parameter device increases the overhead gradually. In the following paragraphs, we first figure out the components of performance overhead in the single GPU environment to answer
RQ1.
Single GPU Performance Analysis. Considering that GPU disaggregation in DxPU is implemented at the PCIe level, it is transparent to OS and applications. Therefore, the differences between DxPU and the native environment are the
bandwidth and
command latency, which are generated by
network latency between the host server and GPUs (mentioned in Section
3.4). We utilize NVIDIA Nsight System [
21] to conduct a detailed investigation. All detailed data is collected from experiments on the native GPU servers.
For bandwidth, we collect the total transferred data between the host server and GPUs in each training step, which are only around 0.01 MB and 40 MB for the synthetic and ImageNet dataset. Thus, the bandwidth is not the critical cause of performance overhead in single GPU training or inference for DxPU. Therefore, we pay more attention to the effect of the command latency in the following analysis.
Most of GPU workloads are made up of GPU kernel operations and memory load/store operations. With the default parameter configuration, GPU kernel execution makes up the most of GPU hardware time (around 99%). So, we mainly analyze GPU kernel execution in these cases. For simplicity, we categorize GPU kernels into short-duration kernels (smaller than or equal to 10 us) and long-duration kernels (longer than 10 us). Then, collected data shows that short-duration kernels account for about 60% of all kernels. Obviously, the proportion of \(RTT_{delta}\) is larger when the kernel duration is smaller, especially for our cases where the percentage of short-duration kernels is larger than 50%. Therefore, in our cases, the command latency is the major factor that causes the performance decline in the single GPU environment. What is more, it illustrates that users of DxPU should consider reducing the proportion of short-duration kernels in their applications.
Moreover, it can be inferred from Table
9 that different parameters influence the performance of DxPU greatly. Therefore, we conduct a detailed analysis and explain the causes to answer
RQ2.
Analysis of batch size: We make experiments with different batch sizes based on ResNet-50 synthetic dataset training. Similar to the above findings, the proportions of memory operations are still no more than 1% for each batch size, revealing that the batch size mainly affects
kernel duration. From the above analysis, we know that more than half of GPU kernels are
short-duration kernels. If the percentage of
short-duration kernels increases, then the percentage of
\(RTT_{delta}\) increases correspondingly, further worsening the performance. Thus, in Figure
5(a), we plot the
Cumulative Distribution Function (CDF) based on the number of the same duration kernels in all kernels. However, the proportions of
short-duration kernels are 59.3%, 58.9%, 58.3% when the batch sizes are 32, 64, 128. Therefore, the batch size does not have significant effects on the proportion of
short-duration kernels.
Yet, further analysis shows that the
long-duration kernels are key factors influencing the performance. We display the CDF based on the total time spent on the same duration kernels in all of time in Figure
5(b). It can be concluded that the
long-duration kernels influence the average value greatly. For example, kernels ranging from 200–800 us account for 58.9%, 68.8%, 53.6% of the total
kernel duration when the batch sizes are 32, 64, 128, respectively. In addition, the average
kernel durations are 56.0 us, 102.3 us, 193.0 us correspondingly. So, in ResNet-50, a larger batch size causes the average
kernel duration to increase, further reducing the percentage of
\(RTT_{delta}\) and improving the performance.
Analysis of local parameter device: In terms of local parameter device, if the parameter device is set to CPU, then more parameter-related operations are executed between the host server and GPUs. Correspondingly, the proportion of
\(RTT_{delta}\) increases and the performance in Table
9 decreases. To demonstrate it in statistics, we collect proportions of memory operations with different local parameter devices in Table
10. It can be seen that the value increases rapidly when changing the parameter device to CPU, no matter which dataset we use.
Analysis of xla: Accelerated linear algebra (xla) is a domain-specific compiler that realizes the kernel fusion in the computational graph optimization. Accordingly, the collected data shows that the average kernel duration in the default situation increases from 102.3 us to 131.0 us when xla is on, thus improving the overall performance.
However, in Table
9, turning on xla worsens the performance when local parameter device is CPU or the mode is inference. So, we do a further analysis in these cases and record the changes. For the former situation, we observe that the proportion of memory operations increases from 14.9% to 16.7%, and average duration of them decrease from 35.1 us to 11.3 us, all incurring more performance overhead. For the latter one, we find that no matter in which case, turning xla on would increase the number and duration of memory operations. In training cases, this increase can be ignored when compared with the total GPU kernels. Yet, in inference cases, the number of kernels is far smaller than that in training. And the proportion of memory operations increases from 0.2% to 7.9%, showing that it cannot be overlooked.
Analysis of mode and dataset: With regard to mode and dataset, we discover that the proportions of kernels and memory operations are the same in training and inference, generally. However, the average kernel duration increases in inference. Statistically speaking, the average value increase by 48.2%, 55.5%, 55.1%, compared with training cases, when the batch sizes are 32, 64, 128. Therefore, the performance is better in inference. As the ImageNet dataset is larger than the synthetic one in size, more transmission time between the host server and GPUs causes the proportion of \(RTT_{delta}\) to increase. Consequently, the performance overhead incurred by \(RTT_{delta}\) increases a little bit for ImageNet dataset.
Analysis in Object Detection and Recommendation Scenarios: Table
11 demonstrates performance of DxPU in object detection and recommendation scenarios. In test cases of NCF, the performance is above 96% with different batch sizes. However, when it comes to SSD320, the performance is around 83%. And we also plot the CDFs based on the number and time of kernels. As can be seen from Figure
6, the distributions of kernel durations with different batch sizes are similar. In statistics, the average kernel durations for these cases are 10.7 us, 8.2 us, 7.9 us, and 8.1 us, respectively. So, the performances do not change a lot with different batch sizes. Different from ResNet-50, the proportion of
short-duration kernels are more than 90% with different batch sizes. And this explains why its performance is only around 83%.
Multi-GPU Performance Analysis. Besides ResNet-50, we also conduct multi-GPU experiments in BERT (fine-tuning training for SQuAD v1.1). We set the number of GPU(s) to 1, 4, and 8 and measure the performance overhead in multi-GPU scenarios to answer RQ3. Since more GPUs incur more interactions (e.g., data, parameter, and synchronous operations) between the host server and GPUs, the performance of DxPU declines when more GPUs are allocated. In statistics, the performance of DxPU in BERT and ResNet are 94.6%, 93.8%, 93.4% and 92.7%, 87.5%, 82.4%, respectively.
Analysis of Host-server-to-GPU and GPU-to-Host-server bandwidth: In multi-GPU scenarios, besides
command latency,
bandwidth between the host server and GPUs may also affect the performance. Since the amount of network packets, which DxPU_PROXY can process at the same time, is limited, a single DxPU_PROXY is not enough when more GPUs are allocated. To measure this impact in statistics, we change the number of GPUs and record the
bandwidth in Table
12. When the number is no more than 4, the
bandwidth increases at a linear speed. However, when it switches from 4 to 8, the change of
bandwidth does not follow the law. It reveals that the communication bottleneck is triggered. Thus, in multi-GPU scenarios, users should take this effect into consideration and set up more DxPU_PROXY s according to the
communication bandwidth. Additionally, if multiple host servers access GPUs in the same GPU Boxes, then the bandwidth will increase quickly and communication bottleneck will be triggered, too. In this case, users also need to deploy more DxPU_PROXY s according to the
communication bandwidth.
Analysis of GPU-to-GPU bandwidth: To demonstrate the bandwidth between GPUs across different DxPU_PROXYs, we test the corresponding PCIe read and write bandwidth. As can be seen from Figure
7, the bandwidth between GPUs across different DxPU_PROXYs is around 74% of that across a single PCIe bridge. Obviously, in multi-GPU cases, it is recommended that GPUs are allocated under the same DxPU_PROXY to avoid extra expense. As we have mentioned, in our experiments, GPUs are connected via NVLINKs. So, bandwidth between GPUs are not affected by DxPU_PROXY in these cases.