馃悗
Performance Benchmark
PublicNightly benchmark
The main goal of this benchmarking is two-fold:
- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in reproduce.md.
Docker images
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:
- vllm/vllm-openai:v0.5.0.post1
- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
- openmmlab/lmdeploy:v0.5.0
- ghcr.io/huggingface/text-generation-inference:2.1
Hardware
One AWS node with 8x NVIDIA A100 GPUs.
Workload description
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:
- Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 500 prompts.
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
Plots
In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.
Results
Test name | GPU | Successful req. | Tput (req/s) | Mean TTFT (ms) | Std TTFT (ms) | Mean ITL (ms) | Std ITL (ms) | Input Tput (tok/s) | Output Tput (tok/s) | Engine |
---|---|---|---|---|---|---|---|---|---|---|
tgi_llama8B_tp1_qps_4 | A100-SXM4-80GB | 500 | 3.74425 | 108.705 | 100.847 | 16.6054 | 8.11108 | 755.553 | 487.988 | tgi |
tgi_llama70B_tp4_qps_2 | A100-SXM4-80GB | 500 | 1.88792 | 471.147 | 299.21 | 45.5061 | 26.9567 | 380.963 | 335.166 | tgi |
tgi_mixtral8x7B_tp2_qps_2 | A100-SXM4-80GB | 500 | 1.83876 | 442.111 | 431.699 | 38.7051 | 52.0669 | 417.204 | 396.172 | tgi |
trt_llama70B_tp4_qps_2 | A100-SXM4-80GB | 500 | 1.88808 | 135.072 | 52.7508 | 34.9815 | 11.7329 | 380.996 | 284.877 | trt |
trt_llama8B_tp1_qps_4 | A100-SXM4-80GB | 500 | 3.77042 | 51.2959 | 16.967 | 13.6497 | 4.22742 | 760.832 | 575.75 | trt |
trt_mixtral8x7B_tp2_qps_2 | A100-SXM4-80GB | 500 | 1.8578 | 103.746 | 23.6912 | 32.9532 | 6.10077 | 421.524 | 482.858 | trt |
vllm_llama8B_tp1_qps_4 | A100-SXM4-80GB | 500 | 3.74554 | 20.585 | 18.9562 | 16.2997 | 8.53774 | 755.813 | 499.251 | vllm |
vllm_llama70B_tp4_qps_2 | A100-SXM4-80GB | 500 | 1.89946 | 67.8226 | 54.8692 | 43.266 | 27.2318 | 383.293 | 333.633 | vllm |
vllm_mixtral8x7B_tp2_qps_2 | A100-SXM4-80GB | 500 | 1.83413 | 114.966 | 268.457 | 39.1624 | 55.9786 | 416.154 | 395.997 | vllm |
lmdeploy_llama8B_tp1_qps_4 | A100-SXM4-80GB | 500 | 3.79197 | 20.9606 | 20.25 | 12.5817 | 4.86262 | 765.181 | 494.943 | lmdeploy |
lmdeploy_llama70B_tp4_qps_2 | A100-SXM4-80GB | 500 | 1.91245 | 64.8334 | 53.9591 | 34.2332 | 14.7409 | 385.913 | 341.49 | lmdeploy |
lmdeploy_mixtral8x7B_tp2_qps_2 | A100-SXM4-80GB | 500 | 1.74858 | 1457.06 | 477.169 | 60.1654 | 110.52 | 396.743 | 389.27 | lmdeploy |
bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash
Waited 36s
Ran in 9s
Total Job Run Time: 2h 15m