馃悗
Performance Benchmark
Publicbring back the full test suite
Passed in 6h 1m
Nightly benchmark
The main goal of this benchmarking is two-fold:
- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in reproduce.md.
Versions
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:
- vllm/vllm-openai:v0.5.0.post1
- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
- openmmlab/lmdeploy:v0.5.0
- ghcr.io/huggingface/text-generation-inference:2.1
Check nightly-pipeline.yaml artifact for more details.
Workload description
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:
- Input length: randomly sample 1000 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 1000 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 4 for 8B model and 2 for larger models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: Throughput, TTFT (time to the first token, with mean and std), ITL (inter-token latency, with mean and std).
Check nightly-tests.json artifact for more details.
Known issues
- TGI v2.1 crashes when running mixtral model, see tgi issue #2122
- Pin the transformers library to 4.41.2 to avoid lmdeploy missing cache_position error, see lmdeploy issue 1885.
Plots
In the following plots, the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.
Results
Test name | GPU | Successful req. | Tput (req/s) | Mean TTFT (ms) | Std TTFT (ms) | Mean ITL (ms) | Std ITL (ms) | Engine |
---|---|---|---|---|---|---|---|---|
tgi_llama8B_tp1_qps_4 | A100-SXM4-80GB | 500 | 3.74167 | 112.025 | 106.495 | 16.94 | 8.38153 | tgi |
tgi_llama70B_tp4_qps_2 | A100-SXM4-80GB | 500 | 1.85297 | 467.191 | 300.037 | 45.642 | 27.0579 | tgi |
tgi_mixtral8x7B_tp2_qps_2 | A100-SXM4-80GB | 500 | 1.83849 | 453.897 | 431.474 | 38.7525 | 51.2062 | tgi |
vllm_llama8B_tp1_qps_4 | A100-SXM4-80GB | 500 | 3.74688 | 20.8147 | 22.5057 | 16.2072 | 8.63551 | vllm |
vllm_llama70B_tp4_qps_2 | A100-SXM4-80GB | 500 | 1.89608 | 69.0084 | 54.7202 | 44.2541 | 28.4806 | vllm |
vllm_mixtral8x7B_tp2_qps_2 | A100-SXM4-80GB | 500 | 1.83253 | 120.103 | 274.588 | 40.9029 | 60.4394 | vllm |
trt_mixtral8x7B_tp2_qps_2 | A100-SXM4-80GB | 500 | 1.8629 | 103.729 | 24.0213 | 32.9333 | 6.1068 | trt |
trt_llama8B_tp1_qps_4 | A100-SXM4-80GB | 500 | 3.77129 | 51.1976 | 16.9842 | 13.5824 | 4.16044 | trt |
trt_llama70B_tp4_qps_2 | A100-SXM4-80GB | 500 | 1.88798 | 135.329 | 54.4486 | 34.9689 | 11.856 | trt |
lmdeploy_llama8B_tp1_qps_4 | A100-SXM4-80GB | 500 | 3.78979 | 20.9192 | 20.6185 | 12.5697 | 4.85207 | lmdeploy |
lmdeploy_llama70B_tp4_qps_2 | A100-SXM4-80GB | 500 | 1.91192 | 65.2403 | 53.6743 | 34.208 | 14.7309 | lmdeploy |
lmdeploy_mixtral8x7B_tp2_qps_2 | A100-SXM4-80GB | 500 | 1.7112 | 1482.03 | 500.372 | 60.8857 | 112.457 | lmdeploy |
bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash
Waited 45s
Ran in 8s
Total Job Run Time: 2h 57m