馃悗

Nightly benchmark

The main goal of this benchmarking is two-fold:

  • Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
  • Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in reproduce.md.

Versions

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:

  • vllm/vllm-openai:v0.5.0.post1
  • nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
  • openmmlab/lmdeploy:v0.5.0
  • ghcr.io/huggingface/text-generation-inference:2.1

Check nightly-pipeline.yaml artifact for more details.

Workload description

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:

  • Input length: randomly sample 1000 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 1000 prompts.
  • Batch size: dynamically determined by vllm and the arrival pattern of the requests.
  • Average QPS (query per second): 4 for 8B model and 2 for larger models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
  • Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: Throughput, TTFT (time to the first token, with mean and std), ITL (inter-token latency, with mean and std).

Check nightly-tests.json artifact for more details.

Known issues

  • TGI v2.1 crashes when running mixtral model, see tgi issue #2122
  • Pin the transformers library to 4.41.2 to avoid lmdeploy missing cache_position error, see lmdeploy issue 1885.

Plots

In the following plots, the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.

Benchmarking results

Results

Test name GPU Successful req. Tput (req/s) Mean TTFT (ms) Std TTFT (ms) Mean ITL (ms) Std ITL (ms) Engine
tgi_llama8B_tp1_qps_4 A100-SXM4-80GB 500 3.74167 112.025 106.495 16.94 8.38153 tgi
tgi_llama70B_tp4_qps_2 A100-SXM4-80GB 500 1.85297 467.191 300.037 45.642 27.0579 tgi
tgi_mixtral8x7B_tp2_qps_2 A100-SXM4-80GB 500 1.83849 453.897 431.474 38.7525 51.2062 tgi
vllm_llama8B_tp1_qps_4 A100-SXM4-80GB 500 3.74688 20.8147 22.5057 16.2072 8.63551 vllm
vllm_llama70B_tp4_qps_2 A100-SXM4-80GB 500 1.89608 69.0084 54.7202 44.2541 28.4806 vllm
vllm_mixtral8x7B_tp2_qps_2 A100-SXM4-80GB 500 1.83253 120.103 274.588 40.9029 60.4394 vllm
trt_mixtral8x7B_tp2_qps_2 A100-SXM4-80GB 500 1.8629 103.729 24.0213 32.9333 6.1068 trt
trt_llama8B_tp1_qps_4 A100-SXM4-80GB 500 3.77129 51.1976 16.9842 13.5824 4.16044 trt
trt_llama70B_tp4_qps_2 A100-SXM4-80GB 500 1.88798 135.329 54.4486 34.9689 11.856 trt
lmdeploy_llama8B_tp1_qps_4 A100-SXM4-80GB 500 3.78979 20.9192 20.6185 12.5697 4.85207 lmdeploy
lmdeploy_llama70B_tp4_qps_2 A100-SXM4-80GB 500 1.91192 65.2403 53.6743 34.208 14.7309 lmdeploy
lmdeploy_mixtral8x7B_tp2_qps_2 A100-SXM4-80GB 500 1.7112 1482.03 500.372 60.8857 112.457 lmdeploy
bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash
Waited 45s
Ran in 8s
A100 trt benchmark llama8B
Waited 24m 56s
Ran in 6m 15s
A100 trt benchmark mixtral8x7B
Waited 37m 26s
Ran in 29m 52s
A100 trt benchmark llama70B
Waited 2h 46m
Ran in 1h 6m
A100 lmdeploy benchmark
Waited 4h 21m
Ran in 34m 27s
A100 vllm benchmark
Waited 4h 58m
Ran in 22m 0s
A100 tgi benchmark
Waited 5h 22m
Ran in 18m 1s
Plot
Waited 21m 4s
Ran in 23s
Total Job Run Time: 2h 57m