馃悗

Nightly benchmark

The main goal of this benchmarking is two-fold:

  • Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
  • Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in reproduce.md.

Docker images

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:

  • vllm/vllm-openai:v0.5.0.post1
  • nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
  • openmmlab/lmdeploy:v0.5.0
  • ghcr.io/huggingface/text-generation-inference:2.1

Hardware

One AWS node with 8x NVIDIA A100 GPUs.

Workload description

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:

  • Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 500 prompts.
  • Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
  • Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
  • Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).

Plots

In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.

Benchmarking results

Results

Test name GPU Successful req. Tput (req/s) Mean TTFT (ms) Std TTFT (ms) Mean ITL (ms) Std ITL (ms) Input Tput (tok/s) Output Tput (tok/s) Engine
tgi_llama8B_tp1_qps_4 A100-SXM4-80GB 500 3.74425 108.705 100.847 16.6054 8.11108 755.553 487.988 tgi
tgi_llama70B_tp4_qps_2 A100-SXM4-80GB 500 1.88792 471.147 299.21 45.5061 26.9567 380.963 335.166 tgi
tgi_mixtral8x7B_tp2_qps_2 A100-SXM4-80GB 500 1.83876 442.111 431.699 38.7051 52.0669 417.204 396.172 tgi
trt_llama70B_tp4_qps_2 A100-SXM4-80GB 500 1.88808 135.072 52.7508 34.9815 11.7329 380.996 284.877 trt
trt_llama8B_tp1_qps_4 A100-SXM4-80GB 500 3.77042 51.2959 16.967 13.6497 4.22742 760.832 575.75 trt
trt_mixtral8x7B_tp2_qps_2 A100-SXM4-80GB 500 1.8578 103.746 23.6912 32.9532 6.10077 421.524 482.858 trt
vllm_llama8B_tp1_qps_4 A100-SXM4-80GB 500 3.74554 20.585 18.9562 16.2997 8.53774 755.813 499.251 vllm
vllm_llama70B_tp4_qps_2 A100-SXM4-80GB 500 1.89946 67.8226 54.8692 43.266 27.2318 383.293 333.633 vllm
vllm_mixtral8x7B_tp2_qps_2 A100-SXM4-80GB 500 1.83413 114.966 268.457 39.1624 55.9786 416.154 395.997 vllm
lmdeploy_llama8B_tp1_qps_4 A100-SXM4-80GB 500 3.79197 20.9606 20.25 12.5817 4.86262 765.181 494.943 lmdeploy
lmdeploy_llama70B_tp4_qps_2 A100-SXM4-80GB 500 1.91245 64.8334 53.9591 34.2332 14.7409 385.913 341.49 lmdeploy
lmdeploy_mixtral8x7B_tp2_qps_2 A100-SXM4-80GB 500 1.74858 1457.06 477.169 60.1654 110.52 396.743 389.27 lmdeploy
bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash
Waited 36s
Ran in 9s
A100 trt benchmark llama8B
Waited 35m 10s
Ran in 5m 48s
A100 trt benchmark mixtral8x7B
Waited 2h 24m
Ran in 22m 58s
A100 trt benchmark llama70B
Waited 43m 39s
Ran in 39m 50s
A100 lmdeploy benchmark
Waited 1h 34m
Ran in 29m 53s
A100 vllm benchmark
Waited 2h 4m
Ran in 18m 53s
A100 tgi benchmark
Waited 2h 58m
Ran in 17m 18s
Plot
Waited 46m 35s
Ran in 29s
Total Job Run Time: 2h 15m