Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in reproduce.md.

Versions

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:

Check nightly-pipeline.yaml artifact for more details.

Workload description

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:

Input length: randomly sample 1000 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 1000 prompts.
Batch size: dynamically determined by vllm and the arrival pattern of the requests.
Average QPS (query per second): 4 for 8B model and 2 for larger models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: Throughput, TTFT (time to the first token, with mean and std), ITL (inter-token latency, with mean and std).

Check nightly-tests.json artifact for more details.

TGI v2.1 crashes when running mixtral model, see tgi issue #2122
Pin the transformers library to 4.41.2 to avoid lmdeploy missing cache_position error, see lmdeploy issue 1885.

In the following plots, the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.

Test name	GPU	Successful req.	Tput (req/s)	Mean TTFT (ms)	Std TTFT (ms)	Mean ITL (ms)	Std ITL (ms)	Engine
tgi_llama8B_tp1_qps_4	A100-SXM4-80GB	500	3.74167	112.025	106.495	16.94	8.38153	tgi
tgi_llama70B_tp4_qps_2	A100-SXM4-80GB	500	1.85297	467.191	300.037	45.642	27.0579	tgi
tgi_mixtral8x7B_tp2_qps_2	A100-SXM4-80GB	500	1.83849	453.897	431.474	38.7525	51.2062	tgi
vllm_llama8B_tp1_qps_4	A100-SXM4-80GB	500	3.74688	20.8147	22.5057	16.2072	8.63551	vllm
vllm_llama70B_tp4_qps_2	A100-SXM4-80GB	500	1.89608	69.0084	54.7202	44.2541	28.4806	vllm
vllm_mixtral8x7B_tp2_qps_2	A100-SXM4-80GB	500	1.83253	120.103	274.588	40.9029	60.4394	vllm
trt_mixtral8x7B_tp2_qps_2	A100-SXM4-80GB	500	1.8629	103.729	24.0213	32.9333	6.1068	trt
trt_llama8B_tp1_qps_4	A100-SXM4-80GB	500	3.77129	51.1976	16.9842	13.5824	4.16044	trt
trt_llama70B_tp4_qps_2	A100-SXM4-80GB	500	1.88798	135.329	54.4486	34.9689	11.856	trt
lmdeploy_llama8B_tp1_qps_4	A100-SXM4-80GB	500	3.78979	20.9192	20.6185	12.5697	4.85207	lmdeploy
lmdeploy_llama70B_tp4_qps_2	A100-SXM4-80GB	500	1.91192	65.2403	53.6743	34.208	14.7309	lmdeploy
lmdeploy_mixtral8x7B_tp2_qps_2	A100-SXM4-80GB	500	1.7112	1482.03	500.372	60.8857	112.457	lmdeploy

bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash

Ran in 8s

A100 trt benchmark llama8B

Ran in 6m 15s

A100 trt benchmark mixtral8x7B

Ran in 29m 52s

A100 trt benchmark llama70B

Ran in 1h 6m

A100 lmdeploy benchmark

Ran in 34m 27s

A100 vllm benchmark

Ran in 22m 0s

A100 tgi benchmark

Ran in 18m 1s

Plot

Ran in 23s

Total Job Run Time: 2h 57m