Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in reproduce.md.

Docker images

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:

vllm/vllm-openai:v0.5.0.post1
nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
openmmlab/lmdeploy:v0.5.0
ghcr.io/huggingface/text-generation-inference:2.1

Hardware

One AWS node with 8x NVIDIA A100 GPUs.

Workload description

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:

Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 500 prompts.
Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).

Plots

In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.

Results

Test name	GPU	Successful req.	Tput (req/s)	Mean TTFT (ms)	Std TTFT (ms)	Mean ITL (ms)	Std ITL (ms)	Input Tput (tok/s)	Output Tput (tok/s)	Engine
tgi_llama8B_tp1_qps_4	A100-SXM4-80GB	500	3.74425	108.705	100.847	16.6054	8.11108	755.553	487.988	tgi
tgi_llama70B_tp4_qps_2	A100-SXM4-80GB	500	1.88792	471.147	299.21	45.5061	26.9567	380.963	335.166	tgi
tgi_mixtral8x7B_tp2_qps_2	A100-SXM4-80GB	500	1.83876	442.111	431.699	38.7051	52.0669	417.204	396.172	tgi
trt_llama70B_tp4_qps_2	A100-SXM4-80GB	500	1.88808	135.072	52.7508	34.9815	11.7329	380.996	284.877	trt
trt_llama8B_tp1_qps_4	A100-SXM4-80GB	500	3.77042	51.2959	16.967	13.6497	4.22742	760.832	575.75	trt
trt_mixtral8x7B_tp2_qps_2	A100-SXM4-80GB	500	1.8578	103.746	23.6912	32.9532	6.10077	421.524	482.858	trt
vllm_llama8B_tp1_qps_4	A100-SXM4-80GB	500	3.74554	20.585	18.9562	16.2997	8.53774	755.813	499.251	vllm
vllm_llama70B_tp4_qps_2	A100-SXM4-80GB	500	1.89946	67.8226	54.8692	43.266	27.2318	383.293	333.633	vllm
vllm_mixtral8x7B_tp2_qps_2	A100-SXM4-80GB	500	1.83413	114.966	268.457	39.1624	55.9786	416.154	395.997	vllm
lmdeploy_llama8B_tp1_qps_4	A100-SXM4-80GB	500	3.79197	20.9606	20.25	12.5817	4.86262	765.181	494.943	lmdeploy
lmdeploy_llama70B_tp4_qps_2	A100-SXM4-80GB	500	1.91245	64.8334	53.9591	34.2332	14.7409	385.913	341.49	lmdeploy
lmdeploy_mixtral8x7B_tp2_qps_2	A100-SXM4-80GB	500	1.74858	1457.06	477.169	60.1654	110.52	396.743	389.27	lmdeploy

bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash

Ran in 9s

A100 trt benchmark llama8B

Ran in 5m 48s

A100 trt benchmark mixtral8x7B

Ran in 22m 58s

A100 trt benchmark llama70B

Ran in 39m 50s

A100 lmdeploy benchmark

Ran in 29m 53s

A100 vllm benchmark

Ran in 18m 53s

A100 tgi benchmark

Ran in 17m 18s

Plot

Ran in 29s

Total Job Run Time: 2h 15m