Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in reproduce.md.

Workload description

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:

Input length: randomly sample 1000 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 1000 prompts.
Batch size: dynamically determined by vllm and the arrival pattern of the requests.
Average QPS (query per second): 4, 8 for 8B model and 1, 4 for larger models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

Results

Test name	GPU	Successful req.	Tput (req/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)	Engine

bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash

Ran in 10s

Kuntai Du unblocked 🚀 Ready for comparing vllm against alternatives? This will take 4 hours.
Thu 3rd Oct 2024 at 5:46 AM

A100 vllm step 10

Ran in 1h 8m

A100 sglang benchmark

Ran in 1h 29m

A100 lmdeploy benchmark

Ran in 20m 24s

A100 trt llama-8B

Ran in 31m 48s

A100 trt llama-70B

Ran in 1h 14m

Collect the results

Ran in 18s

Wait for container to be ready

A100

Total Job Run Time: 4h 45m