馃悗
Performance Benchmark
Publicupdate metric collection metric to incorporate latest benchmark_serving.py
Passed in 6h 15m and blocked
bootstrap
馃殌 Ready for comparing vllm against alternatives? This will take 4 hours.
A100 vllm step 10
A100 sglang benchmark
A100 lmdeploy benchmark
A100 trt llama-8B
A100 trt llama-70B
Collect the results
馃殌 check the results!
Wait for container to be ready
A100
Nightly benchmark
The main goal of this benchmarking is two-fold:
- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in reproduce.md.
Workload description
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:
- Input length: randomly sample 1000 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 1000 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 4, 8 for 8B model and 1, 4 for larger models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Results
Test name | GPU | Successful req. | Tput (req/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) | Mean ITL (ms) | Median ITL (ms) | P99 ITL (ms) | Engine |
---|
bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash
Waited 53s
Ran in 10s

Wait for container to be ready
A100
Total Job Run Time: 4h 45m