Deploy website
PublicDeploys website (docs and shortlinks)
Deploy website
Passed in 6m 43s
Nightly benchmark
The main goal of this benchmarking is two-fold:
- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in reproduce.md.
Versions
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:
- vllm/vllm-openai:v0.5.0.post1
- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
- openmmlab/lmdeploy:v0.5.0
- ghcr.io/huggingface/text-generation-inference:2.1
Check nightly-pipeline.yaml artifact for more details.
Workload description
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:
- Input length: randomly sample 1000 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 1000 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 4 for 8B model and 2 for larger models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: Throughput, TTFT (time to the first token, with mean and std), ITL (inter-token latency, with mean and std).
Check nightly-tests.json artifact for more details.
Known crashes
- TGI v2.1 crashes when running mixtral model, see TGI PR #2122
Results
Test name | GPU | Successful req. | Tput (req/s) | Mean TTFT (ms) | Std TTFT (ms) | Mean ITL (ms) | Std ITL (ms) | Engine |
---|---|---|---|---|---|---|---|---|
tgi_llama8B_tp1_qps_4 | A100-SXM4-80GB | 500 | 3.7438 | 106.226 | 100.277 | 16.6865 | 8.14355 | tgi |
Plots
In the following plots, the error bar shows the standard error of the mean.

Waited 6s
Ran in 1m 5s
bin/ci-builder run nightly ci/deploy_website/we...bin/ci-builder run nightly ci/deploy_website/website.sh
Waited 42s
Ran in 4m 56s
Total Job Run Time: 6m 1s