build based on bb7475eec

Not Run

Nightly benchmark

The main goal of this benchmarking is two-fold:

  • Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
  • Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in reproduce.md.

Versions

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:

  • vllm/vllm-openai:v0.5.0.post1
  • nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
  • openmmlab/lmdeploy:v0.5.0
  • ghcr.io/huggingface/text-generation-inference:2.1

Check nightly-pipeline.yaml artifact for more details.

Workload description

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:

  • Input length: randomly sample 1000 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 1000 prompts.
  • Batch size: dynamically determined by vllm and the arrival pattern of the requests.
  • Average QPS (query per second): 4 for 8B model and 2 for larger models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
  • Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: Throughput, TTFT (time to the first token, with mean and std), ITL (inter-token latency, with mean and std).

Check nightly-tests.json artifact for more details.

Known crashes

  • TGI v2.1 crashes when running mixtral model, see TGI PR #2122

Results

Test name GPU Successful req. Tput (req/s) Mean TTFT (ms) Std TTFT (ms) Mean ITL (ms) Std ITL (ms) Engine
tgi_llama8B_tp1_qps_4 A100-SXM4-80GB 500 3.7438 106.226 100.277 16.6865 8.14355 tgi

Plots

In the following plots, the error bar shows the standard error of the mean.

Benchmarking results
Total Job Run Time: 0s