馃悗

update metric collection metric to incorporate latest benchmark_serving.py

Passed in 6h 15m and blocked

Nightly benchmark

The main goal of this benchmarking is two-fold:

  • Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
  • Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in reproduce.md.

Workload description

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:

  • Input length: randomly sample 1000 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 1000 prompts.
  • Batch size: dynamically determined by vllm and the arrival pattern of the requests.
  • Average QPS (query per second): 4, 8 for 8B model and 1, 4 for larger models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
  • Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

Results

Test name GPU Successful req. Tput (req/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms) Engine
bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash
Waited 53s
Ran in 10s
Kuntai Du unblocked 馃殌 Ready for comparing vllm against alternatives? This will take 4 hours.
A100 vllm step 10
Waited 1h 39m
Ran in 1h 8m
A100 sglang benchmark
Waited 6m 37s
Ran in 1h 29m
A100 lmdeploy benchmark
Waited 4h 55m
Ran in 20m 24s
A100 trt llama-8B
Waited 2h 55m
Ran in 31m 48s
A100 trt llama-70B
Waited 3h 27m
Ran in 1h 14m
Collect the results
Waited 58m 54s
Ran in 18s
Wait for container to be ready
A100
Total Job Run Time: 4h 45m