DiffEqDocs.jl

Public

#301

Not Run

Anant Thazhemadam

Created Sun 30th Mar at 4:53 PM

Triggered from Webhook

Nightly benchmark

The main goal of this benchmarking is two-fold:

Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in reproduce.md.

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:

Check nightly-pipeline.yaml artifact for more details.

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:

Input length: randomly sample 1000 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 1000 prompts.
Batch size: dynamically determined by vllm and the arrival pattern of the requests.
Average QPS (query per second): 4 for 8B model and 2 for larger models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: Throughput, TTFT (time to the first token, with mean and std), ITL (inter-token latency, with mean and std).

Check nightly-tests.json artifact for more details.

Test name	GPU	Successful req.	Tput (req/s)	Mean TTFT (ms)	Std TTFT (ms)	Mean ITL (ms)	Std ITL (ms)	Engine
tgi_llama8B_tp1_qps_4	A100-SXM4-80GB	500	3.7438	106.226	100.277	16.6865	8.14355	tgi

In the following plots, the error bar shows the standard error of the mean.

Total Job Run Time: 0s