Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm to achieve maximum throughput.
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: throughput.

Test name	GPU	Tput (req/s)
throughput_llama8B_tp1	H200	21.8334
throughput_mixtral8x7B_tp2	H200	8.04811
throughput_llama70B_tp4	H200	9.74671

Serving tests

Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm and the arrival pattern of the requests.
Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
We also added a speculative decoding test for llama-3 70B, under QPS 2
Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

Test name	GPU	Tput (req/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)
serving_llama70B_tp4_sharegpt_qps_1	H200	0.989745	65.8942	58.6838	117.731	16.5538	15.9126	57.2564
serving_mixtral8x7B_tp2_sharegpt_qps_inf	H200	8.30144	2212.01	2693.16	2705.08	27.5387	25.1811	49.9445
serving_llama70B_tp4_sharegpt_qps_16	H200	7.94188	129.844	118.778	245.186	26.8641	20.6569	73.1271
serving_llama70B_tp4_sharegpt_qps_inf	H200	9.83461	2956.12	2966.79	5737.24	28.5089	22.5366	60.9162
serving_mixtral8x7B_tp2_sharegpt_qps_1	H200	0.994574	46.1188	41.2574	86.9205	12.7744	12.0251	38.6543
serving_llama8B_tp1_sharegpt_qps_1	H200	1.00721	23.7952	22.2514	42.4458	6.26858	6.18391	6.71771
serving_llama8B_tp1_sharegpt_qps_4	H200	3.86665	26.4758	23.9896	44.6248	6.67286	6.39768	20.6561
serving_llama8B_tp1_sharegpt_qps_16	H200	12.1697	36.8132	30.8492	103.342	9.13766	7.90394	22.6719
serving_llama8B_tp1_sharegpt_qps_inf	H200	22.132	1311.18	1297.12	2298.27	12.2979	11.3569	22.8169
serving_llama70B_tp4_sharegpt_qps_4	H200	3.44712	77.9262	64.1089	153.266	19.255	17.1243	60.4741
serving_mixtral8x7B_tp2_sharegpt_qps_16	H200	6.64302	68.5915	63.0198	151.21	30.9852	24.344	151.034
serving_mixtral8x7B_tp2_sharegpt_qps_4	H200	3.25054	53.1555	48.8058	93.9642	23.7428	21.1877	56.6531

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "H200", "1": "H200", "2": "H200"}, "Mean latency (ms)": {"0": 834.0585705824196, "1": 2075.0775761902332, "2": 1921.0956402122974}, "Median latency (ms)": {"0": 834.0138993225992, "1": 2075.3575819544494, "2": 1920.164190698415}, "P99 latency (ms)": {"0": 834.8459033109248, "1": 2076.722107725218, "2": 1932.7935293503106}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_mixtral8x7B_tp2", "2": "throughput_llama70B_tp4"}, "GPU": {"0": "H200", "1": "H200", "2": "H200"}, "Tput (req/s)": {"0": 21.83341524788152, "1": 8.048109997071029, "2": 9.746709727458574}}, "serving": {"Test name": {"0": "serving_llama70B_tp4_sharegpt_qps_1", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "2": "serving_llama70B_tp4_sharegpt_qps_16", "3": "serving_llama70B_tp4_sharegpt_qps_inf", "4": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "5": "serving_llama8B_tp1_sharegpt_qps_1", "6": "serving_llama8B_tp1_sharegpt_qps_4", "7": "serving_llama8B_tp1_sharegpt_qps_16", "8": "serving_llama8B_tp1_sharegpt_qps_inf", "9": "serving_llama70B_tp4_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_4"}, "GPU": {"0": "H200", "1": "H200", "2": "H200", "3": "H200", "4": "H200", "5": "H200", "6": "H200", "7": "H200", "8": "H200", "9": "H200", "10": "H200", "11": "H200"}, "Tput (req/s)": {"0": 0.9897454093597071, "1": 8.301438303486135, "2": 7.941884952254488, "3": 9.83461405896792, "4": 0.9945736571908708, "5": 1.0072132823950808, "6": 3.866650776305284, "7": 12.169701952157455, "8": 22.132008765124226, "9": 3.4471185869260013, "10": 6.6430202329711845, "11": 3.25053573113392}, "Mean TTFT (ms)": {"0": 65.89417540002614, "1": 2212.0139549998567, "2": 129.8440619581379, "3": 2956.116558215581, "4": 46.11883599776775, "5": 23.79521505907178, "6": 26.47580065531656, "7": 36.813228321261704, "8": 1311.1809824383818, "9": 77.9262473876588, "10": 68.5915441554971, "11": 53.15553463995457}, "Median TTFT (ms)": {"0": 58.68379143066704, "1": 2693.1617595255375, "2": 118.77811304293573, "3": 2966.7866479139775, "4": 41.25743662007153, "5": 22.25138060748577, "6": 23.98963994346559, "7": 30.849200673401356, "8": 1297.1230044495314, "9": 64.10894286818802, "10": 63.01983795128763, "11": 48.80578117445111}, "P99 TTFT (ms)": {"0": 117.7309931674972, "1": 2705.078350431286, "2": 245.1856034155935, "3": 5737.240736754611, "4": 86.92045607138422, "5": 42.44583906605839, "6": 44.62477448396378, "7": 103.34168700501296, "8": 2298.2681707665324, "9": 153.26619511004526, "10": 151.21046319138233, "11": 93.96423864644017}, "Mean ITL (ms)": {"0": 16.55375627379828, "1": 27.538672305496316, "2": 26.86406814457687, "3": 28.508920810925122, "4": 12.774443284429175, "5": 6.268575682927343, "6": 6.672862159281757, "7": 9.137658113079894, "8": 12.297889603084739, "9": 19.255033580634123, "10": 30.98522109735675, "11": 23.74275151011624}, "Median ITL (ms)": {"0": 15.912594506517053, "1": 25.181110948324203, "2": 20.656882552430034, "3": 22.536626551300287, "4": 12.025097850710154, "5": 6.183910416439176, "6": 6.3976801466196775, "7": 7.903937250375748, "8": 11.356941889971495, "9": 17.124322475865483, "10": 24.343972094357014, "11": 21.187677048146725}, "P99 ITL (ms)": {"0": 57.25642004050316, "1": 49.94449386373162, "2": 73.12712532002479, "3": 60.91620681807399, "4": 38.65427698940039, "5": 6.717711966484787, "6": 20.656075677834455, "7": 22.671854496002197, "8": 22.816914832219496, "9": 60.47411515843123, "10": 151.03395713493225, "11": 56.653142580762506}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash

Ran in 10s

A100 trt benchmark

A100 lmdeploy benchmark

A100 vllm benchmark

A100 tgi benchmark

Plot

Total Job Run Time: 10s