🐎

Performance Benchmark

Public

restore other changes

#9523

simon-mo:h200-bench/d88811a80(#9768)

Passed in 11h 53m

bootstrap

Wait for container to be ready

A100

H200

Simon Mo

Created Wed 20th Nov 2024 at 6:33 AM

Triggered from Webhook

Latency tests

Input length: 32 tokens.
Output length: 128 tokens.
Batch size: fixed (8).
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: end-to-end latency (mean, median, p99).

Test name	GPU	Mean latency (ms)	Median latency (ms)	P99 latency (ms)
latency_llama70B_tp4	A100-SXM4-80GB	3990.59	3990.39	3991.87
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
latency_llama8B_tp1	A100-SXM4-80GB	1579.27	1579.34	1579.87
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
latency_mixtral8x7B_tp2	A100-SXM4-80GB	3643.84	3648.91	3680.16
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB

Throughput tests

Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm to achieve maximum throughput.
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: throughput.

Test name	GPU	Tput (req/s)
throughput_llama70B_tp4	A100-SXM4-80GB	4.78859
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
throughput_llama8B_tp1	A100-SXM4-80GB	11.0567
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
throughput_mixtral8x7B_tp2	A100-SXM4-80GB	5.2096
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB

Serving tests

Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm and the arrival pattern of the requests.
Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
We also added a speculative decoding test for llama-3 70B, under QPS 2
Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

Test name	GPU	Tput (req/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)
serving_llama70B_tp4_sharegpt_qps_1	A100-SXM4-80GB	0.952471	115.311	91.0313	265.056	33.5852	30.7201	97.3908
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
serving_llama70B_tp4_sharegpt_qps_16	A100-SXM4-80GB	4.80791	739.389	763.067	1486.83	58.6988	44.9026	247.291
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
serving_llama70B_tp4_sharegpt_qps_4	A100-SXM4-80GB	2.91433	150.67	128.708	380.441	42.4353	35.2233	111.432
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
serving_llama70B_tp4_sharegpt_qps_inf	A100-SXM4-80GB	4.83096	6647.28	6483.58	13063.6	59.6783	45.1585	238.631
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
serving_llama70B_tp4_sharegpt_specdecode_qps_2	A100-SXM4-80GB	1.56388	119.466	104.25	253.872	60.8368	50.1386	240.037
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
serving_llama8B_tp1_sharegpt_qps_1	A100-SXM4-80GB	0.99686	41.2431	33.9182	86.1282	12.3817	11.9783	30.4699
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
serving_llama8B_tp1_sharegpt_qps_16	A100-SXM4-80GB	8.78881	85.7946	67.627	341.346	21.2062	18.8506	44.4968
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
serving_llama8B_tp1_sharegpt_qps_4	A100-SXM4-80GB	3.59462	44.0776	36.9963	88.3449	13.9754	12.9534	32.9243
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
serving_llama8B_tp1_sharegpt_qps_inf	A100-SXM4-80GB	11.1166	2343.4	2269.15	4494.72	26.1787	21.8317	50.6188
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
serving_mixtral8x7B_tp2_sharegpt_qps_1	A100-SXM4-80GB	0.950567	322.814	63.0845	3074.9	32.3384	28.2235	63.3446
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
serving_mixtral8x7B_tp2_sharegpt_qps_16	A100-SXM4-80GB	4.65562	100.619	96.0579	206.358	47.6891	36.4703	295.411
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
serving_mixtral8x7B_tp2_sharegpt_qps_4	A100-SXM4-80GB	2.88023	73.6468	67.235	140.869	38.7074	33.7848	110.65
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
serving_mixtral8x7B_tp2_sharegpt_qps_inf	A100-SXM4-80GB	5.47425	2974.18	2614.49	3604.41	40.8073	38.1229	254.489
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB
	A100-SXM4-80GB

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Mean latency (ms)": {"0": 1579.2741344310343, "1": 3990.5927983112633, "2": 3643.844443745911}, "Median latency (ms)": {"0": 1579.3387778103352, "1": 3990.386940073222, "2": 3648.910060059279}, "P99 latency (ms)": {"0": 1579.8685044329613, "1": 3991.868680436164, "2": 3680.155389076099}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_llama70B_tp4", "2": "throughput_mixtral8x7B_tp2"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Tput (req/s)": {"0": 11.056650025301797, "1": 4.788594455549321, "2": 5.209599610769483}}, "serving": {"Test name": {"0": "serving_llama8B_tp1_sharegpt_qps_1", "1": "serving_llama8B_tp1_sharegpt_qps_4", "2": "serving_llama8B_tp1_sharegpt_qps_16", "3": "serving_llama8B_tp1_sharegpt_qps_inf", "4": "serving_llama70B_tp4_sharegpt_qps_1", "5": "serving_llama70B_tp4_sharegpt_qps_4", "6": "serving_llama70B_tp4_sharegpt_qps_16", "7": "serving_llama70B_tp4_sharegpt_qps_inf", "8": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "9": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "12": "serving_llama70B_tp4_sharegpt_specdecode_qps_2"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "3": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "4": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "5": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "6": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "7": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "8": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "9": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "10": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "11": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "12": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Tput (req/s)": {"0": 0.996860105131725, "1": 3.594616274181651, "2": 8.788809990815022, "3": 11.116569511102174, "4": 0.9524712318177595, "5": 2.9143259428791333, "6": 4.807908351065189, "7": 4.830957409318412, "8": 0.9505671381170304, "9": 2.880227262737138, "10": 4.655617720115071, "11": 5.474248795427025, "12": 1.5638805341960242}, "Mean TTFT (ms)": {"0": 41.243121719453484, "1": 44.077623966149986, "2": 85.79455303959548, "3": 2343.400535401888, "4": 115.31116942409426, "5": 150.67011817125604, "6": 739.3892058706842, "7": 6647.2774839098565, "8": 322.8140198113397, "9": 73.64680503029376, "10": 100.61861591646448, "11": 2974.1790991905145, "12": 119.46559221690107}, "Median TTFT (ms)": {"0": 33.91816699877381, "1": 36.99628356844187, "2": 67.62702250853181, "3": 2269.1522729583085, "4": 91.0313033964485, "5": 128.70792299509048, "6": 763.0671088118106, "7": 6483.580605825409, "8": 63.08454996906221, "9": 67.23504257388413, "10": 96.05789557099342, "11": 2614.4928943831474, "12": 104.24970323219895}, "P99 TTFT (ms)": {"0": 86.12816051114349, "1": 88.34485004656015, "2": 341.3459550682454, "3": 4494.72184590064, "4": 265.0556325260538, "5": 380.4409689595922, "6": 1486.8260128656395, "7": 13063.584973793477, "8": 3074.8974635964246, "9": 140.8694759849459, "10": 206.35831077583092, "11": 3604.4086371082813, "12": 253.8721384108068}, "Mean ITL (ms)": {"0": 12.381713669474479, "1": 13.97535663168982, "2": 21.206160098427702, "3": 26.17872029052179, "4": 33.585217624787845, "5": 42.435285622622, "6": 58.698767704667674, "7": 59.67830900113209, "8": 32.33838742374971, "9": 38.70738570247557, "10": 47.689148928676595, "11": 40.80733728618981, "12": 60.83681516647543}, "Median ITL (ms)": {"0": 11.978331953287125, "1": 12.953380355611444, "2": 18.850606866180897, "3": 21.831671707332134, "4": 30.72014870122075, "5": 35.22330219857395, "6": 44.90255401469767, "7": 45.15848867595196, "8": 28.22347404435277, "9": 33.78476295620203, "10": 36.47025674581528, "11": 38.12288399785757, "12": 50.13856524601579}, "P99 ITL (ms)": {"0": 30.469923452474195, "1": 32.9242795240134, "2": 44.4967800006271, "3": 50.61877420172099, "4": 97.39084912464023, "5": 111.43196043092756, "6": 247.29101412929384, "7": 238.63066593185067, "8": 63.34455063566565, "9": 110.65045634284614, "10": 295.41056528687477, "11": 254.4893297180533, "12": 240.03740200772882}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

Latency tests

Input length: 32 tokens.
Output length: 128 tokens.
Batch size: fixed (8).
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: end-to-end latency (mean, median, p99).

Test name	GPU	Mean latency (ms)	Median latency (ms)	P99 latency (ms)
latency_llama70B_tp4	H200	2056.3	2056.57	2057.75
	H200
	H200
	H200
	H200
	H200
	H200
latency_llama8B_tp1	H200	827.381	827.391	827.806
	H200
	H200
	H200
	H200
	H200
	H200
latency_mixtral8x7B_tp2	H200	1887.24	1887.99	1898.83
	H200
	H200
	H200
	H200
	H200
	H200

Throughput tests

Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm to achieve maximum throughput.
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: throughput.

Test name	GPU	Tput (req/s)
throughput_llama70B_tp4	H200	9.93361
	H200
	H200
	H200
	H200
	H200
	H200
throughput_llama8B_tp1	H200	21.7857
	H200
	H200
	H200
	H200
	H200
	H200
throughput_mixtral8x7B_tp2	H200	8.10596
	H200
	H200
	H200
	H200
	H200
	H200

Serving tests

Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm and the arrival pattern of the requests.
Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
We also added a speculative decoding test for llama-3 70B, under QPS 2
Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

Test name	GPU	Tput (req/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)
serving_llama70B_tp4_sharegpt_qps_1	H200	0.98987	53.2916	47.3446	106.249	16.2839	15.8502	43.3427
	H200
	H200
	H200
	H200
	H200
	H200
serving_llama70B_tp4_sharegpt_qps_16	H200	8.09794	94.7277	86.1779	192.687	24.0539	19.6597	58.118
	H200
	H200
	H200
	H200
	H200
	H200
serving_llama70B_tp4_sharegpt_qps_4	H200	3.45418	60.4077	52.0092	112.608	18.1923	16.5399	46.1188
	H200
	H200
	H200
	H200
	H200
	H200
serving_llama70B_tp4_sharegpt_qps_inf	H200	10.0194	2888.7	2884.22	5482.37	27.5007	22.1666	135.618
	H200
	H200
	H200
	H200
	H200
	H200
serving_llama70B_tp4_sharegpt_specdecode_qps_2	H200	1.64944	56.3498	51.39	102.777	29.6755	27.511	89.5286
	H200
	H200
	H200
	H200
	H200
	H200
serving_llama8B_tp1_sharegpt_qps_1	H200	1.00735	20.1967	18.4575	38.4603	6.22192	6.16619	6.67523
	H200
	H200
	H200
	H200
	H200
	H200
serving_llama8B_tp1_sharegpt_qps_16	H200	12.2262	34.0159	26.892	211.586	8.38521	7.70806	17.9849
	H200
	H200
	H200
	H200
	H200
	H200
serving_llama8B_tp1_sharegpt_qps_4	H200	3.8696	22.7002	20.7864	40.8221	6.55541	6.37116	16.2025
	H200
	H200
	H200
	H200
	H200
	H200
serving_llama8B_tp1_sharegpt_qps_inf	H200	22.8429	1110.05	1078.07	2069.75	12.1618	10.7576	22.6059
	H200
	H200
	H200
	H200
	H200
	H200
serving_mixtral8x7B_tp2_sharegpt_qps_1	H200	0.995005	42.7231	36.708	77.6502	12.7192	11.6729	30.5151
	H200
	H200
	H200
	H200
	H200
	H200
serving_mixtral8x7B_tp2_sharegpt_qps_16	H200	6.79836	65.0607	53.8123	301.992	29.9248	23.9852	165.223
	H200
	H200
	H200
	H200
	H200
	H200
serving_mixtral8x7B_tp2_sharegpt_qps_4	H200	3.29045	47.8036	42.7779	94.4985	23.0714	20.8724	60.4942
	H200
	H200
	H200
	H200
	H200
	H200
serving_mixtral8x7B_tp2_sharegpt_qps_inf	H200	8.31582	2332.3	2091.51	2799.8	27.3485	24.7137	150.142
	H200
	H200
	H200
	H200
	H200
	H200

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Mean latency (ms)": {"0": 827.3810388830801, "1": 2056.299947978308, "2": 1887.2353067621589}, "Median latency (ms)": {"0": 827.3906968533993, "1": 2056.572403293103, "2": 1887.9943192005157}, "P99 latency (ms)": {"0": 827.8058347199112, "1": 2057.7467997930944, "2": 1898.8275077100843}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_mixtral8x7B_tp2", "2": "throughput_llama70B_tp4"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Tput (req/s)": {"0": 21.785715056643895, "1": 8.105962843796984, "2": 9.93361061168471}}, "serving": {"Test name": {"0": "serving_llama70B_tp4_sharegpt_qps_1", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "2": "serving_llama70B_tp4_sharegpt_qps_16", "3": "serving_llama70B_tp4_sharegpt_qps_inf", "4": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "5": "serving_llama8B_tp1_sharegpt_qps_1", "6": "serving_llama8B_tp1_sharegpt_qps_4", "7": "serving_llama8B_tp1_sharegpt_qps_16", "8": "serving_llama8B_tp1_sharegpt_qps_inf", "9": "serving_llama70B_tp4_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "12": "serving_llama70B_tp4_sharegpt_specdecode_qps_2"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "3": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "4": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "5": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "6": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "7": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "8": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "9": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "10": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "11": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "12": "H200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Tput (req/s)": {"0": 0.989869544917641, "1": 8.315815061981844, "2": 8.097937467696065, "3": 10.019409125454105, "4": 0.9950054179188688, "5": 1.0073450423297847, "6": 3.869603964002038, "7": 12.22616499511483, "8": 22.84287061225969, "9": 3.4541822652822773, "10": 6.798364394315366, "11": 3.2904535822972307, "12": 1.6494358131763422}, "Mean TTFT (ms)": {"0": 53.29161213012412, "1": 2332.3012875579298, "2": 94.72772079752758, "3": 2888.69542918168, "4": 42.72309093736112, "5": 20.196737172082067, "6": 22.700216334778816, "7": 34.015931582544, "8": 1110.04889261676, "9": 60.40773471817374, "10": 65.06069403840229, "11": 47.80358461663127, "12": 56.349760617151674}, "Median TTFT (ms)": {"0": 47.3445791285485, "1": 2091.510316124186, "2": 86.17787552066147, "3": 2884.2243475373834, "4": 36.707976600155234, "5": 18.457521684467793, "6": 20.786371314898133, "7": 26.89200546592474, "8": 1078.0729707330465, "9": 52.00918857008219, "10": 53.81231079809368, "11": 42.777938302606344, "12": 51.39000457711518}, "P99 TTFT (ms)": {"0": 106.24904086813329, "1": 2799.799300944432, "2": 192.6872948743402, "3": 5482.372874026187, "4": 77.65017175581303, "5": 38.4603352844715, "6": 40.822145123966024, "7": 211.5863132150842, "8": 2069.7474440047517, "9": 112.60767419822496, "10": 301.9915889855472, "11": 94.49848417192695, "12": 102.77746587526059}, "Mean ITL (ms)": {"0": 16.283852706568087, "1": 27.348534071067597, "2": 24.053919213969603, "3": 27.500667837211765, "4": 12.719157753153393, "5": 6.221918908643021, "6": 6.555407773585138, "7": 8.385213129696478, "8": 12.161835644124064, "9": 18.192300515159907, "10": 29.924834822187712, "11": 23.071353249801486, "12": 29.6755464805204}, "Median ITL (ms)": {"0": 15.85016818717122, "1": 24.713678751140833, "2": 19.659725483506918, "3": 22.166555281728506, "4": 11.67289912700653, "5": 6.166191538795829, "6": 6.371156312525272, "7": 7.708064513280988, "8": 10.757617419585586, "9": 16.539912670850754, "10": 23.985240142792463, "11": 20.872429944574833, "12": 27.51104603521526}, "P99 ITL (ms)": {"0": 43.342703185044236, "1": 150.1419538911432, "2": 58.11802428215748, "3": 135.61838977504522, "4": 30.515092369168997, "5": 6.675234157592062, "6": 16.202509100548923, "7": 17.984890309162438, "8": 22.605922631919384, "9": 46.118766949512064, "10": 165.22320011630654, "11": 60.49416142515838, "12": 89.52857794705778}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash

Ran in 11s

Wait for container to be ready

Ran in 20m 17s

A100

Ran in 40m 54s

H200

Ran in 33m 40s

Total Job Run Time: 1h 35m