🐎

Performance Benchmark

Public

Fix reporting, and try again

#9555

simon-mo:h100-bench-v2/6490d4bd5(#10547)

Passed in 1h 6m

bootstrap

Wait for container to be ready

H100

Simon Mo

Created Thu 21st Nov 2024 at 10:31 PM

Triggered from Webhook

Latency tests

Input length: 32 tokens.
Output length: 128 tokens.
Batch size: fixed (8).
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: end-to-end latency (mean, median, p99).

Test name	GPU	Mean latency (ms)	Median latency (ms)	P99 latency (ms)
latency_llama70B_tp4	8xH100	2444.47	2444.3	2450.91
latency_llama8B_tp1	8xH100	997.542	997.365	999.409
latency_mixtral8x7B_tp2	8xH100	2326.97	2330.57	2350.54

Throughput tests

Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm to achieve maximum throughput.
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: throughput.

Test name	GPU	Tput (req/s)
throughput_llama70B_tp4	8xH100	8.86239
throughput_llama8B_tp1	8xH100	19.5005
throughput_mixtral8x7B_tp2	8xH100	8.15186

Serving tests

Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm and the arrival pattern of the requests.
Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
We also added a speculative decoding test for llama-3 70B, under QPS 2
Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

Test name	GPU	Tput (req/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)
serving_llama70B_tp4_sharegpt_qps_1	8xH100	0.984501	62.4149	56.9924	110.695	19.3936	18.7615	53.3811
serving_llama70B_tp4_sharegpt_qps_16	8xH100	7.29345	130.278	110.029	421.16	28.2932	23.7207	71.6121
serving_llama70B_tp4_sharegpt_qps_4	8xH100	3.33967	72.0854	61.3991	150.311	22.171	20.3971	63.4759
serving_llama70B_tp4_sharegpt_qps_inf	8xH100	8.89715	2828.76	2791.16	5398.14	30.6683	25.6486	163.985
serving_llama70B_tp4_sharegpt_specdecode_qps_2	8xH100	1.63871	65.0781	62.4165	113.499	35.0204	32.2292	100.384
serving_llama8B_tp1_sharegpt_qps_1	8xH100	1.00494	24.6827	21.6912	42.3386	7.62798	7.55312	8.53468
serving_llama8B_tp1_sharegpt_qps_16	8xH100	11.4896	38.2191	32.3384	191.642	10.2977	9.39778	22.4331
serving_llama8B_tp1_sharegpt_qps_4	8xH100	3.80428	25.2503	22.5512	42.9718	8.09316	7.84624	20.3429
serving_llama8B_tp1_sharegpt_qps_inf	8xH100	19.4787	1186.77	1136.81	2170.02	14.2914	12.4214	24.5677
serving_mixtral8x7B_tp2_sharegpt_qps_1	8xH100	0.987662	335.455	40.373	3202.61	18.32	16.0769	38.2721
serving_mixtral8x7B_tp2_sharegpt_qps_16	8xH100	6.65328	234.619	60.2056	1839.69	30.6882	24.0982	187.731
serving_mixtral8x7B_tp2_sharegpt_qps_4	8xH100	3.25427	47.0435	43.7934	85.4128	22.8735	20.5077	47.5261
serving_mixtral8x7B_tp2_sharegpt_qps_inf	8xH100	8.71956	1233.99	1116.38	1480.35	26.2536	24.4811	173.943

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_mixtral8x7B_tp2", "1": "latency_llama70B_tp4", "2": "latency_llama8B_tp1"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Mean latency (ms)": {"0": 2326.965676266021, "1": 2444.4721424001427, "2": 997.5417277334296}, "Median latency (ms)": {"0": 2330.5670189984085, "1": 2444.2982479995408, "2": 997.3654839996016}, "P99 latency (ms)": {"0": 2350.5369539411913, "1": 2450.910051380997, "2": 999.408646782831}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_llama70B_tp4", "2": "throughput_mixtral8x7B_tp2"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Tput (req/s)": {"0": 19.50047012279712, "1": 8.862392360296212, "2": 8.151860797301364}}, "serving": {"Test name": {"0": "serving_llama8B_tp1_sharegpt_qps_inf", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "2": "serving_llama70B_tp4_sharegpt_qps_inf", "3": "serving_llama70B_tp4_sharegpt_specdecode_qps_2", "4": "serving_llama70B_tp4_sharegpt_qps_16", "5": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "6": "serving_llama70B_tp4_sharegpt_qps_1", "7": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "8": "serving_llama8B_tp1_sharegpt_qps_16", "9": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "10": "serving_llama70B_tp4_sharegpt_qps_4", "11": "serving_llama8B_tp1_sharegpt_qps_4", "12": "serving_llama8B_tp1_sharegpt_qps_1"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "3": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "4": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "5": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "6": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "7": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "8": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "9": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "10": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "11": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "12": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Tput (req/s)": {"0": 19.478671509224945, "1": 3.2542682320741907, "2": 8.89715088680512, "3": 1.6387134944470014, "4": 7.293452863434265, "5": 6.653282510621168, "6": 0.9845009832601267, "7": 8.719562292290481, "8": 11.489592565396304, "9": 0.9876621096812964, "10": 3.3396674973279286, "11": 3.804282447932104, "12": 1.0049441526923881}, "Mean TTFT (ms)": {"0": 1186.7715199698432, "1": 47.04349210000146, "2": 2828.755769004929, "3": 65.07806814280619, "4": 130.27841210498082, "5": 234.6192061650254, "6": 62.41492667983039, "7": 1233.985231079987, "8": 38.219072010124364, "9": 335.45491353988837, "10": 72.08538667488028, "11": 25.2503033649009, "12": 24.68272186989452}, "Median TTFT (ms)": {"0": 1136.8130074988585, "1": 43.79336549936852, "2": 2791.1591214997316, "3": 62.41652300013811, "4": 110.02869550065952, "5": 60.205612500794814, "6": 56.99239799832867, "7": 1116.382263999185, "8": 32.33839550011908, "9": 40.372964500420494, "10": 61.39909349985828, "11": 22.551166501216358, "12": 21.691177498723846}, "P99 TTFT (ms)": {"0": 2170.0158058400484, "1": 85.41280763962126, "2": 5398.144727478029, "3": 113.49940616048119, "4": 421.16045877835813, "5": 1839.6879125505068, "6": 110.69542514687767, "7": 1480.3458080886774, "8": 191.64241939091858, "9": 3202.613635900633, "10": 150.31086032809978, "11": 42.97179792880342, "12": 42.33856607857888}, "Mean ITL (ms)": {"0": 14.291355223924628, "1": 22.87351067848165, "2": 30.66834920119073, "3": 35.02044642954789, "4": 28.293165652833537, "5": 30.688247894297504, "6": 19.39364450262744, "7": 26.253617825232315, "8": 10.297709183774685, "9": 18.320030211559104, "10": 22.17100693743933, "11": 8.093159038074383, "12": 7.627975875345532}, "Median ITL (ms)": {"0": 12.421366000126, "1": 20.50769599736668, "2": 25.64860899838095, "3": 32.22915399965132, "4": 23.720700499325176, "5": 24.098178000713233, "6": 18.761489998723846, "7": 24.481063002895098, "8": 9.397782499945606, "9": 16.07687999785412, "10": 20.397060501636588, "11": 7.846243499443517, "12": 7.553117500719964}, "P99 ITL (ms)": {"0": 24.56774540059996, "1": 47.52614814125991, "2": 163.9850535910591, "3": 100.38362414044968, "4": 71.61208892772265, "5": 187.73057302110834, "6": 53.38106291310397, "7": 173.94275460130305, "8": 22.433148449999862, "9": 38.272068400110584, "10": 63.475900410558104, "11": 20.34292935040867, "12": 8.534682251774965}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash

Ran in 27s

Wait for container to be ready

Ran in 30m 18s

H100

Ran in 36m 0s

Total Job Run Time: 1h 6m