馃悗

Fix reporting, and try again

Passed in 1h 6m

Latency tests

  • Input length: 32 tokens.
  • Output length: 128 tokens.
  • Batch size: fixed (8).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: end-to-end latency (mean, median, p99).
Test name GPU Mean latency (ms) Median latency (ms) P99 latency (ms)
latency_llama70B_tp4 8xH100 2444.47 2444.3 2450.91
latency_llama8B_tp1 8xH100 997.542 997.365 999.409
latency_mixtral8x7B_tp2 8xH100 2326.97 2330.57 2350.54

Throughput tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm to achieve maximum throughput.
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: throughput.
Test name GPU Tput (req/s)
throughput_llama70B_tp4 8xH100 8.86239
throughput_llama8B_tp1 8xH100 19.5005
throughput_mixtral8x7B_tp2 8xH100 8.15186

Serving tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm and the arrival pattern of the requests.
  • Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • We also added a speculative decoding test for llama-3 70B, under QPS 2
  • Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name GPU Tput (req/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
serving_llama70B_tp4_sharegpt_qps_1 8xH100 0.984501 62.4149 56.9924 110.695 19.3936 18.7615 53.3811
serving_llama70B_tp4_sharegpt_qps_16 8xH100 7.29345 130.278 110.029 421.16 28.2932 23.7207 71.6121
serving_llama70B_tp4_sharegpt_qps_4 8xH100 3.33967 72.0854 61.3991 150.311 22.171 20.3971 63.4759
serving_llama70B_tp4_sharegpt_qps_inf 8xH100 8.89715 2828.76 2791.16 5398.14 30.6683 25.6486 163.985
serving_llama70B_tp4_sharegpt_specdecode_qps_2 8xH100 1.63871 65.0781 62.4165 113.499 35.0204 32.2292 100.384
serving_llama8B_tp1_sharegpt_qps_1 8xH100 1.00494 24.6827 21.6912 42.3386 7.62798 7.55312 8.53468
serving_llama8B_tp1_sharegpt_qps_16 8xH100 11.4896 38.2191 32.3384 191.642 10.2977 9.39778 22.4331
serving_llama8B_tp1_sharegpt_qps_4 8xH100 3.80428 25.2503 22.5512 42.9718 8.09316 7.84624 20.3429
serving_llama8B_tp1_sharegpt_qps_inf 8xH100 19.4787 1186.77 1136.81 2170.02 14.2914 12.4214 24.5677
serving_mixtral8x7B_tp2_sharegpt_qps_1 8xH100 0.987662 335.455 40.373 3202.61 18.32 16.0769 38.2721
serving_mixtral8x7B_tp2_sharegpt_qps_16 8xH100 6.65328 234.619 60.2056 1839.69 30.6882 24.0982 187.731
serving_mixtral8x7B_tp2_sharegpt_qps_4 8xH100 3.25427 47.0435 43.7934 85.4128 22.8735 20.5077 47.5261
serving_mixtral8x7B_tp2_sharegpt_qps_inf 8xH100 8.71956 1233.99 1116.38 1480.35 26.2536 24.4811 173.943

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_mixtral8x7B_tp2", "1": "latency_llama70B_tp4", "2": "latency_llama8B_tp1"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Mean latency (ms)": {"0": 2326.965676266021, "1": 2444.4721424001427, "2": 997.5417277334296}, "Median latency (ms)": {"0": 2330.5670189984085, "1": 2444.2982479995408, "2": 997.3654839996016}, "P99 latency (ms)": {"0": 2350.5369539411913, "1": 2450.910051380997, "2": 999.408646782831}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_llama70B_tp4", "2": "throughput_mixtral8x7B_tp2"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Tput (req/s)": {"0": 19.50047012279712, "1": 8.862392360296212, "2": 8.151860797301364}}, "serving": {"Test name": {"0": "serving_llama8B_tp1_sharegpt_qps_inf", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "2": "serving_llama70B_tp4_sharegpt_qps_inf", "3": "serving_llama70B_tp4_sharegpt_specdecode_qps_2", "4": "serving_llama70B_tp4_sharegpt_qps_16", "5": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "6": "serving_llama70B_tp4_sharegpt_qps_1", "7": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "8": "serving_llama8B_tp1_sharegpt_qps_16", "9": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "10": "serving_llama70B_tp4_sharegpt_qps_4", "11": "serving_llama8B_tp1_sharegpt_qps_4", "12": "serving_llama8B_tp1_sharegpt_qps_1"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "3": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "4": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "5": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "6": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "7": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "8": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "9": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "10": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "11": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "12": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Tput (req/s)": {"0": 19.478671509224945, "1": 3.2542682320741907, "2": 8.89715088680512, "3": 1.6387134944470014, "4": 7.293452863434265, "5": 6.653282510621168, "6": 0.9845009832601267, "7": 8.719562292290481, "8": 11.489592565396304, "9": 0.9876621096812964, "10": 3.3396674973279286, "11": 3.804282447932104, "12": 1.0049441526923881}, "Mean TTFT (ms)": {"0": 1186.7715199698432, "1": 47.04349210000146, "2": 2828.755769004929, "3": 65.07806814280619, "4": 130.27841210498082, "5": 234.6192061650254, "6": 62.41492667983039, "7": 1233.985231079987, "8": 38.219072010124364, "9": 335.45491353988837, "10": 72.08538667488028, "11": 25.2503033649009, "12": 24.68272186989452}, "Median TTFT (ms)": {"0": 1136.8130074988585, "1": 43.79336549936852, "2": 2791.1591214997316, "3": 62.41652300013811, "4": 110.02869550065952, "5": 60.205612500794814, "6": 56.99239799832867, "7": 1116.382263999185, "8": 32.33839550011908, "9": 40.372964500420494, "10": 61.39909349985828, "11": 22.551166501216358, "12": 21.691177498723846}, "P99 TTFT (ms)": {"0": 2170.0158058400484, "1": 85.41280763962126, "2": 5398.144727478029, "3": 113.49940616048119, "4": 421.16045877835813, "5": 1839.6879125505068, "6": 110.69542514687767, "7": 1480.3458080886774, "8": 191.64241939091858, "9": 3202.613635900633, "10": 150.31086032809978, "11": 42.97179792880342, "12": 42.33856607857888}, "Mean ITL (ms)": {"0": 14.291355223924628, "1": 22.87351067848165, "2": 30.66834920119073, "3": 35.02044642954789, "4": 28.293165652833537, "5": 30.688247894297504, "6": 19.39364450262744, "7": 26.253617825232315, "8": 10.297709183774685, "9": 18.320030211559104, "10": 22.17100693743933, "11": 8.093159038074383, "12": 7.627975875345532}, "Median ITL (ms)": {"0": 12.421366000126, "1": 20.50769599736668, "2": 25.64860899838095, "3": 32.22915399965132, "4": 23.720700499325176, "5": 24.098178000713233, "6": 18.761489998723846, "7": 24.481063002895098, "8": 9.397782499945606, "9": 16.07687999785412, "10": 20.397060501636588, "11": 7.846243499443517, "12": 7.553117500719964}, "P99 ITL (ms)": {"0": 24.56774540059996, "1": 47.52614814125991, "2": 163.9850535910591, "3": 100.38362414044968, "4": 71.61208892772265, "5": 187.73057302110834, "6": 53.38106291310397, "7": 173.94275460130305, "8": 22.433148449999862, "9": 38.272068400110584, "10": 63.475900410558104, "11": 20.34292935040867, "12": 8.534682251774965}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash
Waited 39s
Ran in 27s
Wait for container to be ready
Waited 5s
Ran in 30m 18s
H100
Waited 7s
Ran in 36m 0s
Total Job Run Time: 1h 6m