🐎

restore other changes

Passed in 11h 53m

Latency tests

  • Input length: 32 tokens.
  • Output length: 128 tokens.
  • Batch size: fixed (8).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: end-to-end latency (mean, median, p99).
Test name GPU Mean latency (ms) Median latency (ms) P99 latency (ms)
latency_llama70B_tp4 A100-SXM4-80GB 3990.59 3990.39 3991.87
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
latency_llama8B_tp1 A100-SXM4-80GB 1579.27 1579.34 1579.87
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
latency_mixtral8x7B_tp2 A100-SXM4-80GB 3643.84 3648.91 3680.16
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB

Throughput tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm to achieve maximum throughput.
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: throughput.
Test name GPU Tput (req/s)
throughput_llama70B_tp4 A100-SXM4-80GB 4.78859
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
throughput_llama8B_tp1 A100-SXM4-80GB 11.0567
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
throughput_mixtral8x7B_tp2 A100-SXM4-80GB 5.2096
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB

Serving tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm and the arrival pattern of the requests.
  • Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • We also added a speculative decoding test for llama-3 70B, under QPS 2
  • Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name GPU Tput (req/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
serving_llama70B_tp4_sharegpt_qps_1 A100-SXM4-80GB 0.952471 115.311 91.0313 265.056 33.5852 30.7201 97.3908
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
serving_llama70B_tp4_sharegpt_qps_16 A100-SXM4-80GB 4.80791 739.389 763.067 1486.83 58.6988 44.9026 247.291
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
serving_llama70B_tp4_sharegpt_qps_4 A100-SXM4-80GB 2.91433 150.67 128.708 380.441 42.4353 35.2233 111.432
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
serving_llama70B_tp4_sharegpt_qps_inf A100-SXM4-80GB 4.83096 6647.28 6483.58 13063.6 59.6783 45.1585 238.631
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
serving_llama70B_tp4_sharegpt_specdecode_qps_2 A100-SXM4-80GB 1.56388 119.466 104.25 253.872 60.8368 50.1386 240.037
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
serving_llama8B_tp1_sharegpt_qps_1 A100-SXM4-80GB 0.99686 41.2431 33.9182 86.1282 12.3817 11.9783 30.4699
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
serving_llama8B_tp1_sharegpt_qps_16 A100-SXM4-80GB 8.78881 85.7946 67.627 341.346 21.2062 18.8506 44.4968
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
serving_llama8B_tp1_sharegpt_qps_4 A100-SXM4-80GB 3.59462 44.0776 36.9963 88.3449 13.9754 12.9534 32.9243
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
serving_llama8B_tp1_sharegpt_qps_inf A100-SXM4-80GB 11.1166 2343.4 2269.15 4494.72 26.1787 21.8317 50.6188
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
serving_mixtral8x7B_tp2_sharegpt_qps_1 A100-SXM4-80GB 0.950567 322.814 63.0845 3074.9 32.3384 28.2235 63.3446
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
serving_mixtral8x7B_tp2_sharegpt_qps_16 A100-SXM4-80GB 4.65562 100.619 96.0579 206.358 47.6891 36.4703 295.411
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
serving_mixtral8x7B_tp2_sharegpt_qps_4 A100-SXM4-80GB 2.88023 73.6468 67.235 140.869 38.7074 33.7848 110.65
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
serving_mixtral8x7B_tp2_sharegpt_qps_inf A100-SXM4-80GB 5.47425 2974.18 2614.49 3604.41 40.8073 38.1229 254.489
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB
A100-SXM4-80GB

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Mean latency (ms)": {"0": 1579.2741344310343, "1": 3990.5927983112633, "2": 3643.844443745911}, "Median latency (ms)": {"0": 1579.3387778103352, "1": 3990.386940073222, "2": 3648.910060059279}, "P99 latency (ms)": {"0": 1579.8685044329613, "1": 3991.868680436164, "2": 3680.155389076099}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_llama70B_tp4", "2": "throughput_mixtral8x7B_tp2"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Tput (req/s)": {"0": 11.056650025301797, "1": 4.788594455549321, "2": 5.209599610769483}}, "serving": {"Test name": {"0": "serving_llama8B_tp1_sharegpt_qps_1", "1": "serving_llama8B_tp1_sharegpt_qps_4", "2": "serving_llama8B_tp1_sharegpt_qps_16", "3": "serving_llama8B_tp1_sharegpt_qps_inf", "4": "serving_llama70B_tp4_sharegpt_qps_1", "5": "serving_llama70B_tp4_sharegpt_qps_4", "6": "serving_llama70B_tp4_sharegpt_qps_16", "7": "serving_llama70B_tp4_sharegpt_qps_inf", "8": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "9": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "12": "serving_llama70B_tp4_sharegpt_specdecode_qps_2"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "3": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "4": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "5": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "6": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "7": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "8": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "9": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "10": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "11": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "12": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Tput (req/s)": {"0": 0.996860105131725, "1": 3.594616274181651, "2": 8.788809990815022, "3": 11.116569511102174, "4": 0.9524712318177595, "5": 2.9143259428791333, "6": 4.807908351065189, "7": 4.830957409318412, "8": 0.9505671381170304, "9": 2.880227262737138, "10": 4.655617720115071, "11": 5.474248795427025, "12": 1.5638805341960242}, "Mean TTFT (ms)": {"0": 41.243121719453484, "1": 44.077623966149986, "2": 85.79455303959548, "3": 2343.400535401888, "4": 115.31116942409426, "5": 150.67011817125604, "6": 739.3892058706842, "7": 6647.2774839098565, "8": 322.8140198113397, "9": 73.64680503029376, "10": 100.61861591646448, "11": 2974.1790991905145, "12": 119.46559221690107}, "Median TTFT (ms)": {"0": 33.91816699877381, "1": 36.99628356844187, "2": 67.62702250853181, "3": 2269.1522729583085, "4": 91.0313033964485, "5": 128.70792299509048, "6": 763.0671088118106, "7": 6483.580605825409, "8": 63.08454996906221, "9": 67.23504257388413, "10": 96.05789557099342, "11": 2614.4928943831474, "12": 104.24970323219895}, "P99 TTFT (ms)": {"0": 86.12816051114349, "1": 88.34485004656015, "2": 341.3459550682454, "3": 4494.72184590064, "4": 265.0556325260538, "5": 380.4409689595922, "6": 1486.8260128656395, "7": 13063.584973793477, "8": 3074.8974635964246, "9": 140.8694759849459, "10": 206.35831077583092, "11": 3604.4086371082813, "12": 253.8721384108068}, "Mean ITL (ms)": {"0": 12.381713669474479, "1": 13.97535663168982, "2": 21.206160098427702, "3": 26.17872029052179, "4": 33.585217624787845, "5": 42.435285622622, "6": 58.698767704667674, "7": 59.67830900113209, "8": 32.33838742374971, "9": 38.70738570247557, "10": 47.689148928676595, "11": 40.80733728618981, "12": 60.83681516647543}, "Median ITL (ms)": {"0": 11.978331953287125, "1": 12.953380355611444, "2": 18.850606866180897, "3": 21.831671707332134, "4": 30.72014870122075, "5": 35.22330219857395, "6": 44.90255401469767, "7": 45.15848867595196, "8": 28.22347404435277, "9": 33.78476295620203, "10": 36.47025674581528, "11": 38.12288399785757, "12": 50.13856524601579}, "P99 ITL (ms)": {"0": 30.469923452474195, "1": 32.9242795240134, "2": 44.4967800006271, "3": 50.61877420172099, "4": 97.39084912464023, "5": 111.43196043092756, "6": 247.29101412929384, "7": 238.63066593185067, "8": 63.34455063566565, "9": 110.65045634284614, "10": 295.41056528687477, "11": 254.4893297180533, "12": 240.03740200772882}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

Latency tests

  • Input length: 32 tokens.
  • Output length: 128 tokens.
  • Batch size: fixed (8).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: end-to-end latency (mean, median, p99).
Test name GPU Mean latency (ms) Median latency (ms) P99 latency (ms)
latency_llama70B_tp4 H200 2056.3 2056.57 2057.75
H200
H200
H200
H200
H200
H200
latency_llama8B_tp1 H200 827.381 827.391 827.806
H200
H200
H200
H200
H200
H200
latency_mixtral8x7B_tp2 H200 1887.24 1887.99 1898.83
H200
H200
H200
H200
H200
H200

Throughput tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm to achieve maximum throughput.
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: throughput.
Test name GPU Tput (req/s)
throughput_llama70B_tp4 H200 9.93361
H200
H200
H200
H200
H200
H200
throughput_llama8B_tp1 H200 21.7857
H200
H200
H200
H200
H200
H200
throughput_mixtral8x7B_tp2 H200 8.10596
H200
H200
H200
H200
H200
H200

Serving tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm and the arrival pattern of the requests.
  • Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • We also added a speculative decoding test for llama-3 70B, under QPS 2
  • Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name GPU Tput (req/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
serving_llama70B_tp4_sharegpt_qps_1 H200 0.98987 53.2916 47.3446 106.249 16.2839 15.8502 43.3427
H200
H200
H200
H200
H200
H200
serving_llama70B_tp4_sharegpt_qps_16 H200 8.09794 94.7277 86.1779 192.687 24.0539 19.6597 58.118
H200
H200
H200
H200
H200
H200
serving_llama70B_tp4_sharegpt_qps_4 H200 3.45418 60.4077 52.0092 112.608 18.1923 16.5399 46.1188
H200
H200
H200
H200
H200
H200
serving_llama70B_tp4_sharegpt_qps_inf H200 10.0194 2888.7 2884.22 5482.37 27.5007 22.1666 135.618
H200
H200
H200
H200
H200
H200
serving_llama70B_tp4_sharegpt_specdecode_qps_2 H200 1.64944 56.3498 51.39 102.777 29.6755 27.511 89.5286
H200
H200
H200
H200
H200
H200
serving_llama8B_tp1_sharegpt_qps_1 H200 1.00735 20.1967 18.4575 38.4603 6.22192 6.16619 6.67523
H200
H200
H200
H200
H200
H200
serving_llama8B_tp1_sharegpt_qps_16 H200 12.2262 34.0159 26.892 211.586 8.38521 7.70806 17.9849
H200
H200
H200
H200
H200
H200
serving_llama8B_tp1_sharegpt_qps_4 H200 3.8696 22.7002 20.7864 40.8221 6.55541 6.37116 16.2025
H200
H200
H200
H200
H200
H200
serving_llama8B_tp1_sharegpt_qps_inf H200 22.8429 1110.05 1078.07 2069.75 12.1618 10.7576 22.6059
H200
H200
H200
H200
H200
H200
serving_mixtral8x7B_tp2_sharegpt_qps_1 H200 0.995005 42.7231 36.708 77.6502 12.7192 11.6729 30.5151
H200
H200
H200
H200
H200
H200
serving_mixtral8x7B_tp2_sharegpt_qps_16 H200 6.79836 65.0607 53.8123 301.992 29.9248 23.9852 165.223
H200
H200
H200
H200
H200
H200
serving_mixtral8x7B_tp2_sharegpt_qps_4 H200 3.29045 47.8036 42.7779 94.4985 23.0714 20.8724 60.4942
H200
H200
H200
H200
H200
H200
serving_mixtral8x7B_tp2_sharegpt_qps_inf H200 8.31582 2332.3 2091.51 2799.8 27.3485 24.7137 150.142
H200
H200
H200
H200
H200
H200

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Mean latency (ms)": {"0": 827.3810388830801, "1": 2056.299947978308, "2": 1887.2353067621589}, "Median latency (ms)": {"0": 827.3906968533993, "1": 2056.572403293103, "2": 1887.9943192005157}, "P99 latency (ms)": {"0": 827.8058347199112, "1": 2057.7467997930944, "2": 1898.8275077100843}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_mixtral8x7B_tp2", "2": "throughput_llama70B_tp4"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Tput (req/s)": {"0": 21.785715056643895, "1": 8.105962843796984, "2": 9.93361061168471}}, "serving": {"Test name": {"0": "serving_llama70B_tp4_sharegpt_qps_1", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "2": "serving_llama70B_tp4_sharegpt_qps_16", "3": "serving_llama70B_tp4_sharegpt_qps_inf", "4": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "5": "serving_llama8B_tp1_sharegpt_qps_1", "6": "serving_llama8B_tp1_sharegpt_qps_4", "7": "serving_llama8B_tp1_sharegpt_qps_16", "8": "serving_llama8B_tp1_sharegpt_qps_inf", "9": "serving_llama70B_tp4_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "12": "serving_llama70B_tp4_sharegpt_specdecode_qps_2"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "3": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "4": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "5": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "6": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "7": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "8": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "9": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "10": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "11": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "12": "H200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Tput (req/s)": {"0": 0.989869544917641, "1": 8.315815061981844, "2": 8.097937467696065, "3": 10.019409125454105, "4": 0.9950054179188688, "5": 1.0073450423297847, "6": 3.869603964002038, "7": 12.22616499511483, "8": 22.84287061225969, "9": 3.4541822652822773, "10": 6.798364394315366, "11": 3.2904535822972307, "12": 1.6494358131763422}, "Mean TTFT (ms)": {"0": 53.29161213012412, "1": 2332.3012875579298, "2": 94.72772079752758, "3": 2888.69542918168, "4": 42.72309093736112, "5": 20.196737172082067, "6": 22.700216334778816, "7": 34.015931582544, "8": 1110.04889261676, "9": 60.40773471817374, "10": 65.06069403840229, "11": 47.80358461663127, "12": 56.349760617151674}, "Median TTFT (ms)": {"0": 47.3445791285485, "1": 2091.510316124186, "2": 86.17787552066147, "3": 2884.2243475373834, "4": 36.707976600155234, "5": 18.457521684467793, "6": 20.786371314898133, "7": 26.89200546592474, "8": 1078.0729707330465, "9": 52.00918857008219, "10": 53.81231079809368, "11": 42.777938302606344, "12": 51.39000457711518}, "P99 TTFT (ms)": {"0": 106.24904086813329, "1": 2799.799300944432, "2": 192.6872948743402, "3": 5482.372874026187, "4": 77.65017175581303, "5": 38.4603352844715, "6": 40.822145123966024, "7": 211.5863132150842, "8": 2069.7474440047517, "9": 112.60767419822496, "10": 301.9915889855472, "11": 94.49848417192695, "12": 102.77746587526059}, "Mean ITL (ms)": {"0": 16.283852706568087, "1": 27.348534071067597, "2": 24.053919213969603, "3": 27.500667837211765, "4": 12.719157753153393, "5": 6.221918908643021, "6": 6.555407773585138, "7": 8.385213129696478, "8": 12.161835644124064, "9": 18.192300515159907, "10": 29.924834822187712, "11": 23.071353249801486, "12": 29.6755464805204}, "Median ITL (ms)": {"0": 15.85016818717122, "1": 24.713678751140833, "2": 19.659725483506918, "3": 22.166555281728506, "4": 11.67289912700653, "5": 6.166191538795829, "6": 6.371156312525272, "7": 7.708064513280988, "8": 10.757617419585586, "9": 16.539912670850754, "10": 23.985240142792463, "11": 20.872429944574833, "12": 27.51104603521526}, "P99 ITL (ms)": {"0": 43.342703185044236, "1": 150.1419538911432, "2": 58.11802428215748, "3": 135.61838977504522, "4": 30.515092369168997, "5": 6.675234157592062, "6": 16.202509100548923, "7": 17.984890309162438, "8": 22.605922631919384, "9": 46.118766949512064, "10": 165.22320011630654, "11": 60.49416142515838, "12": 89.52857794705778}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash
Waited 47s
·
Ran in 11s
Wait for container to be ready
Waited 31s
·
Ran in 20m 17s
A100
Waited 10h 52m
·
Ran in 40m 54s
H200
Waited 8s
·
Ran in 33m 40s
Total Job Run Time: 1h 35m