🐎
Performance Benchmark
Publicrestore other changes
Passed in 11h 53m
Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
Test name | GPU | Mean latency (ms) | Median latency (ms) | P99 latency (ms) |
---|---|---|---|---|
latency_llama70B_tp4 | A100-SXM4-80GB | 3990.59 | 3990.39 | 3991.87 |
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
latency_llama8B_tp1 | A100-SXM4-80GB | 1579.27 | 1579.34 | 1579.87 |
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
latency_mixtral8x7B_tp2 | A100-SXM4-80GB | 3643.84 | 3648.91 | 3680.16 |
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
A100-SXM4-80GB | ||||
A100-SXM4-80GB |
Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.
Test name | GPU | Tput (req/s) |
---|---|---|
throughput_llama70B_tp4 | A100-SXM4-80GB | 4.78859 |
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
throughput_llama8B_tp1 | A100-SXM4-80GB | 11.0567 |
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
throughput_mixtral8x7B_tp2 | A100-SXM4-80GB | 5.2096 |
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
A100-SXM4-80GB | ||
A100-SXM4-80GB |
Serving tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name | GPU | Tput (req/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) | Mean ITL (ms) | Median ITL (ms) | P99 ITL (ms) |
---|---|---|---|---|---|---|---|---|
serving_llama70B_tp4_sharegpt_qps_1 | A100-SXM4-80GB | 0.952471 | 115.311 | 91.0313 | 265.056 | 33.5852 | 30.7201 | 97.3908 |
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
serving_llama70B_tp4_sharegpt_qps_16 | A100-SXM4-80GB | 4.80791 | 739.389 | 763.067 | 1486.83 | 58.6988 | 44.9026 | 247.291 |
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
serving_llama70B_tp4_sharegpt_qps_4 | A100-SXM4-80GB | 2.91433 | 150.67 | 128.708 | 380.441 | 42.4353 | 35.2233 | 111.432 |
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
serving_llama70B_tp4_sharegpt_qps_inf | A100-SXM4-80GB | 4.83096 | 6647.28 | 6483.58 | 13063.6 | 59.6783 | 45.1585 | 238.631 |
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
serving_llama70B_tp4_sharegpt_specdecode_qps_2 | A100-SXM4-80GB | 1.56388 | 119.466 | 104.25 | 253.872 | 60.8368 | 50.1386 | 240.037 |
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
serving_llama8B_tp1_sharegpt_qps_1 | A100-SXM4-80GB | 0.99686 | 41.2431 | 33.9182 | 86.1282 | 12.3817 | 11.9783 | 30.4699 |
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
serving_llama8B_tp1_sharegpt_qps_16 | A100-SXM4-80GB | 8.78881 | 85.7946 | 67.627 | 341.346 | 21.2062 | 18.8506 | 44.4968 |
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
serving_llama8B_tp1_sharegpt_qps_4 | A100-SXM4-80GB | 3.59462 | 44.0776 | 36.9963 | 88.3449 | 13.9754 | 12.9534 | 32.9243 |
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
serving_llama8B_tp1_sharegpt_qps_inf | A100-SXM4-80GB | 11.1166 | 2343.4 | 2269.15 | 4494.72 | 26.1787 | 21.8317 | 50.6188 |
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
serving_mixtral8x7B_tp2_sharegpt_qps_1 | A100-SXM4-80GB | 0.950567 | 322.814 | 63.0845 | 3074.9 | 32.3384 | 28.2235 | 63.3446 |
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
serving_mixtral8x7B_tp2_sharegpt_qps_16 | A100-SXM4-80GB | 4.65562 | 100.619 | 96.0579 | 206.358 | 47.6891 | 36.4703 | 295.411 |
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
serving_mixtral8x7B_tp2_sharegpt_qps_4 | A100-SXM4-80GB | 2.88023 | 73.6468 | 67.235 | 140.869 | 38.7074 | 33.7848 | 110.65 |
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
serving_mixtral8x7B_tp2_sharegpt_qps_inf | A100-SXM4-80GB | 5.47425 | 2974.18 | 2614.49 | 3604.41 | 40.8073 | 38.1229 | 254.489 |
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB | ||||||||
A100-SXM4-80GB |
json version of the benchmarking tables
This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:
import json
import pandas as pd
benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
The json string for all benchmarking tables:
{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Mean latency (ms)": {"0": 1579.2741344310343, "1": 3990.5927983112633, "2": 3643.844443745911}, "Median latency (ms)": {"0": 1579.3387778103352, "1": 3990.386940073222, "2": 3648.910060059279}, "P99 latency (ms)": {"0": 1579.8685044329613, "1": 3991.868680436164, "2": 3680.155389076099}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_llama70B_tp4", "2": "throughput_mixtral8x7B_tp2"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Tput (req/s)": {"0": 11.056650025301797, "1": 4.788594455549321, "2": 5.209599610769483}}, "serving": {"Test name": {"0": "serving_llama8B_tp1_sharegpt_qps_1", "1": "serving_llama8B_tp1_sharegpt_qps_4", "2": "serving_llama8B_tp1_sharegpt_qps_16", "3": "serving_llama8B_tp1_sharegpt_qps_inf", "4": "serving_llama70B_tp4_sharegpt_qps_1", "5": "serving_llama70B_tp4_sharegpt_qps_4", "6": "serving_llama70B_tp4_sharegpt_qps_16", "7": "serving_llama70B_tp4_sharegpt_qps_inf", "8": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "9": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "12": "serving_llama70B_tp4_sharegpt_specdecode_qps_2"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "3": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "4": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "5": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "6": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "7": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "8": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "9": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "10": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "11": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "12": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Tput (req/s)": {"0": 0.996860105131725, "1": 3.594616274181651, "2": 8.788809990815022, "3": 11.116569511102174, "4": 0.9524712318177595, "5": 2.9143259428791333, "6": 4.807908351065189, "7": 4.830957409318412, "8": 0.9505671381170304, "9": 2.880227262737138, "10": 4.655617720115071, "11": 5.474248795427025, "12": 1.5638805341960242}, "Mean TTFT (ms)": {"0": 41.243121719453484, "1": 44.077623966149986, "2": 85.79455303959548, "3": 2343.400535401888, "4": 115.31116942409426, "5": 150.67011817125604, "6": 739.3892058706842, "7": 6647.2774839098565, "8": 322.8140198113397, "9": 73.64680503029376, "10": 100.61861591646448, "11": 2974.1790991905145, "12": 119.46559221690107}, "Median TTFT (ms)": {"0": 33.91816699877381, "1": 36.99628356844187, "2": 67.62702250853181, "3": 2269.1522729583085, "4": 91.0313033964485, "5": 128.70792299509048, "6": 763.0671088118106, "7": 6483.580605825409, "8": 63.08454996906221, "9": 67.23504257388413, "10": 96.05789557099342, "11": 2614.4928943831474, "12": 104.24970323219895}, "P99 TTFT (ms)": {"0": 86.12816051114349, "1": 88.34485004656015, "2": 341.3459550682454, "3": 4494.72184590064, "4": 265.0556325260538, "5": 380.4409689595922, "6": 1486.8260128656395, "7": 13063.584973793477, "8": 3074.8974635964246, "9": 140.8694759849459, "10": 206.35831077583092, "11": 3604.4086371082813, "12": 253.8721384108068}, "Mean ITL (ms)": {"0": 12.381713669474479, "1": 13.97535663168982, "2": 21.206160098427702, "3": 26.17872029052179, "4": 33.585217624787845, "5": 42.435285622622, "6": 58.698767704667674, "7": 59.67830900113209, "8": 32.33838742374971, "9": 38.70738570247557, "10": 47.689148928676595, "11": 40.80733728618981, "12": 60.83681516647543}, "Median ITL (ms)": {"0": 11.978331953287125, "1": 12.953380355611444, "2": 18.850606866180897, "3": 21.831671707332134, "4": 30.72014870122075, "5": 35.22330219857395, "6": 44.90255401469767, "7": 45.15848867595196, "8": 28.22347404435277, "9": 33.78476295620203, "10": 36.47025674581528, "11": 38.12288399785757, "12": 50.13856524601579}, "P99 ITL (ms)": {"0": 30.469923452474195, "1": 32.9242795240134, "2": 44.4967800006271, "3": 50.61877420172099, "4": 97.39084912464023, "5": 111.43196043092756, "6": 247.29101412929384, "7": 238.63066593185067, "8": 63.34455063566565, "9": 110.65045634284614, "10": 295.41056528687477, "11": 254.4893297180533, "12": 240.03740200772882}}}
You can also check the raw experiment data in the Artifact tab of the Buildkite page.
Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
Test name | GPU | Mean latency (ms) | Median latency (ms) | P99 latency (ms) |
---|---|---|---|---|
latency_llama70B_tp4 | H200 | 2056.3 | 2056.57 | 2057.75 |
H200 | ||||
H200 | ||||
H200 | ||||
H200 | ||||
H200 | ||||
H200 | ||||
latency_llama8B_tp1 | H200 | 827.381 | 827.391 | 827.806 |
H200 | ||||
H200 | ||||
H200 | ||||
H200 | ||||
H200 | ||||
H200 | ||||
latency_mixtral8x7B_tp2 | H200 | 1887.24 | 1887.99 | 1898.83 |
H200 | ||||
H200 | ||||
H200 | ||||
H200 | ||||
H200 | ||||
H200 |
Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.
Test name | GPU | Tput (req/s) |
---|---|---|
throughput_llama70B_tp4 | H200 | 9.93361 |
H200 | ||
H200 | ||
H200 | ||
H200 | ||
H200 | ||
H200 | ||
throughput_llama8B_tp1 | H200 | 21.7857 |
H200 | ||
H200 | ||
H200 | ||
H200 | ||
H200 | ||
H200 | ||
throughput_mixtral8x7B_tp2 | H200 | 8.10596 |
H200 | ||
H200 | ||
H200 | ||
H200 | ||
H200 | ||
H200 |
Serving tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name | GPU | Tput (req/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) | Mean ITL (ms) | Median ITL (ms) | P99 ITL (ms) |
---|---|---|---|---|---|---|---|---|
serving_llama70B_tp4_sharegpt_qps_1 | H200 | 0.98987 | 53.2916 | 47.3446 | 106.249 | 16.2839 | 15.8502 | 43.3427 |
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
serving_llama70B_tp4_sharegpt_qps_16 | H200 | 8.09794 | 94.7277 | 86.1779 | 192.687 | 24.0539 | 19.6597 | 58.118 |
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
serving_llama70B_tp4_sharegpt_qps_4 | H200 | 3.45418 | 60.4077 | 52.0092 | 112.608 | 18.1923 | 16.5399 | 46.1188 |
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
serving_llama70B_tp4_sharegpt_qps_inf | H200 | 10.0194 | 2888.7 | 2884.22 | 5482.37 | 27.5007 | 22.1666 | 135.618 |
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
serving_llama70B_tp4_sharegpt_specdecode_qps_2 | H200 | 1.64944 | 56.3498 | 51.39 | 102.777 | 29.6755 | 27.511 | 89.5286 |
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
serving_llama8B_tp1_sharegpt_qps_1 | H200 | 1.00735 | 20.1967 | 18.4575 | 38.4603 | 6.22192 | 6.16619 | 6.67523 |
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
serving_llama8B_tp1_sharegpt_qps_16 | H200 | 12.2262 | 34.0159 | 26.892 | 211.586 | 8.38521 | 7.70806 | 17.9849 |
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
serving_llama8B_tp1_sharegpt_qps_4 | H200 | 3.8696 | 22.7002 | 20.7864 | 40.8221 | 6.55541 | 6.37116 | 16.2025 |
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
serving_llama8B_tp1_sharegpt_qps_inf | H200 | 22.8429 | 1110.05 | 1078.07 | 2069.75 | 12.1618 | 10.7576 | 22.6059 |
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
serving_mixtral8x7B_tp2_sharegpt_qps_1 | H200 | 0.995005 | 42.7231 | 36.708 | 77.6502 | 12.7192 | 11.6729 | 30.5151 |
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
serving_mixtral8x7B_tp2_sharegpt_qps_16 | H200 | 6.79836 | 65.0607 | 53.8123 | 301.992 | 29.9248 | 23.9852 | 165.223 |
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
serving_mixtral8x7B_tp2_sharegpt_qps_4 | H200 | 3.29045 | 47.8036 | 42.7779 | 94.4985 | 23.0714 | 20.8724 | 60.4942 |
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
serving_mixtral8x7B_tp2_sharegpt_qps_inf | H200 | 8.31582 | 2332.3 | 2091.51 | 2799.8 | 27.3485 | 24.7137 | 150.142 |
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 | ||||||||
H200 |
json version of the benchmarking tables
This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:
import json
import pandas as pd
benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
The json string for all benchmarking tables:
{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Mean latency (ms)": {"0": 827.3810388830801, "1": 2056.299947978308, "2": 1887.2353067621589}, "Median latency (ms)": {"0": 827.3906968533993, "1": 2056.572403293103, "2": 1887.9943192005157}, "P99 latency (ms)": {"0": 827.8058347199112, "1": 2057.7467997930944, "2": 1898.8275077100843}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_mixtral8x7B_tp2", "2": "throughput_llama70B_tp4"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Tput (req/s)": {"0": 21.785715056643895, "1": 8.105962843796984, "2": 9.93361061168471}}, "serving": {"Test name": {"0": "serving_llama70B_tp4_sharegpt_qps_1", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "2": "serving_llama70B_tp4_sharegpt_qps_16", "3": "serving_llama70B_tp4_sharegpt_qps_inf", "4": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "5": "serving_llama8B_tp1_sharegpt_qps_1", "6": "serving_llama8B_tp1_sharegpt_qps_4", "7": "serving_llama8B_tp1_sharegpt_qps_16", "8": "serving_llama8B_tp1_sharegpt_qps_inf", "9": "serving_llama70B_tp4_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "12": "serving_llama70B_tp4_sharegpt_specdecode_qps_2"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "3": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "4": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "5": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "6": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "7": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "8": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "9": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "10": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "11": "H200\nH200\nH200\nH200\nH200\nH200\nH200", "12": "H200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Tput (req/s)": {"0": 0.989869544917641, "1": 8.315815061981844, "2": 8.097937467696065, "3": 10.019409125454105, "4": 0.9950054179188688, "5": 1.0073450423297847, "6": 3.869603964002038, "7": 12.22616499511483, "8": 22.84287061225969, "9": 3.4541822652822773, "10": 6.798364394315366, "11": 3.2904535822972307, "12": 1.6494358131763422}, "Mean TTFT (ms)": {"0": 53.29161213012412, "1": 2332.3012875579298, "2": 94.72772079752758, "3": 2888.69542918168, "4": 42.72309093736112, "5": 20.196737172082067, "6": 22.700216334778816, "7": 34.015931582544, "8": 1110.04889261676, "9": 60.40773471817374, "10": 65.06069403840229, "11": 47.80358461663127, "12": 56.349760617151674}, "Median TTFT (ms)": {"0": 47.3445791285485, "1": 2091.510316124186, "2": 86.17787552066147, "3": 2884.2243475373834, "4": 36.707976600155234, "5": 18.457521684467793, "6": 20.786371314898133, "7": 26.89200546592474, "8": 1078.0729707330465, "9": 52.00918857008219, "10": 53.81231079809368, "11": 42.777938302606344, "12": 51.39000457711518}, "P99 TTFT (ms)": {"0": 106.24904086813329, "1": 2799.799300944432, "2": 192.6872948743402, "3": 5482.372874026187, "4": 77.65017175581303, "5": 38.4603352844715, "6": 40.822145123966024, "7": 211.5863132150842, "8": 2069.7474440047517, "9": 112.60767419822496, "10": 301.9915889855472, "11": 94.49848417192695, "12": 102.77746587526059}, "Mean ITL (ms)": {"0": 16.283852706568087, "1": 27.348534071067597, "2": 24.053919213969603, "3": 27.500667837211765, "4": 12.719157753153393, "5": 6.221918908643021, "6": 6.555407773585138, "7": 8.385213129696478, "8": 12.161835644124064, "9": 18.192300515159907, "10": 29.924834822187712, "11": 23.071353249801486, "12": 29.6755464805204}, "Median ITL (ms)": {"0": 15.85016818717122, "1": 24.713678751140833, "2": 19.659725483506918, "3": 22.166555281728506, "4": 11.67289912700653, "5": 6.166191538795829, "6": 6.371156312525272, "7": 7.708064513280988, "8": 10.757617419585586, "9": 16.539912670850754, "10": 23.985240142792463, "11": 20.872429944574833, "12": 27.51104603521526}, "P99 ITL (ms)": {"0": 43.342703185044236, "1": 150.1419538911432, "2": 58.11802428215748, "3": 135.61838977504522, "4": 30.515092369168997, "5": 6.675234157592062, "6": 16.202509100548923, "7": 17.984890309162438, "8": 22.605922631919384, "9": 46.118766949512064, "10": 165.22320011630654, "11": 60.49416142515838, "12": 89.52857794705778}}}
You can also check the raw experiment data in the Artifact tab of the Buildkite page.
bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash
Waited 47s
Ran in 11s
Total Job Run Time: 1h 35m