๐
Performance Benchmark
Publicadd a wait at the end, essential when merging multiple yaml files
Passed in 10s and blocked
bootstrap
๐ Ready for comparing vllm against alternatives? This will take 4 hours.
A100 trt benchmark
A100 lmdeploy benchmark
A100 vllm benchmark
A100 tgi benchmark
Plot
Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
Test name | GPU | Mean latency (ms) | Median latency (ms) | P99 latency (ms) |
---|---|---|---|---|
latency_llama8B_tp1 | H200 | 834.059 | 834.014 | 834.846 |
latency_llama70B_tp4 | H200 | 2075.08 | 2075.36 | 2076.72 |
latency_mixtral8x7B_tp2 | H200 | 1921.1 | 1920.16 | 1932.79 |
Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.
Test name | GPU | Tput (req/s) |
---|---|---|
throughput_llama8B_tp1 | H200 | 21.8334 |
throughput_mixtral8x7B_tp2 | H200 | 8.04811 |
throughput_llama70B_tp4 | H200 | 9.74671 |
Serving tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name | GPU | Tput (req/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) | Mean ITL (ms) | Median ITL (ms) | P99 ITL (ms) |
---|---|---|---|---|---|---|---|---|
serving_llama70B_tp4_sharegpt_qps_1 | H200 | 0.989745 | 65.8942 | 58.6838 | 117.731 | 16.5538 | 15.9126 | 57.2564 |
serving_mixtral8x7B_tp2_sharegpt_qps_inf | H200 | 8.30144 | 2212.01 | 2693.16 | 2705.08 | 27.5387 | 25.1811 | 49.9445 |
serving_llama70B_tp4_sharegpt_qps_16 | H200 | 7.94188 | 129.844 | 118.778 | 245.186 | 26.8641 | 20.6569 | 73.1271 |
serving_llama70B_tp4_sharegpt_qps_inf | H200 | 9.83461 | 2956.12 | 2966.79 | 5737.24 | 28.5089 | 22.5366 | 60.9162 |
serving_mixtral8x7B_tp2_sharegpt_qps_1 | H200 | 0.994574 | 46.1188 | 41.2574 | 86.9205 | 12.7744 | 12.0251 | 38.6543 |
serving_llama8B_tp1_sharegpt_qps_1 | H200 | 1.00721 | 23.7952 | 22.2514 | 42.4458 | 6.26858 | 6.18391 | 6.71771 |
serving_llama8B_tp1_sharegpt_qps_4 | H200 | 3.86665 | 26.4758 | 23.9896 | 44.6248 | 6.67286 | 6.39768 | 20.6561 |
serving_llama8B_tp1_sharegpt_qps_16 | H200 | 12.1697 | 36.8132 | 30.8492 | 103.342 | 9.13766 | 7.90394 | 22.6719 |
serving_llama8B_tp1_sharegpt_qps_inf | H200 | 22.132 | 1311.18 | 1297.12 | 2298.27 | 12.2979 | 11.3569 | 22.8169 |
serving_llama70B_tp4_sharegpt_qps_4 | H200 | 3.44712 | 77.9262 | 64.1089 | 153.266 | 19.255 | 17.1243 | 60.4741 |
serving_mixtral8x7B_tp2_sharegpt_qps_16 | H200 | 6.64302 | 68.5915 | 63.0198 | 151.21 | 30.9852 | 24.344 | 151.034 |
serving_mixtral8x7B_tp2_sharegpt_qps_4 | H200 | 3.25054 | 53.1555 | 48.8058 | 93.9642 | 23.7428 | 21.1877 | 56.6531 |
json version of the benchmarking tables
This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:
import json
import pandas as pd
benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
The json string for all benchmarking tables:
{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "H200", "1": "H200", "2": "H200"}, "Mean latency (ms)": {"0": 834.0585705824196, "1": 2075.0775761902332, "2": 1921.0956402122974}, "Median latency (ms)": {"0": 834.0138993225992, "1": 2075.3575819544494, "2": 1920.164190698415}, "P99 latency (ms)": {"0": 834.8459033109248, "1": 2076.722107725218, "2": 1932.7935293503106}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_mixtral8x7B_tp2", "2": "throughput_llama70B_tp4"}, "GPU": {"0": "H200", "1": "H200", "2": "H200"}, "Tput (req/s)": {"0": 21.83341524788152, "1": 8.048109997071029, "2": 9.746709727458574}}, "serving": {"Test name": {"0": "serving_llama70B_tp4_sharegpt_qps_1", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "2": "serving_llama70B_tp4_sharegpt_qps_16", "3": "serving_llama70B_tp4_sharegpt_qps_inf", "4": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "5": "serving_llama8B_tp1_sharegpt_qps_1", "6": "serving_llama8B_tp1_sharegpt_qps_4", "7": "serving_llama8B_tp1_sharegpt_qps_16", "8": "serving_llama8B_tp1_sharegpt_qps_inf", "9": "serving_llama70B_tp4_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_4"}, "GPU": {"0": "H200", "1": "H200", "2": "H200", "3": "H200", "4": "H200", "5": "H200", "6": "H200", "7": "H200", "8": "H200", "9": "H200", "10": "H200", "11": "H200"}, "Tput (req/s)": {"0": 0.9897454093597071, "1": 8.301438303486135, "2": 7.941884952254488, "3": 9.83461405896792, "4": 0.9945736571908708, "5": 1.0072132823950808, "6": 3.866650776305284, "7": 12.169701952157455, "8": 22.132008765124226, "9": 3.4471185869260013, "10": 6.6430202329711845, "11": 3.25053573113392}, "Mean TTFT (ms)": {"0": 65.89417540002614, "1": 2212.0139549998567, "2": 129.8440619581379, "3": 2956.116558215581, "4": 46.11883599776775, "5": 23.79521505907178, "6": 26.47580065531656, "7": 36.813228321261704, "8": 1311.1809824383818, "9": 77.9262473876588, "10": 68.5915441554971, "11": 53.15553463995457}, "Median TTFT (ms)": {"0": 58.68379143066704, "1": 2693.1617595255375, "2": 118.77811304293573, "3": 2966.7866479139775, "4": 41.25743662007153, "5": 22.25138060748577, "6": 23.98963994346559, "7": 30.849200673401356, "8": 1297.1230044495314, "9": 64.10894286818802, "10": 63.01983795128763, "11": 48.80578117445111}, "P99 TTFT (ms)": {"0": 117.7309931674972, "1": 2705.078350431286, "2": 245.1856034155935, "3": 5737.240736754611, "4": 86.92045607138422, "5": 42.44583906605839, "6": 44.62477448396378, "7": 103.34168700501296, "8": 2298.2681707665324, "9": 153.26619511004526, "10": 151.21046319138233, "11": 93.96423864644017}, "Mean ITL (ms)": {"0": 16.55375627379828, "1": 27.538672305496316, "2": 26.86406814457687, "3": 28.508920810925122, "4": 12.774443284429175, "5": 6.268575682927343, "6": 6.672862159281757, "7": 9.137658113079894, "8": 12.297889603084739, "9": 19.255033580634123, "10": 30.98522109735675, "11": 23.74275151011624}, "Median ITL (ms)": {"0": 15.912594506517053, "1": 25.181110948324203, "2": 20.656882552430034, "3": 22.536626551300287, "4": 12.025097850710154, "5": 6.183910416439176, "6": 6.3976801466196775, "7": 7.903937250375748, "8": 11.356941889971495, "9": 17.124322475865483, "10": 24.343972094357014, "11": 21.187677048146725}, "P99 ITL (ms)": {"0": 57.25642004050316, "1": 49.94449386373162, "2": 73.12712532002479, "3": 60.91620681807399, "4": 38.65427698940039, "5": 6.717711966484787, "6": 20.656075677834455, "7": 22.671854496002197, "8": 22.816914832219496, "9": 60.47411515843123, "10": 151.03395713493225, "11": 56.653142580762506}}}
You can also check the raw experiment data in the Artifact tab of the Buildkite page.
bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash
Waited 43s
Ran in 10s
A100 trt benchmark
A100 lmdeploy benchmark
A100 vllm benchmark
A100 tgi benchmark
Plot
Total Job Run Time: 10s