馃悗
Performance Benchmark
Publicadd a wait at the end, essential when merging multiple yaml files
Failed in 3h 22m
bootstrap
馃殌 Ready for comparing vllm against alternatives? This will take 4 hours.
A100 trt benchmark
A100 lmdeploy benchmark
A100 vllm benchmark
A100 tgi benchmark
Plot
Wait for container to be ready
A100 Benchmark
Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
Test name | GPU | Mean latency (ms) | Median latency (ms) | P99 latency (ms) |
---|---|---|---|---|
latency_llama70B_tp4 | H200 | 2077.53 | 2077.56 | 2079.95 |
latency_llama8B_tp1 | H200 | 833.421 | 833.53 | 834.167 |
latency_mixtral8x7B_tp2 | H200 | 1917.44 | 1916.58 | 1929 |
Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.
Test name | GPU | Tput (req/s) |
---|---|---|
throughput_llama70B_tp4 | H200 | 9.83905 |
throughput_llama8B_tp1 | H200 | 22.5674 |
throughput_mixtral8x7B_tp2 | H200 | 8.06251 |
Serving tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name | GPU | Tput (req/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) | Mean ITL (ms) | Median ITL (ms) | P99 ITL (ms) |
---|---|---|---|---|---|---|---|---|
serving_llama70B_tp4_sharegpt_qps_1 | H200 | 0.989698 | 64.6695 | 57.5495 | 114.381 | 16.54 | 15.9035 | 57.0511 |
serving_llama70B_tp4_sharegpt_qps_16 | H200 | 7.93297 | 129.331 | 121.234 | 235.558 | 27.0082 | 20.8107 | 73.0208 |
serving_llama70B_tp4_sharegpt_qps_4 | H200 | 3.44656 | 77.4506 | 63.337 | 155.165 | 19.1865 | 17.085 | 59.7834 |
serving_llama70B_tp4_sharegpt_qps_inf | H200 | 9.80641 | 3002.99 | 2953.38 | 5774.49 | 28.4332 | 22.5972 | 62.2624 |
serving_llama70B_tp4_sharegpt_specdecode_qps_2 | H200 | 1.95686 | 56.382 | 61.792 | 109.222 | 30.1394 | 27.3403 | 96.1451 |
serving_llama8B_tp1_sharegpt_qps_1 | H200 | 1.00723 | 23.7387 | 22.3615 | 42.087 | 6.27439 | 6.19076 | 6.70216 |
serving_llama8B_tp1_sharegpt_qps_16 | H200 | 12.163 | 36.5353 | 31.9291 | 83.0084 | 9.17103 | 7.92751 | 22.6797 |
serving_llama8B_tp1_sharegpt_qps_4 | H200 | 3.8669 | 26.219 | 23.4337 | 43.4302 | 6.67473 | 6.39724 | 20.5362 |
serving_llama8B_tp1_sharegpt_qps_inf | H200 | 22.2192 | 1300.75 | 1275.22 | 2300.03 | 12.2474 | 11.4056 | 22.4373 |
serving_mixtral8x7B_tp2_sharegpt_qps_1 | H200 | 0.99435 | 46.4297 | 41.3002 | 88.5616 | 12.9366 | 12.0271 | 38.5728 |
serving_mixtral8x7B_tp2_sharegpt_qps_16 | H200 | 6.66174 | 72.6094 | 63.4399 | 226.987 | 31.0919 | 24.3868 | 172.603 |
serving_mixtral8x7B_tp2_sharegpt_qps_4 | H200 | 3.2501 | 52.4526 | 48.9461 | 94.6454 | 23.4618 | 21.0607 | 56.974 |
serving_mixtral8x7B_tp2_sharegpt_qps_inf | H200 | 8.28159 | 2213.17 | 2005.53 | 2715.63 | 27.7495 | 25.3022 | 51.4907 |
json version of the benchmarking tables
This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:
import json
import pandas as pd
benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
The json string for all benchmarking tables:
{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "H200", "1": "H200", "2": "H200"}, "Mean latency (ms)": {"0": 833.4214728946488, "1": 2077.5339416228235, "2": 1917.4413826627037}, "Median latency (ms)": {"0": 833.5301280021667, "1": 2077.5569062680006, "2": 1916.5793992578983}, "P99 latency (ms)": {"0": 834.1666641458869, "1": 2079.9549554102123, "2": 1929.0014506783336}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_mixtral8x7B_tp2", "2": "throughput_llama70B_tp4"}, "GPU": {"0": "H200", "1": "H200", "2": "H200"}, "Tput (req/s)": {"0": 22.56741098325012, "1": 8.06251347257433, "2": 9.83904802973439}}, "serving": {"Test name": {"0": "serving_llama70B_tp4_sharegpt_qps_1", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "2": "serving_llama70B_tp4_sharegpt_qps_16", "3": "serving_llama70B_tp4_sharegpt_qps_inf", "4": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "5": "serving_llama8B_tp1_sharegpt_qps_1", "6": "serving_llama8B_tp1_sharegpt_qps_4", "7": "serving_llama8B_tp1_sharegpt_qps_16", "8": "serving_llama8B_tp1_sharegpt_qps_inf", "9": "serving_llama70B_tp4_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "12": "serving_llama70B_tp4_sharegpt_specdecode_qps_2"}, "GPU": {"0": "H200", "1": "H200", "2": "H200", "3": "H200", "4": "H200", "5": "H200", "6": "H200", "7": "H200", "8": "H200", "9": "H200", "10": "H200", "11": "H200", "12": "H200"}, "Tput (req/s)": {"0": 0.9896977889656053, "1": 8.28159338054524, "2": 7.932968711752506, "3": 9.806409127702178, "4": 0.9943502861076632, "5": 1.0072348217112044, "6": 3.8669012030840593, "7": 12.162959386522013, "8": 22.21919766601935, "9": 3.446556870971799, "10": 6.661743742631414, "11": 3.250100042912229, "12": 1.956857415983935}, "Mean TTFT (ms)": {"0": 64.66954215429723, "1": 2213.1659540743567, "2": 129.33125539682806, "3": 3002.994736842811, "4": 46.42968596657738, "5": 23.738678817171603, "6": 26.218972683418542, "7": 36.53525110334158, "8": 1300.7527791964822, "9": 77.45059719774872, "10": 72.60941448388621, "11": 52.45257397182286, "12": 56.38197902124375}, "Median TTFT (ms)": {"0": 57.54952295683324, "1": 2005.5272534955293, "2": 121.23411428183317, "3": 2953.378487378359, "4": 41.300211334601045, "5": 22.361535346135497, "6": 23.433676920831203, "7": 31.929075717926025, "8": 1275.2170253079385, "9": 63.33703640848398, "10": 63.43989260494709, "11": 48.9460586104542, "12": 61.79200345650315}, "P99 TTFT (ms)": {"0": 114.38109113369136, "1": 2715.6348367640744, "2": 235.5583326471968, "3": 5774.493958768434, "4": 88.56162520125505, "5": 42.08704750519246, "6": 43.430245905183156, "7": 83.0083797499534, "8": 2300.0334176421165, "9": 155.16540429554877, "10": 226.986764236353, "11": 94.64538314379747, "12": 109.22247116453943}, "Mean ITL (ms)": {"0": 16.540048188017764, "1": 27.74949681771482, "2": 27.00816342756017, "3": 28.43324468155776, "4": 12.936618607449548, "5": 6.27439203658357, "6": 6.674730034577902, "7": 9.171031044295976, "8": 12.247382195842055, "9": 19.186542548971318, "10": 31.09191192388171, "11": 23.46184330725827, "12": 30.139367799210298}, "Median ITL (ms)": {"0": 15.903530409559608, "1": 25.302208960056305, "2": 20.81073261797428, "3": 22.597191389650106, "4": 12.02711882069707, "5": 6.190762389451265, "6": 6.397241959348321, "7": 7.927508093416691, "8": 11.40562235377729, "9": 17.085016472265124, "10": 24.386791978031397, "11": 21.060695871710777, "12": 27.340279892086983}, "P99 ITL (ms)": {"0": 57.05114717129618, "1": 51.49066437035799, "2": 73.0208202963695, "3": 62.2624043840915, "4": 38.57284761965275, "5": 6.70215771533549, "6": 20.536222611553967, "7": 22.679689247161157, "8": 22.43727925233543, "9": 59.78339056484401, "10": 172.60293657891452, "11": 56.97400974109764, "12": 96.14510252140462}}}
You can also check the raw experiment data in the Artifact tab of the Buildkite page.
Total Job Run Time: 37s