馃悗
Passed in 1h 9m

Latency tests

  • Input length: 32 tokens.
  • Output length: 128 tokens.
  • Batch size: fixed (8).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: end-to-end latency (mean, median, p99).
Test name GPU Mean latency (ms) Median latency (ms) P99 latency (ms)
latency_llama8B_tp1 H200 834.059 834.014 834.846
latency_llama70B_tp4 H200 2075.08 2075.36 2076.72
latency_mixtral8x7B_tp2 H200 1921.1 1920.16 1932.79

Throughput tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm to achieve maximum throughput.
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: throughput.
Test name GPU Tput (req/s)
throughput_llama8B_tp1 H200 21.8334
throughput_mixtral8x7B_tp2 H200 8.04811
throughput_llama70B_tp4 H200 9.74671

Serving tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm and the arrival pattern of the requests.
  • Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • We also added a speculative decoding test for llama-3 70B, under QPS 2
  • Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name GPU Tput (req/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
serving_llama70B_tp4_sharegpt_qps_1 H200 0.989745 65.8942 58.6838 117.731 16.5538 15.9126 57.2564
serving_mixtral8x7B_tp2_sharegpt_qps_inf H200 8.30144 2212.01 2693.16 2705.08 27.5387 25.1811 49.9445
serving_llama70B_tp4_sharegpt_qps_16 H200 7.94188 129.844 118.778 245.186 26.8641 20.6569 73.1271
serving_llama70B_tp4_sharegpt_qps_inf H200 9.83461 2956.12 2966.79 5737.24 28.5089 22.5366 60.9162
serving_mixtral8x7B_tp2_sharegpt_qps_1 H200 0.994574 46.1188 41.2574 86.9205 12.7744 12.0251 38.6543
serving_llama8B_tp1_sharegpt_qps_1 H200 1.00721 23.7952 22.2514 42.4458 6.26858 6.18391 6.71771
serving_llama8B_tp1_sharegpt_qps_4 H200 3.86665 26.4758 23.9896 44.6248 6.67286 6.39768 20.6561
serving_llama8B_tp1_sharegpt_qps_16 H200 12.1697 36.8132 30.8492 103.342 9.13766 7.90394 22.6719
serving_llama8B_tp1_sharegpt_qps_inf H200 22.132 1311.18 1297.12 2298.27 12.2979 11.3569 22.8169
serving_llama70B_tp4_sharegpt_qps_4 H200 3.44712 77.9262 64.1089 153.266 19.255 17.1243 60.4741
serving_mixtral8x7B_tp2_sharegpt_qps_16 H200 6.64302 68.5915 63.0198 151.21 30.9852 24.344 151.034
serving_mixtral8x7B_tp2_sharegpt_qps_4 H200 3.25054 53.1555 48.8058 93.9642 23.7428 21.1877 56.6531

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "H200", "1": "H200", "2": "H200"}, "Mean latency (ms)": {"0": 834.0585705824196, "1": 2075.0775761902332, "2": 1921.0956402122974}, "Median latency (ms)": {"0": 834.0138993225992, "1": 2075.3575819544494, "2": 1920.164190698415}, "P99 latency (ms)": {"0": 834.8459033109248, "1": 2076.722107725218, "2": 1932.7935293503106}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_mixtral8x7B_tp2", "2": "throughput_llama70B_tp4"}, "GPU": {"0": "H200", "1": "H200", "2": "H200"}, "Tput (req/s)": {"0": 21.83341524788152, "1": 8.048109997071029, "2": 9.746709727458574}}, "serving": {"Test name": {"0": "serving_llama70B_tp4_sharegpt_qps_1", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "2": "serving_llama70B_tp4_sharegpt_qps_16", "3": "serving_llama70B_tp4_sharegpt_qps_inf", "4": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "5": "serving_llama8B_tp1_sharegpt_qps_1", "6": "serving_llama8B_tp1_sharegpt_qps_4", "7": "serving_llama8B_tp1_sharegpt_qps_16", "8": "serving_llama8B_tp1_sharegpt_qps_inf", "9": "serving_llama70B_tp4_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_4"}, "GPU": {"0": "H200", "1": "H200", "2": "H200", "3": "H200", "4": "H200", "5": "H200", "6": "H200", "7": "H200", "8": "H200", "9": "H200", "10": "H200", "11": "H200"}, "Tput (req/s)": {"0": 0.9897454093597071, "1": 8.301438303486135, "2": 7.941884952254488, "3": 9.83461405896792, "4": 0.9945736571908708, "5": 1.0072132823950808, "6": 3.866650776305284, "7": 12.169701952157455, "8": 22.132008765124226, "9": 3.4471185869260013, "10": 6.6430202329711845, "11": 3.25053573113392}, "Mean TTFT (ms)": {"0": 65.89417540002614, "1": 2212.0139549998567, "2": 129.8440619581379, "3": 2956.116558215581, "4": 46.11883599776775, "5": 23.79521505907178, "6": 26.47580065531656, "7": 36.813228321261704, "8": 1311.1809824383818, "9": 77.9262473876588, "10": 68.5915441554971, "11": 53.15553463995457}, "Median TTFT (ms)": {"0": 58.68379143066704, "1": 2693.1617595255375, "2": 118.77811304293573, "3": 2966.7866479139775, "4": 41.25743662007153, "5": 22.25138060748577, "6": 23.98963994346559, "7": 30.849200673401356, "8": 1297.1230044495314, "9": 64.10894286818802, "10": 63.01983795128763, "11": 48.80578117445111}, "P99 TTFT (ms)": {"0": 117.7309931674972, "1": 2705.078350431286, "2": 245.1856034155935, "3": 5737.240736754611, "4": 86.92045607138422, "5": 42.44583906605839, "6": 44.62477448396378, "7": 103.34168700501296, "8": 2298.2681707665324, "9": 153.26619511004526, "10": 151.21046319138233, "11": 93.96423864644017}, "Mean ITL (ms)": {"0": 16.55375627379828, "1": 27.538672305496316, "2": 26.86406814457687, "3": 28.508920810925122, "4": 12.774443284429175, "5": 6.268575682927343, "6": 6.672862159281757, "7": 9.137658113079894, "8": 12.297889603084739, "9": 19.255033580634123, "10": 30.98522109735675, "11": 23.74275151011624}, "Median ITL (ms)": {"0": 15.912594506517053, "1": 25.181110948324203, "2": 20.656882552430034, "3": 22.536626551300287, "4": 12.025097850710154, "5": 6.183910416439176, "6": 6.3976801466196775, "7": 7.903937250375748, "8": 11.356941889971495, "9": 17.124322475865483, "10": 24.343972094357014, "11": 21.187677048146725}, "P99 ITL (ms)": {"0": 57.25642004050316, "1": 49.94449386373162, "2": 73.12712532002479, "3": 60.91620681807399, "4": 38.65427698940039, "5": 6.717711966484787, "6": 20.656075677834455, "7": 22.671854496002197, "8": 22.816914832219496, "9": 60.47411515843123, "10": 151.03395713493225, "11": 56.653142580762506}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

bootstrapcurl -sSL https://raw.githubusercontent.com/vllm-project/buildkite-ci/main/scripts/kickoff-benchmark.sh | bash
Waited 28s
Ran in 13s
Wait for container to be ready
Waited 4s
Ran in 21m 14s
H200
Waited 7s
Ran in 48m 18s
Total Job Run Time: 1h 9m