Plugin that creates a sparse checkout of the repository

include ssh

Failed in 1m 31s

Latency tests

  • Input length: 32 tokens.
  • Output length: 128 tokens.
  • Batch size: fixed (8).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: end-to-end latency (mean, median, p99).
Test name GPU Mean latency (ms) Median latency (ms) P99 latency (ms)
latency_llama70B_tp4 H200 2077.53 2077.56 2079.95
latency_llama8B_tp1 H200 833.421 833.53 834.167
latency_mixtral8x7B_tp2 H200 1917.44 1916.58 1929

Throughput tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm to achieve maximum throughput.
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: throughput.
Test name GPU Tput (req/s)
throughput_llama70B_tp4 H200 9.83905
throughput_llama8B_tp1 H200 22.5674
throughput_mixtral8x7B_tp2 H200 8.06251

Serving tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm and the arrival pattern of the requests.
  • Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • We also added a speculative decoding test for llama-3 70B, under QPS 2
  • Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name GPU Tput (req/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
serving_llama70B_tp4_sharegpt_qps_1 H200 0.989698 64.6695 57.5495 114.381 16.54 15.9035 57.0511
serving_llama70B_tp4_sharegpt_qps_16 H200 7.93297 129.331 121.234 235.558 27.0082 20.8107 73.0208
serving_llama70B_tp4_sharegpt_qps_4 H200 3.44656 77.4506 63.337 155.165 19.1865 17.085 59.7834
serving_llama70B_tp4_sharegpt_qps_inf H200 9.80641 3002.99 2953.38 5774.49 28.4332 22.5972 62.2624
serving_llama70B_tp4_sharegpt_specdecode_qps_2 H200 1.95686 56.382 61.792 109.222 30.1394 27.3403 96.1451
serving_llama8B_tp1_sharegpt_qps_1 H200 1.00723 23.7387 22.3615 42.087 6.27439 6.19076 6.70216
serving_llama8B_tp1_sharegpt_qps_16 H200 12.163 36.5353 31.9291 83.0084 9.17103 7.92751 22.6797
serving_llama8B_tp1_sharegpt_qps_4 H200 3.8669 26.219 23.4337 43.4302 6.67473 6.39724 20.5362
serving_llama8B_tp1_sharegpt_qps_inf H200 22.2192 1300.75 1275.22 2300.03 12.2474 11.4056 22.4373
serving_mixtral8x7B_tp2_sharegpt_qps_1 H200 0.99435 46.4297 41.3002 88.5616 12.9366 12.0271 38.5728
serving_mixtral8x7B_tp2_sharegpt_qps_16 H200 6.66174 72.6094 63.4399 226.987 31.0919 24.3868 172.603
serving_mixtral8x7B_tp2_sharegpt_qps_4 H200 3.2501 52.4526 48.9461 94.6454 23.4618 21.0607 56.974
serving_mixtral8x7B_tp2_sharegpt_qps_inf H200 8.28159 2213.17 2005.53 2715.63 27.7495 25.3022 51.4907

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "H200", "1": "H200", "2": "H200"}, "Mean latency (ms)": {"0": 833.4214728946488, "1": 2077.5339416228235, "2": 1917.4413826627037}, "Median latency (ms)": {"0": 833.5301280021667, "1": 2077.5569062680006, "2": 1916.5793992578983}, "P99 latency (ms)": {"0": 834.1666641458869, "1": 2079.9549554102123, "2": 1929.0014506783336}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_mixtral8x7B_tp2", "2": "throughput_llama70B_tp4"}, "GPU": {"0": "H200", "1": "H200", "2": "H200"}, "Tput (req/s)": {"0": 22.56741098325012, "1": 8.06251347257433, "2": 9.83904802973439}}, "serving": {"Test name": {"0": "serving_llama70B_tp4_sharegpt_qps_1", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "2": "serving_llama70B_tp4_sharegpt_qps_16", "3": "serving_llama70B_tp4_sharegpt_qps_inf", "4": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "5": "serving_llama8B_tp1_sharegpt_qps_1", "6": "serving_llama8B_tp1_sharegpt_qps_4", "7": "serving_llama8B_tp1_sharegpt_qps_16", "8": "serving_llama8B_tp1_sharegpt_qps_inf", "9": "serving_llama70B_tp4_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "12": "serving_llama70B_tp4_sharegpt_specdecode_qps_2"}, "GPU": {"0": "H200", "1": "H200", "2": "H200", "3": "H200", "4": "H200", "5": "H200", "6": "H200", "7": "H200", "8": "H200", "9": "H200", "10": "H200", "11": "H200", "12": "H200"}, "Tput (req/s)": {"0": 0.9896977889656053, "1": 8.28159338054524, "2": 7.932968711752506, "3": 9.806409127702178, "4": 0.9943502861076632, "5": 1.0072348217112044, "6": 3.8669012030840593, "7": 12.162959386522013, "8": 22.21919766601935, "9": 3.446556870971799, "10": 6.661743742631414, "11": 3.250100042912229, "12": 1.956857415983935}, "Mean TTFT (ms)": {"0": 64.66954215429723, "1": 2213.1659540743567, "2": 129.33125539682806, "3": 3002.994736842811, "4": 46.42968596657738, "5": 23.738678817171603, "6": 26.218972683418542, "7": 36.53525110334158, "8": 1300.7527791964822, "9": 77.45059719774872, "10": 72.60941448388621, "11": 52.45257397182286, "12": 56.38197902124375}, "Median TTFT (ms)": {"0": 57.54952295683324, "1": 2005.5272534955293, "2": 121.23411428183317, "3": 2953.378487378359, "4": 41.300211334601045, "5": 22.361535346135497, "6": 23.433676920831203, "7": 31.929075717926025, "8": 1275.2170253079385, "9": 63.33703640848398, "10": 63.43989260494709, "11": 48.9460586104542, "12": 61.79200345650315}, "P99 TTFT (ms)": {"0": 114.38109113369136, "1": 2715.6348367640744, "2": 235.5583326471968, "3": 5774.493958768434, "4": 88.56162520125505, "5": 42.08704750519246, "6": 43.430245905183156, "7": 83.0083797499534, "8": 2300.0334176421165, "9": 155.16540429554877, "10": 226.986764236353, "11": 94.64538314379747, "12": 109.22247116453943}, "Mean ITL (ms)": {"0": 16.540048188017764, "1": 27.74949681771482, "2": 27.00816342756017, "3": 28.43324468155776, "4": 12.936618607449548, "5": 6.27439203658357, "6": 6.674730034577902, "7": 9.171031044295976, "8": 12.247382195842055, "9": 19.186542548971318, "10": 31.09191192388171, "11": 23.46184330725827, "12": 30.139367799210298}, "Median ITL (ms)": {"0": 15.903530409559608, "1": 25.302208960056305, "2": 20.81073261797428, "3": 22.597191389650106, "4": 12.02711882069707, "5": 6.190762389451265, "6": 6.397241959348321, "7": 7.927508093416691, "8": 11.40562235377729, "9": 17.085016472265124, "10": 24.386791978031397, "11": 21.060695871710777, "12": 27.340279892086983}, "P99 ITL (ms)": {"0": 57.05114717129618, "1": 51.49066437035799, "2": 73.0208202963695, "3": 62.2624043840915, "4": 38.57284761965275, "5": 6.70215771533549, "6": 20.536222611553967, "7": 22.679689247161157, "8": 22.43727925233543, "9": 59.78339056484401, "10": 172.60293657891452, "11": 56.97400974109764, "12": 96.14510252140462}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

๐Ÿš Tests
Waited 7s
ยท
Ran in 40s
๐Ÿš Tests
Waited 4s
ยท
Ran in 38s
Total Job Run Time: 1m 40s