🐎

[Benchmark][Doc] Update throughput benchmark and README (#15998)

Running for 4d 3h and failing

Latency tests

  • Input length: 32 tokens.
  • Output length: 128 tokens.
  • Batch size: fixed (8).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: end-to-end latency (mean, median, p99).
Test name GPU Mean latency (ms) Median latency (ms) P99 latency (ms)
latency_llama70B_tp4 8xH100 2540.75 2540.22 2544.89
latency_llama8B_tp1 8xH100 1046.91 1046.82 1048.3
latency_mixtral8x7B_tp2 8xH100 2316.74 2320.11 2334.17

Throughput tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm to achieve maximum throughput.
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: throughput.
Test name GPU Tput (req/s)
throughput_llama70B_tp4 8xH100 9.13468
throughput_llama8B_tp1 8xH100 21.6201
throughput_mixtral8x7B_tp2 8xH100 9.0138

Serving tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm and the arrival pattern of the requests.
  • Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • We also added a speculative decoding test for llama-3 70B, under QPS 2
  • Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name GPU Tput (req/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
serving_llama70B_tp4_sharegpt_qps_1 8xH100 0.982905 46.9257 41.3784 96.5447 20.4458 20.0713 39.5204
serving_llama70B_tp4_sharegpt_qps_16 8xH100 7.34964 42.9625 42.2113 66.2351 24.3085 23.918 39.6371
serving_llama70B_tp4_sharegpt_qps_4 8xH100 3.31252 38.0708 38.1734 56.8092 21.873 21.8651 25.0219
serving_llama70B_tp4_sharegpt_qps_inf 8xH100 10.6181 310.843 300.908 375.259 26.2041 25.6431 47.8632
serving_llama8B_tp1_sharegpt_qps_1 8xH100 1.00455 18.3569 17.0653 35.8438 8.08728 8.07435 8.54922
serving_llama8B_tp1_sharegpt_qps_16 8xH100 11.5228 19.5259 19.811 25.4446 9.40074 9.38206 10.676
serving_llama8B_tp1_sharegpt_qps_4 8xH100 3.78847 16.1596 15.9192 22.108 8.41121 8.39612 8.99637
serving_llama8B_tp1_sharegpt_qps_inf 8xH100 24.1869 240.664 238.222 306.25 11.5137 11.4632 21.9811
serving_mixtral8x7B_tp2_sharegpt_qps_1 8xH100 0.985639 121.694 34.9095 2259.81 16.9988 17.1107 24.54
serving_mixtral8x7B_tp2_sharegpt_qps_16 8xH100 7.14885 39.6672 38.959 80.9585 21.6034 21.3628 25.5175
serving_mixtral8x7B_tp2_sharegpt_qps_4 8xH100 3.29721 34.4165 34.6872 45.9823 19.9132 20.0408 22.6441
serving_mixtral8x7B_tp2_sharegpt_qps_inf 8xH100 9.59572 282.61 280.799 323.141 22.6624 22.3845 28.8505

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_mixtral8x7B_tp2", "1": "latency_llama70B_tp4", "2": "latency_llama8B_tp1"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Mean latency (ms)": {"0": 2316.744400312503, "1": 2540.7543117801347, "2": 1046.9095210234325}, "Median latency (ms)": {"0": 2320.1082749292254, "1": 2540.222811512649, "2": 1046.8183774501085}, "P99 latency (ms)": {"0": 2334.1726494580507, "1": 2544.8885366879404, "2": 1048.303036596626}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_llama70B_tp4", "2": "throughput_mixtral8x7B_tp2"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Tput (req/s)": {"0": 21.620062820024827, "1": 9.134678104490586, "2": 9.013797973436434}}, "serving": {"Test name": {"0": "serving_llama8B_tp1_sharegpt_qps_inf", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "2": "serving_llama70B_tp4_sharegpt_qps_inf", "3": "serving_llama70B_tp4_sharegpt_qps_16", "4": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "5": "serving_llama70B_tp4_sharegpt_qps_1", "6": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "7": "serving_llama8B_tp1_sharegpt_qps_16", "8": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "9": "serving_llama70B_tp4_sharegpt_qps_4", "10": "serving_llama8B_tp1_sharegpt_qps_4", "11": "serving_llama8B_tp1_sharegpt_qps_1"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "3": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "4": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "5": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "6": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "7": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "8": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "9": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "10": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "11": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Tput (req/s)": {"0": 24.186907198204548, "1": 3.297212254852508, "2": 10.618124942561348, "3": 7.349636420935427, "4": 7.148851923019628, "5": 0.9829047178770594, "6": 9.595717904387822, "7": 11.522796431373235, "8": 0.9856386319035464, "9": 3.3125227896284843, "10": 3.7884722544447005, "11": 1.0045473046889262}, "Mean TTFT (ms)": {"0": 240.66389142535627, "1": 34.4164644414559, "2": 310.8427061699331, "3": 42.96253070700914, "4": 39.66719148680568, "5": 46.92570028826594, "6": 282.6098405988887, "7": 19.525949363596737, "8": 121.69368740171194, "9": 38.0708377296105, "10": 16.159570794552565, "11": 18.35688839200884}, "Median TTFT (ms)": {"0": 238.22169471532106, "1": 34.6872229129076, "2": 300.9081673808396, "3": 42.2112587839365, "4": 38.95895183086395, "5": 41.378372348845005, "6": 280.7990796864033, "7": 19.811025820672512, "8": 34.90947466343641, "9": 38.17338775843382, "10": 15.919176395982504, "11": 17.065261490643024}, "P99 TTFT (ms)": {"0": 306.2498870212585, "1": 45.982348108664155, "2": 375.258861342445, "3": 66.23510554432866, "4": 80.95852104946954, "5": 96.54471790418025, "6": 323.14059362746775, "7": 25.444589080289003, "8": 2259.80990773998, "9": 56.80923787876961, "10": 22.108013881370425, "11": 35.843759244307876}, "Mean ITL (ms)": {"0": 11.513655710447507, "1": 19.91316655936158, "2": 26.204122954305266, "3": 24.30854885444774, "4": 21.60340456219261, "5": 20.445824146639126, "6": 22.66244359644529, "7": 9.400744091195646, "8": 16.998846889892548, "9": 21.872995575343754, "10": 8.411214496113805, "11": 8.087283300139706}, "Median ITL (ms)": {"0": 11.463219299912453, "1": 20.0408436357975, "2": 25.643108412623405, "3": 23.917971178889275, "4": 21.362835075706244, "5": 20.071309991180897, "6": 22.384504787623882, "7": 9.38205886632204, "8": 17.11069280281663, "9": 21.86509408056736, "10": 8.396124467253685, "11": 8.074347861111164}, "P99 ITL (ms)": {"0": 21.981134973466403, "1": 22.64406639151275, "2": 47.86318650469183, "3": 39.63710363954306, "4": 25.517483763395827, "5": 39.52038638293743, "6": 28.85052182711661, "7": 10.675965584814554, "8": 24.539980804547664, "9": 25.0218597240746, "10": 8.996365405619144, "11": 8.549217134714127}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

Latency tests

  • Input length: 32 tokens.
  • Output length: 128 tokens.
  • Batch size: fixed (8).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: end-to-end latency (mean, median, p99).
Test name GPU Mean latency (ms) Median latency (ms) P99 latency (ms)
latency_llama70B_tp4 8xH200 2118.58 2119.02 2121.19
latency_llama8B_tp1 8xH200 833.272 833.759 834.682
latency_mixtral8x7B_tp2 8xH200 1900.82 1902.82 1910.8

Throughput tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm to achieve maximum throughput.
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: throughput.
Test name GPU Tput (req/s)
throughput_llama70B_tp4 8xH200 10.6882
throughput_llama8B_tp1 8xH200 25.6716
throughput_mixtral8x7B_tp2 8xH200 8.64563

Serving tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm and the arrival pattern of the requests.
  • Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • We also added a speculative decoding test for llama-3 70B, under QPS 2
  • Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name GPU Tput (req/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
serving_llama70B_tp4_sharegpt_qps_1 8xH200 0.989159 43.4688 35.7182 96.5888 16.9733 16.6291 32.941
serving_llama70B_tp4_sharegpt_qps_16 8xH200 8.22273 38.0496 36.9882 58.629 20.1791 19.7701 26.677
serving_llama70B_tp4_sharegpt_qps_4 8xH200 3.4382 33.2551 33.0601 45.445 18.0005 17.72 20.5296
serving_llama70B_tp4_sharegpt_qps_inf 8xH200 12.4872 322.868 315.396 377.538 22.552 21.9297 39.7766
serving_llama8B_tp1_sharegpt_qps_1 8xH200 1.00729 17.6736 15.7798 59.9132 6.45663 6.44447 6.86137
serving_llama8B_tp1_sharegpt_qps_16 8xH200 12.2952 16.7868 16.6197 21.9581 7.45783 7.40181 8.80273
serving_llama8B_tp1_sharegpt_qps_4 8xH200 3.86122 14.9627 14.6392 20.101 6.71829 6.71521 7.16092
serving_llama8B_tp1_sharegpt_qps_inf 8xH200 28.8288 247.482 247.675 301.188 9.81606 9.72876 20.4845
serving_mixtral8x7B_tp2_sharegpt_qps_1 8xH200 0.994479 44.1851 34.384 237.285 12.9853 13.4167 22.7421
serving_mixtral8x7B_tp2_sharegpt_qps_16 8xH200 7.19619 41.0541 40.8124 73.2873 21.9064 22.0236 24.7899
serving_mixtral8x7B_tp2_sharegpt_qps_4 8xH200 3.3036 36.2937 36.6095 47.2847 20.2467 20.5157 21.6394
serving_mixtral8x7B_tp2_sharegpt_qps_inf 8xH200 9.63472 281.247 269.375 325.874 23.3612 23.1974 30.8719

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Mean latency (ms)": {"0": 833.271651606386, "1": 2118.5753696598113, "2": 1900.8225788672767}, "Median latency (ms)": {"0": 833.7587309069932, "1": 2119.018518831581, "2": 1902.8190749231726}, "P99 latency (ms)": {"0": 834.6816526539624, "1": 2121.194312358275, "2": 1910.802426696755}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_mixtral8x7B_tp2", "2": "throughput_llama70B_tp4"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Tput (req/s)": {"0": 25.67157416243356, "1": 8.645627080238913, "2": 10.688174432771335}}, "serving": {"Test name": {"0": "serving_llama70B_tp4_sharegpt_qps_1", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "2": "serving_llama70B_tp4_sharegpt_qps_16", "3": "serving_llama70B_tp4_sharegpt_qps_inf", "4": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "5": "serving_llama8B_tp1_sharegpt_qps_1", "6": "serving_llama8B_tp1_sharegpt_qps_4", "7": "serving_llama8B_tp1_sharegpt_qps_16", "8": "serving_llama8B_tp1_sharegpt_qps_inf", "9": "serving_llama70B_tp4_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_4"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "3": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "4": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "5": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "6": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "7": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "8": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "9": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "10": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "11": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Tput (req/s)": {"0": 0.9891587219409574, "1": 9.634719099896412, "2": 8.222729387606314, "3": 12.487191082808714, "4": 0.9944789014061436, "5": 1.007293800387217, "6": 3.861221480628719, "7": 12.2952309368178, "8": 28.828767850421375, "9": 3.4382049852060788, "10": 7.196187099624867, "11": 3.303602432149024}, "Mean TTFT (ms)": {"0": 43.468809713376686, "1": 281.24746197019704, "2": 38.04963176022284, "3": 322.86752470070496, "4": 44.18505497975275, "5": 17.67359694465995, "6": 14.962725100340322, "7": 16.786847764160484, "8": 247.48155995155685, "9": 33.25512474984862, "10": 41.05411588679999, "11": 36.29365864209831}, "Median TTFT (ms)": {"0": 35.718209808692336, "1": 269.37501097563654, "2": 36.988212494179606, "3": 315.39560307282954, "4": 34.38399650622159, "5": 15.779806533828378, "6": 14.639193541370332, "7": 16.61967358086258, "8": 247.6753635564819, "9": 33.06009899824858, "10": 40.81238398794085, "11": 36.60947049502283}, "P99 TTFT (ms)": {"0": 96.58877548761551, "1": 325.87380204582587, "2": 58.62896100850774, "3": 377.5383864669129, "4": 237.28483830811228, "5": 59.91316309198732, "6": 20.100962540600257, "7": 21.958127913530916, "8": 301.18788017425686, "9": 45.44501084135842, "10": 73.28729778761043, "11": 47.28467447916046}, "Mean ITL (ms)": {"0": 16.973273727020054, "1": 23.361190750207047, "2": 20.179083884169412, "3": 22.552021951497785, "4": 12.985279223130725, "5": 6.4566331521403795, "6": 6.718289552257886, "7": 7.457826713022044, "8": 9.816059000520152, "9": 18.000531091515306, "10": 21.9063759349119, "11": 20.246724950082804}, "Median ITL (ms)": {"0": 16.629134071990848, "1": 23.19735533092171, "2": 19.770101876929402, "3": 21.92965685389936, "4": 13.41667806264013, "5": 6.444473983719945, "6": 6.715206196531653, "7": 7.4018139857798815, "8": 9.728759061545134, "9": 17.719964031130075, "10": 22.023566532880068, "11": 20.515665062703192}, "P99 ITL (ms)": {"0": 32.94095506425944, "1": 30.871940082870413, "2": 26.676964415237308, "3": 39.776585944928215, "4": 22.742118535097646, "5": 6.861367318779231, "6": 7.160921916365624, "7": 8.802727032452824, "8": 20.48447238281375, "9": 20.529647427611053, "10": 24.78994420496747, "11": 21.639364948496222}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

Latency tests

  • Input length: 32 tokens.
  • Output length: 128 tokens.
  • Batch size: fixed (8).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: end-to-end latency (mean, median, p99).
Test name GPU Mean latency (ms) Median latency (ms) P99 latency (ms)
latency_llama70B_tp4 8xA100-SXM4-80GB 4023.16 4022.88 4025.69
latency_llama8B_tp1 8xA100-SXM4-80GB 1563.77 1563.69 1564.5
latency_mixtral8x7B_tp2 8xA100-SXM4-80GB 3588.68 3591.41 3618.25

Throughput tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm to achieve maximum throughput.
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: throughput.
Test name GPU Tput (req/s)
throughput_llama70B_tp4 8xA100-SXM4-80GB 5.20635
throughput_llama8B_tp1 8xA100-SXM4-80GB 12.3242
throughput_mixtral8x7B_tp2 8xA100-SXM4-80GB 5.56502

Serving tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm and the arrival pattern of the requests.
  • Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • We also added a speculative decoding test for llama-3 70B, under QPS 2
  • Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name GPU Tput (req/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
serving_llama70B_tp4_sharegpt_qps_1 8xA100-SXM4-80GB 0.950337 89.657 71.7668 213.161 34.1761 31.7823 78.7449
serving_llama70B_tp4_sharegpt_qps_16 8xA100-SXM4-80GB 5.18335 70.2521 67.5836 107.834 42.5295 43.1636 85.2469
serving_llama70B_tp4_sharegpt_qps_4 8xA100-SXM4-80GB 2.91829 60.0811 58.9892 79.4673 37.2828 37.0384 74.5274
serving_llama70B_tp4_sharegpt_qps_inf 8xA100-SXM4-80GB 6.31563 660.594 726.653 744.767 47.047 44.5985 88.9333
serving_llama8B_tp1_sharegpt_qps_1 8xA100-SXM4-80GB 0.996789 33.8842 28.4588 75.8084 12.3951 12.1616 24.3049
serving_llama8B_tp1_sharegpt_qps_16 8xA100-SXM4-80GB 9.25824 32.8746 32.3245 46.8777 17.0301 17.4238 23.1376
serving_llama8B_tp1_sharegpt_qps_4 8xA100-SXM4-80GB 3.60743 24.1183 24.3184 31.4861 13.0238 12.8767 15.4021
serving_llama8B_tp1_sharegpt_qps_inf 8xA100-SXM4-80GB 14.243 506.841 513.762 573.928 20.6969 20.2554 36.0529
serving_mixtral8x7B_tp2_sharegpt_qps_1 8xA100-SXM4-80GB 0.954064 72.8993 54.941 288.59 28.6196 30.4361 42.9455
serving_mixtral8x7B_tp2_sharegpt_qps_16 8xA100-SXM4-80GB 5.18713 59.1705 58.6186 80.7185 34.2135 34.4161 39.0515
serving_mixtral8x7B_tp2_sharegpt_qps_4 8xA100-SXM4-80GB 2.91228 53.662 52.9766 72.7159 32.3259 32.094 36.247
serving_mixtral8x7B_tp2_sharegpt_qps_inf 8xA100-SXM4-80GB 6.22647 514.894 531.726 585.142 35.6269 35.9529 41.2122

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Mean latency (ms)": {"0": 1563.7737552945812, "1": 4023.1602351491647, "2": 3588.676120651265}, "Median latency (ms)": {"0": 1563.6915154755116, "1": 4022.8782389312983, "2": 3591.4132557809353}, "P99 latency (ms)": {"0": 1564.5017629861832, "1": 4025.694858431816, "2": 3618.2455145195127}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_llama70B_tp4", "2": "throughput_mixtral8x7B_tp2"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Tput (req/s)": {"0": 12.324186389245893, "1": 5.206345478318473, "2": 5.5650183932356265}}, "serving": {"Test name": {"0": "serving_llama8B_tp1_sharegpt_qps_1", "1": "serving_llama8B_tp1_sharegpt_qps_4", "2": "serving_llama8B_tp1_sharegpt_qps_16", "3": "serving_llama8B_tp1_sharegpt_qps_inf", "4": "serving_llama70B_tp4_sharegpt_qps_1", "5": "serving_llama70B_tp4_sharegpt_qps_4", "6": "serving_llama70B_tp4_sharegpt_qps_16", "7": "serving_llama70B_tp4_sharegpt_qps_inf", "8": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "9": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_inf"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "3": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "4": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "5": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "6": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "7": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "8": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "9": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "10": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "11": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Tput (req/s)": {"0": 0.9967892714078663, "1": 3.607425078737425, "2": 9.25824243335634, "3": 14.24299102466845, "4": 0.9503372682041944, "5": 2.9182876089870216, "6": 5.183348193259361, "7": 6.315625454299485, "8": 0.9540639615561795, "9": 2.912280466248237, "10": 5.187129037993653, "11": 6.226466409528192}, "Mean TTFT (ms)": {"0": 33.884167866781354, "1": 24.11834311671555, "2": 32.8746001701802, "3": 506.84092393144965, "4": 89.6570359962061, "5": 60.08108027745038, "6": 70.25212858337909, "7": 660.5936508392915, "8": 72.89926828816533, "9": 53.662010272964835, "10": 59.170533269643784, "11": 514.8943650210276}, "Median TTFT (ms)": {"0": 28.458827175199986, "1": 24.318381678313017, "2": 32.324529718607664, "3": 513.7622435577214, "4": 71.76684122532606, "5": 58.98921377956867, "6": 67.58361915126443, "7": 726.6527987085283, "8": 54.940960835665464, "9": 52.976563572883606, "10": 58.61864611506462, "11": 531.7255398258567}, "P99 TTFT (ms)": {"0": 75.80841735005379, "1": 31.486078994348645, "2": 46.87770428135988, "3": 573.9277134649456, "4": 213.16094738431252, "5": 79.46726609021425, "6": 107.83371409401296, "7": 744.7674644552171, "8": 288.5898773930859, "9": 72.7159012760967, "10": 80.7185451872645, "11": 585.14214402996}, "Mean ITL (ms)": {"0": 12.395056888603717, "1": 13.023832919228793, "2": 17.030094147154763, "3": 20.69686610245767, "4": 34.176110649384405, "5": 37.2827830768884, "6": 42.529486680323735, "7": 47.046975418218196, "8": 28.61957275412666, "9": 32.32587099025326, "10": 34.21349674631134, "11": 35.62688467661398}, "Median ITL (ms)": {"0": 12.16161623597145, "1": 12.876669876277447, "2": 17.423816956579685, "3": 20.2554352581501, "4": 31.782250851392746, "5": 37.038447335362434, "6": 43.16364694386721, "7": 44.59854308515787, "8": 30.436135828495026, "9": 32.09400922060013, "10": 34.416137263178825, "11": 35.95291264355183}, "P99 ITL (ms)": {"0": 24.30490868166089, "1": 15.40208488702774, "2": 23.1375889480114, "3": 36.05294151231648, "4": 78.74485917389393, "5": 74.52739045023918, "6": 85.24688880890606, "7": 88.93326073884967, "8": 42.94547829777049, "9": 36.246955022215865, "10": 39.05145451426507, "11": 41.21215883642435}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

Cleanup H100docker system prune -a --volumes --force
Waited 5m 15s
·
Ran in 5m 4s
Total Job Run Time: 5m 16s