Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm to achieve maximum throughput.
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: throughput.

Test name	GPU	Tput (req/s)
throughput_llama70B_tp4	8xH100	9.13468
throughput_llama8B_tp1	8xH100	21.6201
throughput_mixtral8x7B_tp2	8xH100	9.0138

Serving tests

Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm and the arrival pattern of the requests.
Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
We also added a speculative decoding test for llama-3 70B, under QPS 2
Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

Test name	GPU	Tput (req/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)
serving_llama70B_tp4_sharegpt_qps_1	8xH100	0.982905	46.9257	41.3784	96.5447	20.4458	20.0713	39.5204
serving_llama70B_tp4_sharegpt_qps_16	8xH100	7.34964	42.9625	42.2113	66.2351	24.3085	23.918	39.6371
serving_llama70B_tp4_sharegpt_qps_4	8xH100	3.31252	38.0708	38.1734	56.8092	21.873	21.8651	25.0219
serving_llama70B_tp4_sharegpt_qps_inf	8xH100	10.6181	310.843	300.908	375.259	26.2041	25.6431	47.8632
serving_llama8B_tp1_sharegpt_qps_1	8xH100	1.00455	18.3569	17.0653	35.8438	8.08728	8.07435	8.54922
serving_llama8B_tp1_sharegpt_qps_16	8xH100	11.5228	19.5259	19.811	25.4446	9.40074	9.38206	10.676
serving_llama8B_tp1_sharegpt_qps_4	8xH100	3.78847	16.1596	15.9192	22.108	8.41121	8.39612	8.99637
serving_llama8B_tp1_sharegpt_qps_inf	8xH100	24.1869	240.664	238.222	306.25	11.5137	11.4632	21.9811
serving_mixtral8x7B_tp2_sharegpt_qps_1	8xH100	0.985639	121.694	34.9095	2259.81	16.9988	17.1107	24.54
serving_mixtral8x7B_tp2_sharegpt_qps_16	8xH100	7.14885	39.6672	38.959	80.9585	21.6034	21.3628	25.5175
serving_mixtral8x7B_tp2_sharegpt_qps_4	8xH100	3.29721	34.4165	34.6872	45.9823	19.9132	20.0408	22.6441
serving_mixtral8x7B_tp2_sharegpt_qps_inf	8xH100	9.59572	282.61	280.799	323.141	22.6624	22.3845	28.8505

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_mixtral8x7B_tp2", "1": "latency_llama70B_tp4", "2": "latency_llama8B_tp1"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Mean latency (ms)": {"0": 2316.744400312503, "1": 2540.7543117801347, "2": 1046.9095210234325}, "Median latency (ms)": {"0": 2320.1082749292254, "1": 2540.222811512649, "2": 1046.8183774501085}, "P99 latency (ms)": {"0": 2334.1726494580507, "1": 2544.8885366879404, "2": 1048.303036596626}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_llama70B_tp4", "2": "throughput_mixtral8x7B_tp2"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Tput (req/s)": {"0": 21.620062820024827, "1": 9.134678104490586, "2": 9.013797973436434}}, "serving": {"Test name": {"0": "serving_llama8B_tp1_sharegpt_qps_inf", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "2": "serving_llama70B_tp4_sharegpt_qps_inf", "3": "serving_llama70B_tp4_sharegpt_qps_16", "4": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "5": "serving_llama70B_tp4_sharegpt_qps_1", "6": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "7": "serving_llama8B_tp1_sharegpt_qps_16", "8": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "9": "serving_llama70B_tp4_sharegpt_qps_4", "10": "serving_llama8B_tp1_sharegpt_qps_4", "11": "serving_llama8B_tp1_sharegpt_qps_1"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "3": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "4": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "5": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "6": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "7": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "8": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "9": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "10": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "11": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Tput (req/s)": {"0": 24.186907198204548, "1": 3.297212254852508, "2": 10.618124942561348, "3": 7.349636420935427, "4": 7.148851923019628, "5": 0.9829047178770594, "6": 9.595717904387822, "7": 11.522796431373235, "8": 0.9856386319035464, "9": 3.3125227896284843, "10": 3.7884722544447005, "11": 1.0045473046889262}, "Mean TTFT (ms)": {"0": 240.66389142535627, "1": 34.4164644414559, "2": 310.8427061699331, "3": 42.96253070700914, "4": 39.66719148680568, "5": 46.92570028826594, "6": 282.6098405988887, "7": 19.525949363596737, "8": 121.69368740171194, "9": 38.0708377296105, "10": 16.159570794552565, "11": 18.35688839200884}, "Median TTFT (ms)": {"0": 238.22169471532106, "1": 34.6872229129076, "2": 300.9081673808396, "3": 42.2112587839365, "4": 38.95895183086395, "5": 41.378372348845005, "6": 280.7990796864033, "7": 19.811025820672512, "8": 34.90947466343641, "9": 38.17338775843382, "10": 15.919176395982504, "11": 17.065261490643024}, "P99 TTFT (ms)": {"0": 306.2498870212585, "1": 45.982348108664155, "2": 375.258861342445, "3": 66.23510554432866, "4": 80.95852104946954, "5": 96.54471790418025, "6": 323.14059362746775, "7": 25.444589080289003, "8": 2259.80990773998, "9": 56.80923787876961, "10": 22.108013881370425, "11": 35.843759244307876}, "Mean ITL (ms)": {"0": 11.513655710447507, "1": 19.91316655936158, "2": 26.204122954305266, "3": 24.30854885444774, "4": 21.60340456219261, "5": 20.445824146639126, "6": 22.66244359644529, "7": 9.400744091195646, "8": 16.998846889892548, "9": 21.872995575343754, "10": 8.411214496113805, "11": 8.087283300139706}, "Median ITL (ms)": {"0": 11.463219299912453, "1": 20.0408436357975, "2": 25.643108412623405, "3": 23.917971178889275, "4": 21.362835075706244, "5": 20.071309991180897, "6": 22.384504787623882, "7": 9.38205886632204, "8": 17.11069280281663, "9": 21.86509408056736, "10": 8.396124467253685, "11": 8.074347861111164}, "P99 ITL (ms)": {"0": 21.981134973466403, "1": 22.64406639151275, "2": 47.86318650469183, "3": 39.63710363954306, "4": 25.517483763395827, "5": 39.52038638293743, "6": 28.85052182711661, "7": 10.675965584814554, "8": 24.539980804547664, "9": 25.0218597240746, "10": 8.996365405619144, "11": 8.549217134714127}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

Latency tests

Input length: 32 tokens.
Output length: 128 tokens.
Batch size: fixed (8).
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: end-to-end latency (mean, median, p99).

Test name	GPU	Mean latency (ms)	Median latency (ms)	P99 latency (ms)
latency_llama70B_tp4	8xH200	2118.58	2119.02	2121.19
latency_llama8B_tp1	8xH200	833.272	833.759	834.682
latency_mixtral8x7B_tp2	8xH200	1900.82	1902.82	1910.8

Throughput tests

Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm to achieve maximum throughput.
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: throughput.

Test name	GPU	Tput (req/s)
throughput_llama70B_tp4	8xH200	10.6882
throughput_llama8B_tp1	8xH200	25.6716
throughput_mixtral8x7B_tp2	8xH200	8.64563

Serving tests

Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm and the arrival pattern of the requests.
Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
We also added a speculative decoding test for llama-3 70B, under QPS 2
Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

Test name	GPU	Tput (req/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)
serving_llama70B_tp4_sharegpt_qps_1	8xH200	0.989159	43.4688	35.7182	96.5888	16.9733	16.6291	32.941
serving_llama70B_tp4_sharegpt_qps_16	8xH200	8.22273	38.0496	36.9882	58.629	20.1791	19.7701	26.677
serving_llama70B_tp4_sharegpt_qps_4	8xH200	3.4382	33.2551	33.0601	45.445	18.0005	17.72	20.5296
serving_llama70B_tp4_sharegpt_qps_inf	8xH200	12.4872	322.868	315.396	377.538	22.552	21.9297	39.7766
serving_llama8B_tp1_sharegpt_qps_1	8xH200	1.00729	17.6736	15.7798	59.9132	6.45663	6.44447	6.86137
serving_llama8B_tp1_sharegpt_qps_16	8xH200	12.2952	16.7868	16.6197	21.9581	7.45783	7.40181	8.80273
serving_llama8B_tp1_sharegpt_qps_4	8xH200	3.86122	14.9627	14.6392	20.101	6.71829	6.71521	7.16092
serving_llama8B_tp1_sharegpt_qps_inf	8xH200	28.8288	247.482	247.675	301.188	9.81606	9.72876	20.4845
serving_mixtral8x7B_tp2_sharegpt_qps_1	8xH200	0.994479	44.1851	34.384	237.285	12.9853	13.4167	22.7421
serving_mixtral8x7B_tp2_sharegpt_qps_16	8xH200	7.19619	41.0541	40.8124	73.2873	21.9064	22.0236	24.7899
serving_mixtral8x7B_tp2_sharegpt_qps_4	8xH200	3.3036	36.2937	36.6095	47.2847	20.2467	20.5157	21.6394
serving_mixtral8x7B_tp2_sharegpt_qps_inf	8xH200	9.63472	281.247	269.375	325.874	23.3612	23.1974	30.8719

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Mean latency (ms)": {"0": 833.271651606386, "1": 2118.5753696598113, "2": 1900.8225788672767}, "Median latency (ms)": {"0": 833.7587309069932, "1": 2119.018518831581, "2": 1902.8190749231726}, "P99 latency (ms)": {"0": 834.6816526539624, "1": 2121.194312358275, "2": 1910.802426696755}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_mixtral8x7B_tp2", "2": "throughput_llama70B_tp4"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Tput (req/s)": {"0": 25.67157416243356, "1": 8.645627080238913, "2": 10.688174432771335}}, "serving": {"Test name": {"0": "serving_llama70B_tp4_sharegpt_qps_1", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "2": "serving_llama70B_tp4_sharegpt_qps_16", "3": "serving_llama70B_tp4_sharegpt_qps_inf", "4": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "5": "serving_llama8B_tp1_sharegpt_qps_1", "6": "serving_llama8B_tp1_sharegpt_qps_4", "7": "serving_llama8B_tp1_sharegpt_qps_16", "8": "serving_llama8B_tp1_sharegpt_qps_inf", "9": "serving_llama70B_tp4_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_4"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "3": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "4": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "5": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "6": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "7": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "8": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "9": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "10": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "11": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Tput (req/s)": {"0": 0.9891587219409574, "1": 9.634719099896412, "2": 8.222729387606314, "3": 12.487191082808714, "4": 0.9944789014061436, "5": 1.007293800387217, "6": 3.861221480628719, "7": 12.2952309368178, "8": 28.828767850421375, "9": 3.4382049852060788, "10": 7.196187099624867, "11": 3.303602432149024}, "Mean TTFT (ms)": {"0": 43.468809713376686, "1": 281.24746197019704, "2": 38.04963176022284, "3": 322.86752470070496, "4": 44.18505497975275, "5": 17.67359694465995, "6": 14.962725100340322, "7": 16.786847764160484, "8": 247.48155995155685, "9": 33.25512474984862, "10": 41.05411588679999, "11": 36.29365864209831}, "Median TTFT (ms)": {"0": 35.718209808692336, "1": 269.37501097563654, "2": 36.988212494179606, "3": 315.39560307282954, "4": 34.38399650622159, "5": 15.779806533828378, "6": 14.639193541370332, "7": 16.61967358086258, "8": 247.6753635564819, "9": 33.06009899824858, "10": 40.81238398794085, "11": 36.60947049502283}, "P99 TTFT (ms)": {"0": 96.58877548761551, "1": 325.87380204582587, "2": 58.62896100850774, "3": 377.5383864669129, "4": 237.28483830811228, "5": 59.91316309198732, "6": 20.100962540600257, "7": 21.958127913530916, "8": 301.18788017425686, "9": 45.44501084135842, "10": 73.28729778761043, "11": 47.28467447916046}, "Mean ITL (ms)": {"0": 16.973273727020054, "1": 23.361190750207047, "2": 20.179083884169412, "3": 22.552021951497785, "4": 12.985279223130725, "5": 6.4566331521403795, "6": 6.718289552257886, "7": 7.457826713022044, "8": 9.816059000520152, "9": 18.000531091515306, "10": 21.9063759349119, "11": 20.246724950082804}, "Median ITL (ms)": {"0": 16.629134071990848, "1": 23.19735533092171, "2": 19.770101876929402, "3": 21.92965685389936, "4": 13.41667806264013, "5": 6.444473983719945, "6": 6.715206196531653, "7": 7.4018139857798815, "8": 9.728759061545134, "9": 17.719964031130075, "10": 22.023566532880068, "11": 20.515665062703192}, "P99 ITL (ms)": {"0": 32.94095506425944, "1": 30.871940082870413, "2": 26.676964415237308, "3": 39.776585944928215, "4": 22.742118535097646, "5": 6.861367318779231, "6": 7.160921916365624, "7": 8.802727032452824, "8": 20.48447238281375, "9": 20.529647427611053, "10": 24.78994420496747, "11": 21.639364948496222}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

Latency tests

Input length: 32 tokens.
Output length: 128 tokens.
Batch size: fixed (8).
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: end-to-end latency (mean, median, p99).

Test name	GPU	Mean latency (ms)	Median latency (ms)	P99 latency (ms)
latency_llama70B_tp4	8xA100-SXM4-80GB	4023.16	4022.88	4025.69
latency_llama8B_tp1	8xA100-SXM4-80GB	1563.77	1563.69	1564.5
latency_mixtral8x7B_tp2	8xA100-SXM4-80GB	3588.68	3591.41	3618.25

Throughput tests

Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm to achieve maximum throughput.
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: throughput.

Test name	GPU	Tput (req/s)
throughput_llama70B_tp4	8xA100-SXM4-80GB	5.20635
throughput_llama8B_tp1	8xA100-SXM4-80GB	12.3242
throughput_mixtral8x7B_tp2	8xA100-SXM4-80GB	5.56502

Serving tests

Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm and the arrival pattern of the requests.
Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
We also added a speculative decoding test for llama-3 70B, under QPS 2
Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

Test name	GPU	Tput (req/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)
serving_llama70B_tp4_sharegpt_qps_1	8xA100-SXM4-80GB	0.950337	89.657	71.7668	213.161	34.1761	31.7823	78.7449
serving_llama70B_tp4_sharegpt_qps_16	8xA100-SXM4-80GB	5.18335	70.2521	67.5836	107.834	42.5295	43.1636	85.2469
serving_llama70B_tp4_sharegpt_qps_4	8xA100-SXM4-80GB	2.91829	60.0811	58.9892	79.4673	37.2828	37.0384	74.5274
serving_llama70B_tp4_sharegpt_qps_inf	8xA100-SXM4-80GB	6.31563	660.594	726.653	744.767	47.047	44.5985	88.9333
serving_llama8B_tp1_sharegpt_qps_1	8xA100-SXM4-80GB	0.996789	33.8842	28.4588	75.8084	12.3951	12.1616	24.3049
serving_llama8B_tp1_sharegpt_qps_16	8xA100-SXM4-80GB	9.25824	32.8746	32.3245	46.8777	17.0301	17.4238	23.1376
serving_llama8B_tp1_sharegpt_qps_4	8xA100-SXM4-80GB	3.60743	24.1183	24.3184	31.4861	13.0238	12.8767	15.4021
serving_llama8B_tp1_sharegpt_qps_inf	8xA100-SXM4-80GB	14.243	506.841	513.762	573.928	20.6969	20.2554	36.0529
serving_mixtral8x7B_tp2_sharegpt_qps_1	8xA100-SXM4-80GB	0.954064	72.8993	54.941	288.59	28.6196	30.4361	42.9455
serving_mixtral8x7B_tp2_sharegpt_qps_16	8xA100-SXM4-80GB	5.18713	59.1705	58.6186	80.7185	34.2135	34.4161	39.0515
serving_mixtral8x7B_tp2_sharegpt_qps_4	8xA100-SXM4-80GB	2.91228	53.662	52.9766	72.7159	32.3259	32.094	36.247
serving_mixtral8x7B_tp2_sharegpt_qps_inf	8xA100-SXM4-80GB	6.22647	514.894	531.726	585.142	35.6269	35.9529	41.2122

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Mean latency (ms)": {"0": 1563.7737552945812, "1": 4023.1602351491647, "2": 3588.676120651265}, "Median latency (ms)": {"0": 1563.6915154755116, "1": 4022.8782389312983, "2": 3591.4132557809353}, "P99 latency (ms)": {"0": 1564.5017629861832, "1": 4025.694858431816, "2": 3618.2455145195127}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_llama70B_tp4", "2": "throughput_mixtral8x7B_tp2"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Tput (req/s)": {"0": 12.324186389245893, "1": 5.206345478318473, "2": 5.5650183932356265}}, "serving": {"Test name": {"0": "serving_llama8B_tp1_sharegpt_qps_1", "1": "serving_llama8B_tp1_sharegpt_qps_4", "2": "serving_llama8B_tp1_sharegpt_qps_16", "3": "serving_llama8B_tp1_sharegpt_qps_inf", "4": "serving_llama70B_tp4_sharegpt_qps_1", "5": "serving_llama70B_tp4_sharegpt_qps_4", "6": "serving_llama70B_tp4_sharegpt_qps_16", "7": "serving_llama70B_tp4_sharegpt_qps_inf", "8": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "9": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_inf"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "3": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "4": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "5": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "6": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "7": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "8": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "9": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "10": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "11": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Tput (req/s)": {"0": 0.9967892714078663, "1": 3.607425078737425, "2": 9.25824243335634, "3": 14.24299102466845, "4": 0.9503372682041944, "5": 2.9182876089870216, "6": 5.183348193259361, "7": 6.315625454299485, "8": 0.9540639615561795, "9": 2.912280466248237, "10": 5.187129037993653, "11": 6.226466409528192}, "Mean TTFT (ms)": {"0": 33.884167866781354, "1": 24.11834311671555, "2": 32.8746001701802, "3": 506.84092393144965, "4": 89.6570359962061, "5": 60.08108027745038, "6": 70.25212858337909, "7": 660.5936508392915, "8": 72.89926828816533, "9": 53.662010272964835, "10": 59.170533269643784, "11": 514.8943650210276}, "Median TTFT (ms)": {"0": 28.458827175199986, "1": 24.318381678313017, "2": 32.324529718607664, "3": 513.7622435577214, "4": 71.76684122532606, "5": 58.98921377956867, "6": 67.58361915126443, "7": 726.6527987085283, "8": 54.940960835665464, "9": 52.976563572883606, "10": 58.61864611506462, "11": 531.7255398258567}, "P99 TTFT (ms)": {"0": 75.80841735005379, "1": 31.486078994348645, "2": 46.87770428135988, "3": 573.9277134649456, "4": 213.16094738431252, "5": 79.46726609021425, "6": 107.83371409401296, "7": 744.7674644552171, "8": 288.5898773930859, "9": 72.7159012760967, "10": 80.7185451872645, "11": 585.14214402996}, "Mean ITL (ms)": {"0": 12.395056888603717, "1": 13.023832919228793, "2": 17.030094147154763, "3": 20.69686610245767, "4": 34.176110649384405, "5": 37.2827830768884, "6": 42.529486680323735, "7": 47.046975418218196, "8": 28.61957275412666, "9": 32.32587099025326, "10": 34.21349674631134, "11": 35.62688467661398}, "Median ITL (ms)": {"0": 12.16161623597145, "1": 12.876669876277447, "2": 17.423816956579685, "3": 20.2554352581501, "4": 31.782250851392746, "5": 37.038447335362434, "6": 43.16364694386721, "7": 44.59854308515787, "8": 30.436135828495026, "9": 32.09400922060013, "10": 34.416137263178825, "11": 35.95291264355183}, "P99 ITL (ms)": {"0": 24.30490868166089, "1": 15.40208488702774, "2": 23.1375889480114, "3": 36.05294151231648, "4": 78.74485917389393, "5": 74.52739045023918, "6": 85.24688880890606, "7": 88.93326073884967, "8": 42.94547829777049, "9": 36.246955022215865, "10": 39.05145451426507, "11": 41.21215883642435}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

Cleanup H100docker system prune -a --volumes --force

Ran in 5m 4s

Total Job Run Time: 5m 16s