Daedalus
Publicchore: update `cardano-node` to 10.2.0

Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
Test name | GPU | Mean latency (ms) | Median latency (ms) | P99 latency (ms) |
---|---|---|---|---|
latency_llama70B_tp4 | 8xH100 | 2540.75 | 2540.22 | 2544.89 |
latency_llama8B_tp1 | 8xH100 | 1046.91 | 1046.82 | 1048.3 |
latency_mixtral8x7B_tp2 | 8xH100 | 2316.74 | 2320.11 | 2334.17 |
Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.
Test name | GPU | Tput (req/s) |
---|---|---|
throughput_llama70B_tp4 | 8xH100 | 9.13468 |
throughput_llama8B_tp1 | 8xH100 | 21.6201 |
throughput_mixtral8x7B_tp2 | 8xH100 | 9.0138 |
Serving tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name | GPU | Tput (req/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) | Mean ITL (ms) | Median ITL (ms) | P99 ITL (ms) |
---|---|---|---|---|---|---|---|---|
serving_llama70B_tp4_sharegpt_qps_1 | 8xH100 | 0.982905 | 46.9257 | 41.3784 | 96.5447 | 20.4458 | 20.0713 | 39.5204 |
serving_llama70B_tp4_sharegpt_qps_16 | 8xH100 | 7.34964 | 42.9625 | 42.2113 | 66.2351 | 24.3085 | 23.918 | 39.6371 |
serving_llama70B_tp4_sharegpt_qps_4 | 8xH100 | 3.31252 | 38.0708 | 38.1734 | 56.8092 | 21.873 | 21.8651 | 25.0219 |
serving_llama70B_tp4_sharegpt_qps_inf | 8xH100 | 10.6181 | 310.843 | 300.908 | 375.259 | 26.2041 | 25.6431 | 47.8632 |
serving_llama8B_tp1_sharegpt_qps_1 | 8xH100 | 1.00455 | 18.3569 | 17.0653 | 35.8438 | 8.08728 | 8.07435 | 8.54922 |
serving_llama8B_tp1_sharegpt_qps_16 | 8xH100 | 11.5228 | 19.5259 | 19.811 | 25.4446 | 9.40074 | 9.38206 | 10.676 |
serving_llama8B_tp1_sharegpt_qps_4 | 8xH100 | 3.78847 | 16.1596 | 15.9192 | 22.108 | 8.41121 | 8.39612 | 8.99637 |
serving_llama8B_tp1_sharegpt_qps_inf | 8xH100 | 24.1869 | 240.664 | 238.222 | 306.25 | 11.5137 | 11.4632 | 21.9811 |
serving_mixtral8x7B_tp2_sharegpt_qps_1 | 8xH100 | 0.985639 | 121.694 | 34.9095 | 2259.81 | 16.9988 | 17.1107 | 24.54 |
serving_mixtral8x7B_tp2_sharegpt_qps_16 | 8xH100 | 7.14885 | 39.6672 | 38.959 | 80.9585 | 21.6034 | 21.3628 | 25.5175 |
serving_mixtral8x7B_tp2_sharegpt_qps_4 | 8xH100 | 3.29721 | 34.4165 | 34.6872 | 45.9823 | 19.9132 | 20.0408 | 22.6441 |
serving_mixtral8x7B_tp2_sharegpt_qps_inf | 8xH100 | 9.59572 | 282.61 | 280.799 | 323.141 | 22.6624 | 22.3845 | 28.8505 |
json version of the benchmarking tables
This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:
import json
import pandas as pd
benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
The json string for all benchmarking tables:
{"latency": {"Test name": {"0": "latency_mixtral8x7B_tp2", "1": "latency_llama70B_tp4", "2": "latency_llama8B_tp1"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Mean latency (ms)": {"0": 2316.744400312503, "1": 2540.7543117801347, "2": 1046.9095210234325}, "Median latency (ms)": {"0": 2320.1082749292254, "1": 2540.222811512649, "2": 1046.8183774501085}, "P99 latency (ms)": {"0": 2334.1726494580507, "1": 2544.8885366879404, "2": 1048.303036596626}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_llama70B_tp4", "2": "throughput_mixtral8x7B_tp2"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Tput (req/s)": {"0": 21.620062820024827, "1": 9.134678104490586, "2": 9.013797973436434}}, "serving": {"Test name": {"0": "serving_llama8B_tp1_sharegpt_qps_inf", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "2": "serving_llama70B_tp4_sharegpt_qps_inf", "3": "serving_llama70B_tp4_sharegpt_qps_16", "4": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "5": "serving_llama70B_tp4_sharegpt_qps_1", "6": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "7": "serving_llama8B_tp1_sharegpt_qps_16", "8": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "9": "serving_llama70B_tp4_sharegpt_qps_4", "10": "serving_llama8B_tp1_sharegpt_qps_4", "11": "serving_llama8B_tp1_sharegpt_qps_1"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "3": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "4": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "5": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "6": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "7": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "8": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "9": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "10": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "11": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Tput (req/s)": {"0": 24.186907198204548, "1": 3.297212254852508, "2": 10.618124942561348, "3": 7.349636420935427, "4": 7.148851923019628, "5": 0.9829047178770594, "6": 9.595717904387822, "7": 11.522796431373235, "8": 0.9856386319035464, "9": 3.3125227896284843, "10": 3.7884722544447005, "11": 1.0045473046889262}, "Mean TTFT (ms)": {"0": 240.66389142535627, "1": 34.4164644414559, "2": 310.8427061699331, "3": 42.96253070700914, "4": 39.66719148680568, "5": 46.92570028826594, "6": 282.6098405988887, "7": 19.525949363596737, "8": 121.69368740171194, "9": 38.0708377296105, "10": 16.159570794552565, "11": 18.35688839200884}, "Median TTFT (ms)": {"0": 238.22169471532106, "1": 34.6872229129076, "2": 300.9081673808396, "3": 42.2112587839365, "4": 38.95895183086395, "5": 41.378372348845005, "6": 280.7990796864033, "7": 19.811025820672512, "8": 34.90947466343641, "9": 38.17338775843382, "10": 15.919176395982504, "11": 17.065261490643024}, "P99 TTFT (ms)": {"0": 306.2498870212585, "1": 45.982348108664155, "2": 375.258861342445, "3": 66.23510554432866, "4": 80.95852104946954, "5": 96.54471790418025, "6": 323.14059362746775, "7": 25.444589080289003, "8": 2259.80990773998, "9": 56.80923787876961, "10": 22.108013881370425, "11": 35.843759244307876}, "Mean ITL (ms)": {"0": 11.513655710447507, "1": 19.91316655936158, "2": 26.204122954305266, "3": 24.30854885444774, "4": 21.60340456219261, "5": 20.445824146639126, "6": 22.66244359644529, "7": 9.400744091195646, "8": 16.998846889892548, "9": 21.872995575343754, "10": 8.411214496113805, "11": 8.087283300139706}, "Median ITL (ms)": {"0": 11.463219299912453, "1": 20.0408436357975, "2": 25.643108412623405, "3": 23.917971178889275, "4": 21.362835075706244, "5": 20.071309991180897, "6": 22.384504787623882, "7": 9.38205886632204, "8": 17.11069280281663, "9": 21.86509408056736, "10": 8.396124467253685, "11": 8.074347861111164}, "P99 ITL (ms)": {"0": 21.981134973466403, "1": 22.64406639151275, "2": 47.86318650469183, "3": 39.63710363954306, "4": 25.517483763395827, "5": 39.52038638293743, "6": 28.85052182711661, "7": 10.675965584814554, "8": 24.539980804547664, "9": 25.0218597240746, "10": 8.996365405619144, "11": 8.549217134714127}}}
You can also check the raw experiment data in the Artifact tab of the Buildkite page.
Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
Test name | GPU | Mean latency (ms) | Median latency (ms) | P99 latency (ms) |
---|---|---|---|---|
latency_llama70B_tp4 | 8xH200 | 2118.58 | 2119.02 | 2121.19 |
latency_llama8B_tp1 | 8xH200 | 833.272 | 833.759 | 834.682 |
latency_mixtral8x7B_tp2 | 8xH200 | 1900.82 | 1902.82 | 1910.8 |
Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.
Test name | GPU | Tput (req/s) |
---|---|---|
throughput_llama70B_tp4 | 8xH200 | 10.6882 |
throughput_llama8B_tp1 | 8xH200 | 25.6716 |
throughput_mixtral8x7B_tp2 | 8xH200 | 8.64563 |
Serving tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name | GPU | Tput (req/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) | Mean ITL (ms) | Median ITL (ms) | P99 ITL (ms) |
---|---|---|---|---|---|---|---|---|
serving_llama70B_tp4_sharegpt_qps_1 | 8xH200 | 0.989159 | 43.4688 | 35.7182 | 96.5888 | 16.9733 | 16.6291 | 32.941 |
serving_llama70B_tp4_sharegpt_qps_16 | 8xH200 | 8.22273 | 38.0496 | 36.9882 | 58.629 | 20.1791 | 19.7701 | 26.677 |
serving_llama70B_tp4_sharegpt_qps_4 | 8xH200 | 3.4382 | 33.2551 | 33.0601 | 45.445 | 18.0005 | 17.72 | 20.5296 |
serving_llama70B_tp4_sharegpt_qps_inf | 8xH200 | 12.4872 | 322.868 | 315.396 | 377.538 | 22.552 | 21.9297 | 39.7766 |
serving_llama8B_tp1_sharegpt_qps_1 | 8xH200 | 1.00729 | 17.6736 | 15.7798 | 59.9132 | 6.45663 | 6.44447 | 6.86137 |
serving_llama8B_tp1_sharegpt_qps_16 | 8xH200 | 12.2952 | 16.7868 | 16.6197 | 21.9581 | 7.45783 | 7.40181 | 8.80273 |
serving_llama8B_tp1_sharegpt_qps_4 | 8xH200 | 3.86122 | 14.9627 | 14.6392 | 20.101 | 6.71829 | 6.71521 | 7.16092 |
serving_llama8B_tp1_sharegpt_qps_inf | 8xH200 | 28.8288 | 247.482 | 247.675 | 301.188 | 9.81606 | 9.72876 | 20.4845 |
serving_mixtral8x7B_tp2_sharegpt_qps_1 | 8xH200 | 0.994479 | 44.1851 | 34.384 | 237.285 | 12.9853 | 13.4167 | 22.7421 |
serving_mixtral8x7B_tp2_sharegpt_qps_16 | 8xH200 | 7.19619 | 41.0541 | 40.8124 | 73.2873 | 21.9064 | 22.0236 | 24.7899 |
serving_mixtral8x7B_tp2_sharegpt_qps_4 | 8xH200 | 3.3036 | 36.2937 | 36.6095 | 47.2847 | 20.2467 | 20.5157 | 21.6394 |
serving_mixtral8x7B_tp2_sharegpt_qps_inf | 8xH200 | 9.63472 | 281.247 | 269.375 | 325.874 | 23.3612 | 23.1974 | 30.8719 |
json version of the benchmarking tables
This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:
import json
import pandas as pd
benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
The json string for all benchmarking tables:
{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Mean latency (ms)": {"0": 833.271651606386, "1": 2118.5753696598113, "2": 1900.8225788672767}, "Median latency (ms)": {"0": 833.7587309069932, "1": 2119.018518831581, "2": 1902.8190749231726}, "P99 latency (ms)": {"0": 834.6816526539624, "1": 2121.194312358275, "2": 1910.802426696755}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_mixtral8x7B_tp2", "2": "throughput_llama70B_tp4"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Tput (req/s)": {"0": 25.67157416243356, "1": 8.645627080238913, "2": 10.688174432771335}}, "serving": {"Test name": {"0": "serving_llama70B_tp4_sharegpt_qps_1", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "2": "serving_llama70B_tp4_sharegpt_qps_16", "3": "serving_llama70B_tp4_sharegpt_qps_inf", "4": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "5": "serving_llama8B_tp1_sharegpt_qps_1", "6": "serving_llama8B_tp1_sharegpt_qps_4", "7": "serving_llama8B_tp1_sharegpt_qps_16", "8": "serving_llama8B_tp1_sharegpt_qps_inf", "9": "serving_llama70B_tp4_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_4"}, "GPU": {"0": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "1": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "2": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "3": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "4": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "5": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "6": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "7": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "8": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "9": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "10": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200", "11": "H200\nH200\nH200\nH200\nH200\nH200\nH200\nH200"}, "Tput (req/s)": {"0": 0.9891587219409574, "1": 9.634719099896412, "2": 8.222729387606314, "3": 12.487191082808714, "4": 0.9944789014061436, "5": 1.007293800387217, "6": 3.861221480628719, "7": 12.2952309368178, "8": 28.828767850421375, "9": 3.4382049852060788, "10": 7.196187099624867, "11": 3.303602432149024}, "Mean TTFT (ms)": {"0": 43.468809713376686, "1": 281.24746197019704, "2": 38.04963176022284, "3": 322.86752470070496, "4": 44.18505497975275, "5": 17.67359694465995, "6": 14.962725100340322, "7": 16.786847764160484, "8": 247.48155995155685, "9": 33.25512474984862, "10": 41.05411588679999, "11": 36.29365864209831}, "Median TTFT (ms)": {"0": 35.718209808692336, "1": 269.37501097563654, "2": 36.988212494179606, "3": 315.39560307282954, "4": 34.38399650622159, "5": 15.779806533828378, "6": 14.639193541370332, "7": 16.61967358086258, "8": 247.6753635564819, "9": 33.06009899824858, "10": 40.81238398794085, "11": 36.60947049502283}, "P99 TTFT (ms)": {"0": 96.58877548761551, "1": 325.87380204582587, "2": 58.62896100850774, "3": 377.5383864669129, "4": 237.28483830811228, "5": 59.91316309198732, "6": 20.100962540600257, "7": 21.958127913530916, "8": 301.18788017425686, "9": 45.44501084135842, "10": 73.28729778761043, "11": 47.28467447916046}, "Mean ITL (ms)": {"0": 16.973273727020054, "1": 23.361190750207047, "2": 20.179083884169412, "3": 22.552021951497785, "4": 12.985279223130725, "5": 6.4566331521403795, "6": 6.718289552257886, "7": 7.457826713022044, "8": 9.816059000520152, "9": 18.000531091515306, "10": 21.9063759349119, "11": 20.246724950082804}, "Median ITL (ms)": {"0": 16.629134071990848, "1": 23.19735533092171, "2": 19.770101876929402, "3": 21.92965685389936, "4": 13.41667806264013, "5": 6.444473983719945, "6": 6.715206196531653, "7": 7.4018139857798815, "8": 9.728759061545134, "9": 17.719964031130075, "10": 22.023566532880068, "11": 20.515665062703192}, "P99 ITL (ms)": {"0": 32.94095506425944, "1": 30.871940082870413, "2": 26.676964415237308, "3": 39.776585944928215, "4": 22.742118535097646, "5": 6.861367318779231, "6": 7.160921916365624, "7": 8.802727032452824, "8": 20.48447238281375, "9": 20.529647427611053, "10": 24.78994420496747, "11": 21.639364948496222}}}
You can also check the raw experiment data in the Artifact tab of the Buildkite page.
Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
Test name | GPU | Mean latency (ms) | Median latency (ms) | P99 latency (ms) |
---|---|---|---|---|
latency_llama70B_tp4 | 8xA100-SXM4-80GB | 4023.16 | 4022.88 | 4025.69 |
latency_llama8B_tp1 | 8xA100-SXM4-80GB | 1563.77 | 1563.69 | 1564.5 |
latency_mixtral8x7B_tp2 | 8xA100-SXM4-80GB | 3588.68 | 3591.41 | 3618.25 |
Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.
Test name | GPU | Tput (req/s) |
---|---|---|
throughput_llama70B_tp4 | 8xA100-SXM4-80GB | 5.20635 |
throughput_llama8B_tp1 | 8xA100-SXM4-80GB | 12.3242 |
throughput_mixtral8x7B_tp2 | 8xA100-SXM4-80GB | 5.56502 |
Serving tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name | GPU | Tput (req/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) | Mean ITL (ms) | Median ITL (ms) | P99 ITL (ms) |
---|---|---|---|---|---|---|---|---|
serving_llama70B_tp4_sharegpt_qps_1 | 8xA100-SXM4-80GB | 0.950337 | 89.657 | 71.7668 | 213.161 | 34.1761 | 31.7823 | 78.7449 |
serving_llama70B_tp4_sharegpt_qps_16 | 8xA100-SXM4-80GB | 5.18335 | 70.2521 | 67.5836 | 107.834 | 42.5295 | 43.1636 | 85.2469 |
serving_llama70B_tp4_sharegpt_qps_4 | 8xA100-SXM4-80GB | 2.91829 | 60.0811 | 58.9892 | 79.4673 | 37.2828 | 37.0384 | 74.5274 |
serving_llama70B_tp4_sharegpt_qps_inf | 8xA100-SXM4-80GB | 6.31563 | 660.594 | 726.653 | 744.767 | 47.047 | 44.5985 | 88.9333 |
serving_llama8B_tp1_sharegpt_qps_1 | 8xA100-SXM4-80GB | 0.996789 | 33.8842 | 28.4588 | 75.8084 | 12.3951 | 12.1616 | 24.3049 |
serving_llama8B_tp1_sharegpt_qps_16 | 8xA100-SXM4-80GB | 9.25824 | 32.8746 | 32.3245 | 46.8777 | 17.0301 | 17.4238 | 23.1376 |
serving_llama8B_tp1_sharegpt_qps_4 | 8xA100-SXM4-80GB | 3.60743 | 24.1183 | 24.3184 | 31.4861 | 13.0238 | 12.8767 | 15.4021 |
serving_llama8B_tp1_sharegpt_qps_inf | 8xA100-SXM4-80GB | 14.243 | 506.841 | 513.762 | 573.928 | 20.6969 | 20.2554 | 36.0529 |
serving_mixtral8x7B_tp2_sharegpt_qps_1 | 8xA100-SXM4-80GB | 0.954064 | 72.8993 | 54.941 | 288.59 | 28.6196 | 30.4361 | 42.9455 |
serving_mixtral8x7B_tp2_sharegpt_qps_16 | 8xA100-SXM4-80GB | 5.18713 | 59.1705 | 58.6186 | 80.7185 | 34.2135 | 34.4161 | 39.0515 |
serving_mixtral8x7B_tp2_sharegpt_qps_4 | 8xA100-SXM4-80GB | 2.91228 | 53.662 | 52.9766 | 72.7159 | 32.3259 | 32.094 | 36.247 |
serving_mixtral8x7B_tp2_sharegpt_qps_inf | 8xA100-SXM4-80GB | 6.22647 | 514.894 | 531.726 | 585.142 | 35.6269 | 35.9529 | 41.2122 |
json version of the benchmarking tables
This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:
import json
import pandas as pd
benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
The json string for all benchmarking tables:
{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Mean latency (ms)": {"0": 1563.7737552945812, "1": 4023.1602351491647, "2": 3588.676120651265}, "Median latency (ms)": {"0": 1563.6915154755116, "1": 4022.8782389312983, "2": 3591.4132557809353}, "P99 latency (ms)": {"0": 1564.5017629861832, "1": 4025.694858431816, "2": 3618.2455145195127}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_llama70B_tp4", "2": "throughput_mixtral8x7B_tp2"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Tput (req/s)": {"0": 12.324186389245893, "1": 5.206345478318473, "2": 5.5650183932356265}}, "serving": {"Test name": {"0": "serving_llama8B_tp1_sharegpt_qps_1", "1": "serving_llama8B_tp1_sharegpt_qps_4", "2": "serving_llama8B_tp1_sharegpt_qps_16", "3": "serving_llama8B_tp1_sharegpt_qps_inf", "4": "serving_llama70B_tp4_sharegpt_qps_1", "5": "serving_llama70B_tp4_sharegpt_qps_4", "6": "serving_llama70B_tp4_sharegpt_qps_16", "7": "serving_llama70B_tp4_sharegpt_qps_inf", "8": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "9": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_inf"}, "GPU": {"0": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "1": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "2": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "3": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "4": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "5": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "6": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "7": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "8": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "9": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "10": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB", "11": "A100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB\nA100-SXM4-80GB"}, "Tput (req/s)": {"0": 0.9967892714078663, "1": 3.607425078737425, "2": 9.25824243335634, "3": 14.24299102466845, "4": 0.9503372682041944, "5": 2.9182876089870216, "6": 5.183348193259361, "7": 6.315625454299485, "8": 0.9540639615561795, "9": 2.912280466248237, "10": 5.187129037993653, "11": 6.226466409528192}, "Mean TTFT (ms)": {"0": 33.884167866781354, "1": 24.11834311671555, "2": 32.8746001701802, "3": 506.84092393144965, "4": 89.6570359962061, "5": 60.08108027745038, "6": 70.25212858337909, "7": 660.5936508392915, "8": 72.89926828816533, "9": 53.662010272964835, "10": 59.170533269643784, "11": 514.8943650210276}, "Median TTFT (ms)": {"0": 28.458827175199986, "1": 24.318381678313017, "2": 32.324529718607664, "3": 513.7622435577214, "4": 71.76684122532606, "5": 58.98921377956867, "6": 67.58361915126443, "7": 726.6527987085283, "8": 54.940960835665464, "9": 52.976563572883606, "10": 58.61864611506462, "11": 531.7255398258567}, "P99 TTFT (ms)": {"0": 75.80841735005379, "1": 31.486078994348645, "2": 46.87770428135988, "3": 573.9277134649456, "4": 213.16094738431252, "5": 79.46726609021425, "6": 107.83371409401296, "7": 744.7674644552171, "8": 288.5898773930859, "9": 72.7159012760967, "10": 80.7185451872645, "11": 585.14214402996}, "Mean ITL (ms)": {"0": 12.395056888603717, "1": 13.023832919228793, "2": 17.030094147154763, "3": 20.69686610245767, "4": 34.176110649384405, "5": 37.2827830768884, "6": 42.529486680323735, "7": 47.046975418218196, "8": 28.61957275412666, "9": 32.32587099025326, "10": 34.21349674631134, "11": 35.62688467661398}, "Median ITL (ms)": {"0": 12.16161623597145, "1": 12.876669876277447, "2": 17.423816956579685, "3": 20.2554352581501, "4": 31.782250851392746, "5": 37.038447335362434, "6": 43.16364694386721, "7": 44.59854308515787, "8": 30.436135828495026, "9": 32.09400922060013, "10": 34.416137263178825, "11": 35.95291264355183}, "P99 ITL (ms)": {"0": 24.30490868166089, "1": 15.40208488702774, "2": 23.1375889480114, "3": 36.05294151231648, "4": 78.74485917389393, "5": 74.52739045023918, "6": 85.24688880890606, "7": 88.93326073884967, "8": 42.94547829777049, "9": 36.246955022215865, "10": 39.05145451426507, "11": 41.21215883642435}}}
You can also check the raw experiment data in the Artifact tab of the Buildkite page.