enterprise-search-ruby
PublicEnterprise Search Ruby Client
Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
Test name | GPU | Mean latency (ms) | Median latency (ms) | P99 latency (ms) |
---|---|---|---|---|
latency_llama70B_tp4 | 8xH100 | 2444.47 | 2444.3 | 2450.91 |
latency_llama8B_tp1 | 8xH100 | 997.542 | 997.365 | 999.409 |
latency_mixtral8x7B_tp2 | 8xH100 | 2326.97 | 2330.57 | 2350.54 |
Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.
Test name | GPU | Tput (req/s) |
---|---|---|
throughput_llama70B_tp4 | 8xH100 | 8.86239 |
throughput_llama8B_tp1 | 8xH100 | 19.5005 |
throughput_mixtral8x7B_tp2 | 8xH100 | 8.15186 |
Serving tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name | GPU | Tput (req/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) | Mean ITL (ms) | Median ITL (ms) | P99 ITL (ms) |
---|---|---|---|---|---|---|---|---|
serving_llama70B_tp4_sharegpt_qps_1 | 8xH100 | 0.984501 | 62.4149 | 56.9924 | 110.695 | 19.3936 | 18.7615 | 53.3811 |
serving_llama70B_tp4_sharegpt_qps_16 | 8xH100 | 7.29345 | 130.278 | 110.029 | 421.16 | 28.2932 | 23.7207 | 71.6121 |
serving_llama70B_tp4_sharegpt_qps_4 | 8xH100 | 3.33967 | 72.0854 | 61.3991 | 150.311 | 22.171 | 20.3971 | 63.4759 |
serving_llama70B_tp4_sharegpt_qps_inf | 8xH100 | 8.89715 | 2828.76 | 2791.16 | 5398.14 | 30.6683 | 25.6486 | 163.985 |
serving_llama70B_tp4_sharegpt_specdecode_qps_2 | 8xH100 | 1.63871 | 65.0781 | 62.4165 | 113.499 | 35.0204 | 32.2292 | 100.384 |
serving_llama8B_tp1_sharegpt_qps_1 | 8xH100 | 1.00494 | 24.6827 | 21.6912 | 42.3386 | 7.62798 | 7.55312 | 8.53468 |
serving_llama8B_tp1_sharegpt_qps_16 | 8xH100 | 11.4896 | 38.2191 | 32.3384 | 191.642 | 10.2977 | 9.39778 | 22.4331 |
serving_llama8B_tp1_sharegpt_qps_4 | 8xH100 | 3.80428 | 25.2503 | 22.5512 | 42.9718 | 8.09316 | 7.84624 | 20.3429 |
serving_llama8B_tp1_sharegpt_qps_inf | 8xH100 | 19.4787 | 1186.77 | 1136.81 | 2170.02 | 14.2914 | 12.4214 | 24.5677 |
serving_mixtral8x7B_tp2_sharegpt_qps_1 | 8xH100 | 0.987662 | 335.455 | 40.373 | 3202.61 | 18.32 | 16.0769 | 38.2721 |
serving_mixtral8x7B_tp2_sharegpt_qps_16 | 8xH100 | 6.65328 | 234.619 | 60.2056 | 1839.69 | 30.6882 | 24.0982 | 187.731 |
serving_mixtral8x7B_tp2_sharegpt_qps_4 | 8xH100 | 3.25427 | 47.0435 | 43.7934 | 85.4128 | 22.8735 | 20.5077 | 47.5261 |
serving_mixtral8x7B_tp2_sharegpt_qps_inf | 8xH100 | 8.71956 | 1233.99 | 1116.38 | 1480.35 | 26.2536 | 24.4811 | 173.943 |
json version of the benchmarking tables
This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:
import json
import pandas as pd
benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
The json string for all benchmarking tables:
{"latency": {"Test name": {"0": "latency_mixtral8x7B_tp2", "1": "latency_llama70B_tp4", "2": "latency_llama8B_tp1"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Mean latency (ms)": {"0": 2326.965676266021, "1": 2444.4721424001427, "2": 997.5417277334296}, "Median latency (ms)": {"0": 2330.5670189984085, "1": 2444.2982479995408, "2": 997.3654839996016}, "P99 latency (ms)": {"0": 2350.5369539411913, "1": 2450.910051380997, "2": 999.408646782831}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_llama70B_tp4", "2": "throughput_mixtral8x7B_tp2"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Tput (req/s)": {"0": 19.50047012279712, "1": 8.862392360296212, "2": 8.151860797301364}}, "serving": {"Test name": {"0": "serving_llama8B_tp1_sharegpt_qps_inf", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "2": "serving_llama70B_tp4_sharegpt_qps_inf", "3": "serving_llama70B_tp4_sharegpt_specdecode_qps_2", "4": "serving_llama70B_tp4_sharegpt_qps_16", "5": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "6": "serving_llama70B_tp4_sharegpt_qps_1", "7": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "8": "serving_llama8B_tp1_sharegpt_qps_16", "9": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "10": "serving_llama70B_tp4_sharegpt_qps_4", "11": "serving_llama8B_tp1_sharegpt_qps_4", "12": "serving_llama8B_tp1_sharegpt_qps_1"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "3": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "4": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "5": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "6": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "7": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "8": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "9": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "10": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "11": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "12": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Tput (req/s)": {"0": 19.478671509224945, "1": 3.2542682320741907, "2": 8.89715088680512, "3": 1.6387134944470014, "4": 7.293452863434265, "5": 6.653282510621168, "6": 0.9845009832601267, "7": 8.719562292290481, "8": 11.489592565396304, "9": 0.9876621096812964, "10": 3.3396674973279286, "11": 3.804282447932104, "12": 1.0049441526923881}, "Mean TTFT (ms)": {"0": 1186.7715199698432, "1": 47.04349210000146, "2": 2828.755769004929, "3": 65.07806814280619, "4": 130.27841210498082, "5": 234.6192061650254, "6": 62.41492667983039, "7": 1233.985231079987, "8": 38.219072010124364, "9": 335.45491353988837, "10": 72.08538667488028, "11": 25.2503033649009, "12": 24.68272186989452}, "Median TTFT (ms)": {"0": 1136.8130074988585, "1": 43.79336549936852, "2": 2791.1591214997316, "3": 62.41652300013811, "4": 110.02869550065952, "5": 60.205612500794814, "6": 56.99239799832867, "7": 1116.382263999185, "8": 32.33839550011908, "9": 40.372964500420494, "10": 61.39909349985828, "11": 22.551166501216358, "12": 21.691177498723846}, "P99 TTFT (ms)": {"0": 2170.0158058400484, "1": 85.41280763962126, "2": 5398.144727478029, "3": 113.49940616048119, "4": 421.16045877835813, "5": 1839.6879125505068, "6": 110.69542514687767, "7": 1480.3458080886774, "8": 191.64241939091858, "9": 3202.613635900633, "10": 150.31086032809978, "11": 42.97179792880342, "12": 42.33856607857888}, "Mean ITL (ms)": {"0": 14.291355223924628, "1": 22.87351067848165, "2": 30.66834920119073, "3": 35.02044642954789, "4": 28.293165652833537, "5": 30.688247894297504, "6": 19.39364450262744, "7": 26.253617825232315, "8": 10.297709183774685, "9": 18.320030211559104, "10": 22.17100693743933, "11": 8.093159038074383, "12": 7.627975875345532}, "Median ITL (ms)": {"0": 12.421366000126, "1": 20.50769599736668, "2": 25.64860899838095, "3": 32.22915399965132, "4": 23.720700499325176, "5": 24.098178000713233, "6": 18.761489998723846, "7": 24.481063002895098, "8": 9.397782499945606, "9": 16.07687999785412, "10": 20.397060501636588, "11": 7.846243499443517, "12": 7.553117500719964}, "P99 ITL (ms)": {"0": 24.56774540059996, "1": 47.52614814125991, "2": 163.9850535910591, "3": 100.38362414044968, "4": 71.61208892772265, "5": 187.73057302110834, "6": 53.38106291310397, "7": 173.94275460130305, "8": 22.433148449999862, "9": 38.272068400110584, "10": 63.475900410558104, "11": 20.34292935040867, "12": 8.534682251774965}}}
You can also check the raw experiment data in the Artifact tab of the Buildkite page.
Matrix



Waited 1m 20s
Ran in 9m 27s
Matrix



Waited 1m 26s
Ran in 9m 18s
Matrix



Waited 1m 31s
Ran in 9m 42s
Matrix



Waited 1m 24s
Ran in 9m 6s
Build will continue even if previous stage fails
Total Job Run Time: 37m 46s