
rules-js
PublicAspect Workflows pipeline for aspect-build/rules_js
chore(deps): update dependency rollup to v4
Failed in 24m 35s
Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
Test name | GPU | Mean latency (ms) | Median latency (ms) | P99 latency (ms) |
---|---|---|---|---|
latency_llama70B_tp4 | 8xH100 | 2444.47 | 2444.3 | 2450.91 |
latency_llama8B_tp1 | 8xH100 | 997.542 | 997.365 | 999.409 |
latency_mixtral8x7B_tp2 | 8xH100 | 2326.97 | 2330.57 | 2350.54 |
Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.
Test name | GPU | Tput (req/s) |
---|---|---|
throughput_llama70B_tp4 | 8xH100 | 8.86239 |
throughput_llama8B_tp1 | 8xH100 | 19.5005 |
throughput_mixtral8x7B_tp2 | 8xH100 | 8.15186 |
Serving tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name | GPU | Tput (req/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) | Mean ITL (ms) | Median ITL (ms) | P99 ITL (ms) |
---|---|---|---|---|---|---|---|---|
serving_llama70B_tp4_sharegpt_qps_1 | 8xH100 | 0.984501 | 62.4149 | 56.9924 | 110.695 | 19.3936 | 18.7615 | 53.3811 |
serving_llama70B_tp4_sharegpt_qps_16 | 8xH100 | 7.29345 | 130.278 | 110.029 | 421.16 | 28.2932 | 23.7207 | 71.6121 |
serving_llama70B_tp4_sharegpt_qps_4 | 8xH100 | 3.33967 | 72.0854 | 61.3991 | 150.311 | 22.171 | 20.3971 | 63.4759 |
serving_llama70B_tp4_sharegpt_qps_inf | 8xH100 | 8.89715 | 2828.76 | 2791.16 | 5398.14 | 30.6683 | 25.6486 | 163.985 |
serving_llama70B_tp4_sharegpt_specdecode_qps_2 | 8xH100 | 1.63871 | 65.0781 | 62.4165 | 113.499 | 35.0204 | 32.2292 | 100.384 |
serving_llama8B_tp1_sharegpt_qps_1 | 8xH100 | 1.00494 | 24.6827 | 21.6912 | 42.3386 | 7.62798 | 7.55312 | 8.53468 |
serving_llama8B_tp1_sharegpt_qps_16 | 8xH100 | 11.4896 | 38.2191 | 32.3384 | 191.642 | 10.2977 | 9.39778 | 22.4331 |
serving_llama8B_tp1_sharegpt_qps_4 | 8xH100 | 3.80428 | 25.2503 | 22.5512 | 42.9718 | 8.09316 | 7.84624 | 20.3429 |
serving_llama8B_tp1_sharegpt_qps_inf | 8xH100 | 19.4787 | 1186.77 | 1136.81 | 2170.02 | 14.2914 | 12.4214 | 24.5677 |
serving_mixtral8x7B_tp2_sharegpt_qps_1 | 8xH100 | 0.987662 | 335.455 | 40.373 | 3202.61 | 18.32 | 16.0769 | 38.2721 |
serving_mixtral8x7B_tp2_sharegpt_qps_16 | 8xH100 | 6.65328 | 234.619 | 60.2056 | 1839.69 | 30.6882 | 24.0982 | 187.731 |
serving_mixtral8x7B_tp2_sharegpt_qps_4 | 8xH100 | 3.25427 | 47.0435 | 43.7934 | 85.4128 | 22.8735 | 20.5077 | 47.5261 |
serving_mixtral8x7B_tp2_sharegpt_qps_inf | 8xH100 | 8.71956 | 1233.99 | 1116.38 | 1480.35 | 26.2536 | 24.4811 | 173.943 |
json version of the benchmarking tables
This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:
import json
import pandas as pd
benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
The json string for all benchmarking tables:
{"latency": {"Test name": {"0": "latency_mixtral8x7B_tp2", "1": "latency_llama70B_tp4", "2": "latency_llama8B_tp1"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Mean latency (ms)": {"0": 2326.965676266021, "1": 2444.4721424001427, "2": 997.5417277334296}, "Median latency (ms)": {"0": 2330.5670189984085, "1": 2444.2982479995408, "2": 997.3654839996016}, "P99 latency (ms)": {"0": 2350.5369539411913, "1": 2450.910051380997, "2": 999.408646782831}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_llama70B_tp4", "2": "throughput_mixtral8x7B_tp2"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Tput (req/s)": {"0": 19.50047012279712, "1": 8.862392360296212, "2": 8.151860797301364}}, "serving": {"Test name": {"0": "serving_llama8B_tp1_sharegpt_qps_inf", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "2": "serving_llama70B_tp4_sharegpt_qps_inf", "3": "serving_llama70B_tp4_sharegpt_specdecode_qps_2", "4": "serving_llama70B_tp4_sharegpt_qps_16", "5": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "6": "serving_llama70B_tp4_sharegpt_qps_1", "7": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "8": "serving_llama8B_tp1_sharegpt_qps_16", "9": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "10": "serving_llama70B_tp4_sharegpt_qps_4", "11": "serving_llama8B_tp1_sharegpt_qps_4", "12": "serving_llama8B_tp1_sharegpt_qps_1"}, "GPU": {"0": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "1": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "2": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "3": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "4": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "5": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "6": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "7": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "8": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "9": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "10": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "11": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100", "12": "H100\nH100\nH100\nH100\nH100\nH100\nH100\nH100"}, "Tput (req/s)": {"0": 19.478671509224945, "1": 3.2542682320741907, "2": 8.89715088680512, "3": 1.6387134944470014, "4": 7.293452863434265, "5": 6.653282510621168, "6": 0.9845009832601267, "7": 8.719562292290481, "8": 11.489592565396304, "9": 0.9876621096812964, "10": 3.3396674973279286, "11": 3.804282447932104, "12": 1.0049441526923881}, "Mean TTFT (ms)": {"0": 1186.7715199698432, "1": 47.04349210000146, "2": 2828.755769004929, "3": 65.07806814280619, "4": 130.27841210498082, "5": 234.6192061650254, "6": 62.41492667983039, "7": 1233.985231079987, "8": 38.219072010124364, "9": 335.45491353988837, "10": 72.08538667488028, "11": 25.2503033649009, "12": 24.68272186989452}, "Median TTFT (ms)": {"0": 1136.8130074988585, "1": 43.79336549936852, "2": 2791.1591214997316, "3": 62.41652300013811, "4": 110.02869550065952, "5": 60.205612500794814, "6": 56.99239799832867, "7": 1116.382263999185, "8": 32.33839550011908, "9": 40.372964500420494, "10": 61.39909349985828, "11": 22.551166501216358, "12": 21.691177498723846}, "P99 TTFT (ms)": {"0": 2170.0158058400484, "1": 85.41280763962126, "2": 5398.144727478029, "3": 113.49940616048119, "4": 421.16045877835813, "5": 1839.6879125505068, "6": 110.69542514687767, "7": 1480.3458080886774, "8": 191.64241939091858, "9": 3202.613635900633, "10": 150.31086032809978, "11": 42.97179792880342, "12": 42.33856607857888}, "Mean ITL (ms)": {"0": 14.291355223924628, "1": 22.87351067848165, "2": 30.66834920119073, "3": 35.02044642954789, "4": 28.293165652833537, "5": 30.688247894297504, "6": 19.39364450262744, "7": 26.253617825232315, "8": 10.297709183774685, "9": 18.320030211559104, "10": 22.17100693743933, "11": 8.093159038074383, "12": 7.627975875345532}, "Median ITL (ms)": {"0": 12.421366000126, "1": 20.50769599736668, "2": 25.64860899838095, "3": 32.22915399965132, "4": 23.720700499325176, "5": 24.098178000713233, "6": 18.761489998723846, "7": 24.481063002895098, "8": 9.397782499945606, "9": 16.07687999785412, "10": 20.397060501636588, "11": 7.846243499443517, "12": 7.553117500719964}, "P99 ITL (ms)": {"0": 24.56774540059996, "1": 47.52614814125991, "2": 163.9850535910591, "3": 100.38362414044968, "4": 71.61208892772265, "5": 187.73057302110834, "6": 53.38106291310397, "7": 173.94275460130305, "8": 22.433148449999862, "9": 38.272068400110584, "10": 63.475900410558104, "11": 20.34292935040867, "12": 8.534682251774965}}}
You can also check the raw experiment data in the Artifact tab of the Buildkite page.

Waited 4m 12s
Ran in 31s
🛠️ Formatecho "--- :aspect-build: Workflows environment" && /etc/aspect/workflows/bin/configure_workflows_env && echo "--- :stethoscope: Agent health check" && /etc/aspect/workflows/bin/agent_health_check && echo "--- :git: Checkout health" && rosetta run checkout --workspace . && echo "~~~ :broom: Prepare archive directories" && rm -rf /workflows/artifacts /workflows/testlogs && echo "--- :hammer_and_wrench: Format" && rosetta run format --workspace . && echo "+++ "
Waited 17m 50s
Ran in 32s
🛠️ Buildifierecho "--- :aspect-build: Workflows environment" && /etc/aspect/workflows/bin/configure_workflows_env && echo "--- :stethoscope: Agent health check" && /etc/aspect/workflows/bin/agent_health_check && echo "--- :git: Checkout health" && rosetta run checkout --workspace . && echo "~~~ :broom: Prepare archive directories" && rm -rf /workflows/artifacts /workflows/testlogs && echo "--- :hammer_and_wrench: Buildifier" && rosetta run buildifier --workspace . && echo "+++ "
Waited 18m 0s
Ran in 45s
Total Job Run Time: 18m 24s