Plugins / Sparse Checkout

Public

Plugin that creates a sparse checkout of the repository

include ssh

#65

Failed in 1m 31s

priya kandababu

Created Mon 3rd Jun 2024 at 2:58 AM

Triggered from Webhook

Latency tests

Input length: 32 tokens.
Output length: 128 tokens.
Batch size: fixed (8).
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: end-to-end latency (mean, median, p99).

Test name	GPU	Mean latency (ms)	Median latency (ms)	P99 latency (ms)
latency_llama70B_tp4	H200	2077.53	2077.56	2079.95
latency_llama8B_tp1	H200	833.421	833.53	834.167
latency_mixtral8x7B_tp2	H200	1917.44	1916.58	1929

Throughput tests

Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm to achieve maximum throughput.
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: throughput.

Test name	GPU	Tput (req/s)
throughput_llama70B_tp4	H200	9.83905
throughput_llama8B_tp1	H200	22.5674
throughput_mixtral8x7B_tp2	H200	8.06251

Serving tests

Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm and the arrival pattern of the requests.
Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
We also added a speculative decoding test for llama-3 70B, under QPS 2
Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

Test name	GPU	Tput (req/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)
serving_llama70B_tp4_sharegpt_qps_1	H200	0.989698	64.6695	57.5495	114.381	16.54	15.9035	57.0511
serving_llama70B_tp4_sharegpt_qps_16	H200	7.93297	129.331	121.234	235.558	27.0082	20.8107	73.0208
serving_llama70B_tp4_sharegpt_qps_4	H200	3.44656	77.4506	63.337	155.165	19.1865	17.085	59.7834
serving_llama70B_tp4_sharegpt_qps_inf	H200	9.80641	3002.99	2953.38	5774.49	28.4332	22.5972	62.2624
serving_llama70B_tp4_sharegpt_specdecode_qps_2	H200	1.95686	56.382	61.792	109.222	30.1394	27.3403	96.1451
serving_llama8B_tp1_sharegpt_qps_1	H200	1.00723	23.7387	22.3615	42.087	6.27439	6.19076	6.70216
serving_llama8B_tp1_sharegpt_qps_16	H200	12.163	36.5353	31.9291	83.0084	9.17103	7.92751	22.6797
serving_llama8B_tp1_sharegpt_qps_4	H200	3.8669	26.219	23.4337	43.4302	6.67473	6.39724	20.5362
serving_llama8B_tp1_sharegpt_qps_inf	H200	22.2192	1300.75	1275.22	2300.03	12.2474	11.4056	22.4373
serving_mixtral8x7B_tp2_sharegpt_qps_1	H200	0.99435	46.4297	41.3002	88.5616	12.9366	12.0271	38.5728
serving_mixtral8x7B_tp2_sharegpt_qps_16	H200	6.66174	72.6094	63.4399	226.987	31.0919	24.3868	172.603
serving_mixtral8x7B_tp2_sharegpt_qps_4	H200	3.2501	52.4526	48.9461	94.6454	23.4618	21.0607	56.974
serving_mixtral8x7B_tp2_sharegpt_qps_inf	H200	8.28159	2213.17	2005.53	2715.63	27.7495	25.3022	51.4907

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_llama8B_tp1", "1": "latency_llama70B_tp4", "2": "latency_mixtral8x7B_tp2"}, "GPU": {"0": "H200", "1": "H200", "2": "H200"}, "Mean latency (ms)": {"0": 833.4214728946488, "1": 2077.5339416228235, "2": 1917.4413826627037}, "Median latency (ms)": {"0": 833.5301280021667, "1": 2077.5569062680006, "2": 1916.5793992578983}, "P99 latency (ms)": {"0": 834.1666641458869, "1": 2079.9549554102123, "2": 1929.0014506783336}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1", "1": "throughput_mixtral8x7B_tp2", "2": "throughput_llama70B_tp4"}, "GPU": {"0": "H200", "1": "H200", "2": "H200"}, "Tput (req/s)": {"0": 22.56741098325012, "1": 8.06251347257433, "2": 9.83904802973439}}, "serving": {"Test name": {"0": "serving_llama70B_tp4_sharegpt_qps_1", "1": "serving_mixtral8x7B_tp2_sharegpt_qps_inf", "2": "serving_llama70B_tp4_sharegpt_qps_16", "3": "serving_llama70B_tp4_sharegpt_qps_inf", "4": "serving_mixtral8x7B_tp2_sharegpt_qps_1", "5": "serving_llama8B_tp1_sharegpt_qps_1", "6": "serving_llama8B_tp1_sharegpt_qps_4", "7": "serving_llama8B_tp1_sharegpt_qps_16", "8": "serving_llama8B_tp1_sharegpt_qps_inf", "9": "serving_llama70B_tp4_sharegpt_qps_4", "10": "serving_mixtral8x7B_tp2_sharegpt_qps_16", "11": "serving_mixtral8x7B_tp2_sharegpt_qps_4", "12": "serving_llama70B_tp4_sharegpt_specdecode_qps_2"}, "GPU": {"0": "H200", "1": "H200", "2": "H200", "3": "H200", "4": "H200", "5": "H200", "6": "H200", "7": "H200", "8": "H200", "9": "H200", "10": "H200", "11": "H200", "12": "H200"}, "Tput (req/s)": {"0": 0.9896977889656053, "1": 8.28159338054524, "2": 7.932968711752506, "3": 9.806409127702178, "4": 0.9943502861076632, "5": 1.0072348217112044, "6": 3.8669012030840593, "7": 12.162959386522013, "8": 22.21919766601935, "9": 3.446556870971799, "10": 6.661743742631414, "11": 3.250100042912229, "12": 1.956857415983935}, "Mean TTFT (ms)": {"0": 64.66954215429723, "1": 2213.1659540743567, "2": 129.33125539682806, "3": 3002.994736842811, "4": 46.42968596657738, "5": 23.738678817171603, "6": 26.218972683418542, "7": 36.53525110334158, "8": 1300.7527791964822, "9": 77.45059719774872, "10": 72.60941448388621, "11": 52.45257397182286, "12": 56.38197902124375}, "Median TTFT (ms)": {"0": 57.54952295683324, "1": 2005.5272534955293, "2": 121.23411428183317, "3": 2953.378487378359, "4": 41.300211334601045, "5": 22.361535346135497, "6": 23.433676920831203, "7": 31.929075717926025, "8": 1275.2170253079385, "9": 63.33703640848398, "10": 63.43989260494709, "11": 48.9460586104542, "12": 61.79200345650315}, "P99 TTFT (ms)": {"0": 114.38109113369136, "1": 2715.6348367640744, "2": 235.5583326471968, "3": 5774.493958768434, "4": 88.56162520125505, "5": 42.08704750519246, "6": 43.430245905183156, "7": 83.0083797499534, "8": 2300.0334176421165, "9": 155.16540429554877, "10": 226.986764236353, "11": 94.64538314379747, "12": 109.22247116453943}, "Mean ITL (ms)": {"0": 16.540048188017764, "1": 27.74949681771482, "2": 27.00816342756017, "3": 28.43324468155776, "4": 12.936618607449548, "5": 6.27439203658357, "6": 6.674730034577902, "7": 9.171031044295976, "8": 12.247382195842055, "9": 19.186542548971318, "10": 31.09191192388171, "11": 23.46184330725827, "12": 30.139367799210298}, "Median ITL (ms)": {"0": 15.903530409559608, "1": 25.302208960056305, "2": 20.81073261797428, "3": 22.597191389650106, "4": 12.02711882069707, "5": 6.190762389451265, "6": 6.397241959348321, "7": 7.927508093416691, "8": 11.40562235377729, "9": 17.085016472265124, "10": 24.386791978031397, "11": 21.060695871710777, "12": 27.340279892086983}, "P99 ITL (ms)": {"0": 57.05114717129618, "1": 51.49066437035799, "2": 73.0208202963695, "3": 62.2624043840915, "4": 38.57284761965275, "5": 6.70215771533549, "6": 20.536222611553967, "7": 22.679689247161157, "8": 22.43727925233543, "9": 59.78339056484401, "10": 172.60293657891452, "11": 56.97400974109764, "12": 96.14510252140462}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

🐚 Tests

Ran in 40s

🐚 Tests

Ran in 38s

Total Job Run Time: 1m 40s