JustPIC.jl

Public

https://github.com/JuliaGeodynamics/JustPIC.jl

add MarkerChain conversion function

#1036

pa-checkpointing/930742c(#194)

Failed in 41m 30s

Pipeline upload

CUDA Julia 1.10

AMDGPU Julia 1.10

Pascal Aellig

Created Sat 18th Jan at 7:26 PM

Triggered from Webhook

Latency tests

Input length: 32 tokens.
Output length: 128 tokens.
Batch size: fixed (8).
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: end-to-end latency (mean, median, p99).

Test name	GPU	Mean latency (ms)	Median latency (ms)	P99 latency (ms)
latency_llama8B_tp1	A100-SXM4-80GB	1585.43	1585.3	1586.62

Throughput tests

Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm to achieve maximum throughput.
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
Evaluation metrics: throughput.

Test name	GPU	Tput (req/s)
throughput_llama8B_tp1	A100-SXM4-80GB	11.0511

Serving tests

Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm and the arrival pattern of the requests.
Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
We also added a speculative decoding test for llama-3 70B, under QPS 2
Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

Test name	GPU	Tput (req/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)
serving_llama8B_tp1_sharegpt_qps_1	A100-SXM4-80GB	0.996675	45.2435	40.0117	84.2829	12.4834	12.0232	36.9104
serving_llama8B_tp1_sharegpt_qps_4	A100-SXM4-80GB	3.5898	50.6603	43.88	103.504	14.3334	13.0438	39.3704
serving_llama8B_tp1_sharegpt_qps_16	A100-SXM4-80GB	8.6693	88.7766	75.6271	290.078	22.8193	19.4976	47.0021
serving_llama8B_tp1_sharegpt_qps_inf	A100-SXM4-80GB	11.2301	2351.38	2311.11	4524.92	25.5895	22.1167	47.9077

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_llama8B_tp1"}, "GPU": {"0": "A100-SXM4-80GB"}, "Mean latency (ms)": {"0": 1585.4302836582065}, "Median latency (ms)": {"0": 1585.2955740410835}, "P99 latency (ms)": {"0": 1586.6179404547438}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1"}, "GPU": {"0": "A100-SXM4-80GB"}, "Tput (req/s)": {"0": 11.051051066856987}}, "serving": {"Test name": {"0": "serving_llama8B_tp1_sharegpt_qps_1", "1": "serving_llama8B_tp1_sharegpt_qps_4", "2": "serving_llama8B_tp1_sharegpt_qps_16", "3": "serving_llama8B_tp1_sharegpt_qps_inf"}, "GPU": {"0": "A100-SXM4-80GB", "1": "A100-SXM4-80GB", "2": "A100-SXM4-80GB", "3": "A100-SXM4-80GB"}, "Tput (req/s)": {"0": 0.9966750028890998, "1": 3.589799365387123, "2": 8.669303557011263, "3": 11.230140465962256}, "Mean TTFT (ms)": {"0": 45.24345211684704, "1": 50.66031151684001, "2": 88.77658500452526, "3": 2351.383146782173}, "Median TTFT (ms)": {"0": 40.011718519963324, "1": 43.87998208403587, "2": 75.62712801154703, "3": 2311.106847017072}, "P99 TTFT (ms)": {"0": 84.28291223477572, "1": 103.50417125970122, "2": 290.0784279988145, "3": 4524.917814568616}, "Mean ITL (ms)": {"0": 12.483403929454695, "1": 14.333410171630511, "2": 22.819252189808246, "3": 25.58949835774591}, "Median ITL (ms)": {"0": 12.023237068206072, "1": 13.043780578300357, "2": 19.497641013003886, "3": 22.116736974567175}, "P99 ITL (ms)": {"0": 36.91042453050618, "1": 39.37038655159996, "2": 47.002061328385025, "3": 47.90773662971333}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

Matrix

CUDA Julia 1.10julia -e 'println("--- :julia: Instantiating project") && using Pkg && Pkg.develop(; path=pwd())' || exit 3 && julia -e 'println("+++ :julia: Running tests") && using Pkg && Pkg.test("JustPIC"; test_args=["--backend=CUDA"], coverage=true)'

Ran in 5m 49s

Matrix

AMDGPU Julia 1.10julia -e 'println("--- :julia: Instantiating project") && using Pkg && Pkg.develop(; path=pwd())' || exit 3 && julia -e 'println("+++ :julia: Running tests") && using Pkg && Pkg.test("JustPIC"; test_args=["--backend=AMDGPU"], coverage=true)'

Ran in 41m 24s

Total Job Run Time: 47m 15s