JustPIC.jl
Publichttps://github.com/JuliaGeodynamics/JustPIC.jl
add MarkerChain conversion function
Failed in 41m 30s
Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
Test name | GPU | Mean latency (ms) | Median latency (ms) | P99 latency (ms) |
---|---|---|---|---|
latency_llama8B_tp1 | A100-SXM4-80GB | 1585.43 | 1585.3 | 1586.62 |
Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.
Test name | GPU | Tput (req/s) |
---|---|---|
throughput_llama8B_tp1 | A100-SXM4-80GB | 11.0511 |
Serving tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name | GPU | Tput (req/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) | Mean ITL (ms) | Median ITL (ms) | P99 ITL (ms) |
---|---|---|---|---|---|---|---|---|
serving_llama8B_tp1_sharegpt_qps_1 | A100-SXM4-80GB | 0.996675 | 45.2435 | 40.0117 | 84.2829 | 12.4834 | 12.0232 | 36.9104 |
serving_llama8B_tp1_sharegpt_qps_4 | A100-SXM4-80GB | 3.5898 | 50.6603 | 43.88 | 103.504 | 14.3334 | 13.0438 | 39.3704 |
serving_llama8B_tp1_sharegpt_qps_16 | A100-SXM4-80GB | 8.6693 | 88.7766 | 75.6271 | 290.078 | 22.8193 | 19.4976 | 47.0021 |
serving_llama8B_tp1_sharegpt_qps_inf | A100-SXM4-80GB | 11.2301 | 2351.38 | 2311.11 | 4524.92 | 25.5895 | 22.1167 | 47.9077 |
json version of the benchmarking tables
This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:
import json
import pandas as pd
benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
The json string for all benchmarking tables:
{"latency": {"Test name": {"0": "latency_llama8B_tp1"}, "GPU": {"0": "A100-SXM4-80GB"}, "Mean latency (ms)": {"0": 1585.4302836582065}, "Median latency (ms)": {"0": 1585.2955740410835}, "P99 latency (ms)": {"0": 1586.6179404547438}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1"}, "GPU": {"0": "A100-SXM4-80GB"}, "Tput (req/s)": {"0": 11.051051066856987}}, "serving": {"Test name": {"0": "serving_llama8B_tp1_sharegpt_qps_1", "1": "serving_llama8B_tp1_sharegpt_qps_4", "2": "serving_llama8B_tp1_sharegpt_qps_16", "3": "serving_llama8B_tp1_sharegpt_qps_inf"}, "GPU": {"0": "A100-SXM4-80GB", "1": "A100-SXM4-80GB", "2": "A100-SXM4-80GB", "3": "A100-SXM4-80GB"}, "Tput (req/s)": {"0": 0.9966750028890998, "1": 3.589799365387123, "2": 8.669303557011263, "3": 11.230140465962256}, "Mean TTFT (ms)": {"0": 45.24345211684704, "1": 50.66031151684001, "2": 88.77658500452526, "3": 2351.383146782173}, "Median TTFT (ms)": {"0": 40.011718519963324, "1": 43.87998208403587, "2": 75.62712801154703, "3": 2311.106847017072}, "P99 TTFT (ms)": {"0": 84.28291223477572, "1": 103.50417125970122, "2": 290.0784279988145, "3": 4524.917814568616}, "Mean ITL (ms)": {"0": 12.483403929454695, "1": 14.333410171630511, "2": 22.819252189808246, "3": 25.58949835774591}, "Median ITL (ms)": {"0": 12.023237068206072, "1": 13.043780578300357, "2": 19.497641013003886, "3": 22.116736974567175}, "P99 ITL (ms)": {"0": 36.91042453050618, "1": 39.37038655159996, "2": 47.002061328385025, "3": 47.90773662971333}}}
You can also check the raw experiment data in the Artifact tab of the Buildkite page.
Matrix
CUDA Julia 1.10julia -e 'println("--- :julia: Instantiating project") && using Pkg && Pkg.develop(; path=pwd())' || exit 3 && julia -e 'println("+++ :julia: Running tests") && using Pkg && Pkg.test("JustPIC"; test_args=["--backend=CUDA"], coverage=true)'Waited 10s
Ran in 5m 49s
Matrix
AMDGPU Julia 1.10julia -e 'println("--- :julia: Instantiating project") && using Pkg && Pkg.develop(; path=pwd())' || exit 3 && julia -e 'println("+++ :julia: Running tests") && using Pkg && Pkg.test("JustPIC"; test_args=["--backend=AMDGPU"], coverage=true)'Waited 4s
Ran in 41m 24s
Total Job Run Time: 47m 15s