https://github.com/JuliaGeodynamics/JustPIC.jl

add MarkerChain conversion function

Failed in 41m 30s

Latency tests

  • Input length: 32 tokens.
  • Output length: 128 tokens.
  • Batch size: fixed (8).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: end-to-end latency (mean, median, p99).
Test name GPU Mean latency (ms) Median latency (ms) P99 latency (ms)
latency_llama8B_tp1 A100-SXM4-80GB 1585.43 1585.3 1586.62

Throughput tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm to achieve maximum throughput.
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: throughput.
Test name GPU Tput (req/s)
throughput_llama8B_tp1 A100-SXM4-80GB 11.0511

Serving tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm and the arrival pattern of the requests.
  • Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • We also added a speculative decoding test for llama-3 70B, under QPS 2
  • Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name GPU Tput (req/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
serving_llama8B_tp1_sharegpt_qps_1 A100-SXM4-80GB 0.996675 45.2435 40.0117 84.2829 12.4834 12.0232 36.9104
serving_llama8B_tp1_sharegpt_qps_4 A100-SXM4-80GB 3.5898 50.6603 43.88 103.504 14.3334 13.0438 39.3704
serving_llama8B_tp1_sharegpt_qps_16 A100-SXM4-80GB 8.6693 88.7766 75.6271 290.078 22.8193 19.4976 47.0021
serving_llama8B_tp1_sharegpt_qps_inf A100-SXM4-80GB 11.2301 2351.38 2311.11 4524.92 25.5895 22.1167 47.9077

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {"Test name": {"0": "latency_llama8B_tp1"}, "GPU": {"0": "A100-SXM4-80GB"}, "Mean latency (ms)": {"0": 1585.4302836582065}, "Median latency (ms)": {"0": 1585.2955740410835}, "P99 latency (ms)": {"0": 1586.6179404547438}}, "throughput": {"Test name": {"0": "throughput_llama8B_tp1"}, "GPU": {"0": "A100-SXM4-80GB"}, "Tput (req/s)": {"0": 11.051051066856987}}, "serving": {"Test name": {"0": "serving_llama8B_tp1_sharegpt_qps_1", "1": "serving_llama8B_tp1_sharegpt_qps_4", "2": "serving_llama8B_tp1_sharegpt_qps_16", "3": "serving_llama8B_tp1_sharegpt_qps_inf"}, "GPU": {"0": "A100-SXM4-80GB", "1": "A100-SXM4-80GB", "2": "A100-SXM4-80GB", "3": "A100-SXM4-80GB"}, "Tput (req/s)": {"0": 0.9966750028890998, "1": 3.589799365387123, "2": 8.669303557011263, "3": 11.230140465962256}, "Mean TTFT (ms)": {"0": 45.24345211684704, "1": 50.66031151684001, "2": 88.77658500452526, "3": 2351.383146782173}, "Median TTFT (ms)": {"0": 40.011718519963324, "1": 43.87998208403587, "2": 75.62712801154703, "3": 2311.106847017072}, "P99 TTFT (ms)": {"0": 84.28291223477572, "1": 103.50417125970122, "2": 290.0784279988145, "3": 4524.917814568616}, "Mean ITL (ms)": {"0": 12.483403929454695, "1": 14.333410171630511, "2": 22.819252189808246, "3": 25.58949835774591}, "Median ITL (ms)": {"0": 12.023237068206072, "1": 13.043780578300357, "2": 19.497641013003886, "3": 22.116736974567175}, "P99 ITL (ms)": {"0": 36.91042453050618, "1": 39.37038655159996, "2": 47.002061328385025, "3": 47.90773662971333}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

Matrix
CUDA Julia 1.10julia -e 'println("--- :julia: Instantiating project") && using Pkg && Pkg.develop(; path=pwd())' || exit 3 && julia -e 'println("+++ :julia: Running tests") && using Pkg && Pkg.test("JustPIC"; test_args=["--backend=CUDA"], coverage=true)'
Waited 10s
·
Ran in 5m 49s
Matrix
AMDGPU Julia 1.10julia -e 'println("--- :julia: Instantiating project") && using Pkg && Pkg.develop(; path=pwd())' || exit 3 && julia -e 'println("+++ :julia: Running tests") && using Pkg && Pkg.test("JustPIC"; test_args=["--backend=AMDGPU"], coverage=true)'
Waited 4s
·
Ran in 41m 24s
Total Job Run Time: 47m 15s