Propagate dtype to final benchmark result #84

linamy85 · 2026-01-23T07:01:15Z

This change propagates dtype from the benchmark function arguments to final reporting results. This impacts benchmarks that were missing such metadata, including:

gemm_multiple_run
inference_add
inference_rmsnorm
inference_silu_mul
inference_sigmoid

chishuen

LGTM

junjieqian · 2026-01-23T19:32:26Z

Hi @linamy85 , seems the final report still does not have the data type for gemm_multiple_run. Can you check again? Thank you

linamy85 · 2026-01-24T09:11:44Z

Hi @junjieqian , I can see the dtype result from tsv when running the following kube config. Could you share the yaml that you were using?

apiVersion: v1
kind: Pod
metadata:
  name: microbenchmark
spec:
  restartPolicy: Never
  nodeSelector:
    cloud.google.com/gke-tpu-accelerator: tpu7x
    cloud.google.com/gke-tpu-topology: 2x2x1
  containers:
  - name: tpu-job
    image: python:3.12
    ports:
    - containerPort: 8431
    securityContext:
      privileged: false
    command:
    - bash
    - -c
    - |
      set -ex

      git clone https://github.com/AI-Hypercomputer/accelerator-microbenchmarks.git
      cd accelerator-microbenchmarks
      pip install -r requirements.txt

      python3 Ironwood/src/run_benchmark.py --config=Ironwood/configs/training/gemm_multiple_run.yaml

      sleep 36000

    resources:
      requests:
        google.com/tpu: 4
      limits:
        google.com/tpu: 4

junjieqian · 2026-01-25T06:14:50Z

Hi @linamy85 thanks for checking this! We actually did not look into the csv file but only from the stdout logs, which does not include the dtype.
Would you mind adding it to the print log as well?
Thanks

linamy85 · 2026-01-26T02:58:04Z

@junjieqian To make sure we're aligned, would [float4_e2m1fn] prefix at the following logging line help?

[float4_e2m1fn] Total floating-point ops: 9895604649984, Step Time (median): 16.36, Throughput (median): 605.00 TFLOP / second / device, TotalThroughput (median): 4840.01 TFLOP / second, MFU: 52.45%

I feel like for result gathering, it's the easier if we could rely on final tsv file. Though I know it's quite difficult at the moment due to the lack of GCS support.

log for single run

==============================Starting benchmark 'gemm_multiple_run'==============================

Running benchmark: gemm_multiple_run with params: {'m': 16384, 'k': 18432, 'n': 16384, 'num_runs': 100, 'dtype': <class 'jax.numpy.float4_e2m1fn'>, 'trace_dir': '../microbenchmarks/gemm_multiple_run_fp4/benchmark_0'}
Running gemm_multiple_run benchmark 100
[gemm_multiple_run] Running iteration 0 of 100 with float4_e2m1fn_16384x16384x18432...
[gemm_multiple_run] Running iteration 10 of 100 with float4_e2m1fn_16384x16384x18432...
[gemm_multiple_run] Running iteration 20 of 100 with float4_e2m1fn_16384x16384x18432...
[gemm_multiple_run] Running iteration 30 of 100 with float4_e2m1fn_16384x16384x18432...
[gemm_multiple_run] Running iteration 40 of 100 with float4_e2m1fn_16384x16384x18432...
[gemm_multiple_run] Running iteration 50 of 100 with float4_e2m1fn_16384x16384x18432...
[gemm_multiple_run] Running iteration 60 of 100 with float4_e2m1fn_16384x16384x18432...
[gemm_multiple_run] Running iteration 70 of 100 with float4_e2m1fn_16384x16384x18432...
[gemm_multiple_run] Running iteration 80 of 100 with float4_e2m1fn_16384x16384x18432...
[gemm_multiple_run] Running iteration 90 of 100 with float4_e2m1fn_16384x16384x18432...
Unique PIDs: {3, 4, 37, 38, 20, 21, 54, 55}
Collected 100 events from trace for pid 3.
[16.357372149, 16.356966387, 16.356578631, 16.356494598, 16.356966387, 16.356685474, 16.356817527, 16.356793517, 16.356641056, 16.356123649, 16.356306122, 16.356777911, 16.356222089, 16.356584634, 16.35657503, 16.356619448, 16.356470588, 16.356492197, 16.356786315, 16.356092437, 16.356364946, 16.356726291, 16.356442977, 16.356752701, 16.356268908, 16.35607443, 16.356295318, 16.356204082, 16.356129652, 16.355857143, 16.356102041, 16.356626651, 16.355811525, 16.356255702, 16.356326531, 16.356422569, 16.35630012, 16.356361345, 16.356201681, 16.356086435, 16.356452581, 16.357060024, 16.356105642, 16.356382953, 16.35652461, 16.356328932, 16.356165666, 16.355831933, 16.356129652, 16.356297719, 16.356169268, 16.356297719, 16.356229292, 16.35630012, 16.356343337, 16.355992797, 16.356163265, 16.355931573, 16.356142857, 16.356313325, 16.356333733, 16.356342137, 16.356271309, 16.356060024, 16.35622569, 16.356398559, 16.356909964, 16.356506603, 16.356357743, 16.356567827, 16.35609964, 16.356482593, 16.356405762, 16.356771909, 16.356645858, 16.356792317, 16.35629892, 16.356187275, 16.356336134, 16.356256903, 16.356228091, 16.356009604, 16.356192077, 16.356626651, 16.356704682, 16.356114046, 16.356420168, 16.356164466, 16.356243697, 16.35644898, 16.356571429, 16.356236495, 16.356433373, 16.356410564, 16.356590636, 16.356092437, 16.356990396, 16.356142857, 16.356846339, 16.356685474]
The XLA dump is stored in ../microbenchmarks/gemm_multiple_run_fp4/hlo_graphs
Could not find replica_groups in ../microbenchmarks/gemm_multiple_run_fp4/hlo_graphs/gemm_multiple_run_m_16384_k_18432_n_16384_num_runs_100_dtype_float4.after_optimizations.txt.
[float4_e2m1fn] Total floating-point ops: 9895604649984, Step Time (median): 16.36, Throughput (median): 605.00 TFLOP / second / device, TotalThroughput (median): 4840.01 TFLOP / second, MFU: 52.45%
Writing metrics to JSONL file: ../microbenchmarks/gemm_multiple_run_fp4/metrics_report.jsonl
Metrics written to CSV at ../microbenchmarks/gemm_multiple_run_fp4/t_gemm_multiple_run_U3GEQ267RK.tsv.

As requested in [another PR](#84 (comment)) for easier result inspection.

linamy85 requested review from chishuen and hylin2002 January 23, 2026 07:01

Propagate dtype to final benchmark result

7df8f7d

linamy85 force-pushed the fix/propagate-dtype branch from 7f0fbb4 to 7df8f7d Compare January 23, 2026 07:20

chishuen approved these changes Jan 23, 2026

View reviewed changes

linamy85 merged commit 2c847e8 into AI-Hypercomputer:main Jan 23, 2026
2 checks passed

linamy85 deleted the fix/propagate-dtype branch January 23, 2026 14:36

linamy85 mentioned this pull request Jan 26, 2026

Print dtype in final aggregated result #89

Merged

linamy85 added a commit that referenced this pull request Jan 26, 2026

Print dtype in metrics stage (#89)

5564ea5

As requested in [another PR](#84 (comment)) for easier result inspection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate dtype to final benchmark result #84

Propagate dtype to final benchmark result #84

Uh oh!

linamy85 commented Jan 23, 2026

Uh oh!

chishuen left a comment

Uh oh!

Uh oh!

junjieqian commented Jan 23, 2026

Uh oh!

linamy85 commented Jan 24, 2026

Uh oh!

junjieqian commented Jan 25, 2026

Uh oh!

linamy85 commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Propagate dtype to final benchmark result #84

Propagate dtype to final benchmark result #84

Uh oh!

Conversation

linamy85 commented Jan 23, 2026

Uh oh!

chishuen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

junjieqian commented Jan 23, 2026

Uh oh!

linamy85 commented Jan 24, 2026

Uh oh!

junjieqian commented Jan 25, 2026

Uh oh!

linamy85 commented Jan 26, 2026

log for single run

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants