Skip to content

Added MLP test for baseline and VEGETA engine.#316

Open
tandonmitul27 wants to merge 2 commits intovortexgpgpu:vegfrom
tandonmitul27:veg
Open

Added MLP test for baseline and VEGETA engine.#316
tandonmitul27 wants to merge 2 commits intovortexgpgpu:vegfrom
tandonmitul27:veg

Conversation

@tandonmitul27
Copy link

No description provided.

Copilot AI review requested due to automatic review settings February 4, 2026 18:41
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new regression tests for a 4-layer MLP, including a baseline implementation and a VEGETA (tile GEMM) implementation with optional sparse modes.

Changes:

  • Added tests/regression/mlp host+kernel for baseline MLP inference with CPU reference verification
  • Added tests/regression/mlp_vegeta host+kernel for batched VEGETA MLP inference, including (intended) 2:4 and 1:4 weight compression paths
  • Added per-test common.h and Makefile scaffolding for both tests

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
tests/regression/mlp/main.cpp Baseline MLP host-side setup, device execution, CPU reference, and verification
tests/regression/mlp/kernel.cpp Baseline MLP device kernel (FC layers + softmax)
tests/regression/mlp/common.h Baseline MLP shared constants and argument structs
tests/regression/mlp/Makefile Build rules for baseline MLP regression
tests/regression/mlp_vegeta/main.cpp VEGETA MLP host-side setup, (intended) sparsity compression, device execution, CPU reference
tests/regression/mlp_vegeta/kernel.cpp VEGETA tile-GEMM based MLP kernel with TGEMM/UGEMM/VGEMM paths
tests/regression/mlp_vegeta/common.h VEGETA MLP shared constants (tile sizes, modes) and argument structs
tests/regression/mlp_vegeta/Makefile Build rules for VEGETA MLP regression

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +18 to +24
// Layer configuration structure
typedef struct {
uint32_t input_dim;
uint32_t output_dim;
uint64_t weights_addr; // Weight matrix: output_dim x input_dim
uint64_t bias_addr; // Bias vector: output_dim
} layer_config_t;
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header uses uint32_t/uint64_t but doesn't include <stdint.h>/<cstdint>. Relying on transitive includes makes builds fragile; include the standard integer header directly from common.h.

Copilot uses AI. Check for mistakes.
Comment on lines 128 to 137
// Store metadata (simplified - in real impl would be tile-based)
int tile_in = i / TILE_SIZE;
int tile_out = j / TILE_SIZE;
int meta_offset = (tile_in * out_tiles + tile_out) * M_TILE_BYTES;
int row_in_tile = (i % TILE_SIZE) / 4;
int col_in_tile = j % TILE_SIZE;
int byte_offset = meta_offset + row_in_tile * TILE_SIZE + col_in_tile;
if (byte_offset < (int)metadata.size()) {
metadata[byte_offset] = mask;
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata layout written here (row_in_tile * TILE_SIZE + col_in_tile) does not match the 128-byte 16×16 nibble format used by other VEGETA sparse tests (8 bytes per row, 2 nibbles per byte; see tests/regression/sgemm_tile/main.cpp:124-132). As written, only 64 bytes/tile can be addressed (row_in_tile is 0..3) and no nibble packing is performed, so the device will interpret masks incorrectly.

Copilot uses AI. Check for mistakes.

for (uint32_t row = 0; row < BATCH_SIZE; ++row) {
for (uint32_t col = 0; col < TILE_SIZE; ++col) {
output[row * out_dim + out_t * TILE_SIZE + col] =
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These scalar stores write g_C_tile into output in plain row-major order. However, subsequent layers pass layer_output back into VEGETA GEMM paths where tiles are loaded via vx_lt, which expects activations laid out as contiguous 16×16 tiles. Without a repack step (or storing the result in tile-major layout), later layers will load incorrect activation tiles.

Suggested change
output[row * out_dim + out_t * TILE_SIZE + col] =
uint32_t tile_index = out_t * TILE_SIZE * TILE_SIZE + row * TILE_SIZE + col;
output[tile_index] =

Copilot uses AI. Check for mistakes.
Comment on lines 51 to 67
// Thread function: compute exp in parallel, thread 0 sums
void softmax_exp_sum_thread(softmax_args_t* __UNIFORM__ args) {
uint32_t i = blockIdx.x;

// All threads compute their exp value
if (i < args->size) {
args->data[i] = expf(args->data[i] - *(args->max_val));
}

// Thread 0 does the sum reduction serially
if (i == 0) {
TYPE sum_exp = 0.0f;
for (uint32_t j = 0; j < args->size; ++j) {
sum_exp += args->data[j];
}
*(args->sum_exp) = sum_exp;
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

softmax_exp_sum_thread computes exp for one element per work-item (indexed by blockIdx.x) and then work-item 0 immediately sums args->data[j] without any cross-work-item synchronization. With vx_spawn_threads(..., grid_dim=&size, block_dim=nullptr), these work-items run in parallel and there is no ordering guarantee, so the sum can read uninitialized/stale values. Split into two launches (one to compute exp in parallel, then a single-thread launch to sum), or launch with a non-NULL block_dim and use __syncthreads() for proper synchronization.

Copilot uses AI. Check for mistakes.
for (int k = 0; k < 2; ++k) {
int pos = vals[k].second;
mask |= (1 << pos);
compressed[(comp_idx / 2) * out_dim + j] = dense[(i + pos) * out_dim + j];
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In compress_2_4_sparse, the compressed write index uses (comp_idx / 2) * out_dim + j while comp_idx is incremented for each kept value. This causes the two kept values from the same 4-element group to be written to the same slot (second overwrites the first), corrupting the compressed weights.

Suggested change
compressed[(comp_idx / 2) * out_dim + j] = dense[(i + pos) * out_dim + j];
compressed[comp_idx * out_dim + j] = dense[(i + pos) * out_dim + j];

Copilot uses AI. Check for mistakes.
Comment on lines 171 to 180
// Store metadata
uint8_t mask = (1 << max_pos);
int tile_in = i / TILE_SIZE;
int tile_out = j / TILE_SIZE;
int meta_offset = (tile_in * out_tiles + tile_out) * M_TILE_BYTES;
int row_in_tile = (i % TILE_SIZE) / 4;
int col_in_tile = j % TILE_SIZE;
int byte_offset = meta_offset + row_in_tile * TILE_SIZE + col_in_tile;
if (byte_offset < (int)metadata.size()) {
metadata[byte_offset] = mask;
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1:4 compression metadata is written using the same row_in_tile * TILE_SIZE + col_in_tile byte addressing scheme as 2:4. This doesn't match the expected 128-byte packed-nibble format (8 bytes/row, two 4-bit masks per byte), so the VGEMM path will consume incorrect masks.

Suggested change
// Store metadata
uint8_t mask = (1 << max_pos);
int tile_in = i / TILE_SIZE;
int tile_out = j / TILE_SIZE;
int meta_offset = (tile_in * out_tiles + tile_out) * M_TILE_BYTES;
int row_in_tile = (i % TILE_SIZE) / 4;
int col_in_tile = j % TILE_SIZE;
int byte_offset = meta_offset + row_in_tile * TILE_SIZE + col_in_tile;
if (byte_offset < (int)metadata.size()) {
metadata[byte_offset] = mask;
// Store metadata using 128-byte packed-nibble format:
// - 8 bytes per row
// - each byte holds two 4-bit masks (low/high nibble) for a pair of columns
uint8_t mask = (1 << max_pos) & 0x0F;
int tile_in = i / TILE_SIZE;
int tile_out = j / TILE_SIZE;
int meta_offset = (tile_in * out_tiles + tile_out) * M_TILE_BYTES;
int row_in_tile = i % TILE_SIZE;
int col_in_tile = j % TILE_SIZE;
int col_pair = col_in_tile / 2;
int byte_offset = meta_offset + row_in_tile * 8 + col_pair;
if (byte_offset < (int)metadata.size()) {
uint8_t &entry = metadata[byte_offset];
if (col_in_tile & 1) {
// odd column: store in high nibble
entry = (uint8_t)((entry & 0x0F) | (mask << 4));
} else {
// even column: store in low nibble
entry = (uint8_t)((entry & 0xF0) | mask);
}

Copilot uses AI. Check for mistakes.
Comment on lines 69 to 72
for (uint32_t in_t = 0; in_t < in_tiles; ++in_t) {
TYPE* A_tile = input + in_t * TILE_SIZE;
TYPE* B_tile = weights + in_t * TILE_SIZE * out_dim + out_t * TILE_SIZE;
vegeta_tgemm_tile_accumulate(A_tile, B_tile, g_C_tile);
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vx_lt/vx_lu/vx_lv load a contiguous 1KB (or larger) tile from memory (see kernel/include/vx_intrinsics.h:288-293). The pointers computed here (input + in_t * TILE_SIZE, weights + in_t * TILE_SIZE * out_dim + ...) assume plain row-major matrices with strides, so each tile load will read unrelated data rather than a 16×16 packed tile. Either pack activations/weights into a tile-major contiguous layout on the host (and keep intermediates in that layout), or change the kernel to explicitly gather/scatter into contiguous tile buffers before issuing VEGETA ops.

Copilot uses AI. Check for mistakes.
Comment on lines 90 to 93
vx_lt(1, (size_t)A_tile, 0); // Load A to T1
vx_lm(1, (size_t)M_tile, 0); // Load metadata to M1
vx_lu(2, (size_t)B_tile, 0); // Load B (2KB) to U2
vx_ugemm(0, 1, 2); // T0 += T1 × U2 (with M1 metadata)
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In vegeta_ugemm_tile_accumulate, metadata is loaded into M1 and then vx_ugemm is invoked. Per vx_intrinsics.h, UGEMM/VGEMM interpret the first operand tile register as sparse using the metadata. Here the first operand is loaded from A_tile (the activation/input), but the metadata being passed corresponds to the weights. Unless activations are also compressed in the same sparse format (they aren't in the host code), this will produce incorrect results. To use sparse weights, the sparse operand should be the weights tile (and dimensions/layout adjusted accordingly).

Suggested change
vx_lt(1, (size_t)A_tile, 0); // Load A to T1
vx_lm(1, (size_t)M_tile, 0); // Load metadata to M1
vx_lu(2, (size_t)B_tile, 0); // Load B (2KB) to U2
vx_ugemm(0, 1, 2); // T0 += T1 × U2 (with M1 metadata)
vx_lt(1, (size_t)B_tile, 0); // Load sparse weights B to T1
vx_lm(1, (size_t)M_tile, 0); // Load metadata for weights into M1
vx_lu(2, (size_t)A_tile, 0); // Load dense activations A to U2
vx_ugemm(0, 1, 2); // T0 += T1(sparse weights) × U2(dense activations) with M1 metadata

Copilot uses AI. Check for mistakes.
Comment on lines 131 to 134
vx_lt(1, (size_t)A_tile, 0); // Load A to T1
vx_lm(1, (size_t)M_tile, 0); // Load metadata to M1
vx_lv(1, (size_t)B_tile, 0); // Load B (4KB) to V1
vx_vgemm(0, 1, 1); // T0 += T1 × V1 (with M1 metadata)
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as UGEMM: vx_vgemm treats the first operand (loaded from A_tile) as sparse using the metadata in M1, but A_tile is taken from activations while the metadata is generated for weights. This mismatch will corrupt the multiply unless the kernel/host are changed so the sparse operand is the compressed weights tile.

Suggested change
vx_lt(1, (size_t)A_tile, 0); // Load A to T1
vx_lm(1, (size_t)M_tile, 0); // Load metadata to M1
vx_lv(1, (size_t)B_tile, 0); // Load B (4KB) to V1
vx_vgemm(0, 1, 1); // T0 += T1 × V1 (with M1 metadata)
vx_lt(1, (size_t)B_tile, 0); // Load compressed weights (sparse) to T1
vx_lm(1, (size_t)M_tile, 0); // Load metadata for sparse weights to M1
vx_lv(1, (size_t)A_tile, 0); // Load activations (dense) to V1
vx_vgemm(0, 1, 1); // T0 += T1 × V1 (with M1 metadata applied to T1)

Copilot uses AI. Check for mistakes.
Comment on lines +368 to +376
// Upload weights (compressed for sparse modes, dense for TGEMM)
std::vector<TYPE>& weights_to_upload =
(mlp_mode == MLP_MODE_TGEMM) ? h_weights_dense[layer] : h_weights_compressed[layer];

size_t weight_size = weights_to_upload.size() * sizeof(TYPE);
RT_CHECK(vx_mem_alloc(device, weight_size, VX_MEM_READ, &weight_buffers[layer]));
RT_CHECK(vx_mem_address(weight_buffers[layer], &layer_configs[layer].weights_addr));
RT_CHECK(vx_copy_to_dev(weight_buffers[layer], weights_to_upload.data(), 0, weight_size));

Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weights_to_upload is uploaded as a flat row-major matrix (dense[i*out_dim+j] or similarly for compressed). The VEGETA kernel uses vx_lt/vx_lu/vx_lv, which load contiguous tiles, so weights (and activations) generally need to be packed into tile-major contiguous 16×16 blocks (and for U/V, packed exactly as the instruction expects). As-is, the device kernel will read incorrect tiles from these buffers.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +312 to +324
// Print outputs
// std::cout << "Device output: ";
// for (int i = 0; i < OUTPUT_DIM; ++i) {
// std::cout << h_output[i] << " ";
// }
// std::cout << std::endl;

// std::cout << "CPU reference: ";
// for (int i = 0; i < OUTPUT_DIM; ++i) {
// std::cout << h_ref_output[i] << " ";
// }
// std::cout << std::endl;

Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out debug code should be removed before merging to keep the codebase clean.

Suggested change
// Print outputs
// std::cout << "Device output: ";
// for (int i = 0; i < OUTPUT_DIM; ++i) {
// std::cout << h_output[i] << " ";
// }
// std::cout << std::endl;
// std::cout << "CPU reference: ";
// for (int i = 0; i < OUTPUT_DIM; ++i) {
// std::cout << h_ref_output[i] << " ";
// }
// std::cout << std::endl;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant