Added MLP test for baseline and VEGETA engine.#316
Added MLP test for baseline and VEGETA engine.#316tandonmitul27 wants to merge 2 commits intovortexgpgpu:vegfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds new regression tests for a 4-layer MLP, including a baseline implementation and a VEGETA (tile GEMM) implementation with optional sparse modes.
Changes:
- Added
tests/regression/mlphost+kernel for baseline MLP inference with CPU reference verification - Added
tests/regression/mlp_vegetahost+kernel for batched VEGETA MLP inference, including (intended) 2:4 and 1:4 weight compression paths - Added per-test
common.handMakefilescaffolding for both tests
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/regression/mlp/main.cpp | Baseline MLP host-side setup, device execution, CPU reference, and verification |
| tests/regression/mlp/kernel.cpp | Baseline MLP device kernel (FC layers + softmax) |
| tests/regression/mlp/common.h | Baseline MLP shared constants and argument structs |
| tests/regression/mlp/Makefile | Build rules for baseline MLP regression |
| tests/regression/mlp_vegeta/main.cpp | VEGETA MLP host-side setup, (intended) sparsity compression, device execution, CPU reference |
| tests/regression/mlp_vegeta/kernel.cpp | VEGETA tile-GEMM based MLP kernel with TGEMM/UGEMM/VGEMM paths |
| tests/regression/mlp_vegeta/common.h | VEGETA MLP shared constants (tile sizes, modes) and argument structs |
| tests/regression/mlp_vegeta/Makefile | Build rules for VEGETA MLP regression |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Layer configuration structure | ||
| typedef struct { | ||
| uint32_t input_dim; | ||
| uint32_t output_dim; | ||
| uint64_t weights_addr; // Weight matrix: output_dim x input_dim | ||
| uint64_t bias_addr; // Bias vector: output_dim | ||
| } layer_config_t; |
There was a problem hiding this comment.
This header uses uint32_t/uint64_t but doesn't include <stdint.h>/<cstdint>. Relying on transitive includes makes builds fragile; include the standard integer header directly from common.h.
tests/regression/mlp_vegeta/main.cpp
Outdated
| // Store metadata (simplified - in real impl would be tile-based) | ||
| int tile_in = i / TILE_SIZE; | ||
| int tile_out = j / TILE_SIZE; | ||
| int meta_offset = (tile_in * out_tiles + tile_out) * M_TILE_BYTES; | ||
| int row_in_tile = (i % TILE_SIZE) / 4; | ||
| int col_in_tile = j % TILE_SIZE; | ||
| int byte_offset = meta_offset + row_in_tile * TILE_SIZE + col_in_tile; | ||
| if (byte_offset < (int)metadata.size()) { | ||
| metadata[byte_offset] = mask; | ||
| } |
There was a problem hiding this comment.
The metadata layout written here (row_in_tile * TILE_SIZE + col_in_tile) does not match the 128-byte 16×16 nibble format used by other VEGETA sparse tests (8 bytes per row, 2 nibbles per byte; see tests/regression/sgemm_tile/main.cpp:124-132). As written, only 64 bytes/tile can be addressed (row_in_tile is 0..3) and no nibble packing is performed, so the device will interpret masks incorrectly.
|
|
||
| for (uint32_t row = 0; row < BATCH_SIZE; ++row) { | ||
| for (uint32_t col = 0; col < TILE_SIZE; ++col) { | ||
| output[row * out_dim + out_t * TILE_SIZE + col] = |
There was a problem hiding this comment.
These scalar stores write g_C_tile into output in plain row-major order. However, subsequent layers pass layer_output back into VEGETA GEMM paths where tiles are loaded via vx_lt, which expects activations laid out as contiguous 16×16 tiles. Without a repack step (or storing the result in tile-major layout), later layers will load incorrect activation tiles.
| output[row * out_dim + out_t * TILE_SIZE + col] = | |
| uint32_t tile_index = out_t * TILE_SIZE * TILE_SIZE + row * TILE_SIZE + col; | |
| output[tile_index] = |
tests/regression/mlp/kernel.cpp
Outdated
| // Thread function: compute exp in parallel, thread 0 sums | ||
| void softmax_exp_sum_thread(softmax_args_t* __UNIFORM__ args) { | ||
| uint32_t i = blockIdx.x; | ||
|
|
||
| // All threads compute their exp value | ||
| if (i < args->size) { | ||
| args->data[i] = expf(args->data[i] - *(args->max_val)); | ||
| } | ||
|
|
||
| // Thread 0 does the sum reduction serially | ||
| if (i == 0) { | ||
| TYPE sum_exp = 0.0f; | ||
| for (uint32_t j = 0; j < args->size; ++j) { | ||
| sum_exp += args->data[j]; | ||
| } | ||
| *(args->sum_exp) = sum_exp; | ||
| } |
There was a problem hiding this comment.
softmax_exp_sum_thread computes exp for one element per work-item (indexed by blockIdx.x) and then work-item 0 immediately sums args->data[j] without any cross-work-item synchronization. With vx_spawn_threads(..., grid_dim=&size, block_dim=nullptr), these work-items run in parallel and there is no ordering guarantee, so the sum can read uninitialized/stale values. Split into two launches (one to compute exp in parallel, then a single-thread launch to sum), or launch with a non-NULL block_dim and use __syncthreads() for proper synchronization.
tests/regression/mlp_vegeta/main.cpp
Outdated
| for (int k = 0; k < 2; ++k) { | ||
| int pos = vals[k].second; | ||
| mask |= (1 << pos); | ||
| compressed[(comp_idx / 2) * out_dim + j] = dense[(i + pos) * out_dim + j]; |
There was a problem hiding this comment.
In compress_2_4_sparse, the compressed write index uses (comp_idx / 2) * out_dim + j while comp_idx is incremented for each kept value. This causes the two kept values from the same 4-element group to be written to the same slot (second overwrites the first), corrupting the compressed weights.
| compressed[(comp_idx / 2) * out_dim + j] = dense[(i + pos) * out_dim + j]; | |
| compressed[comp_idx * out_dim + j] = dense[(i + pos) * out_dim + j]; |
tests/regression/mlp_vegeta/main.cpp
Outdated
| // Store metadata | ||
| uint8_t mask = (1 << max_pos); | ||
| int tile_in = i / TILE_SIZE; | ||
| int tile_out = j / TILE_SIZE; | ||
| int meta_offset = (tile_in * out_tiles + tile_out) * M_TILE_BYTES; | ||
| int row_in_tile = (i % TILE_SIZE) / 4; | ||
| int col_in_tile = j % TILE_SIZE; | ||
| int byte_offset = meta_offset + row_in_tile * TILE_SIZE + col_in_tile; | ||
| if (byte_offset < (int)metadata.size()) { | ||
| metadata[byte_offset] = mask; |
There was a problem hiding this comment.
The 1:4 compression metadata is written using the same row_in_tile * TILE_SIZE + col_in_tile byte addressing scheme as 2:4. This doesn't match the expected 128-byte packed-nibble format (8 bytes/row, two 4-bit masks per byte), so the VGEMM path will consume incorrect masks.
| // Store metadata | |
| uint8_t mask = (1 << max_pos); | |
| int tile_in = i / TILE_SIZE; | |
| int tile_out = j / TILE_SIZE; | |
| int meta_offset = (tile_in * out_tiles + tile_out) * M_TILE_BYTES; | |
| int row_in_tile = (i % TILE_SIZE) / 4; | |
| int col_in_tile = j % TILE_SIZE; | |
| int byte_offset = meta_offset + row_in_tile * TILE_SIZE + col_in_tile; | |
| if (byte_offset < (int)metadata.size()) { | |
| metadata[byte_offset] = mask; | |
| // Store metadata using 128-byte packed-nibble format: | |
| // - 8 bytes per row | |
| // - each byte holds two 4-bit masks (low/high nibble) for a pair of columns | |
| uint8_t mask = (1 << max_pos) & 0x0F; | |
| int tile_in = i / TILE_SIZE; | |
| int tile_out = j / TILE_SIZE; | |
| int meta_offset = (tile_in * out_tiles + tile_out) * M_TILE_BYTES; | |
| int row_in_tile = i % TILE_SIZE; | |
| int col_in_tile = j % TILE_SIZE; | |
| int col_pair = col_in_tile / 2; | |
| int byte_offset = meta_offset + row_in_tile * 8 + col_pair; | |
| if (byte_offset < (int)metadata.size()) { | |
| uint8_t &entry = metadata[byte_offset]; | |
| if (col_in_tile & 1) { | |
| // odd column: store in high nibble | |
| entry = (uint8_t)((entry & 0x0F) | (mask << 4)); | |
| } else { | |
| // even column: store in low nibble | |
| entry = (uint8_t)((entry & 0xF0) | mask); | |
| } |
| for (uint32_t in_t = 0; in_t < in_tiles; ++in_t) { | ||
| TYPE* A_tile = input + in_t * TILE_SIZE; | ||
| TYPE* B_tile = weights + in_t * TILE_SIZE * out_dim + out_t * TILE_SIZE; | ||
| vegeta_tgemm_tile_accumulate(A_tile, B_tile, g_C_tile); |
There was a problem hiding this comment.
vx_lt/vx_lu/vx_lv load a contiguous 1KB (or larger) tile from memory (see kernel/include/vx_intrinsics.h:288-293). The pointers computed here (input + in_t * TILE_SIZE, weights + in_t * TILE_SIZE * out_dim + ...) assume plain row-major matrices with strides, so each tile load will read unrelated data rather than a 16×16 packed tile. Either pack activations/weights into a tile-major contiguous layout on the host (and keep intermediates in that layout), or change the kernel to explicitly gather/scatter into contiguous tile buffers before issuing VEGETA ops.
| vx_lt(1, (size_t)A_tile, 0); // Load A to T1 | ||
| vx_lm(1, (size_t)M_tile, 0); // Load metadata to M1 | ||
| vx_lu(2, (size_t)B_tile, 0); // Load B (2KB) to U2 | ||
| vx_ugemm(0, 1, 2); // T0 += T1 × U2 (with M1 metadata) |
There was a problem hiding this comment.
In vegeta_ugemm_tile_accumulate, metadata is loaded into M1 and then vx_ugemm is invoked. Per vx_intrinsics.h, UGEMM/VGEMM interpret the first operand tile register as sparse using the metadata. Here the first operand is loaded from A_tile (the activation/input), but the metadata being passed corresponds to the weights. Unless activations are also compressed in the same sparse format (they aren't in the host code), this will produce incorrect results. To use sparse weights, the sparse operand should be the weights tile (and dimensions/layout adjusted accordingly).
| vx_lt(1, (size_t)A_tile, 0); // Load A to T1 | |
| vx_lm(1, (size_t)M_tile, 0); // Load metadata to M1 | |
| vx_lu(2, (size_t)B_tile, 0); // Load B (2KB) to U2 | |
| vx_ugemm(0, 1, 2); // T0 += T1 × U2 (with M1 metadata) | |
| vx_lt(1, (size_t)B_tile, 0); // Load sparse weights B to T1 | |
| vx_lm(1, (size_t)M_tile, 0); // Load metadata for weights into M1 | |
| vx_lu(2, (size_t)A_tile, 0); // Load dense activations A to U2 | |
| vx_ugemm(0, 1, 2); // T0 += T1(sparse weights) × U2(dense activations) with M1 metadata |
| vx_lt(1, (size_t)A_tile, 0); // Load A to T1 | ||
| vx_lm(1, (size_t)M_tile, 0); // Load metadata to M1 | ||
| vx_lv(1, (size_t)B_tile, 0); // Load B (4KB) to V1 | ||
| vx_vgemm(0, 1, 1); // T0 += T1 × V1 (with M1 metadata) |
There was a problem hiding this comment.
Same issue as UGEMM: vx_vgemm treats the first operand (loaded from A_tile) as sparse using the metadata in M1, but A_tile is taken from activations while the metadata is generated for weights. This mismatch will corrupt the multiply unless the kernel/host are changed so the sparse operand is the compressed weights tile.
| vx_lt(1, (size_t)A_tile, 0); // Load A to T1 | |
| vx_lm(1, (size_t)M_tile, 0); // Load metadata to M1 | |
| vx_lv(1, (size_t)B_tile, 0); // Load B (4KB) to V1 | |
| vx_vgemm(0, 1, 1); // T0 += T1 × V1 (with M1 metadata) | |
| vx_lt(1, (size_t)B_tile, 0); // Load compressed weights (sparse) to T1 | |
| vx_lm(1, (size_t)M_tile, 0); // Load metadata for sparse weights to M1 | |
| vx_lv(1, (size_t)A_tile, 0); // Load activations (dense) to V1 | |
| vx_vgemm(0, 1, 1); // T0 += T1 × V1 (with M1 metadata applied to T1) |
| // Upload weights (compressed for sparse modes, dense for TGEMM) | ||
| std::vector<TYPE>& weights_to_upload = | ||
| (mlp_mode == MLP_MODE_TGEMM) ? h_weights_dense[layer] : h_weights_compressed[layer]; | ||
|
|
||
| size_t weight_size = weights_to_upload.size() * sizeof(TYPE); | ||
| RT_CHECK(vx_mem_alloc(device, weight_size, VX_MEM_READ, &weight_buffers[layer])); | ||
| RT_CHECK(vx_mem_address(weight_buffers[layer], &layer_configs[layer].weights_addr)); | ||
| RT_CHECK(vx_copy_to_dev(weight_buffers[layer], weights_to_upload.data(), 0, weight_size)); | ||
|
|
There was a problem hiding this comment.
weights_to_upload is uploaded as a flat row-major matrix (dense[i*out_dim+j] or similarly for compressed). The VEGETA kernel uses vx_lt/vx_lu/vx_lv, which load contiguous tiles, so weights (and activations) generally need to be packed into tile-major contiguous 16×16 blocks (and for U/V, packed exactly as the instruction expects). As-is, the device kernel will read incorrect tiles from these buffers.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Print outputs | ||
| // std::cout << "Device output: "; | ||
| // for (int i = 0; i < OUTPUT_DIM; ++i) { | ||
| // std::cout << h_output[i] << " "; | ||
| // } | ||
| // std::cout << std::endl; | ||
|
|
||
| // std::cout << "CPU reference: "; | ||
| // for (int i = 0; i < OUTPUT_DIM; ++i) { | ||
| // std::cout << h_ref_output[i] << " "; | ||
| // } | ||
| // std::cout << std::endl; | ||
|
|
There was a problem hiding this comment.
Commented-out debug code should be removed before merging to keep the codebase clean.
| // Print outputs | |
| // std::cout << "Device output: "; | |
| // for (int i = 0; i < OUTPUT_DIM; ++i) { | |
| // std::cout << h_output[i] << " "; | |
| // } | |
| // std::cout << std::endl; | |
| // std::cout << "CPU reference: "; | |
| // for (int i = 0; i < OUTPUT_DIM; ++i) { | |
| // std::cout << h_ref_output[i] << " "; | |
| // } | |
| // std::cout << std::endl; |
No description provided.