Implement quantization support for the GroupedTensor type with FP8 per-tensor quantization. The needed modifications to the existing kernel: - handling multiple amax values, each for different tensor - ignore padding in the allocation