Quantization support for GroupedTensor: FP8 per-tensor

Implement quantization support for the GroupedTensor type with FP8 per-tensor quantization.
The needed modifications to the existing kernel:
 - handling multiple amax values, each for different tensor
 - ignore padding in the allocation