Skip to content

Conversation

@dzzz2001
Copy link
Collaborator

@dzzz2001 dzzz2001 commented Dec 26, 2025

Summary

This PR adds CUDA GPU acceleration for the snap_psibeta_half function in RT-TDDFT calculations, achieving significant performance improvements.

Changes

New Files

  • source/source_lcao/module_rt/kernels/cuda/snap_psibeta_gpu.cu - Main CUDA implementation
  • source/source_lcao/module_rt/kernels/cuda/snap_psibeta_kernel.cu - CUDA kernel implementations
  • source/source_lcao/module_rt/kernels/cuda/snap_psibeta_kernel.cuh - Kernel headers and device functions
  • source/source_lcao/module_rt/kernels/snap_psibeta_gpu.h - Public interface header

Modified Files

  • source/source_lcao/module_operator_lcao/td_nonlocal_lcao.cpp - Integration with GPU path
  • source/source_lcao/module_rt/CMakeLists.txt - Build configuration for CUDA files

Key Optimizations

  1. Atom-batch GPU kernel: Processes atoms in batches to maximize GPU utilization
  2. Constant memory grids: Uses CUDA constant memory for Gauss-Legendre integration grids
  3. Warp shuffle reduction: Efficient parallel reduction using warp primitives
  4. Optimized spherical harmonics: GPU-optimized implementation with atan2 and sincos
  5. Template-based kernel dispatch: Compile-time optimization paths

Performance Results

Test Environment

  • Platform: 北极星 (Polaris)
  • CPU: Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
  • GPU: NVIDIA A800 80GB PCIe (2x GPUs used)
  • Test command: OMP_NUM_THREADS=12 mpirun -n 2 abacus

Benchmark Results

Version snap_psibeta Time contribute_HR Time snap Ratio
v3.9.0.20 (baseline) 306.46s 404.38s 48.32%
v3.9.0.21 (CPU optimized) 158.09s 236.51s 33.57%
This PR (GPU) 6.59s 19.90s 2.58%

Speedup

  • vs v3.9.0.20: ~46x faster for snap_psibeta
  • vs v3.9.0.21: ~24x faster for snap_psibeta
  • Overall contribute_HR: ~20x faster than v3.9.0.20, ~12x faster than v3.9.0.21

…ory grids

- Implement GPU-accelerated snap_psibeta_neighbor_batch_kernel
- Use constant memory for Lebedev and Gauss-Legendre integration grids
- Add multi-GPU support via set_device_by_rank
- Initialize/finalize GPU resources in each calculate_HR call
- Remove static global variables for cleaner resource management
- CPU fallback when GPU processing fails
…ructure

- Add ModuleBase::timer for snap_psibeta_atom_batch_gpu function
- Remove GPU fallback to CPU design (return true/false in void function)
- Replace fallback returns with error messages and proper early exits
- Ensure timer is properly called on all exit paths
- Simplify code structure for better readability
…uring

- Move ylm0 computation outside radial loop (saves 140x redundant calculations)
- Hoist A_dot_leb and dR calculations outside inner loop
- Add #pragma unroll hints for radial and m0 loops

Achieves 23.3% speedup on snap_psibeta_gpu (19.27s -> 14.78s).
Numerical correctness verified: energy matches baseline (-756.053 Ry).
- Replace conditional atan branches with single atan2 call
- Use sincos() instead of separate sin/cos calls

Achieves 8.4% additional speedup (14.78s -> 13.56s)
Combined with loop restructuring: 29.6% total from baseline
Numerical correctness verified: -756.053 Ry
- Convert compute_ylm_gpu to templated version with L as template param
- Use linear array for Legendre polynomials (reduces from 25 to 15 doubles)
- Add DISPATCH_YLM macro for runtime-to-template dispatch
- Add MAX_M0_SIZE constant for result array sizing
- Replace C++17 constexpr if with regular if for C++14 compatibility
- Enable compiler loop unrolling with #pragma unroll

Performance: snap_psibeta_gpu improved from 13.27s to 9.83s (1.35x speedup)
- Replace shared memory tree reduction with warp shuffle reduction
- Use warp_reduce_sum for intra-warp reduction (faster shuffle ops)
- Reduce shared memory from BLOCK_SIZE (2KB) to NUM_WARPS (64 bytes)
- Cross-warp reduction done by first warp reading from shared memory

Reduces register usage from 94 to 88, shared memory from 2KB to 64 bytes.
…umentation

- Add comprehensive file headers explaining purpose and key features
- Organize code into logical sections with clear separators
- Add doxygen-style documentation for all functions, structs, and constants
- Fix inaccurate comments (BLOCK_SIZE requirement, direction vector normalization)
- Remove unused variables (dR, distance01)
- Remove finalize_gpu_resources() as it's not needed for constant memory
- Improve inline comments explaining algorithms and optimizations
…ction

- Add use_gpu runtime flag that checks both __CUDA macro and PARAM.inp.device
- GPU path is now only enabled when __CUDA is defined AND device == "gpu"
- Makes the conditional logic clearer with if/else instead of nested #ifdef
- Move CUDA_CHECK macro to shared header snap_psibeta_kernel.cuh
- Remove duplicate CUDA_CHECK definition from snap_psibeta_gpu.cu
- Remove CUDA_CHECK_KERNEL macro and replace all usages with CUDA_CHECK
- Reduces code duplication and improves consistency
- Replace local PI, FOUR_PI, SQRT2 definitions with ModuleBase:: versions
- Add include for source_base/constants.h
- Replace fprintf(stderr, ...) with ModuleBase::WARNING_QUIT
- Update CUDA_CHECK macro to use WARNING_QUIT instead of fprintf
- Add includes for tool_quit.h and string header
- Consistent error handling with ABACUS codebase conventions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant