forked from abacusmodeling/abacus-develop
-
Notifications
You must be signed in to change notification settings - Fork 148
perf(TDDFT): Add CUDA acceleration for snap_psibeta_half function #6808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dzzz2001
wants to merge
13
commits into
deepmodeling:develop
Choose a base branch
from
dzzz2001:cuda_tddft_claude
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ory grids - Implement GPU-accelerated snap_psibeta_neighbor_batch_kernel - Use constant memory for Lebedev and Gauss-Legendre integration grids - Add multi-GPU support via set_device_by_rank - Initialize/finalize GPU resources in each calculate_HR call - Remove static global variables for cleaner resource management - CPU fallback when GPU processing fails
…ructure - Add ModuleBase::timer for snap_psibeta_atom_batch_gpu function - Remove GPU fallback to CPU design (return true/false in void function) - Replace fallback returns with error messages and proper early exits - Ensure timer is properly called on all exit paths - Simplify code structure for better readability
…uring - Move ylm0 computation outside radial loop (saves 140x redundant calculations) - Hoist A_dot_leb and dR calculations outside inner loop - Add #pragma unroll hints for radial and m0 loops Achieves 23.3% speedup on snap_psibeta_gpu (19.27s -> 14.78s). Numerical correctness verified: energy matches baseline (-756.053 Ry).
- Replace conditional atan branches with single atan2 call - Use sincos() instead of separate sin/cos calls Achieves 8.4% additional speedup (14.78s -> 13.56s) Combined with loop restructuring: 29.6% total from baseline Numerical correctness verified: -756.053 Ry
- Convert compute_ylm_gpu to templated version with L as template param - Use linear array for Legendre polynomials (reduces from 25 to 15 doubles) - Add DISPATCH_YLM macro for runtime-to-template dispatch - Add MAX_M0_SIZE constant for result array sizing - Replace C++17 constexpr if with regular if for C++14 compatibility - Enable compiler loop unrolling with #pragma unroll Performance: snap_psibeta_gpu improved from 13.27s to 9.83s (1.35x speedup)
- Replace shared memory tree reduction with warp shuffle reduction - Use warp_reduce_sum for intra-warp reduction (faster shuffle ops) - Reduce shared memory from BLOCK_SIZE (2KB) to NUM_WARPS (64 bytes) - Cross-warp reduction done by first warp reading from shared memory Reduces register usage from 94 to 88, shared memory from 2KB to 64 bytes.
…umentation - Add comprehensive file headers explaining purpose and key features - Organize code into logical sections with clear separators - Add doxygen-style documentation for all functions, structs, and constants - Fix inaccurate comments (BLOCK_SIZE requirement, direction vector normalization) - Remove unused variables (dR, distance01) - Remove finalize_gpu_resources() as it's not needed for constant memory - Improve inline comments explaining algorithms and optimizations
…ction - Add use_gpu runtime flag that checks both __CUDA macro and PARAM.inp.device - GPU path is now only enabled when __CUDA is defined AND device == "gpu" - Makes the conditional logic clearer with if/else instead of nested #ifdef
- Move CUDA_CHECK macro to shared header snap_psibeta_kernel.cuh - Remove duplicate CUDA_CHECK definition from snap_psibeta_gpu.cu - Remove CUDA_CHECK_KERNEL macro and replace all usages with CUDA_CHECK - Reduces code duplication and improves consistency
- Replace local PI, FOUR_PI, SQRT2 definitions with ModuleBase:: versions - Add include for source_base/constants.h
- Replace fprintf(stderr, ...) with ModuleBase::WARNING_QUIT - Update CUDA_CHECK macro to use WARNING_QUIT instead of fprintf - Add includes for tool_quit.h and string header - Consistent error handling with ABACUS codebase conventions
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds CUDA GPU acceleration for the
snap_psibeta_halffunction in RT-TDDFT calculations, achieving significant performance improvements.Changes
New Files
source/source_lcao/module_rt/kernels/cuda/snap_psibeta_gpu.cu- Main CUDA implementationsource/source_lcao/module_rt/kernels/cuda/snap_psibeta_kernel.cu- CUDA kernel implementationssource/source_lcao/module_rt/kernels/cuda/snap_psibeta_kernel.cuh- Kernel headers and device functionssource/source_lcao/module_rt/kernels/snap_psibeta_gpu.h- Public interface headerModified Files
source/source_lcao/module_operator_lcao/td_nonlocal_lcao.cpp- Integration with GPU pathsource/source_lcao/module_rt/CMakeLists.txt- Build configuration for CUDA filesKey Optimizations
atan2andsincosPerformance Results
Test Environment
OMP_NUM_THREADS=12 mpirun -n 2 abacusBenchmark Results
snap_psibetaTimecontribute_HRTimesnapRatioSpeedup
snap_psibetasnap_psibetacontribute_HR: ~20x faster than v3.9.0.20, ~12x faster than v3.9.0.21