In this thesis, we present and compare three libraries for programming heterogeneous systems: Kokkos, SYCL with the AdaptiveCpp implementation, and CUDA. The main challenge in developing high-performance applications lies in ensuring code portability across different GPU architectures, while also striving for maximum performance. Consequently, developers must choose whether to prioritize simpler development in universal frameworks, thereby enhancing program portability, or opt for the highest performance in a specialized framework for a single architecture. We implemented and tested matrix multiplication and bitonic sorting algorithms on the Nvidia RTX 3070 and AMD Radeon VII graphics cards. The results show that Kokkos and SYCL are competitive in terms of execution time and similar in programming style, whereas CUDA is a low-level solution that remains the most optimized for the Nvidia hardware. We observed that selecting the right GPU architecture for a given algorithm generally has a greater impact on execution time than the choice between the universal libraries under comparison.
|