Abstract:The GRAPES_MESO and WRF models are used to analyse the computational characteristics of numerical models on the KUNPENG platform, and are compared with the Intel (X86) platform to explore the improvement space of numerical models in resource utilisation, computational bottlenecks, hotspot functions, and other aspects on the KUNPENG platform. The results indicate that: (1) After adaptation, both models obtain consistent results on the domestic KUNPENG platform as on the X86 platform. (2) Both models exhibit good parallel scalability on both X86 and KUNPENG platforms. When using the same number of processes, the computing efficiency of the KUNPENG platform is 65% to 90% of that of the X86 platform. However, when using the same number of nodes, the computing efficiency of the KUNPENG platform exceeds that of the X86 platform by 22% to 45%. (3) In terms of hardware resource utilisation, the two models consume the most time in computing, followed by communication, and finally IO. The models have a higher CPU usage rate, appropriate memory usage of nodes, and the subsequent optimisation mainly focuses on code efficiency, algorithm, memory access, etc. (4) In terms of MPI communication, the communication efficiency of MPI in the GRAPES model improves by optimising the Collective Sync, Allreduce, and Wait algorithms of collective communication on the KUNPENG platform. (5) Through top-down analysis, it is found that the computing bottlenecks of the two models on the two platforms are mainly concentrated in the back-end CPU bottleneck and the back-end memory subsystem bottleneck. Thanks to the optimisation of multi-memory channels and the Bisheng compiler, the memory access efficiency, branch prediction rate, and cache hit rate of the GRAPES model on the KUNPENG platform are higher than those on the X86 platform. In addition, from the perspective of memory subsystem bottleneck information, TLB Miss and L1/L2 Miss are generally low, the memory access efficiency is high, and the memory access optimisation space is limited. From the perspective of instruction distribution information, the proportion of memory read and shaping instructions is relatively high, and there are certain floating-point instructions, which reflect the high memory bandwidth advantage of the KUNPENG architecture. In addition, the vectorisation instruction is not high, so vectorisation optimisation is considered. (6) From the analysis of hotspots, the GRAPES model is optimised by the GCR algorithm, the collective communication hotspots represented by uct and ucg, the mathematical functions represented by expf and powf, and the hot functions such as malloc memory operations are also optimised on the KUNPENG platform.