Adding a data prefetching transformation to LLVM Polly – generating load-balanced and coarse-grain loop pipelinable code for more task level parallelism and data locality
by Junqi Deng for The LLVM Compiler Infrastructure
I propose adding a prefetching transformation to LLVM Polly. Such transformation splits the innermost loop into three task level pipelinable parts. They are head, which prefetches data, body, which performs calculation, and foot, which stores back data, and their loads (execution time) are balanced. The transformed code in some sense mimics the behavior of cache, but it is much more than cache because it is timely, accurate and simple. This transformation can well benefit those architectures with on-chip scratch-pad memory and capable of task level parallelism, such as GPU and FPGA. It will also work on multi-core CPU with non-blocking data cache prefetch instruction. Therefore, it will enable LLVM Polly to perform a much larger range of locality optimization.