Software Reliability Enhancements for GPU Architectures

Si Li, Naila Farooqui and Sudhakar Yalamanchili. “Software Reliability Enhancements for GPU Applications.” Sixth Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG-2013), held in conjunction with the 8th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC). January 2013.


As the role of highly-parallel accelerators becomes more important in high performance computing, so does the need to ensure their reliable operation. In applications where precision and correctness is a necessity, bit-level reliable operation is required. While there exist mechanisms for error detection and correction, the cost-effective implementation in massively parallel accelerators is still an active area of research. In this paper we present an alternative software based approach for improving the reliability of massively parallel bulk synchronous processors such as modern GPUs. Specfi cally, we propose a set of software reliability enhancements via transparent code patching of GPU applications. Reliability enhancements can be applied selectively at runtime, customized by the user, and transparent to the application. Runtime overhead ranges from 1-737% depending on the nature of the enhancement. We provide an analysis of bene ts and limitations.



author={Li, Si and Farooqui, Naila and and Yalamanchili, Sudhakar},
title={Software Reliability Enhancements for GPU Applications},
booktitle={Proceedings of the Sixth Workshop on Programmability Issues for Heterogeneous Multicores},
keywords={GPU, correctness checks, instrumentation, reliability, CUDA, PTX},