GPU Ocelot is an open-source dynamic JIT compilation framework for GPU compute applications targetinga range of GPU and non-GPU execution targets. Ocelot supports CUDA applications and provides animplementation of the CUDA Runtime API enabling seamless integration. NVIDIA’s PTX virtualinstruction set architecture is used as adevice-agnostic program representation that captures the data-parallel SIMT execution model ofCUDA applications. Ocelot supports several backend execution targets – a PTX emulator, NVIDIA GPUs,AMD GPUs, and a translator to LLVM for efficient execution of GPU kernels on multicore CPUs.
GPU Ocelot facilitates research and development on several fronts. First, Ocelot improves developerproductivity of GPU compute applications by providing an infrastructure for building event traceanalyzers using the emulator and monitoring kernel execution. Second, as aJIT compiler infrastructure, Ocelot provides facilities for compiler research including interfaces toan internal representation of PTX programs in support of optimization passes for massively dataparallel computer kernels. Third, Ocelot enables research in heterogeneous architectures via tracegeneration interfaces which can be used to drive detailed simulators. Fourth, with an open sourceimplementation of the CUDA runtime, Ocelot enables research in kernel scheduling,resource allocation for accelerator devices, and heterogeneity-aware operating systems.
GPU Ocelot is available from the GPU Ocelot Github site. It is an open-source project released under the News BSD license. Documentation of the codebase is available here: http://gpuocelot.gatech.edu/doxygen
GPU Ocelot defines a succinct interface for implementingadditional accelerator backends to interact with Ocelot’s PTX transformation pipeline and APIfront ends (CUDA Runtime and CUDA Driver APIs). The following devices are currently supported.
- NVIDIA GPU – native execution on GPUs, exploring instrumentation and optimization techniques
- AMD GPU – prototype translator from PTX to IL enables executing CUDA on AMD GPUs
- Multicore CPU – translator from PTX to LLVM enables efficient execution on multicore CPUs
- PTX Emulator – provides functional simulation of PTX kernels and detailed instruction traces
- Remote – lightweight remote procedure call implementation transparently utilizes distributed GPU devices
GPU Computing Development
GPU Ocelot’s PTX emulator enables CUDA applications tobe executed on a functional simulator that computes thecomplete architectural state of a GPU for each dynamic instruction. This may be augmented with user-definedtrace generators which react to dynamic instruction traces as the program is executing enabling real-timeworkload characterization and correctness checks. Existing trace analyzers provide support formemory access checks, race detection, an interactive debugger, and feedback for performing tuning.
GPU Ocelot facilitates research in heterogeneous and data-parallel compilation techniques by providinga parser and internal representation for NVIDIA’s PTX, a virtual instruction set for data-parallel computing.Control- and data-flow analysis passes are implemented using this IR as well as an emitter and set ofvalidation procedures. Translators from PTX to LLVM enable leveraging analysis tools available in theLLVM project. A pass manager inspired by the LLVM project orchestrates structured analysis and transformation passeswhich can be applied statically or on-line during the execution of CUDA programs. Execution modeltranslation mapping the PTX execution model to processors other than NVIDIA GPUs enables portabilityacross processor architectures. This has given rise to research in which SIMT execution modelstarget processors with scalar pipelines that are tightly coupled to vector functional units.
GPU Ocelot’s PTX emulator is a functional simulator and iscapable of driving externally definedtiming models. Currently, trace generators interfacing to MACSIM are available toperform design space exploration of GPU architectures. We are aware of other efforts using GPU Ocelotto perform architecture research using custom timing models.
In addition to transparently supporting three commodity processorarchitectures (multicore CPU, NVIDIA, AMD GPU), GPU Ocelot provides on-line context switch enablingstate to be migrated between processors as programs are running. This contrasts with other platformssuch as OpenCL which is advertised as cross-platform yet vendor implementations require re-compilingthe host application to target different processor architectures.GPU Ocelot has been used as the execution manager and dynamic compiler for the Harmony Runtimewhich kernel-level scheduling. Lynx is an extension of GPU Ocelot to provide software instrumentationfor GPU computing, largely inspired by the Pin project.
We gratefully acknowledge the support of this research by the National Science Foundation, LogicBlox Corporation, IBM, NVIDIA, AMD, and Sandia National Laboratories, with equipment grants from NVIDIA and Intel.