The on-chip memory design is critical to the GPGPU performance because it serves between the massive threads and the huge external memory as a low-latency and high-throughput data communication point. However, the existing on-chip memory hierarchy is inherited from the conventional CPU architecture and is oftentimes sub-optimal to the SIMT (single instruction, multiple threads) execution. In this study, we surpass the traditional memory hierarchy design and reform the on-chip memory into an integrated architecture with the cache-emulated register file (RF) capability tailored for high performance GPGPU computing. With the lightweight support from ISA, compiler and the modified microarchitecture, this integrated architecture can dynamically emulate a variable-sized RF and a cache in a uniform way. Evaluation results demonstrate that this novel architecture can deliver better performance and energy efficiency with smaller on-chip memory size. For example, it can gain an average of 50% performance improvement for the cache-sensitive applications.