Office of Research, UC Riverside
Laxminarayan Bhuyan
Distinguished Professor-Emeritus
Computer Science & Engineering
lbhuyan@ucr.edu
(951) 827-2281


SHF: Small: Efficient CPU-GPU Communication for Heterogeneous Architectures

AWARD NUMBER
006824-002
FUND NUMBER
21264
STATUS
Closed
AWARD TYPE
3-Grant
AWARD EXECUTION DATE
6/18/2014
BEGIN DATE
7/1/2014
END DATE
6/30/2017
AWARD AMOUNT
$498,976

Sponsor Information

SPONSOR AWARD NUMBER
CCF-1423108
SPONSOR
NATIONAL SCIENCE FOUNDATION
SPONSOR TYPE
Federal
FUNCTION
Organized Research
PROGRAM NAME

Proposal Information

PROPOSAL NUMBER
14070606
PROPOSAL TYPE
New
ACTIVITY TYPE
Basic Research

PI Information

PI
Bhuyan, Laxmi
PI TITLE
Other
PI DEPTARTMENT
Computer Science & Engineering
PI COLLEGE/SCHOOL
Bourns College of Engineering
CO PIs

Project Information

ABSTRACT

Future chip multiprocessors (CMPs) will have silicon space and technology to incorporate hundreds of cores. The trend is to integrate tens of cores and hardware accelerators (HAs), such as GPUs, on a single platform. The proposed heterogeneous architecture will enable future chips to operate within their power budgets while providing the high-throughput per Watt required for large scientific applications. Many of the top-500 supercomputers integrate thousands of CPUs with GPU accelerators to achieve the desired throughput for scientific applications. Considerable effort, however, is needed to design efficient communication mechanisms between heterogeneous components in such a system. Currently, HAs are not fully integrated with the system architecture; offloading computation from the CPU to the HAs adds large communication overhead. This research project explores comprehensive solutions to this problem through many different techniques. The project has significant broader impact in terms of research publications, graduate student supervision, and minority education because UCR is a minority serving institution.

This project will develop new CPU-GPU communication techniques through static programming and run-time optimization. It will develop a divisible load theory (DLT) technique to overlap communication with computation, and optimize the time and size of data transfer between the CPU and GPU. The research will also develop run-time techniques that can monitor the efficiency of execution and dynamically change the transfer parameters by considering the execution behaviors of different applications. Architectural changes are to be incorporated in the GPU to initiate data transfers based on task execution inside the GPU. Design of the shared virtual memory (SVM) architecture is to be developed, where the accelerator and system memories share a single virtual address space; and CPUs and HAs in the system will communicate through the SVM. The hardware controllers, memory management unit (MMU), GPU cache memory architectures, cache coherence protocols, and other interfaces between the GPU and CPU cores will also be designed. The project proposes suitable hybrid cache coherence protocols and efficient interconnection networks for scalable system design. Finally, run-time system and software interfaces will be developed that can execute multiple multithreaded applications on a heterogeneous multicore architecture.
(Abstract from NSF)