At the 56th Annual IEEE/ACM International Symposium on Microarchitecture, researchers from the University of California, Riverside (UCR) demonstrated an approach in which any computing component on a platform would truly run concurrently. Due to this, you can double the speed of calculations and halve energy consumption. The technology can work on any processors and accelerators from smartphones to data center servers, but requires further development.
“You don't need to add new processors [to speed up computing] because you already have them,” said Hung-Wei Tseng, an associate professor in the Department of Electrical and Computer Engineering at the University of California and co-author of the study. You just need to wisely manage the available hardware resources, and not line them all up.
The platform the researchers developed, which they called simultaneous and heterogeneous multithreading (SHMT), breaks away from traditional programming models. Instead of providing data in one period of time to only one of the computing components of the system - the central, graphics, tensor or other processor or accelerator, SHMT technology parallelizes code execution across all components simultaneously.
SHMT uses a quality-aware work-stealing (QAWS) scheduling policy that doesn't require a lot of resources but helps maintain quality control and workload balance. The runtime system creates and divides a set of virtual operations (vOPS) into one or more high-level operations (HLOPs) to use multiple hardware resources simultaneously. The SHMT runtime system then distributes these HLOPS to task queues to run on the target hardware. Because HLOPS are hardware independent, the runtime system can redirect tasks as needed to one or another component of the computing platform.
What is especially valuable is that the researchers, using the example of the test platform they created, showed the effectiveness of the new software libraries. They created a kind of hybrid that can be considered both a smartphone and a kind of PC and even a server. Based on a backplane board with a PCIe connector, a “computer” was created from a combination of an NVIDIA Nano Jetson module with a quad-core ARM Cortex-A57 processor (CPU) and 128 Maxwell architecture graphics cores (GPU). The Google Edge accelerator (TPU) was connected through the M.2 Key E slot on the board.
The main memory of the presented system is 4 GB LPDDR4 with a frequency of 1600 MHz and a speed of 25.6 Gbps, where general data is stored. The Edge TPU module additionally contains 8 MB of memory, and Ubuntu Linux 18.04 was used as the operating system.
Running the SHMT package on an improvised heterogeneous platform using standard testing applications showed that with the most efficient policy, the QAWS framework shows a 1.95 times increase in computation speed and a significant reduction in consumption by 51% compared to the basic computation distribution method. If you scale this approach for use as part of a data center, the gains promise to be enormous and at the same time all the hardware will remain the same - nothing will have to be changed. The proposed solution is not yet ready for implementation, but it will certainly easily find people interested in it.