The Art of High Performance Computing for Computational Science, Vol. 1 by Masaaki Geshi

The Art of High Performance Computing for Computational Science, Vol. 1 by Masaaki Geshi

Author:Masaaki Geshi
Language: eng
Format: epub
ISBN: 9789811361944
Publisher: Springer Singapore


7.2.2 Hardware Characteristics of FLOPS-oriented Supercomputers

In the following, we outline the projected hardware specifications of exaFLOPS machines, assuming the FLOPS-oriented type, and point out their important features. A typical example of an exaFLOPS machine assumed in this chapter is shown in Fig. 7.1.

Fig. 7.1Architecture of an exaFLOPS machine presupposed in this chapter

Parallelism of order

At present, the clock frequency of CPU cores is at most a few GHz. This situation will not change drastically in the near future, mainly due to the requirement of keeping the power consumption at an affordable level. This means that we need order of parallelism to achieve exaFLOPS. This parallelism will be realized by a hierarchy, consisting of instruction level, core level, chip level, and node level parallelism.

Deep memory hierarchy

Today’s supercomputers already have a fairly deep memory hierarchy, consisting of on-chip registers, several levels of on-chip and off-chip cache, main memory within a node, and main memory in other nodes. This hierarchy will become even deeper and more complicated in exaFLOPS machines, corresponding to the hierarchical parallelism stated above.

Increase in the data transfer cost

Up to now, the floating-point performance of supercomputers has been increasing more rapidly than memory access performance or internode communication performance. This has resulted in a severe discrepancy between the computation speed and the data transfer speed, which is expected to grow even larger in the future. Let us divide the data transfer performance into latency and throughput and consider them separately. According to the prediction in [26], a FLOPS-oriented machine with total performance of 1,000–2,000 PFLOS will have a total memory bandwidth of 5–10 PBytes/s. Hence, the ratio of data transfer throughput to the floating-point performance is 0.005 Byte/FLOP. This means that we need to perform at least 1600 operations on each double-precision data (8 bytes) fetched from memory in order to fully exploit the machine’s floating-point performance. In contrast, for the K computer, the total performance is 10 PFLOPS and the total memory bandwidth is 5 PByte/s, so the ratio is 0.5 Byte/FLOP. Thus, the relative memory access cost of a FLOPS-oriented machine is 100 times higher than that of the K computer. As for the latency, [26, Table 2-3] estimates the inter-core synchronization/communication latency as 100 ns (100 cycle) and the internode communication latency as 80–200 ns. This means that virtually no performance enhancement can be expected with respect to the latency. Considering that exaFLOPS machines have much higher floating-point performance and larger number of nodes than today’s supercomputers, we can conclude that the effect of latency on execution time will be far more serious on exaFLOPS machines. The effect of latency will be the most salient in AllReduce-type communication such as arises in the inner product of two vectors. For example, an AllReduce operation among nodes using a binary tree will require several thousand cycles (Fig. 7.2).

Fig. 7.2AllReduce operation using a binary tree



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.