When Single Instruction Multiple Data (SIMD) architectures are commonly employed by today’s instruction set extensions, data alignment for those alignment constrained systems is becoming a new challenge. This paper focus on how to reorganize the misaligned data, thus SIMD extension can be successfully and efficiently utilized. Four main policies were proposed: Zero-Shift, Eager-Shift, Lazy-shift and Dominant-Shift. The key idea is to add shiftstream operations between load and store instructions to generate proper data format. The Data Reorganization Graph is modeled to guide the SIMD code generation.
The scheme has two disadvantages: (I) when memory alignment is not available at compile-time, only non-optimal Zero-Shift policy can be applied; (II) it doesn’t handle data size conversions, i.e. data conversions between types of different sizes.
This paper is a comprehensive report on works that have been done in compiler optimization for IBM Cell processor. Several optimization technologies have been presented including branch hint, automatic simdization and compiler-controlled software cache, etc. For automatic simdization, it used the same approach suggested by Eichenberger in [1].
This paper targets two challenges with the data reorganization: run-time alignment and data conversion. The main difficulty of run-time alignment is that shiftstream operations have different code sequence for left and right shifting, and it’s very hard to determine if a shifting is left- or right-shift or not, without knowing the offset information at compile-time. The key idea is that the right-shift operations can be transferred to left-shift operations by choosing the right “target” element. Namely, if we focus on the element, which will have offset 0 after shifting and prepend elements to original data accordingly, arbitrary stream-shift operation can be equally presented by left-shift operation.
This paper proposed algorithms to exploiting the Superword Level Parallelism (SLP). One very important issue in parallelism detection is to recognize vectorizable statements and pack them into SIMD-style operations. The algorithms in this paper first find out pairs of statements which can be parallelized (independent, profitable, etc.), then those pairs are extended to larger PackSet via use-def or def-use chains. Moreover, similar pairs can be combined into groups, which will be used to generate the SIMD code.
This scheme mainly focuses on basic block optimization. For loops, unrolling can be utilized before apply the algorithms. However, the number of iterations unrolled must be within the vector register boundary. Meanwhile, the combination algorithm mentioned in this paper may not be able to fully utilized the width of vector registers.
Since the SIMD extensions for multimedia are
similar to conventional vector processors, it is believed that traditional
vectorization can be used to compile for multimedia extensions. In this paper,
the differences between SIMD extension and vector computer are identified.
While the traditional vectorization techniques were developed at the end of the
Generally, the compiler optimization will generate the vector instructions for each vectorizable operation to exploit the SIMD-style architecture. In contrast, scalar hardware is typically targeted using ILP techniques, such as software pipelining. Targeting both scalar and vector resources present novel problems, leading to a new algorithm for automatic vectorization. This paper proposed an approach to exploit vector parallelism in software pipelined loops. Instead of vectorizing all vectorizable statements, the algorithm only perform “selective vectorization” guided by the cost analysis. Namely, all statements in the program are classified into two classes: scalar execution and vector execution. The compiler will compute the cost for each vectorizable statement, and decide whether to generate vector code or not.
Starting from Pentium® processor, Intel has provided MMX™, SSE and SSE2 technologies, which are called SIMD extensions, for its newer microprocessors. Compared with conventional vectorizing approaches developed for vector computers, the schemes for such SIMD extensions were referred as Intra-register Vectorization in this paper. They tailored the classical vectorization methods for the Intel architecture. For example, the Vector Length (VL) was introduced to represent the limitation of the vector registers. Also, elaborate automatic detection of vectorizable idioms for specific instructions, such as the saturation arithmetic, was considered. Although the misalignment store/load operations are supported by Intel architecture, the authors used loop peeling, programming techniques, and dynamic loop peeling aligment strategies to minimize the performance penalty caused by misaligned operations and cache line splits.
In this paper, a vectorizing C compiler targeting the MMX extension for Intel Pentium® microprocessors is proposed. The implementation is based on the SUIF infrastructure. Extra passes for vectorizing are embedded into the basic SUIF tools, including condition/loop distribution, strip mine, scalar expansion, and reduction processing, etc. The parallel detection is based on dependence graph analysis method. “A non-singleton strongly connected component (SCC) of a data dependence graph represents a maximum set of statements that are involved in a dependence cycle….Only a singleton SCC that is not self-dependent is a candidate for exploiting subword parallelism.”
[Classic Vectorizing Techiniques]
This is the classic paper on compiler optimizations for conventional supercomputers. The optimization for supercomputer includes two aspects: vectorization and parallelism. By vectorization, we mean that single operation can be applied to multiple data items packed into vector registers. Parallelism means multiple instructions can be executed simultaneously in multiple function units or microprocessors. The kernel issue needs to be addressed before code generation is the data dependence. Generally, three dependences are considered, including true dependence, anti dependence and output dependence. The dependences between statements in program are represented by Data Dependence Graph. If there is no circle in the DDG, the compiler will generate the vectorization/parallelism code for those statements. In order to improve the potential parallelism, several schemes, for example, induction variable recognition, wraparound variable recognition, global forward substitution and semantic analysis, etc., were used to generate more accurate DDG. Also, variable renaming, node splitting techniques were applied to eliminate the anti- and output dependence.
Based on DDG, a few optimizations for vector or concurrent computers were mentioned. Among them, Scalar Expansion, Loop Interchanging, Loop Fission/Fusion, Strip Mining and Loop Collapsing were discussed in details. Although not all of them are suitable for optimizations for microprocessors, some of them, for example scalar expansion, loop interchanging and loop fusion, etc. can still be used to exploit the parallelism for SIMD style architectures.
[Dependence analysis]