Compiler Optimization


[ Papers | People | Links | Misc ]

Papers

    [Multi-core]
  1. Mary W. Hall, Jennifer M. Anderson, Saman P. Amarasinghe, etc., "Maximizing multiprocessor performance with the SUIF compiler", Computer, Vol. 29, No. 12, pp. 84-89, December 1996. [PDF]
  2. Mary W. Hall, Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, Monica S. Lam, "Detecting coarse-grain parallelism using an interprocedural parallelizing compiler", Proceedings of the IEEE/ACM SC95, 1995. [PDF]
    [SIMD]
  1. Alexandre E. Eichenberger, Peng Wu, Kevin O’Brien, “Vectorization for SIMD architectures with alignment constrains”, PLDI’04, June 2004. [PDF]
  2. When Single Instruction Multiple Data (SIMD) architectures are commonly employed by todays instruction set extensions, data alignment for those alignment constrained systems is becoming a new challenge. This paper focus on how to reorganize the misaligned data, thus SIMD extension can be successfully and efficiently utilized. Four main policies were proposed: Zero-Shift, Eager-Shift, Lazy-shift and Dominant-Shift. The key idea is to add shiftstream operations between load and store instructions to generate proper data format.  The Data Reorganization Graph is modeled to guide the SIMD code generation.

    The scheme has two disadvantages: (I) when memory alignment is not available at compile-time, only non-optimal Zero-Shift policy can be applied; (II) it doesn’t handle data size conversions, i.e. data conversions between types of different sizes.

  3. Alexandre E. Eichenberger, Kathryn O’Brien, Kevin O’Brien, etc., “Optimizing compiler for a CELL processor”, IBM T.J. Watson Research Center, 2005. [PDF]
  4. This paper is a comprehensive report on works that have been done in compiler optimization for IBM Cell processor. Several optimization technologies have been presented including branch hint, automatic simdization and compiler-controlled software cache, etc. For automatic simdization, it used the same approach suggested by Eichenberger in [1].

  5. Peng Wu, Alexandre E. Eichenberger, Amy Wang, “Efficient SIMD code generation for runtime alignment and length conversion”, CGO’05, 2005. [PDF]
  6. This paper targets two challenges with the data reorganization: run-time alignment and data conversion. The main difficulty of run-time alignment is that shiftstream operations have different code sequence for left and right shifting, and it’s very hard to determine if a shifting is left- or right-shift or not, without knowing the offset information at compile-time. The key idea is that the right-shift operations can be transferred to left-shift operations by choosing the right “target” element. Namely, if we focus on the element, which will have offset 0 after shifting and prepend elements to original data accordingly, arbitrary stream-shift operation can be equally presented by left-shift operation.

  7. Peng Wu, Alexandre E. Eichenberger, Amy Wang, Peng Zhao, “An integrated simdization framework using virtual vectors”, ICS’05, June 2005. [PDF]
  8. Samuel Larsen, Saman Amarashighe, “Exploiting superword level parallelism with multimedia instruction sets”, Proceedings of the SIGPLAN, June 2000. [PDF]

    This paper proposed algorithms to exploiting the Superword Level Parallelism (SLP). One very important issue in parallelism detection is to recognize vectorizable statements and pack them into SIMD-style operations. The algorithms in this paper first find out pairs of statements which can be parallelized (independent, profitable, etc.), then those pairs are extended to larger PackSet via use-def or def-use chains. Moreover, similar pairs can be combined into groups, which will be used to generate the SIMD code.

    This scheme mainly focuses on basic block optimization. For loops, unrolling can be utilized before apply the algorithms. However, the number of iterations unrolled must be within the vector register boundary. Meanwhile, the combination algorithm mentioned in this paper may not be able to fully utilized the width of vector registers.

  9. Gang Ren, Peng Wu, David Padua, “A preliminary study on the vectorization of multimedia applications for multimedia extensions”, LCPC’03, LNCS 2958, pp. 420-435, 2004. [PDF]
  10. Since the SIMD extensions for multimedia are similar to conventional vector processors, it is believed that traditional vectorization can be used to compile for multimedia extensions. In this paper, the differences between SIMD extension and vector computer are identified. While the traditional vectorization techniques were developed at the end of  the 1980’s and the beginning of the 1990’s , new challenges arise for vectorization of  multimedia extensions with short, fixed-length vectors, much weaker memory unit, and less uniform and general-purpose ISA.

  11. Weihua Jiang, Chao Mei, Bo Huang, etc., "Boosting the performance of multimedia applications using SIMD instructions", CC 2005, LNCS 3443, pp. 59-75, 2005. [PDF]
  12. Bik A J C, Girkar M, Grey P M, Tian X, "Automatic detection of saturation and clipping idioms", Proceedings of the 15th Int'l Workshop on Languanges and Compilers for Parallel Computers, July, 2002. [PDF].
  13. Samuel Larsen, Rodric Rabbah, Saman Amarasinghe, “Exploiting vector parallelism in software pipelined loops”, MICRO’05, July 2005. [PDF]
  14. Generally, the compiler optimization will generate the vector instructions for each vectorizable operation to exploit the SIMD-style architecture. In contrast, scalar hardware is typically targeted using ILP techniques, such as software pipelining. Targeting both scalar and vector resources present novel problems, leading to a new algorithm for automatic vectorization. This paper proposed an approach to exploit vector parallelism in software pipelined loops. Instead of vectorizing all vectorizable statements, the algorithm only perform “selective vectorization” guided by the cost analysis. Namely, all statements in the program are classified into two classes: scalar execution and vector execution. The compiler will compute the cost for each vectorizable statement, and decide whether to generate vector code or not.

  15. Aart J. C. Bik, Milind Girdar, Paul M. Grey, Xinmin Tian, “Automatic intra-register vectorization for the Intel architecture”, International Journal of Parallel Programming, Vol. 30, No. 2, April 2002. [PDF]
  16. Starting from Pentium® processor, Intel has provided MMX™, SSE and SSE2 technologies, which are called SIMD extensions, for its newer microprocessors. Compared with conventional vectorizing approaches developed for vector computers, the schemes for such SIMD extensions were referred as Intra-register Vectorization in this paper. They tailored the classical vectorization methods for the Intel architecture. For example, the Vector Length (VL) was introduced to represent the limitation of the vector registers. Also, elaborate automatic detection of vectorizable idioms for specific instructions, such as the saturation arithmetic, was considered. Although the misalignment store/load operations are supported by Intel architecture, the authors used loop peeling, programming techniques, and dynamic loop peeling aligment strategies to minimize the performance penalty caused by misaligned operations and cache line splits.

  17. N. Sreraman, R. Govindarajan, “A vectorizing compiler for multimedia extensions”, International Journal of Parallel Programming, Vol. 28, No. 4, pp. 363-400, 2000. [PDF]
  18. In this paper, a vectorizing C compiler targeting the MMX extension for Intel Pentium® microprocessors is proposed. The implementation is based on the SUIF infrastructure. Extra passes for vectorizing are embedded into the basic SUIF tools, including condition/loop distribution, strip mine, scalar expansion, and reduction processing, etc. The parallel detection is based on dependence graph analysis method. “A non-singleton strongly connected component (SCC) of a data dependence graph represents a maximum set of statements that are involved in a dependence cycle….Only a singleton SCC that is not self-dependent is a candidate for exploiting subword parallelism.”

  19. A. W. Lim, M. S. Lam, “Maximizing parallelism and minimizing synchronization with affine partitions”, Parallel Computing, Vol. 24, No. 3-4, pp. 445-475, May 1998. [PDF]
  20. [Classic Vectorizing Techiniques]

  21. David A. Padua, Michael J. Wolfe, “Advanced compiler optimizations for supercomputers”, Communications of ACM, Vol. 29, No. 12, December 1986. [PDF]
  22. This is the classic paper on compiler optimizations for conventional supercomputers. The optimization for supercomputer includes two aspects: vectorization and parallelism. By vectorization, we mean that single operation can be applied to multiple data items packed into vector registers. Parallelism means multiple instructions can be executed simultaneously in multiple function units or microprocessors. The kernel issue needs to be addressed before code generation is the data dependence. Generally, three dependences are considered, including true dependence, anti dependence and output dependence. The dependences between statements in program are represented by Data Dependence Graph. If there is no circle in the DDG, the compiler will generate the vectorization/parallelism code for those statements. In order to improve the potential parallelism, several schemes, for example, induction variable recognition, wraparound variable recognition, global forward substitution and semantic analysis, etc., were used to generate more accurate DDG. Also, variable renaming, node splitting techniques were applied to eliminate the anti- and output dependence.

    Based on DDG, a few optimizations for vector or concurrent computers were mentioned. Among them, Scalar Expansion, Loop Interchanging, Loop Fission/Fusion, Strip Mining and Loop Collapsing were discussed in details. Although not all of them are suitable for optimizations for microprocessors, some of them, for example scalar expansion, loop interchanging and loop fusion, etc. can still be used to exploit the parallelism for SIMD style architectures.

  23. Randy Allen, Ken Kennedy, “Automatic loop interchange”, Proceedings of the ACM SIGPLAN’84 Symposium on Compiler Construction, Vol. 19, No. 6, June 1984. [PDF]
  24. Randy Allen, Ken Kennedy, “Automatic translation of FORTRAN programs to vector form”, ACM Transactions on Programming Languages and Systems, Vol. 9, No. 4, October 1987. [PDF]
  25. Randy Allen, Ken Kennedy, C. Porterfield, J. Warren, "Conversion of control dependence to data dependence", Proc. of the 10th ACM SIGACT-SIGPLAN, pp. 177-189 1983. [PDF].
  26. [Dependence analysis]

  27. U. Banerjee, "Speedup of ordinary programs", Ph.D. thesis, Dept. of Computer Science, UIUC, Oct. 1979. [PDF]

People

Links

Misc