PARALLELIZATION OF BASIC LINEAR ALGEBRA OPERATION LIBRARY (BLAS) ON THE SHARED MEMORY
V.V. Lunev, D.M. Obuvalin, I.N. Orlov, S.V. Sivolgin VANT. Ser.: Mat. Mod. Fiz. Proc 1997. Вып.1. С. 59.
The BLAS library (Basic Linear Algebra Subprograms) is an integral part of the well known LAPACK package intended for numerical analysis applications. The BLAS library includes frequently used linear algebra subprograms. LAPACK is based on block algorithms; the BLAS programs are used as much as possible as the lower level blocks. The library optimization (adaptation) to the specific architecture substantially increases the performance of the LAPACK package as whole. Shared Memory Parallelism is oriented to the configuration consisting of several processors with shared main memory. In such systems the memory is global for all processors, the basic workstation was represented by the Intel platform Altera that can contain from one to four Pentium Pro processors interconnected through the shared memory bus. This configuration ensures a good performance of UNIX and Windows NT. For the implementation of the shared memory parallelism, the SMP package was created representing a simple and efficient tool for . the development of parallel applications. The basic package advantages include; the dynamic adjustment to the number of processors in the cluster and singleshot use of extensive mechanisms for systems management of subtasks. The package designed for the parallelization of BLAS codes can be also used in any other SMP application. BLAS subprograms are divided into three levels depending on operations to be executed. The first level subroutines execute vector operations, the second level subroutines execute, matrixvector operations, the third level subroutines implement matrixmatrix operations. The development of parallel algorithms was meaningful only for the third level because the routines from the second and especially from the first levels have a much lower amount of computations. The main difficulty in the implementation of parallel algorithms is the need of uniform load of all system processors. The unbalance results in, lower parallelization coefficient. The specific feature and at the same time the bottleneck of the sharedmemory systems is the bus shared by all processors that cannot process several results at a time. This factor is the basic restriction prohibiting the growth of the number of processors. To design the parallel algorithms, we had to choose the version that smooth the construction in memory access The third level routines operate various matrix types: general, triangular, symmetric. For the review generality the algorithms for general and symmetric matrices were described. The report presented the parallelization results. The parallelization coefficient varies from 80 to 99 percent for different routines.
