Analyzing Performance and Power of Multicore Architecture Using Multithreaded Iterative Solver

Problem statement: Scientific modeling and simulations have been popularly used with experiments and theoretical analysis in science and engineering communities. Approach: Consequently, computational demands are growing exponentially to afford large scale modeling and simulations. Results: As a result, multicore computing architectures had been proposed and several products are already available. However, we do not have a proper study on the performance, power and thermal issues of real science and engineering problems bec ause software, which takes advantage of multicore architecture, is not available. Conclusion/Recommendations: In this study, we explored the performance and power characteristics of scientific algorithms on multicore architectures using a multithreaded version of sparse iterative linear so lver, named mtCG, with real scientific application problems.


INTRODUCTION
Computational modeling and simulations have been popularly used in science and engineering community to describe and understand complex phenomena instead of expensive or dangerous experiments such as drug design, global climate simulation, radiation simulation, crash testing aerodynamics and combustion (Heath, 2002). These modeling and simulations are usually represented as Partial Differential Equations (PDEs) which require meshes and sparse matrices. In these applications, we could not achieve the peak performance since those mesh and sparse matrix algorithms lack data reuse and locality.
At the same time, high performance computing community increases the number of transistors in a given area to improve performance. The latter meets physical limitation and generates new problems such as power consumption and thermal issues. To overcome these problems, multicourse architecture has been proposed and several products are already available in the market. However, we do not have a proper study on performance, power and thermal issues on multicore processors since the lack of scientific applications which benefits from multicore architectures. Several researches to characterize the performance of multicore architecture have been done with multiprogramming or loop level parallel benchmark programs (Jaleel et al., 2006;Li et al., 2005;Manjikian, 2001).
In this study, we profile the performance of scientific applications using a cycle accurate simulator to further understand the characteristics of multithreaded program on multicore architecture. We also explore the scalability, power and thermal issues on multicore architectures with real scientific application codes. Finally, we provide one variant of scientific application benefits multicore architectures. The latter could be used as a benchmark program in computing architecture community.
This study consists of the followings. We introduce some background information about benchmark programs and related researches. Then, we describe our simulation environments and multithreaded iterative solver. Experimental results of multithreaded iterative solver on multicore architectures are following. Finally some concluding remarks and future plans are described. (2000) has been used widely in computer architecture community to measure the performance of newly developed computing architectures. Several researches have been done with SPEC to measure the performance of multicore architecture (Li et al., 2005;Manjikian, 2001). However, these benchmark programs only support a single thread. Several single thread benchmark programs are used together in experiments. Consequently, the results are very close to multiprogramming characteristics rather than multithreaded program. Even with OpenMP version of SPEC supports loop level parallelism which is different with general multithreaded programs which have task level parallelism.

Background: SPEC Corporation
At the same time, NAS (Bailey et al., 1992) has been used to represent the workloads of scientific and engineering problems. NAS supports various versions of benchmark programs such as serial, Message Passing Interface (MPI), OpenMP and High Performance Fortran (HPF). However, it does not support multithread version of benchmark programs. The reason based on the fact that NAS has been used heavily to measure the performance of clustered systems rather than a single processor machine.
Splash-2 (Woo et al., 1995) has been used to represent multithreaded workloads for Shared Memory Processor (SMP) computers. In addition, it has real scientific kernels, cholesky, fft and lu and real scientific applications such as barnes-hut, fmm and water. However, this benchmark programs use synthetic data rather than real scientific application data. The behavior and memory access patterns of the benchmark programs with synthetic data are different with that of the real scientific applications.
There are several other benchmarks or variants of the traditional benchmark such as MinneSPEC (Klein Osowski and Lilja, 2002) which are developed to reduce simulation time and BioBench (Albayraktaroglu et al., 2005) which represents bioinformatics workloads and its parallel version using OpenMP (Jaleel et al., 2006) and MineBench (Narayanan et al., 2006) which represents data mining workloads on single and parallel machines.
On the other hand, many research on characterizing the performance, power and thermal of multicore architectures have been done. Jaleel et al. (2006) characterized the last level cache performance on Chip Multi Processor (CMP) using OpenMP version of Biobench. Li et al. (2005) characterized performance, energy and thermal of Simultaneous Multi Thread (SMT) and CMP with replicating single threaded applications. Monchiero et al. (2006) explores the design space for multicore architecture in performance, power and thermal view using Splash-2 (Woo et al., 1995) benchmark programs. However, present study is the first contribution to characterize the multicore architecture with real multithreaded scientific applications in author's awareness.
Multithreaded iterative solver: mtCG: The most common algorithm in scientific modeling and simulation is a sparse iterative solve such as Conjugate Gradient (CG). Fig. 1 contains an outline of a generic CG algorithm used in many applications. This CG scheme uses standard data structures for storing the sparse matrix A and vectors p,q,r. Only the nonzero of sparse matrix A and its corresponding indices are explicitly stored using a standard sparse format. The vectors p, q and r are stored as one-dimensional arrays in contiguous locations in memory. A single iteration of CG requires one matrix-vector multiplication, two vector inner products, three vector additions and two floating point divisions. Among these operations, the matrix-vector multiplication dominates the computational cost accounting for more than 90% of the overall execution time. Due to the sparse nature of the matrix A, the number of floating point operations per access to the main memory is relatively low during matrix vector multiplication. Additionally, the access pattern of the elements in the vector p depends on the sparse structure of A.
To provide a multithreaded version of CG algorithm, we divided the matrix by row-wise as shown in Fig. 2. Each thread multiplies row block of matrix with the specific source vector members and stores at the destination vector members. Since we do not share the destination vector, this module has a perfect parallelism. However, sparse matrix algorithm could not benefit from cache as does in dense matrix since it lacks data reuse and locality.
The total operation time required for sparse matrix multiplication is represented as: Memory access time, T mem is determined by the cache architecture inside a system. Assume we have multicore architecture which has two levels of cache, L1 and L2. L1 cache is dedicated cache for each core and L2 cache is shared by all cores. Then, using formula from (Jaleel et al., 2006, Hennessy andPatterson, 2003), the memory access time is represented as:  L2 . Computing the exact value of penalty L2 is difficult in real computing environment. However, even with a simple memory prefetcher, the value is negligibly small in our algorithm since it accesses memory in sequential direction (Malkowski et al., 2005a;2005b).
Since, sparse matrix multiplication is one of those embarrassingly parallel algorithm, we can define the speed up as: T s = Single processor execution time T p = Execution time with p processors Considering, T p for sparse matrix vector multiplication is T s /N thread , we can achieve N thread times speed up in theory (Grama et al., 2002).

MATERIALS AND METHODS
We used SESC (Renau et al., 2005), a cycle accurate architecture simulator, which supports Chip Multi Processor (CMP) and Simultaneous Multi Threading (SMT) architecture. Each core is an out-oforder superscalar processor with private L1 caches (separated instruction and data cache) and a shared L2 cache (hybrid instruction and data cache). The details of the parameters we used for SESC simulator are described in Table 1. We used Wattch (Brooks et al., 2000) to measure power usage on processor core and Orion (Wang et al., 2002) to measure shared bus power usage. Then, we applied Hotspot (Skadron et al., 2003) to get thermal characteristics based on the results of power consumption trace of SESC simulator. Since the memory access pattern of artificially generated data set is different with that of a real application, we used bcsstk16 from MatrixMarket (Boisvert, 1997). The latter is a sparse matrix generated from a real structure analysis application and popularly used in a scientific computation community.  Figure 3 shows our experimental results of mtCG benchmark program using bcsstk16 as an input matrix. Since the input matrix has regular nonzero pattern, mtCG has good balance between cores by assigning the same junk of rows to each thread. As a result, Instruction Per Cycle (IPC) numbers are similar even with different number of cores. In addition, we can achieve linearly increasing speed up with an increasing number of cores. Especially, L2 cache miss rates are decreasing by adding more cores up to four cores. We conjecture that the number of L2 accesses dramatically decreases with eight cores since the data per core is small enough to fit in L1 cache. Consequently, each core uses a similar amount of power as shown in Fig. 3.

RESULTS
In addition to the performance and power, the temperature becomes an important factor in advanced computing architectures. To better understand the thermal characteristics, we traced the changes of temperature during benchmark program executions using Hotspot 3.0 (Skadron et al., 2003) based on the floor plan as shown in Fig. 4a-c. The detail experimental parameters related to thermal are shown in (Renau et al., 2005). We investigated three multicore floorplans. The first layout is spreading hot areas around the corner and keeping L2 cache at the center. The second layout is lining up cores to arrange functional units at the center. The last layout is clustering functional units at the center to improve the performance by having functional units nearby each others. Based on the floor plan in (Renau et al., 2005), we scale down four cores into a single processor. As a result, every units are 1/4 scale of the floor plan. In addition, we locate shared bus at the center to keep cache consistency using MESI protocol. Figure 5 shows the temperature difference between different floor plans of four cores architectures: Spread (Fig. 5a), Lineup (Fig. 5b) and Centered (Fig. 5c). All floor plans have hotspots on issue related units and floating point units. The hottest unit is load/store queue with this benchmark program. The multithreaded version of algorithm mtCG has several synchronization stops between threads during executions. The latter raises the temperature of load/storequeues. In addition, store queue is also suffering from corner effects; the inside chip has plenty area to spread energy but corner area cannot dissipate energy as does in center area. The latter causes the temperature of right side of core has higher temperature than that of left side of core in our lineup displacement. The centered displacement has the coupling thermal effect which two hotspots held up high temperatures and affects each other. The temperature difference between spread, lineup and centered layout is not noticeable in our experiment. The centered layout has slightly higher temperature in overall chip area.

DISCUSSION
In our experiments, sparse iterative solver shows linearly increasing performance with an increasing number of cores. Considering the sparse matrix vector multiplication, which is scalable, dominates the cost of sparse iterative solver, the observation is not surprising. Accordingly, each core uses similar amount of power during the computation since each core executes approximately similar number of floating point operations in our scheme. Consequently, theoretical N thread speed up with using N thread as in our analysis is possible with a multicore computing architecture if the algorithm is designed to take benefits from multicore architecture.
Multithreaded sparse iterative solver raises temperature on issue related units and floating point units since the algorithm requires synchronization between threads and also executes huge number of floating point multiplications and additions. The latter raise the temperature of the related units including load/store queue and floating point units. To relieve the temperature in the related units, we could arrange the related units to locate far apart or use L2 cache as a coolant to surround the hot units.

CONCLUSION
In this study, we explore performance, power and thermal issues on modern computing architectures with using real scientific applications. We investigate multicore architectures with multithreaded benchmark programs mtCG with real scientific application data. Multicore architectures could provide an incredible speed up with the given power and thermal constraints as long as the algorithms are scalable. Finally, we provide a multithreaded version of CG benchmark program, named mtCG, which can be used to measure the performance of multicore architectures.
We are planning to develop more multithreaded benchmark programs based on real scientific and engineering applications. In the near future, we could provide multithreaded NAS benchmark programs to evaluate multicore architectures. In addition, we are studying on additional computing architecture issues such as cache sharing policy and thread scheduling to lower temperature.