TO A PERFORMANCE EVALUATION OF PARALLEL ALGORITHMS IN NOW TO A PERFORMANCE EVALUATION OF PARALLEL ALGORITHMS IN NOW

passing interfaces) are often used to provide an extra layer of abstraction. In this paper, we discuss a new performance evaluation method on the example of multidimensional DFFT (Discrete Fast Fourier Transform) in a NOW’s based on Intel’s personal computers.


Introduction
There has been an increasing interest in the use of networks of workstations (cluster) connected together by high -speed networks for solving large computation-intensive problems [1,6,14,15,19]. This trend is mainly driven by the cost effectiveness of such systems as compared to large multiprocessor systems with tightly coupled processors and memories. Parallel computing on a cluster of workstations connected together by high -speed networks has given rise to a range of hardware and network related issues on any given platform. Performance prediction and evaluation, load balancing, inter-processor communication, and transport protocol for such machines are being widely studied. With the availability of cheap personal computers, workstations and networking devises, the recent trend is to connect a number of such workstations to solve computation-intensive tasks in parallel on such clusters.
The workstations can be connected using different network technologies such as off the shelf devices like Ethernet to specialised networks. Such networks and the associated software and protocols introduce latency and throughput limitations thereby increasing the execution time of cluster -based computation. Researchers are engaged in designing algorithms and protocols to minimise the effect of this latency [16,18].
Network of workstations ( Fig. 1.) has become a widely accepted form of high-performance parallel computing [1,6,14,15,19]. As in conventional multicomputers, parallel programs running on such a platform are often written in an SPMD form (Single -program -multiple data) to exploit data parallelism. Each workstation in a NOW is treated similarly to a processing element in a multicomputer system. However, workstations are far more powerful and flexible than the processing elements in conventional multicomputers. We can also use the advantages of the new SIMD (Single instruction Multiple data) instructions in the modern personal computers.

Effective parallel algorithms
The duty of a programmer is to develop an effective parallel algorithm for the given parallel system and for the given application problem. This task is more complicated in those cases, in which we have to create conditions for a parallel activity, that is through dividing the sequential algorithm to many mutual independent parts, which are named processes (decomposition strategy). Principally the development of the parallel algorithms includes the following activities [7,16]: • Decomposition -the division of the application into a set of parallel processes and data • Mapping -the way in which processes and data are distributed among the computer elements of a used parallel system • Inter-process communication -the way in which individual processes are cooperated and synchronized • Tuning -performance optimisation of a developed parallel algorithm.
The most important step is to choose the best decomposition method for a given application [7,16]. To do this it is necessary to understand the concrete application problem, the data domain, the used algorithm and the functional flow of activities in a given application.

Decomposition strategies
When designing a parallel program the description of the highlevel algorithm must include, in addition to design, a sequential program the method you intend to use to break the application into processes and distribute data to different nodes -the decomposition strategy. The chosen decomposition method drives the rest of program development. This is true in case of developing a new application as porting serial code. The decomposition method tells you how to structure the code and data and defines the communication topology. To choose the best decomposition method for these applications, it is necessary to understand the concrete application problem, the data domain, the used algorithm and the flow of control in given application. According to a concrete application we can use the following decomposition models: • Perfectly parallel decomposition • Domain decomposition • Control decomposition • Object-oriented programming OOP (the latest modern programming technology)

Perfect parallel decomposition
Certain applications fall naturally into the category perfectly parallel. Perfectly parallel applications can be divided into a set of processes that require little or no communication with one another. Applications of this kind are usually the easiest to decompose. To this class belong, for example, all numerical integration algorithms. Obvious way to implement perfect parallelism is simply to run equivalent sequential programs on the various nodes, each with different data set. On a single processor system, this would require running each case sequentially.

Domain decomposition
Another decomposition technique is called domain decomposition. Problem subjects to domain decomposition are usually characterized by a large, discrete, static data structure. Decomposition "the domain" of computation, the fundamental data structures, provides the road map for writing the program.

Control decomposition
Another major decomposition strategy is called control decomposition. When there is no static structure or fixed determination of the numbers of objects or calculations to be performed, domain decomposition is not appropriate. Instead you can focus on the flow of control in the application. As the development progresses you will also distribute the data structures but the guideline to development remains the flow of control.

Object-oriented programming
Object-oriented programmers view applications as a set of abstract data structures or objects. Associated with these objects are tasks so that they are in no confusion about the parts of the code and data that affect other parts.

The theoretical part -The discrete Fourier Transform
The discrete Fourier transform (DFT) has played an important role in the evolution of digital signal processing techniques. It has opened new signal processing techniques in the frequency domain, which are not easily realisable in the analogue domain. The discrete Fourier transform (DFT) is defined as [3,8,17]: x m иw m, n , n ϭ 0, 1, …, NϪ1 and the inverse discrete Fourier transform (IDFT) as: x n иw Ϫm, n , m ϭ 0, 1, …, NϪ1, in which w is N -root of unity that is w ϭ e Ϫi(2p/N) for generally complex numbers. In principle the mentioned equations are the linear transforms. Direct computations of the DFT or the IDFT, according to definitions require N 2 complex arithmetic operations.
In such a way we could take into account only the calculation times and not also the overhead times caused through a parallel way of an algorithm implementation.
Cooley and Tukey [3,7]   The basic form of parallel DFFT is the one-dimensional (1D), unordered, radix-2 (a use of divide and conquer strategy according to the principle in Fig. 2). The effective parallel computing of DFFT tends to computing one -dimensional FFT's with radix equals and greater than two and computing multidimensional FFT's by using the polynomial transfer methods. In practical part of this article we computed 2DFFT (two-dimensional DFFT). In general a radix-q DFFT is computed by splitting the input sequence of size s into a q sequences of size n/q each, computing faster the q smaller DFFT's, and then combining the result. For example, in a radix-4 FFT's, each step computes four outputs from four inputs, and the total number of iterations is log 4 s rather than log 2 s. The input length should, of course, be a power of four. Parallel formulations of higher -radix strategies (e. g. radix-3 and 5) 1-D or multidimensional DFFT's are similar to the basic form because the underlying ideas behind all sequential DFFT are the same. An ordered DFFT is obtained by performing bit reversal (permutation) on the output sequence of an unordered DFFT. Bit reversal does not affect the overall complexity of a parallel implementation.

Performance evaluation
To the performance evaluation of parallel algorithms we can use analytical approaches to get under given constraints some relations such as the known theorem of Munro -Paterson [7], Amdahl's law [14,15], Gustafson's law [14,15] etc. But all these relations have been derived in an idealised way without considering architecture and communication complexity. That means a complexity C p is a function only of parallel algorithm calculation. Such assumption could be real in some centralised multiprocessor systems but not in NOW's (network of workstations based on personal computers).
In such a parallel system we have to take into account all complexity elements according to the relation C p ϭ f (architecture, communication, calculation). In such a case we can use the following solution methods to get a complexity: • Direct measurement -real experimental measure of P p and its components for a concrete developed parallel algorithm on the concrete parallel system [7,8] • Analytical (to find C p on basis of some closed analytical expressions or statistical distributions for overheads) [9,10,11,12,13] • Simulation [2,4,5] (real experimental measure of C p and its components for a concrete developed parallel algorithm) on NOW's.

The results
For direct measuring of complex performance evaluation in a NOW we used the structure according to Fig. 3.

Fig.3 Illustration of the measure in NOW (Ethernet network)
The developed parallel algorithms were divided into the two logical parts -manager and servers. All programs were written on the WNT (Windows New Technology) platform. The manager controls the computer with starting services, makes connections and starts remote functions in a parallel way. At the end it sums the particular results. Every server waits for calculation starting and then calculates the particular results. At the end of calculation it returns to the manager the calculated results and the calculation time. The results are not only the computed results, but also the computation, communication and synchronization times (all overhead components for a given parallel algorithm). To measure these times we used the function "Query Performance Counter", which measures calculation times in ms. Calibration power results in our experiments for 2DFFT are in Fig. 4. The achieved results for 2DFFT algorithm document increasing of both computation and communication parts in a geometricall way with the quotient value nearly four (increasing matrix dimension mean to do twice more computation on columns and twice more on rows). Therefore for a better illustration we used dependencies on relative input load. At these experiments we used computers according Table 1.
Parameters of the used personal computers Table 1. The results in Ethernet NOW are graphically illustrated for 2DFFT in Fig. 5. For a better graphical illustration we limited the measured values for WS1 network node. The achieved results for 2DFFT-algorithm document increasing of both computation and communication parts in a geometrical way with the quotient value nearly four. The influence of matrix dimension to the network load illustrates the Fig. 6.
Percentile amounts of the individual parts (computation, overheads -network load, initialisation) at 2DFFT execution time for the matrix 1024 ϫ 1024 illustrate Fig. 7. The high network loads are involved through the needed matrix transpositions during 2DFFT computation.

Conclusions
Distributed computing was reborn as a kind of "lazy parallelism". A network of computers could team up to solve many problems at once, rather than one problem in higher speed. To get the most out of a distributed parallel system, designers and soft-ware developers must understand the interaction between hardware and software parts of the system. It is obvious that the use of a computer network based on personal computers would be principally less effective than the used typical massively parallel architectures in the world, because of higher communication overheads, but a network of workstations based on powerful personal computers, belongs to very cheap, flexible and perspective asynchronous parallel systems. This trend we can see in recent dynamic growth in the parallel architectures based on the networks of workstations as a cheaper and more flexible architecture in comparison to conventional multiprocessors and supercomputers. Second the used principles in world -realised multiprocessors are implemented in recent time in new symmetric multiprocessor systems (SMP), which are implemented on a single motherboard. Unifying of both mentioned approaches open the new possibilities in HPC computing in our country.
The next steps in the evolution of distributed parallel computing will take place on both fronts: inside and outside the box. Inside, parallelism will continue to be used by hardware designers to increase performance. Intel's new SIMD (Single instruction Multiple data) or MMX (Multimedia extensions) instructions tech-    In relation to achieved results at our faculty (Faculty of Control and Informatics, Zilina) we are able to do better load balancing among network nodes (performance optimisation of parallel algo-rithm). For these purposes we can use calibration results of network nodes in order to apply the input load according the measured performance power of used network nodes. Second we can do load balancing among network nodes based on modern SMP parallel systems and on network nodes or with only single processors. Generally we can say that the parallel algorithms with more communication overheads (similar to analyzed 2DFFT algorithm) will have better speed-up values for modern SMP parallel system as in its parallel implementation in NOW. For the algorithms with small or constant communication overheads (similar to computation [7,14,15]) we can prefer to use the other network nodes based on single processors.
Queueing networks and Petri nets models, simulation, experimental measurements, and hybrid modelling have been successfully used for the evaluation of system components. Via the form of experimental measurement we illustrated the use of this technique for the complex performance evaluation of parallel algorithms. In this context I presented the first part of achieved results. We would like to continue in these experiments in order to derive more precise and general formulae (generalisation of the used Amdahl's and Gustafson's laws) and to develop suitable synthetic parallel tests (SMP, NOW) to predict performance in NOW for some typical parallel algorithms from linear algebra and other application oriented algorithms. In the future we will report about these results.