MULTI-THREADED ANT COLONY OPTIMIZATION WITH ASYNCHRONOUS COMMUNICATIONS FOR THE VEHICLE ROUTING PROBLEM MULTI-THREADED ANT COLONY OPTIMIZATION WITH ASYNCHRONOUS COMMUNICATIONS FOR THE VEHICLE ROUTING PROBLEM

communication for cooperation in finding solutions. Our aim is to analyze the effect of proposed method on speedup, execution and communication time with respect to the quality of solution.


Introduction
In the field of combinatorial problems, the Vehicle Routing Problem (VRP) introduced by [1] is one of the most challenging. This optimization problem and its variants have multiple applications in telecommunication, transportation and logistics. Unfortunately the majority of these applications belong to NP-hard problems so in the worst case the exponential time is required to find the optimal solution. The VRP problem determines a set of vehicle routes starting and ending at the depot where each customer is visited exactly once. The demand of each customer is satisfied and both maximum tour lengths and vehicle capacities cannot be violated. The objective of the VRP is to minimize the total travel costs. Exact algorithms can be used only for relatively small instances. In practice, all known solutions of larger instances come from heuristic or metaheuristic algorithms. It seems that especially metaheuristic methods produce good quality solutions in relatively short calculation time.
The Ant Colony Optimization method (ACO) developed by [3] has become very successful for solving the VRP problem. The idea of the method is inspired by behaviour of real ants where each ant deposits pheromone on the ground as information that other ants should follow it. Deposited pheromone evaporates in time. Successfulness of ant in finding the food causes that certain path is passed more often and more pheromone is deposited. Therefore it is more likely that other ants choose the same path as well. This behaviour of ants is in computer science modeled iteratively by repeatedly called procedures which create solutions by exploring fully connected graph of customers. The artificial ant makes decision about the way it will continue in every vertex. This decision making is specific for concrete problem and is influenced by two factors: joint memory and heuristic information. When created, best solutions are used for updating of common memory according to the achieved quality. This updating is done after all artificial ants have finished their search. The whole procedure is called repeatedly as many times as required.
In our work we have rewritten the parallel Savings based ACO algorithm for the VRP described in [4] using synchronous communication model with Message Passing Interface to a thread based asynchronous model. We study characteristics of the algorithm when decentralized asynchronous communication is used, on four larger instances: C5 published by [2] and G18, G19, G20 published by [6]. For these instances there are still not exactly calculated optimal solutions. This paper is organized as follows. In the following two sections we shortly describe the Vehicle Routing Problem and the ACO parallelization strategies and propose the asynchronous algorithm. In Section 4 gained computational results are shown. We present dependence of the gained speedup and efficiency on the number of threads used, whereby the solution quality, execution and communication time are also presented. The last section concludes with several remarks and outlooks concerning the future work.

Formulation of the Vehicle Routing Problem
According to [2] we can describe the Vehicle Routing Problem (VRP) as follows: Let G ϭ (V, E, c) be a complete graph which has n ϩ 1 nodes (v 0 , …, v N ) corresponding to the customers i ϭ 1, …, N and the depot i ϭ 0, and the edge set With each edge (v i , v j ) ʦ E is associated a non-negative weight c ij , which refers to the travel costs between nodes v i and v j and a nonnegative weight t ij , which refers to the distance between the nodes.

In this paper we study behaviour of Ant Colony Optimization algorithm for solving the Vehicle Routing Problem implemented by POSIX Threads in parallel cluster environment. The algorithm is based on a fine-grained parallelism strategy which uses asynchronous communication for cooperation in finding solutions. Our aim is to analyze the effect of proposed method on speedup, execution and communication time with respect to the quality of solution.
Keywords: Ant Colony Optimization, Parallel Metaheuristic, Vehicle Routing Problem, POSIX threads.
Furthermore, with each node v i , i ϭ 1, …, N is associated a nonnegative demand d i , which has to be satisfied, as well as a service time δ i . The service time at the depot is set to δ 0 ϭ 0. At the depot a fleet of size K is available, where each vehicle has a capacity of Q k and the maximum driving time for each vehicle is T k .
Let x ij k denote the binary decision variables which equals 1 if vehicle k visits node v j immediately after node v i and 0 otherwise.
The objective can be written as (1) Using the following restrictions The objective (1) is to minimize the total travel costs. Constraints (2) ensure that no vehicle can be overloaded. Constrains (3) require that for each vehicle the maximum driving time is respected. Constraints (4) ensure that vehicle which visits a customer also leaves the customer. Constraints (5) require that all customers are visited exactly once, and that the depot is left K times. Through constraints (6) sub-tour elimination is ensured. Constraints (7) are the usual binary constraints.

Parallelization of ACO
Interesting feature of the ACO is its feasibility of parallelization. Each ant makes relatively simple and independent task. The only dependence is caused by using the same pheromone matrix and the same best solution found so far. The first issue is required for decision making process and the second one for measurement of quality of generated solution. We can identify several goals that can be achieved by parallelization of the ACO such as reduction of calculation time, increase of solution quality or speed of convergence. To achieve the reduction of calculation time we can split the colony of ants between processors and let each processor calculate its part of colony. This approach is known as functional decomposition [10]. Instead of this we can apply domain decomposition by dividing customers into subsets and let processors cal- culate solutions of sub-problems. Generally there are possibilities of synchronous or asynchronous communication between processors. Classification of parallelization of the ACO for the VRP can be found in [4], and [5]. In short, we can use: fine-grained, coarsegrained and mixed parallelization. Fine-grained parallelization splits ants of a colony between processors which often communicate when update the pheromone matrix and exchange the best solutions. By course-grained parallelization schemes run more colonies of ants in parallel whereby each colony is calculated on one processor. The information exchange between processors take place in certain time intervals and concern specified parts of information. The mixed approach is a combination of the first two.
In our approach we used fine-grained parallelization strategy with decentralized approach whereby one ant colony is proportionally divided among several computing threads. An execution thread [9] is a fork of a computer program into more tasks which can run concurrently. Those threads share memory and other resources, but they run independently. Considering the shared memory computational model of threads does not need to change the address space, inter-process communication of threads is faster than that for processes.
We suppose that all threads in our implementation have the same behavior and calculate homogeneous parts of the colony of ants. Number of threads is specified by the number of the processor's cores used. Each core runs exactly one thread. Every thread computes its own pheromone update by using information received from other threads. When possible, an inter-thread communication is done by using shared memory, otherwise the network is used. Concurrent access to the shared memory is secured by critical sections which are implemented by PThread mutexes [9]. Instead of using shared files proposed in [7] we used the user datagram protocol as a communication layer. If a thread has found better solution and has to use the network layer, the user datagram packet is sent only once per cooperating node. All received packets are stored in system buffers and are processed by first thread which reads them. After the packet is processed the thread publishes the best gained solution to all threads in its group in shared memory. This approach does not require dedication of a separate thread for communication. The pseudo-algorithm can be formulated as follows: The speedup is defined for measurement of parallelization quality by the following formula: where p is the number of processors, T 1 is the execution time of the sequential algorithm and T p is the execution time of the parallel algorithm with p processors. According to dual-core architecture used in our experiments we are calculating with number of used cores instead of processors. Similarly the efficiency is a performance metric defined by the following formula: This value is typically between zero and one and estimates how well-utilized in solving the problem are processors, compared to how much effort is wasted in synchronization and communication.

Computational results
In our experiments we used the cluster consisting of 72 SUN X4100 nodes with two 64-bit dual core processors, each. Therefore we could use at most 4 threads per node working over the common shared memory. Reported results are average values gained over independent 15 runs for all instances. The number of customers is denoted as N and configurations mentioned below are used. Even we experienced better solution quality with different configurations, we used the same parameters settings as proposed in [8] and used in [4] to keep results comparable. We used N artificial ants for each instance, α ϭ β ϭ 5 and σ ϭ 6 elitist ants, the evaporation rate ρ ϭ 0.95, and the neighborhood size ⎣N/2⎦. We ran the algorithm for 2N iterations for both instances. The algorithm did not send whole pheromone matrix between cores. Only the best σ solutions were chosen, compared with the best solutions found so far and spread between nodes every time better solution was found.
In Table 1 we can see that the time spent by communication does not increase linearly with using more cluster nodes. We can see that efficiency decreases with the increasing number of threads. The achieved efficiency for 32 cores for instance C5 is 0.59 and 0.73 for instance G19. This value is better than published in [4], where the gained efficiency is 0.37 and 0.39, respectively. So we can conclude that asynchronous communications are more suitable for the ACO as synchronous, especially for larger instances. Reduction of speedup of instance C5 against G19 is caused by the fact that time required for creating solution is smaller. Therefore, the ratio between communication and calculation is higher. Efficiency greater than 1 on the C5, G18 and G20 instances is caused by caching effect, where several threads run on separate cores of one processor. We can see that solution quality is decreasing with more threads; this is about 4% on the C15 and 3% on the G19 when 32 threads are used.

Conclusions
We presented parallel POSIX Threads based implementation of the ACO method using asynchronous cooperative approach for solving the VRP. We measured its speedup and efficiency in comparison with synchronous approach published in [4]. We showed that asynchronous communication increases efficiency of the ACO algorithm.
In our future work we would like to apply the presented asynchronous approach to the mixed, multi-colony ACO parallelization with focus on increasing quality of the solution. We would like to test the algorithm with different configurations. We also plan to test the dependency of the instance size on efficiency and solution quality in parallel asynchronous ACO algorithms.
Calculated average results according to the number