IMPROVING INITIAL POPULATION FOR GENETIC ALGORITHM USING THE MULTI LINEAR REGRESSION BASED TECHNIQUE (MLRBT)

Resume Genetic algorithms (GAs) are powerful heuristic search techniques that are used successfully to solve problems for many different applications. Seeding the initial population is considered as the first step of the GAs. In this work, a new method is proposed, for the initial population seeding called the Multi Linear Regression Based Technique (MLRBT). That method divides a given large scale TSP problem into smaller sub-problems and the technique works frequently until the sub-problem size is very small, four cities or less. Experiments were carried out using the well-known Travelling Salesman Problem (TSP) instances and they showed promising results in improving the GAs' performance to solve the TSP. IMPROVING INITIAL POPULATION FOR GENETIC ALGORITHM USING THE MULTI LINEAR REGRESSION BASED TECHNIQUE (MLRBT) Esra'a Alkafaween1,*, Ahmad B. A. Hassanat1,2,3, Sakher Tarawneh1


ISSN 1335-4205 (print version) ISSN 2585-7878 (online version)
operator and selection strategy [12]. In this paper, a new enhanced initial population is proposed to increase the GA performance.
The proposed technique distinguishes from other previous techniques, where it divides the problem into sub problems using the regression line, which indicates the relationship between the points in the xy coordinates. Each sub problem results from intersection between the regression line and the rotated line at the center point. Then, the initial population is determined by reconnecting each sub problem. Results of experiments show that the proposed technique is of high efficiency through different aspects: improvement in error rate, average convergence and convergence diversity. This work contribution is in proposing a new enhanced approach to generate the initial population to be used in solving the decision-making problems, especially the TPS problem. Each individual in population is called chromosome, which will be presented as a solution [13]. Then, sampling this initial population will generate an intermediate population. Therefore, it is possible to apply reproduction, crossover and mutation on the new intermediate population [14]. This process is repeated until reaching the desired number of generations or a convergence adopted in the design is reached. Hence, initializing population is necessary in order to start the process of evolution in the GA [15].

Introduction
Genetic Algorithm (GA) is one of effective and robust machine learning algorithms [1]. Many studies were concerned with Genetic Algorithms (GA) and exploited its capabilities in designing smart systems and solving problems [2][3][4]. The genetic algorithms are concerned, in general, with how to produce new chromosomes (individuals) that possess certain features through recombination (crossover) and mutation operators [5][6]. Therefore, individuals with appropriate characteristics have the strongest chance of survival [7]. Typically, the GAs start with a number of random solutions (initial population); this is the first phase in the GA. This phase generates a set of possible solutions randomly or by heuristic initialization. Although the initial population seeding phase is executed only once, it has an important role to improve the GA performance.
The GA aims to produce many solutions to solve specific problems, such as a problem of the TSP, which is a common issue in Artificial Intelligence area [8][9]. Several previous studies dealt with insight in the GA procedure to indicate its procedure and how it can be exploited to solve sophisticated problems, such as finance, medical, mathematical and technical ones [10][11].
Efficiency of the GA is based on many factors, such as initial population, crossover operator, mutation Bank) is the j-th city to city i. Therefore, the first row in the Gene Bank includes the C closest cities for city i. For each solution, the initial city i is initialized randomly from the row i in the Gene Bank. "After that, the method selects city j, where j is the nearby one in the unvisited elements of the i-th row. Then, city k is selected from the j-th row of gene bank as the next city. If all the city codes of the j-th row have been selected, then next city is chosen randomly from the set of unvisited cities" as mentioned by [18]. Thus, this process is repeated until generating the solution of the size N.

C. Nearest neighbor initialization technique
The nearest neighbor (NN) is known as the most common initial population seeding technique. In addition, NN can be used as an efficient random initial population method to be used for the purpose of generating initial population solutions, especially in the case of the TSP (to be solved with GA). The process of generating each individual begins by randomly selecting a city to be the starting city; then, adding the nearest city to the starting city as the new starting city. Thus, the nearest city that was not added to the current city is added to the individual until all the cities are included in the individual. Therefore, as the next generations that were created from a city nearest to the current city, the generated individuals would enhance the evolving search process [19].

D. K-means initial population (KIP)
To improve the process of initializing population of the GA, several studies used K-means clustering algorithm, especially in the case of the TSP problem, such as [20][21][22]. These studies used the k-means clustering to split a largescale of the TSP into small groups k, where K = [√ N + 0.5] and N denotes number of cities. Then, the KIP is applied to the GA to find the local optimal path for each group and a global optimal path that connects each local optimal solution.

E. Initialization mechanism based on the regression techniques
A new initialization technique has been designed to improve the GA for solving the well-known TSP. It is based on the the Regression line and the perpendicular line that crosses the regression line at the center point to divide a large-scale TSP problem to small sub-problems. The resulting sub-problems are repeatedly classified to fit

Related work
Various initialization techniques have been introduced since the GA concepts appeared, such as random technique, nearest neighbor technique, Gene Bank (GB) technique, K-means clustering technique and Initialization Mechanism Based on Regression techniques. The randomization technique is considered as one of the most suitable and most used technique for generating the seed of the initial population. However, it may contain a poor fitness solution that reduces the possibility of finding the optimal solution.
Here is briefly presented a background review of several of the initial population seeding techniques that are used for the GAs.

A. Random initialization technique
This technique is widely used in machine learning algorithms and it is used widely in GA because it is the simplest way to initialize population seeding. In addition, the researchers prefer this technique especially when the prior data about expected optimal solution is trivial. As shown on the left-hand side of Figure 1, the successive cities of the initial solutions are chosen randomly, where the right-hand side shows initializing using the K-means [16]. This figure shows the difference between the random initialization and any other initialization techniques such as K-means. Most researchers use the sentence "generate an initial population" to indicate that they use random initialization technique. In the case of the TSP, random initialization selects the cities randomly and it generates random numbers from 1 up to (n) for each city. As shown in study [17], "if the current individual already contains the generated number, then it generates a new number. Otherwise, the generated number is added to the current individuals". Hence, this process is repeated until reaching the desired individual size (n).

B. Gene bank initialization technique
The Gene Bank is considered as a database of initial population to be used as solutions based on quality and diversity [18]. In the case of the TSP, each city that the salesman travels to is called N and permuted and assembled to build a gene bank. Then, encoding the nearest cities C to the city in order to be encoded to build a gene bank and noting that C should be less than (N-1). As a result, the Gene bank is given as a matrix A where its size is (C × N). For example, A[i][j] is an element in the matrix (Gene

E3
• Initially, the large-scale TSP problem is divided into four small sub-problems by using regression line on X and the regression line on Y. Then, the method classifies the resulting points into four categories. Then, recursively, dividing each category into new four sub-categories through the Regression line on X and Regression line on Y. The previous process will continue until reaching the optimal target category, which includes a small number of instances (x, y) points. Maximum is four cities or (x, y) points assigned to each category that are considered as initial population for TSP sub-problem. The process ends up when the local optimal solution is obtained for each category. • Secondly, reconnecting all the local optimal solutions together in order to rebuild the initial populations seeding. Finally, to obtain N solutions, the method mutates the initial population N times, where N denotes the population size.
To show how regression population seeding works, Figure 2 shows the main steps by using one of the TSP cities (a280).
The following steps illustrate design of the MLRBT: Step 1: dividing the points into two categories by using the regression line equation (y=a + bx). As shown in Figure 3, each section includes almost an equal number of nodes.
Step 2: this step aims to divide the points into two sections using the regression line equation (x = a + by), then, the diagram is divided into four equal sections where each section includes an equal number of cities as shown in Figure 4.
Step 3: In this step, the diagram is divided into four sections as shown in Figure 5: • Section 1: (A is all the points above the positive x-axis and to the right of the positive y-axis). • Section 2: (B is all the points above the negative x-axis and to the left of the positive y-axis). • Section 3: (C is all the points below the negative x-axis and to the left of the negative y-axis). • Section 4: (D is all the points below the positive x-axis and to the right of the negative y-axis).
into the four categories to obtain local optimal solutions [23].
Procedures of the proposed method firstly start with dividing the large-scale TSP into the four small subproblems using Regression line and the perpendicular line and then classify the points into four categories. Each category is divided into four new categories recursively, by using the Regression line and the perpendicular line. The process carries on until having the target category that contains a small number of instances (x, y points). Maximum four cities or x, y points are assigned to each category that are considered as initial population for the TSP sub-problem. The process ends up when the local optimal solution is obtained for each category.
Second, rebuild the initial populations seeding by reconnecting all the local optimal solutions together. Finally, mutate the initial population N times to obtain N solutions, where N is the population size.
The research, presented in this paper, is compared to that research, because it already outperformed the two methods of seeding the population: random and the nearest neighbor.

The proposed method
This paper aims to propose a new initialization technique, which has been designed to enhance the GA at the aim of solving the problem of the TSP. The proposed technique is called (Multi Linear Regression Biased Techniques (MLRBT). This technique aims to divide the TSP problem into small sub-problems. In addition, it works depending on the Regression line on X and Regression line on Y, where it crosses the regression line at a point, which is not necessary to be in the center. Then, the previous process is repeated on the resulted sub-problems and classified in order to fit into the four categories to obtain local optimal solutions.
The main procedures of the proposed method include two main steps: Figure 2 The (x, y) scatter for the TSP city (a280) Figure 3 The regression line Step 6: Select a random city to be the starting city and then add the nearest city as a new starting city until having all the cities connected in the category of the local path. The group in each category is connected with the nearest group in other categories until all the groups are connected in a global path. This method gives only one solution, and to derive solutions with a specific population size, the mutation process is used. It is used to mutate the seed solution (n-1) times, in order to derive the other solutions, where n is e size of the population.
Step 4: repeating the method recursively four times on all categories: (A), (B), (C) and (D), as shown in Figure 6.
Step 5: Terminate the recursive call, if the number of points (cities) is less than or equal four. Since the algorithm uses the regression line, it guarantees that the small number of cities that it ends with are more likely to be neighbors and closer to each other from the other cities. Therefore, connecting them with each other is better than connecting any of them with further cities, as these local links are minimized and minimizing the local links attributes in finding a smaller global route.  Ten TSP instances were selected to implement the experiment; these instances include KroA100, eil51, pr76, KroA200, in318, pr144, att532, rat783, d2103, fnl4461 and e experiments were repeated ten times for each instance. It resulted in that the MLRB technique was more beneficial than the Regression Based Technique [21] across all the instance categories (small, medium and large) using the same parameters, which have been fixed according to the previous research, i.e. the Regression Based Technique [21]. The RBT was previously compared to other techniques, the random and the NN in different researches and was found superior. In addition, Table 2 shows that the MLRB is found to have advantage for all the cities in terms of the best solution. It can also result in that these techniques are more beneficial than the random and the NN techniques, where the successful performance has been achieved by the proposed solution and the achieved performance was close to the optimal solution.
It is necessary to consider the performance factors (which have been identified as measurements) when investigating several initialization techniques such as error rate, average convergence and convergence diversity.

Figures 7 and 8 show a comparison of the two initialization population methods: Initialization Mechanism
Based on regression Techniques [23] and MLRBT, applied to two TSP instances: a280 and rat195.
As shown in Figures 7 and 8, the process of the GA is continuing to enhance and optimize the solutions.

Results and analysis
To evaluate the proposed methods, he experiments on different TSP problems were conducted. Experiments include conducting the proposed method, which includes implementing the Multi Linear Regression Based Technique MLRBT together with the Regression based technique in [23], which was found superior to both the Random and the NN techniques. Table 1 shows the selected GA parameters.
Experiments include applying each technique 10 times for each of the TSP instances. Then, computing the average of all the executions results for the purpose of experimental analysis. To conduct the experiments, Microsoft visual studio 2008 tool was used, as well as the TSP benchmark datasets obtained from TSPLIB [24]. Results of experiments are shown in Table 2. Hence, the results are divided into

E6
A L K A F A W E E N e t a l .
the individuals that were produced by each of: NN, Random and RBT techniques. Since the proposed technique divides the problem (which is improvement for the RBT), this difference occurred. Table 3 also shows the experimental results of ("the initial population techniques w.r.t. error rate for the best individuals and the worst individuals in the initial population for each technique"). The results of Table 3 are extracted from Table 2, which includes the error rate for the MLRBT and RBT [23] for each city and there is a clear superiority in ratios in the table in favor of the MLRBT.
The Average error rates, obtained from both initial population techniques (RBT and MLRBT), for different problems are given in Table 4.
Besides on the size of cities, Table 5 shows the selected TSP examples, which were classified into three classes according to their problem size. The error rates of several initial population techniques are given for different classes of a problem example by Class A and Class B.
As shown in Figure 9, the error rates of several initial population techniques are given for different classes of a problem example by Class A and Class B.
Average Convergence (%) is the convergence rate of solutions in the initial population, and it is given by the following formula [25]: However, in any problem, the error rate denotes the percentage difference between the known optimal solution and the fitness value of the solution [19]. The Error Rate can be given by the following formula: This factor measures the quality of the generated population by finding the effect of applying initial population technique on the GAs' performance to obtain a solution near to optimal one. The error rates are also classified into two types, depending on the fitness values in the given problem population. In other words, individuals with high error rate are given according to the initial population with the worst fitness value. In addition, individuals with low error rate are given according to the initial population with the best fitness. The experimental results of error rate for random, NN, RTB and MLRBT are given in Table 3.
From Table 3 follows that MLRBT technique achieved the minimum error rate likened to Random, NN and RBT techniques. It can be noted that the Multi Linear Regression based technique for GA's achieved lower error rate than the other seeding techniques, which are "Random and NN and RBT". Hence, it indicates that the produced individuals by the MLRBT for GA's are better fit the quality measures than As shown in Figure 10, the Multi Linear Regression Based Technique (MLRBT) achieved larger convergence than the Regression based technique. Hence, the MLRBT was found slightly greater than Random and NN, especially the RBT, [23] is better than Random and NN technique.
Furthermore, the final solution error rate denotes the difference between the known optimal solution and the final solution that resulted when applying the GAs on the TSP instances using one of initial population techniques. This can be given by the following formula: In addition, this factor aims to compute the produced population quality and this can be conducted by determining Optimal fitness denotes the recognized optimal value of identical instance, and Average fitness denotes the average value of the initial population fitness.
From the results, it was found that the MLRBT regression based technique for the GA's population initialization had average convergence rate higher than the RBT. In addition, it was noted that the MLRBT works in better way, especially in the case of large size problems. Hence, it can result that individuals that are produced by the MLRBT are the closest to the optimal solution. Moreover, Table 6 includes the experimental results of the initial population techniques w.r.t. average convergence (%). Class 3 500<Size<= 1000 att532, rat783, fnl4461, d2103    solve the problem of the TSP and this technique was called the Multi Linear Regression Based Technique for GAs population initialization. In addition, the proposed technique has been implemented and analyzed, and to test the efficiency of the proposed technique, it has been compared to three other population techniques: Random, Nearest Neighbor (NN) initial population and Regression based techniques.

Figure 9 Performance of initialization techniques
In this context, a set performance criteria were considered in order to compute the performance factors for the proposed technique and the other seeding techniques including the convergence diversity, error rate, and average convergence.
Moreover, to conduct experiments, the study extracted several TSP examples from the standard TSPLIB. After conducting experiments, the results indicated that the proposed Multi Linear Regression Based Technique for the GA's population initialization achieved the higher efficiency than the other initial population techniques to be depended in developing GA-based applications. Hence, it can be concluded that the Multi Linear Regression Based Technique for the GA's population initialization produces the high quality and efficient fittest individuals, which enable the GA to enhance the solution using the best-fit individuals.
the impact of applying initial population technique on the GAs performance for obtaining a solution, which is close to the optimal one. However, the error rate of the final solutions indicated that the MLRBT for the GA's population initialization achieved error rate lower than the RBT. In other words, this indicates that the produced individuals from the MLRBT achieved a higher fit quality more than the individuals that are produced from RBT. This is evaluated in Table 7. Figure 11 shows the final solution error rate obtained from several initial population techniques used for different classes of problem instances. The Multi Linear Regression Based Technique (MLRBT) achieved the maximum final solution error rate over the Regression based technique [23].
The performance MLRBT and RBT [23] techniques' was observed from the instances: kroA100 and eil51 as shown in Figures 12 and 13. In addition, the two figures show the produced initial population from the MLRBT, RBT and the final solution after 3000 generations.

Conclusion
This paper introduced a new, enhanced, initial population technique for genetic algorithm in order to