A GPU-Based Genetic Algorithm for the Multiple Allocation P-Hub Median Problem

: As the sizes of realistic hub location problems increase as time goes on (reaching thousands of nodes currently) this makes such problems difficult to solve in a reasonable time using conventional computers. This study aims to show that such problems may be solved in a short computing time and with high-quality solutions using the computational power of the GPU (actually available in most personal computers). So, we present a GPU-based approach for the uncapacitated multiple allocations p-hub median problems. Our method identifies the nodes that are likely to be hubs in the optimal solution and improves them via a parallel genetic algorithm. The obtained GPU implementation reached within seconds the optimal or the best solutions for all the known benchmarks we had access to and solved larger instances up to 6000 nodes so far unsolved. Compared to this study, no other article dealing with hub location problems has presented results for instances as large.


Introduction
Hub location models appear in various fields such as transportation, telecommunications, etc., and have a wide range of applications in air or ground transportation networks, postal delivery systems, and so on.In these models, traffic from origin to destination is not sent directly but must be routed through a specific set of nodes called hubs.Every other non-hub node is assigned to a single hub (single allocation) or to multiple hubs (multiple allocations).Assuming (i) Open hubs are fully connected to more efficient paths, allowing a discount factor α to be applied to transport costs between hubs.(ii) The triangle inequality applies to the distance between nodes (iii).There is no direct connection between the two non-hub nodes.Thus, the traffic flow will flow through one or two nodes, and economies of scale are achieved by consolidating traffic at the hubs.More precisely, the transmission cost of flow wij from i to j via hubs k and l is given by Cijkl = (χdik +αdkl +γdlj)wij where dik, dkl, and dlj are respectively the distances from i to k, from k to l and from l to j, χ is the collection cost per unit, γ is the distribution cost per unit, where α < χ and α < γ.
In the multiple allocation hub location problems, a non-hub node is allowed to be assigned to more than one hub, depending on the destination of the flows originating from that node.The goal is to locate p hubs and map non-hubs (spook) to hubs so that the overall shipping cost is minimized.Given the number p of hubs a priori, the problem is called the p-hub median problem.Also, if the hubs have no capacity constraints, the problem is known as the Uncapacitated Multiple Allocation p-Hub Median Problem (UMApHMP).Multiple allocations increase the flexibility of the model and it is expected to have lowercost solutions compared to the single allocation case.
Since the UMApHMP belongs to the class of NP-hard problems, it is difficult for exact optimization methods to solve large benchmarks (for instance, the hub and spook network of the China Deppon logistics includes 50 hubs and 5400 nodes and these numbers increase as time goes by Wang and Huang (2016).Therefore, heuristic and parallel computing methods are promising approaches for solving real problems with large sizes (Benaini et al., 2022).So, we propose a genetic algorithm parallelized for the GPU for solving the UMApHMP.To our knowledge, this is the first GPU-based approach dedicated to this problem.Our Genetic Algorithm (GA) starts from different initial solutions and improves them via classical genetic operators.The initial solutions locate the hubs which are likely to be hubs in the optimal solution.Consequently, they greatly enhance the convergence of the GA.We present this approach and we show the efficiency of our GA under experimental results on well-known benchmarks and on large random instances of up to 6000 nodes generated by us.
Several efficient heuristics have been implemented for both single and multiple allocations p-hub median problems.Among the most promising algorithms are Tabu Search (TS), Genetic Algorithm (GA), scatter search, simulated annealing, greedy randomized adaptive search, and electromagnetism like methods.A brief presentation of the corresponding references is below.Sangsawang (2011) proposed a multiple tabu search for solving the UMApHMP based on the convex hull technique to identify the nodes that send/receive flows to multiple hubs.He uses an n × 2n array to identify the intermediate hubs between origin-destination pairs.Rabbani and Kazemi (2015) presented a GA optimization approach for the uncapacitated multiple allocation p-hub center problem.He compared the nearest hub and all pair shortest path strategies for multiple allocations on the CAB and the AP data set.Zhang et al. (2023), the authors used Variable Neighborhood Search (VNS) heuristics to solve this problem.Shobeiri (2015) presented greedy deterministic and randomized algorithms for constructing an initial feasible solution and improving it by using local improvement techniques.He studies the impact of the quality of the initial feasible solution on the local improvement phase of AP data instances.Sender and Clausen (2013) developed different versions of a local search approach for solving the capacitated version of the problem derived from a practical application in the network design of German wagonload traffic.Boland et al. (2004) observe characteristics of optimal solutions for three versions of the multiple allocation problems.Then, they used these characteristics to develop preprocessing techniques and tightening constraints and apply them to the appropriate problems.Chen (2006) gives an approach to derive an upper bound for the optimal number of hubs for the Uncapacitated Multiple Allocation Hub Location Problem (UMAHLP).He uses this upper bound with the simulated annealing method to solve this problem.Zetina et al. (2017) study cases where the demand and the transportation cost are subject to interval uncertainty.They present mixed integer programming formulations for these problems and computational results with a general-purpose solver.For a robust optimization, Benaini and Berrajaa (2020) propose a genetic algorithm to solve the robust uncapacitated single allocation and also Rouzpeykar et al. (2022) developed a genetic algorithm to solve p-hub location and revenue management problem.Yaman (2011), introduces the R-allocation p-hub median problem, where each node must be allocated to at most r hubs.This problem generalizes two versions of the p-hub median problem, single and multiple allocations (r = 1 and r = p).His computational study shows that single-allocation solutions can be significantly more expensive than multiple-allocation solutions.However, assigning one node to another hub or two can save significant routing costs, and the resulting network may be easier to manage.Kratica (2013) introduced the original electromagnetism like metaheuristic with a scaling technique.Results on several benchmarks and on large instances up to 1000 nodes showed high performance of the proposed method.
Several Genetic Algorithms (GA) are proposed to solve different variants of the hub location problems.Most of them use binary/integer arrays as encoding.A simple and powerful GA for the multiple allocation problem was proposed by Kratica et al. (2005).This GA uses binary N-string encoding and original genetic operators.The author proposed caching to reduce the computational time in the evaluation function and shows the effectiveness of the GA through tests on CAB and AP instances up to 200 nodes.We adopt the binary N-string encoding for our GA.However, the particularity of the UMApHMP is that once the set of hubs is defined, an optimal allocation of non-hubs to hubs is obtained by the shortest paths rule.So, given the N-string encoding, it is necessary to compute all pair shortest paths to fully determine the solution and to evaluate its fitness which is expensive computationally.To avoid recomputing all the paths at each iteration of the GA, we update those of the current solution after each modification of the latter.So, we must indicate how to represent all the shortest paths of a solution and how to update them after a change in the solution.
Our GA starts from an initial solution that locates hubs at the nodes that are likely to be hubs in an optimal solution, in order to accelerate the convergence to the optimal or best solution.It is therefore necessary to identify potential optimal hubs.For this purpose, we are inspired by the work of Peker et al. (2016) on the analysis of optimal hub locations for single allocation problems and by other researchers on this subject due to Demir et al., (2022);Alvarez Fernandez et al. (2022); Contreras and O'Kelly (2019); Shobeiri (2015).
Finally, it should be mentioned that very few GPU solutions for the hub location problems are proposed in the literature.Benaini et al. (2019) proposed a GPU implementation for solving the single allocation phub median problem.Lim and Ma (2013) presented a GPU-based parallel vertex substitution algorithm for the p-hub median problem and Santos et al. (2010) proposed a GPU implementation of the GRASP metaheuristic for the p-hub median problem.More recently, AlBdaiwi and AboElFotoh (2017) presented a new genetic algorithm based on a pseudo-Boolean formulation of the p-median problem that he implemented on GPU.Interesting experimental tests on hard instances of sizes up to 900 are reported.We would like to highlight that, compared to our work, no article dealing with the hub location problems, proposed results for instances as large as 6000 nodes.In view of these related works, we can affirm that the contributions of this study are in the method of locating optimal hubs and in the GPU-based approach for solving the UMApHMP.

Problem Statement and Formulation
The UMApHMP can be informally stated as follow.Given a set N of n nodes, find a sub-set of p hubs H ⊂ N that minimizes: where the decision variable xijkl = l if the flow wij from i to j goes via hubs k and l (k may be equal to l) and 0 otherwise.Given that there are no capacity constraints on the hubs or links of the network, there always exists an optimal solution in which each flow wij fully routed on a single path (Contreras and O'Kelly, 2019) which justifies the fact that the xijkl are binary variables.The general constraints imposed by the UMApHMP are the following p is the number of hubs.It is assumed that transportation costs and traffic flow are known and determined.The traffic between a pair of nodes i and j must be routed through either one or two established hubs k and l.All hubs are interconnected and none of the non-hubs are directly connected.Each non-hub can be allocated to multiple hubs.We call the hubs of optimal or best solution optimal hubs.A more precise formulation of the UMApHMP can be found in (Boland et al., 2004;Kratica et al., 2005).
Figure 1(a) shows an example of a hub network with n = 5 nodes.The number of located hubs is p = 2. Figure 1(b) presents a solution for the single allocation case with hubs located at nodes 2 and 4. The non-hubs 1 and 5 are allocated to hub 4 and non-hub 3 to hub 2. For instance, the flow from 1-3 takes path 1-4-2-3.A multiple allocation solution is presented in Fig. 1(c) with hubs located at nodes 2 and 4. The non-hub 3 is allocated to hubs 2 and 4, non-hub 1 is allocated to hubs 2 and 4, and non-hub 5 is allocated to hub 4. The flow from 1 to 3 takes the path having the minimum cost among the three paths 1-4-2-3 or 1-2-3 or 1-4-3.

Materials
This study is implemented on GPU Nvidia Quadro with 2 GB and 2 SM and 384 cores running under CUDA 11.0 environment.

Location of Initial Hubs
The authors are inspired by the work of Peker et al. (2016), that have an interesting analysis of the optimal hub locations in the case of a single allocation.Using the optimal solutions for small CAB and AP instances, they observe that the spatial distribution of nodes (betweenness and centrality) and the size of node requirements are the most influential parameters for locating optimal hubs.They proposed a CBS algorithm that partitions the nodes into clusters each centered at a node that is likely to be an optimal hub.We adopted the same methodology to conceive another simple and efficient algorithm to partition the set of nodes into p disjoint clusters each likely to contain an optimal hub for the UMApHMP.We showed the effectiveness of our approach by tests on several known benchmarks (including large random instances).The result is that the percentage of the number of clusters each containing at least one optimal hub is about 82%.Our clustering algorithm is based on the following two observations: (i) In an optimal solution of the UMApHMP, a non-hub node is necessarily allocated to the hub closest to it (ii) A node k with the greatest index I(k) = Ok +Dk is a potential optimal hub and the closest nodes to it are unlikely to be also optimal hubs, they are rather the nodes allocated to hub k.The following simple algorithm partitions the set of nodes into p disjoint clusters Ci each likely to contain a hub of an optimal or best solution.
Clustering algorithm: L = the list of nodes sorted in decreasing order according to their indexes I(); r = ⌈N/p⌉; i = 1; Two remarks on this algorithm.First, as this algorithm is dedicated to the UMApHMP, the Ci may be non-disjoint and consequently may contain different numbers of nodes instead of r nodes.Secondly, the algorithm can be improved to take into account the isolated nodes as in Peker et al. (2016) where the generated clusters (circles) have the same radius and are centered at nodes selected among the 2p most important nodes.These centers are , , Here, the clusters Ci, have a different radius and are centered at hi, and each contains the same number of nodes (except for Cp).Identifying the potential optimal hub in each Ci is a more delicate task.Indeed, the potential hub in Ci is not necessarily hi but may be a node close to hi (according to some criteria).We tested different criteria to select the optimal hub in each cluster.We observed that the node k ∈ Ci with small is the best candidate optimal hub in Ci for several benchmarks but not for all.

The Genetic Algorithm
Most of the GAs for the hub location problems proposed in the literature use binary or integer arrays to encode the solutions.The operators mostly used in genetic algorithms for the UMApHLP consist in randomly exchanging some hubs with non-hub nodes regardless of the encoding used.Indeed, once the set of hubs is defined, an optimal allocation is obtained by the shortest path algorithm and therefore, operators that change the node allocations are useless.

Solution Representation and Encoding
We adopt a simple encoding by a binary string s of length n where s(k) = 1 denotes that k is an established hub and s(k) = 0 denotes that k is a non-hub node.We represent all shortest cost path i-k-l-j of a solution encoded by s by a two-dimensional array Ts of size n × 2n; where a row represents the origin node, a column identifies the destination node, and the l-column and k-column represent the intermediate hubs.Sangsawang (2011) used this representation in a tabu search approach.
For instance, the representation T of the solution of Fig. 1(c) is given in Table 1.
In this example, T(3, 5) = (2, 4) means that the shortest cost path from node 1 to node 3 is 3-2-4-5, and T(1, 5) = (4, 4) means that the shortest cost path from node 1 to node 5 is 1-4-5, etc.Note that if the l-column of T(i, j) = j or the k-column of T(j, i) = j for all i then j is hub.This allows the identification of hubs from T. The representation Ts is associated with the encoding s and defined by Ts(i, j) = argmink,1: The genetic operators operate on the N-string s and the representation Ts, which must be updated after the change of hubs of s, gives the fitness ∑i,jCijT(i,j) of s.

Genetic Operators
The following classical operators have been implemented in our GA.
• Given an individual s and its representation Ts, generate new individual s ′ from s by k exchange() consists in randomly choosing k hubs of s and exchanging them with k non-hubs then update Ts to get Ts′ • A population of R individuals is obtained from an initial individual s by applying R times kexchange() to s • The crossover with single random point crossover exchanges the two parts of parents p1 and p2 to generate two springs ch1 and ch2.If the number of ones in the code of chi, (i = 1, 2); is different from p then chi is corrected to feasible as follows.If the number of hubs of chi is > p then keep the p hubs with greatest I() else the missing hubs are randomly chosen from the non-hub nodes.This produces two feasible individuals c1 and c2.The representation Tc i (i = 1, 2) is computed as follows.If the number of hubs common to ci and p1 is greater than the number of hubs common to ci and p2 then update Tp 1 else update Tp 2 to get Tc i • The aim of the mutation is to enlarge the search space and to avoid the local optima.We use a mutation operator that changes the hubs for 2% of individuals randomly chosen and updates the representations T of the obtained individuals Exchanging hubs with non-hubs and consequently, updating the representation T is the most expensive operation, in computing time, of this GA.Therefore, we designed the updating process to be as efficient as possible without re-computing all pair shortest paths.

Updating the Representation T after the Change of Hubs
Let H be the set of hubs of a solution s represented by Ts.Then: Let, s′ the solution obtained by exchanging t hubs h1,•••,ht of s with t non-hubs h1',…,h't of s and Ts ′ its represented.Let H1 = {h1,…,ht}, H'1 = {h'1,…,h't} and H' = (H\H1) È H'1.Then H' becomes the set of bubs of s ′ and Ts′(i, j) = argmink,l∈H′ Cijkl.We show how to compute Ts′(i, j) for all i, j.Assume that jÎH'1 then any path (of s′) from i to j passes through at most another hub. Hence: is done in O(p) steps.Likewise, if then: is done in O(p) steps.Now, suppose that neither i nor j is a hub of s ′ and let Ts(i, j) = (k0, l0).Finally, if k0 ∈ H1 or l0 ∈ H1 then the computation of Ts′(i, j) needs O(p 2 ) steps.So, the computation of most of the Ts′(i, j) requires O(p) steps in the case where a single hub is exchanged (1-exchange() used in our implementation).Finally, it should be noted that some properties presented in Boland et al. (2004) may be used to improve these computations with little impact if all the Ts′(i, j) are computed in parallel.
The overall structure of our system is shown in Fig. 3.

GPU Implementation of the GA
Before presenting the GA, we briefly recall the main features of the GPU and CUDA.

CUDA and GPU
Graphics Processing Units (GPUs) are actually available in most personal computers.They are used to accelerate the execution of a variety of problems.CUDA is a scalable parallel programming model that runs on Nvidias GPU architecture making it possible to use the GPU as a massively parallel machine.Since GPU and CUDA are actually available in most personal computers, massively parallel computing has become a commodity technology.The smallest unit in the GPU that can execute a kernel is called a thread.All threads execute the same code (kernel) but on multiple data.Threads are grouped into blocks and blocks are grouped in the grid.CUDA enables defining grids and blocks according to the parameters of the problem.
Threads can access data in parallel from different memory locations.GPU memory is divided into three levels: (i) Global memory, which can be accessed by all threads.(ii) Shared memory, accessible to all threads of the same block, and (iii) Local memory (register), accessible by a thread and only by it.Shared memory has low latency (2 cycles) and is limited in size.Global memory has high latency (400 cycles) and is large.Each block is divided into Warps of 32 threads that run in parallel on a single Stream Multiprocessor.Programmers must control block size, number of warps, and various memory accesses.Typical CUDA programs are C programs whose functions are classified according to whether they are designed to run on the CPU or on the GPU.

Outline of the GA
GAs is preferred to be executed on parallel architectures such as the GPU since they exploit the availability of many threads performing the same code on multiple data.In the classical thread per chromosome model, one thread performs fitness evaluation as well as genetic operations (Benaini et al., 2019).Here, we use several threads to perform these operations on a single chromosome, resulting in better utilization of GPU resources.Our GA uses R blocks, each of n × 2n threads, to process a population of 2 × R chromosomes as follows.Each pair of individuals (parents) is processed by a single block.The representations T and T ′ of the two parents, the W costs matrix, and the D distances matrix are stored in the shared memory of the block (to allow all threads of the block to access these data without going through the global memory).The sub-block composed of threads (i, j) (resp.threads (i, j + n)), 0 ≤ i, j < n, updates and evaluates the fitness of one new (resp.the other new) produced individual at each iteration.The fitness of an individual represented by T is simply computed as the ∑i,jCijT(i, j).
Now, the authors explain the GPU implementation of the GA.Starting from an initial solution s constructed as described.In each blocki, 0 ≤ i < R, sets a first parent pi to s and generates a second parent p ′ i = 1-exchange(s).This constitutes a population of R × 2 chromosomes distributed per pair of two per block.Next, each blocki applies the crossover operator to pi, p'i producing two feasible children ci and c ′ i, then evaluates their fitness.The individual si with the best fitness among pi, pi', ci, ci' is reinjected in the blocki as the first parent for the next iteration if any.After N1 iterations (the same for all blocks) the si, 0 ≤ i < R, are copied in the global memory.A mutation operator is eventually applied to them and the individual s with the best fitness among them is selected and re-injected in each block as an initial solution for the next iteration if any.The process terminates after a certain number of N2 of iterations.This process is illustrated in Fig. 4 where pi, fi, and Ti denote respectively the first parent, its fitness, and its representation.
Finally, note that: • This implementation has two levels of parallelism.First, in the processing of each pair of parents (pi, pi') and secondly in the treatment of the population (pi, pi'), 0 ≤ i < R • The initial solution s is simply generated in parallel by computing the list L and sorting it on a block of n × n threads.Then determine the p clusters from which the initial hubs are located in O(n) parallel steps • Other powerful genetic operators proposed especially in Kratica et al. (2005); Naeem and Ombuki-Berman ( 2010) can be implemented in this GA without causing data exchange between the blocks and therefore without additional access to the global memory • If the shared memory is not enough to store the representations then either one stores them in the global memory or one re-computes all the shortest paths at each iteration or one uses the cache memory technique as in Kratica et al. (2005) all these possibilities induce an additional execution time

Results and Discussion
The objective of this section is to give new best results or results for instances that have not been solved before (to our knowledge).Moreover, through these tests, we highlight the relevance of our clustering algorithm et GPU implementation.So, we tested our implementation on the following benchmarks: • The Australian Post (AP) data set represents mail flows in Australia where the flows are not symmetric, there are flows between each node and itself, and χ = 3, γ = 2, and α = 0.75 • The Urand dataset is random instances of up to 400 nodes generated by Meyer et al. (2009), and instances of up to 1000 nodes proposed by Ilić et al. (2010) where the node coordinates and the flow matrix were randomly generated.We completed random instances of large sizes up to 6000 that we generated according to the same rules that those of URAND instances • The planet lab instances are node-to-node delay data for performing measurements of the legacy internet (Ilić et al. (2010).In these instances, χ = γ = α = 1, the distance matrix does not respect the triangle inequality, and the flows wij = 0 if i = j and 1 otherwise We used a modest Nvidia quadro to realize our tests.We report our results for the single and for multiple allocation cases for each tested instance.Thus, the costs of the solutions can be compared in the two cases.We compare our single allocation results to those of (Ilić et al., 2010) and the multiple allocation results to those of Kratica et al., 2005).
The following notations are used in all tables.n is the number of nodes in the instance, and p is the number of hubs.M stands for multiple allocations, and S stands for single allocation.Best sol is the best solution if known, otherwise, the dash is written.GPUsol is our solution with OPT when the solution is known optimal or best the best solution found in the literature.Clust opt is the number of clusters each containing at least one hub of the optimal or the best solution for the UMApHMP.The average of these numbers is given in bold at the end of this column.Some results for the single allocation case (USApHMP) reported here are presented in Benaini et al. (2019).For all benchmarks considered our implementation reaches the optimal or the best solutions in execution times ≤ 7s for n ≤ 419 and between 10 and 15s for n = 1000 and between 25s and 600s for n ∈ [1500,6000].
Table 2 provides the results of the AP instances with 300 and 400 nodes.We do not know other UMApHMP solutions for these instances to compare them with our results.The average of the CLUST OPT column is 87%.This means that, on average, the percentage of clusters each containing at least one optimal hub is 87% (or the optimal hubs of each instance problem are distributed on 87% of clusters).This shows the efficiency of our clustering algorithm.
In Table 3, we reported new best costs for Planet Lab instances for the single and multiple allocation cases.The old best solutions for USApHMP reported in this table are due to Ilić et al. (2010).Given that there are no other known UMApHMP results for these instances, our results become henceforth the best.The average of the CLUST OPT column is 69%.This percentage is quite low compared to the others benchmarks.We think that this is due to the nonrespect of the triangle inequality in the planet lab instances.We complete this table with the results for other values of p in Table 4.Only the results for instances 06, 07, 08, and 09 are reported (to avoid overloading the paper with tables).Note that there are no other known UMApHMP solutions for these instances with these values of p. Here, the percentage of clusters each containing at least one hub of the best solution is quite high.Ilić et al. (2010); Kratica (2013) reported the bestknown results for URAND instances for n = 100, 200, 300, 400, and (χ = 3 and 1, α = 0.75, γ 2 and 1).Our GPU implementation achieves the same results.So, in Table 5 we report only the CLUST to OPT column for these instances with (χ = 1; α = 0.75, γ = 1).We observe that the numbers in this column are not very sensitive to the variation of n.Here too, the average of the CLUST OPT column is quite high.In Table 6 we report new results (for single and multiple allocation cases) on the Urand instances with 1000 nodes with (χ = 1, α = 0.75, γ = 1) since Ilić et al. (2010); Kratica (2013) reported only the results for (χ = 3, α = 0.75, γ = 2).
Table 7 reports our results with the parameters (χ = 3, α = 0.75, γ = 1) for the large Urand instances up to 6000 nodes.To our knowledge, no results for instances of sizes larger than 1000 have been published.The efficiency of our clustering algorithm and GPU implementation is clearly established in solving these large random instances.Compared to this study, no article dealing with hub location problems has presented results for instances as large.

Fig. 1 :
Fig. 1: An example of p-HMP with solutions for single and multiple allocations the set of the r nodes closest to hi (including hi); else Ci = the remaining nodes in L; Remove Ci from L (L = L\Ci); i = i+1; end while

Table 2 :
Results on large AP instances

Table 4 :
Results on Planet Lab instances *-2005 for other values of p

Table 6 :
Results on large URAND instances