Introduction

The global optimization problem is usually defined as:

$$\begin{aligned} x^{*}=\text{ arg }\min _{x\in S}f(x), \end{aligned}$$
(1)

with S:

$$\begin{aligned} S=\left[ a_{1},b_{1}\right] \otimes \left[ a_{2},b_{2}\right] \otimes \ldots \left[ a_{n},b_{n}\right] , \end{aligned}$$

where the function f is assumed to be continuous and differentiable. There is a wide range of problems in the relevant literature that can be treated as global optimization problems. In the area of physics, for example, Yang et al. proposed a multiobjective genetic algorithm on an accelerator lattice [1], Iuliano proposed global optimization techniques for benchmark aerodynamic cases and Duan et al. used global optimization techniques for conceptual rainfall-runoff models [2]. In the area of chemistry, Heils and Johnston provided a review on the usage of global optimization techniques on clusters [3], Shin et al. proposed a software that utilizes global optimization methods for protein–ligand docking [4], and Liwo et al. proposed a global optimization algorithm for protein structure prediction by minimizing the potential energy function [5]. In addition, in the area of economic sciences, Gaing proposed a particle swarm optimization method to solve the economic dispatch [6] and Maranas et al. proposed global optimization methods for tackling long-term financial planning problems [7]. Furthermore, in medicine, Lee proposed large-scale optimization-based classification models applied to medical problems [8] and Cherruault reviewed some global optimization methods [9].

There are several ways to categorize global optimization techniques, but most researchers suggest dividing them into two broad categories: deterministic and stochastic techniques. In the area of deterministic techniques, the one with the greatest spread among researchers is the interval method [10, 11], where the initial interval of values S of the objective function is continuously divided into smaller parts. The segments that may not contain the total minimum are discarded and eventually a very narrow interval of values should be left which will contain the total minimum of the objective function. Also, in the area of deterministic methods, many related works have been presented with various applications, such as the work of Maranas and Floudas that proposed a deterministic method for molecular structure determination [12], the TRUST method for global optimization [13], the method of Evtushenko and Posypkin for global box -constrained optimization [14], a method that uses smooth diagonal auxiliary functions to approximate the global minimum [15], etc. Also, Kunde et al. [16] suggested the usage of deterministic global optimization methods for chemical problems. Furthermore, recently, Sergeyev et al. [17] proposed a systematic comparison between deterministic and stochastic methods used to handle the global optimization problem. Recently, Yassami and Ashtari proposed a hybrid optimization method to handle the optimization problems [18].

On the other hand, in the category of stochastic techniques, where the greatest research effort is made, techniques are presented that look for the total minimum using stochastic methods, which, of course, does not guarantee finding the total minimum. In this vast area of research endeavors, one encounters methods such as controlled random search methods [19,20,21], which can be considered as a direct search method, simulated annealing methods [22,23,24], differential evolution methods [25, 26], particle swarm optimization methods [27,28,29], ant colony optimization [30, 31], and genetic algorithms [32,33,34]. In addition, since in recent years there has been an extremely wide spread of parallel computing units, many studies have appeared that utilize the modern parallel processing units [35,36,37] to tackle the global optimization problem. Some research that one can study regarding metaheuristic algorithms is presented in some recent papers [38,39,40]. This paper suggests a number of directions for efficient parallelization of particle swarm optimization (PSO) techniques.

Parallel optimization methods can be utilized in various application domains, enabling rapid resolution of complex problems. Some of these domains include:

  1. 1.

    Engineering and design: optimizing parameters for improved performance in engineering systems like aircraft, automobiles, and machinery.

  2. 2.

    Computer science: optimizing complex algorithms and computational tasks.

  3. 3.

    Biology and pharmaceuticals: addressing problems like DNA analysis and calculating multiple protein interactions.

  4. 4.

    Economics and finance: automating the optimization of intricate models.

  5. 5.

    Energy optimization: optimizing energy consumption in energy networks.

  6. 6.

    Climate change: optimizing complex climatic models and weather prediction.

  7. 7.

    Telecommunications: designing efficient communication networks and optimizing network performance and bandwidth management.

  8. 8.

    Policy decision-making: finding optimal solutions for problems like transportation routing and sustainability policies.

By applying parallel optimization techniques in these diverse fields, researchers and practitioners can efficiently tackle complex challenges and achieve optimal solutions more effectively.

The PSO method is inspired by the observations of Eberhart and Kennedy in the 1990s. The two observed in detail the behavior of birds as they move through space looking for food and then formulated a technique that falls under the domain of global optimization, using the experience they gained from their observations. In the new technique they formulated, the atoms or particles move in the research space of the problem in search of the minimum of the objective function. These particles are directly related to two basic characteristics: the current position of each particle and the velocity at which it moves in search of the global minimum. The current position is denoted as \(\overrightarrow{x}\) and the velocity is referred to as \(\overrightarrow{u}\). In addition, for each particle, the best position in which it has been found in the past (the one with the smallest value for the objective function) is kept in memory, but also, for the total population, its best position is kept. The method moves the particles to search for the global minimum through an iterative process, in which the positions of the particles at each iteration are derived from each particle’s current position, the best position it has found in previous iterations, and the overall best position of the population.

Due to the simplicity of the method and the small number of parameters that should be set, this method has been applied to many difficult problems from all areas of the sciences, such as problems that arise in physics [41, 42], chemistry [43, 44], medicine [45, 46], and economics [47]. Also, the PSO method was successfully applied recently to many practical problems such as flow shop scheduling [48], successful development of electric vehicle charging strategies [49], emotion recognition [50], and robotics [51]. An extensive tutorial on PSO methods can be found in the work of Marini and Walczak [52]. Moreover, a recent overview of PSO variants can be found in the work of Jain et al. [53]. Since the technique has gained a lot of popularity in recent years, many works have been developed based on it, such as combination with the mutation mechanism [54,55,56], methods that propose new techniques to initialize the velocity vector [57], hybrid techniques [58,59,60], and parallel techniques [61,62,63]. Also, since the calculation of the velocity of the particles is a determining factor for the effectiveness of the method, several techniques were formulated for the calculation of the velocity [64,65,66]. The method of particle swarm optimization has been successfully combined with other global optimization techniques, such as the work of Bogdanova et al. [67], who suggested a combination of grammatical evolution with PSO [68]. Also, Pan et al. [69] suggested a method that combines the PSO method with a variation of simulated annealing. Likewise, Mughal et al. [70] proposed a hybrid technique that effectively combines PSO and simulated annealing for solar energy problems. In the same way, Lin et al. [71] suggested a hybrid method of PSO and differential evolution for some optimization problems. Additionally, a number of techniques have been presented in which local minimization methods are applied to some iterations of particle swarm optimization [72, 73]. Also, Gao et al. suggested the usage of a PSO method to reduce energy consumption [74]. Of course, these techniques significantly increase the computational time required by the method, but also significantly improve the performance of the method.

Since the topic of global optimization is very widely applied in many fields but also due to the fact that it requires significant computing resources, techniques that take advantage of parallel architectures have to be developed to reduce the required computing time. For example, Olenšek et al. [75] proposed an asynchronous parallel global optimization method based on simulated annealing and differential evolution applied on a series of benchmark global optimization problems. Also, Regis and Shoemaker [76] employed radial basis functions [77] for parallel global optimization. A systematic review of parallel metaheuristics can be found in the work of Alba et al. [78]. Also, a review of metaheuristics that are implemented on modern GPU architectures can be found in the work of Essaid et al. [79]. The particle swarm optimization method is an excellent candidate method for parallelization, since it is based on a series of computational solutions that partially act independently of each other. For example, Koh et al. proposed a parallel asynchronous implementation of the PSO algorithm [80], Tewolde et al. [81] proposed a multi-swarm PSO algorithm where the swarm is divided into a set of subswarms that are executed in parallel. Also, Ouyang et al. [82] proposed a parallel implementation of the PSO algorithm using the CUDA programming library to solve the lD heat conduction equation. Furthermore, Campos et al. [83] investigated the impact of communication strategies on a parallel PSO implementation used for solving many objective optimization problems. A survey of parallel PSO algorithms can be found in the work of Lalwani et al. [84].

In the present work, the initial population of particle swarm optimization is divided into independent populations running on different parallel computing units, which will also be called islands. In addition, a number of techniques are proposed, which aim at efficient communication between the independent islands on the one hand, and at the early termination of the overall method on the other. These techniques include a new method of calculating particle velocity, a new termination rule specifically modified for parallel techniques, and a new way of propagating the best particles among the parallel computing units involved in the overall method. The proposed velocity calculation is proposed by Eberhart and Shi [85] and it is based on random values. Also, the proposed stopping rule is based on the work of Charilogis and Tsoulos [86], giving excellent results when applied on a series of well-known optimization problems. However, the proposed termination technique is modified appropriately for parallel computing environments.

The main disadvantages of the PSO method are its tendency to converge quickly to a local minimum in some complex or high-dimensional objective functions, and its performance heavily depends on the initial particle distribution, as incorrect values may lead to suboptimal solutions. The proposed method overcomes these issues and aims to improve the algorithm’s performance in terms of speed and reduced consumption of computational resources. In general, any optimization method can be parallelized, ensuring the ability to utilize fewer computational resources and saving time in the resolution process.

The rest of this article is organized as follows: in “The Proposed Method” the proposed method and the new approaches in particle swarm optimization are discussed in detail, in “Experiments”, the used test functions as well the experimental results are fully outlined and finally in “Conclusions” some conclusions and future guidelines are listed.

The Proposed Method

This section will begin with a detailed description of the serial method to be performed on each parallel processing unit as well as the overall parallel method proposed in this work. Subsequently, the basic components of the proposed process, such as the calculation of the velocity, the proposed termination rule, and the method of propagating points between the parallel computing units will be thoroughly analyzed.

In every parallel processing unit, a PSO algorithm is executed to locate the global minimum of the objective function. The formulation of the PSO algorithm is shown in Algorithm 1. The approach of the underlying algorithm is based on representing the problem as a swarm of particles, where each particle represents a candidate solution in the Euclidean search space. During the optimization process, the particles move in the search space based on their individual experience and also on social interactions. Using these two influences, the particles adjust their velocities and positions during the iterations of the algorithm. The parallel version of the underlying article also promotes social interactions among sub-swarms using different propagation schemes.

figure a

The base PSO algorithm, as described in Algorithm 1, computes at every iteration the new position of the particle i using the following operation:

$$\begin{aligned} x_{i}=x_{i}+u_{i+1}. \end{aligned}$$
(2)

In most cases, the new velocity could be a linear combination of the previously computed velocity and the best values \(p_{i}\) and \(p_{\text{ best }}\). A typical computation of the new velocity is defined as:

$$\begin{aligned} u_{i}=\omega u_{i}+r_{1}c_{1}\left( p_{i}-x_{i}\right) +r_{2}c_{2}\left( p_{\text{ best }}-x_{i},\right) \end{aligned}$$
(3)

where

  1. 1.

    The variables \(r_{1},\ r_{2}\) are random numbers defined in [0, 1].

  2. 2.

    The constants \(c_{1},\ c_{2}\) are defined in [1, 2].

  3. 3.

    The variable \(\omega\) is called inertia and it was suggested by Shi and Eberhart [27]. In the current article, a simple inertia calculation is used and calculated using:

$$\begin{aligned} \omega _{\text{ iter }}=0.5+\frac{r}{2}. \end{aligned}$$
(4)

The variable r is a a random number with \(r\in [0,1]\).

The Parallel Algorithm

The overall parallel algorithm, which runs on \(N_{I}\) independent computing units, is shown in Algorithm 2.

figure b

The two main components of the proposed parallel implementation are the particle propagation mechanism between the parallel computing units and the proposed termination rule. In the first case, it must be decided which units will send the best particles to other parallel units. Units receiving particles replace their worst members with the ones they receive. In the second case, a termination rule based on asymptotic observations is proposed for early termination of the parallel technique. In the proposed termination technique, the overall algorithm is terminated if it holds with some certainty if some asymptotic termination criterion holds for at least one parallel computing unit. These components will be discussed in the following subsections. The overall algorithm is graphically outlined in Fig. 1.

Fig. 1
figure 1

The overall algorithm

Propagation Mechanism

During the execution of the parallel algorithm and periodically the processing units propagate their best particles (those with the lowest function value) to the remaining processing units. This propagation can be achieved in the following possible ways:

  1. 1.

    1 to 1. In this propagation scheme, a randomly selected processing unit will send to some other randomly selected unit its \(N_{P}\) best particles.

  2. 2.

    1 to N. During this scheme, a randomly selected unit will send its best \(N_{P}\) particles to the remaining units.

  3. 3.

    N to 1. In this scheme, all processing units will send the corresponding \(N_{P}\) best particles of each unit to a randomly selected unit.

  4. 4.

    N to N. For this scheme, all processing units will send the corresponding \(N_{P}\) best particles to all.

All propagation schemes are demonstrated graphically in Fig. 2.

Fig. 2
figure 2

A graphic presentation of all propagation schemes

Stopping Rule

In the proposed technique, a distinct termination rule is checked on each parallel processing unit. This rule is formulated as follows:

$$\begin{aligned} \delta _{i}^{(k)}=\left| f_{i,\text{ min }}^{(k)}-f_{i,\text{ min }}^{(k-1)}\right| . \end{aligned}$$
(5)

This quantity is calculated on every iteration k. The value \(f_{i,\text{ min }}^{(k)}\) is the best function value for unit i at iteration k. If the above quantity is less than a predetermined limit \(\epsilon\) for \(N_{M}\) continuous repetitions, then the algorithm executed on this unit is terminated. In the present work, if a parallel processing unit is terminated, then the overall process is also terminated.

Experiments

To be able to establish the efficiency of the method in finding the total minimum of continuous functions, a series of experiments were performed in which objective functions from the relevant literature were used [88, 89]. These objective functions have been extensively studied by many researchers in many publications [90,91,92,93]. In this series of experiments, beyond the ability of the method to find the global minimum, the effect of changes in a number of critical parameters, such as propagation techniques, on the behavior of the proposed method was also evaluated. In addition, experiments were conducted in which the actual execution time was measured for various objective functions. In these experiments, the actual reduction in execution time was also recorded as the number of parallel processing units increased.

Test Functions

The test functions used in the conducted experiments are listed in Table 1.

Table 1 The objective functions used in the experiments

Experimental Results

Experiments were performed on the objective functions presented previously. Each experiment was run 30 times using different random numbers each time and the averages of the experiments are plotted in the tables below. For the parallelization the freely available OpenMP library [96] was utilized. The programming language used was ANSI C++ and the OPTIMUS optimization package freely available from https://github.com/itsoulos/OPTIMUS (accessed on 3 March 2023) was used. The most critical parameters of the proposed method are listed in Table 2.

Table 2 The values for most critical parameters of the algorithm

In the first set of experiments, the serial version of the proposed technique, i.e., running on a single processor, was compared with the folllowing widely used global optimization techniques:

  1. 1.

    A genetic algorithm, denoted as GENETIC in the the following table.

  2. 2.

    A simple version of particle swarm optimization, denoted as PSO in the following table.

  3. 3.

    The parallel implementation of the GaLib library [97], denoted as GALIB in the experimental results.

  4. 4.

    The parallel global best–worst particle swarm optimization algorithm [98], denoted as PGBWPSO.

In all cases, the same number of particles or chromosomes and the same maximum number of generations were used. In addition, after each method was executed, the local optimization method BFGS was used to improve the global minimum found. The results for this experiment are reported in Table 3.

Table 3 Comparison of the proposed method against two other global optimization techniques

In the table, each number in each cell represents the average of the function calls for 30 runs. Also, the numbers in parentheses stand for the percentage of runs in which the global minimum was successfully found. If this percentage is not present, it implies 100% success. In addition, a line has been added at the end of the table showing the total number of function calls for each method. The experimental results demonstrate that even running the proposed technique on a single computing unit can produce better results than the other techniques in terms of the required number of function calls, and, by extension, the time required to find the global minimum. Moreover, although method PGBWPSO requires less total number of calls than the proposed technique, the proposed technique is shown to significantly outperform it in its reliability of finding the global minimum.

To evaluate the effect of executing the method on parallel processing units, another experiment was done in which the number of parallel processing units was increased from one to ten and the results are presented in Table 4. Also, a box plot for this experiment is shown in Fig. 3. Additionally, a Friedman test was conducted to determine whether there were significant differences in the scores of function calls on five (\(N_{I}\) =1, 2, 4, 5, and 10) repeated number of units and the results are shown in Fig. 4. In addition, to give greater reliability to the experimental results, the total number of particles was in all runs constant and equal to 200, as defined in the Table 2. This means, for example, that when the computing units are four, each unit will run a serial version of PSO with 50 particles, and when the computing units become five, the number of particles in each unit will be reduced to 40.

Table 4 Experimental results using the proposed method, the propagation scheme was set to 1to1 and the value of \(N_{P}\) was set to 5. In the conducted experiments, the number of parallel processing units was varied from 1 to 10
Fig. 3
figure 3

Box plot for the comparison between different number of processing units. The propagation method was set to 1to1

Fig. 4
figure 4

We conducted a Friedman test to determine whether there were significant differences in the scores of function calls on five (N i =1, 2, 4, 5, and 10) repeated number of units. The Friedman Chi-square was 94.38 with 4 degrees of freedom, indicating a significant difference between the groups (\(p<0.0001\))

From the experimental results, it is clear that increasing the number of parallel processing nodes further reduces the total number of function calls, since more units are used in the attempt to find the global minimum. In addition, a slight increase in the reliability of the method in finding the total minimum is also observed.

Furthermore, to determine the effect of the propagation mechanism on the reliability and speed of the method, another comparative experiment was performed, in which the number of parallel processing units was set to 5 (\(N_{I}\) = 5) and all propagation mechanisms were used. The results of this experiment are presented in Table 5.

Table 5 Comparison of different propagation schemes

This time, the experimental results did not show any clear superiority of one of the propagation methods. However, it appears that the 1-to-N propagation method has slightly better performances than the rest of the best particle propagation techniques among the sub-populations.

The last experiment had to do with the effect of the \(N_{P}\) parameter on the speed of the method. In it, five parallel computing units were used and the propagation method was set to Nto1. The experimental results are presented in Table 6.

Table 6 The effect of the parameter \(N_{P}\) on the speed of the proposed method

As is evident, increasing the value of the \(N_{P}\) parameter significantly reduces the required number of function calls up to a value of 5. From then on, the total number of calls appears to be constant. Practically, this means that sending not only the best particle, but also those with the immediately lower value further reduces the required number of function calls.

Experiments for the Execution Time

To understand the efficiency of the method with respect to execution time, three additional series of experiments were performed. In all experiments, the total number of particles in the particle swarm optimization was fixed at 500 and from 1 to 20 parallel processing units were used, and the total execution time was measured for 30 different independent runs. The experiments were conducted on AMD Epyc 7552 equipped with 32 GB of RAM. The operating system was the Ubuntu 20.04.

During the first series of the additional experiments, the high conditioned elliptic function, defined as

$$\begin{aligned} f(x)=\sum _{i=1}^{n}\left( 10^{6}\right) ^{\frac{i-1}{n-1}}x_{i}^{2}, \end{aligned}$$

was used as a test case. In the experiments, the values n=10,20,30,40,50 were used and the experimental results are displayed graphically in Fig. 5.

Fig. 5
figure 5

Total execution time for different number of parallel processing threads for the objective function ELP

From the experimental results, it follows that doubling the parallel processing units almost halves the required execution time in most cases.

During the second series of experiments, different versions of the GKLS function were used for dimensions from two to five. This function was thoroughly tested also in the work of Sergeyev et al. [17]. In each dimension, ten different instances of the function were used to have a safe estimate of both the performance of the method and the possible decrease in execution time as the parallel processing units. The average execution time for different dimensions of the GKLS function and for different numbers of processing units is graphically shown in Fig. 6.

Fig. 6
figure 6

Average time execution for the GKLS function. The dimension of the function varies from two to five

As is shown, the execution time decreases rapidly with increasing parallel processing units in all instances of the function. Furthermore, the average number of function calls for every case is also measured and the results are plotted in Fig. 7.

Fig. 7
figure 7

Average function calls for different cases of the GKLS function

Increasing the number of parallel processing units also reduces the number of required function calls, especially for the low-dimensional cases (Gkls2 and Gkls3). However, the most interesting point of this series of experiments is the ability of the proposed technique to locate the global minimum as the number of parallel computing units increases. This capability is graphically represented in Fig. 8.

Fig. 8
figure 8

Average success rate in finding the global minimum for different cases of the Gkls function

From this graph, it can be concluded that the increase in parallel processing units significantly improves the efficiency of the technique in finding the global minimum.

In the third series of additional experiments, a more difficult experiment was used, but one that has multiple applications. The proposed method was used to train an artificial neural network [99, 100] with ten processing nodes applied on two widely used classification problems from the relevant literature: the Wine problem [101, 102] and the Parkinsons problem [103]. The results are graphically outlined in Fig. 9.

Fig. 9
figure 9

Total execution time for different number of parallel processing threads for neural network training for two classification datasets

Again, the required training time decreases drastically with increasing parallel processing units.

Conclusions

In the present work, a number of ideas were presented that can lead to an efficient parallel implementation of the PSO technique. The need for parallelization of the method is great, as it is an iterative process involving repeated operations on vectors and these operations for large problems may require significantly large computational time. In the current work, a technique of propagating the best particles between computing units as well as a termination rule of the overall process was introduced. In the case of the propagation of the best particles from the experiments carried out, it appears that it is more efficient to send between the computing units more than the best particle. Furthermore, the propagation method between parallel computing units did not have a drastic effect on the efficiency and speed of the method, although the 1-to-N propagation method appeared to have slightly better results. Furthermore, the increase of the parameter \(N_{P}\) from one to five decreased the required number of function calls. However, the biggest gain from using the method lies in the increase in parallel processing units. From the experiments performed, it is evident that as parallel processing units increase, the total function calls required to find the global minimum decreases almost for every objective function. In addition, the increase in parallel processing units improved to some extent the efficiency of the method in finding the global minimum. The effectiveness of the proposed method was also shown in the comparison between its serial version and two other widely available techniques in the relevant literature. In this comparison, the serial version of the method required significantly fewer function calls to find each global minimum. Overall, the main contributions of the proposed method are: the velocity calculation mechanism, the propagation among parallel computing units, and the termination rule representing an innovative technique in improving the performance of the algorithm.

On a percentage basis, the proposed technique significantly outperforms the other two compared global optimization techniques. The serial version of the proposed technique reduces the required number of function calls by 22–28% compared to the genetic algorithm and the particle swarm optimization techniques, respectively. However, with the use of more processing units, the reduction in the number of function calls reaches approximately 70%. Furthermore, from the relevant experiments done on the actual execution time, it appears that any doubling of parallel processing threads halves the required execution time required to find the global minimum.

In the future, the important work done in the parallelization of the PSO technique can be continued with the use of new termination rules that will be suitable for execution in parallel computing environments, but also a possible extension of the best particle propagation techniques between parallel computing units.