Comparison of Parallel Simulated Annealing on SMP and Parallel Clusters for Planning a Drone's Route for Military Image Acquisition.
Recently, drones or Unmanned Aircraft Vehicles (UAV) became very popular. This refers to their ability to undergo dangerous missions without exposing human beings' lives to any type of danger. Drones are associated with sensors and devices such as cameras, computing units, communication tools, and others. They are remotely controlled [1,2].
Drones are utilized in diverse military and civilian applications. Examples include, but are not limited to, aerial surveillance, image acquisition, remote sensing, and scientific research . In addition to saving human lives, drones complete missions quickly with minimum cost [1, 3]. On the other hand, the main restriction imposed on a drone is its limited energy; and therefore, flight time. Consequently, one of the main challenges when dealing with drones is to find an effective route plan in the minimum possible amount of time .
As drones usually follow preloaded instructions without human intervention, the route plan may be generated either online during the flight, or offline before taking off. Moreover, drones route planning becomes more challenging when there are several geographical locations to be visited that are dispersed apart; these are called waypoints . This research targets for finding a route plan that allows drones to acquire images from predefined waypoints in the least possible amount of time. Each waypoint is to be visited exactly once. Obviously, this is analogous to the well-known Travelling Salesman Problem (TSP). Finding a near-optimum route plan is necessary to minimize the drone's power consumption during the flight to cover the largest possible geographical area; and therefore, visit the largest number of targeted waypoints. In addition, achieving the mission in the minimum possible time ensures its secrecy.
However, solving TSP problem using a brute-force approach requires a significant amount of time to try every possible solution  . This approach is not suitable for the problem in-hands, since time and secrecy are both important factors in military applications. Therefore, a metaheuristics algorithm, the simulated annealing (SA) algorithm is used. SA is capable of finding an acceptable local optimum route plan . Although SA is used to solve several complex problems, but it requires significant processing time to find a suitable solution . Therefore, parallel computing is expected to positively contribute in the solution of this problem. Parallelism may minimize the execution time to fulfill the requirements of the military mission. In addition, it may increase the chance to provide a better-quality plan. However, the SA is inherently sequential as each new solution depends on the previous one. Therefore, this imposes one of the challenges associated with parallelizing SA. The improvement of the parallel computational power can overcome this challenge. In this research, we aim to study the parallel SA on SANAM supercomputer.
The performance of the parallel program is measured in terms of three metrics; namely, the speedup, the efficiency, and the scalability .
Therefore, the aim of this research is to design a parallel SA implementation with the purpose of generating a route plan for military drones emitted with the intention of acquiring images at multiple sites. Therefore, the program speedup, efficiency, and scalability are to be maximized; while the final distance should be at its minimum.
The layout of this paper is as follows: Section 2 exposes similar work in the literature. Section 3 explains the design of the asynchronous CSS. Section 4 reports the experiments' results. The paper is then concluded in Section 5 with a hint to our plan to future work.
2 RELATED WORK
The drone route planning problem in this research is analogous to the TSP. There are several algorithms that provide a solution for the TSP, such as LS, BB, EAs, ACO, and hybrid algorithms . This chapter exposes the sequential and parallel solutions to the TSP in general, and emphasizes on the drone route planning.
Many sequential algorithms are proposed to solve the TSP, some of which are based on the ACO algorithm. The proposed solution in  provides a modification of the traditional ACO method; this is known as the High Performance ACO. The traditional ACO algorithm involves a single ant randomly looking for the path; whereas the updated algorithm applies the TSP on a group of ants. The authors provide a comparison between their proposed algorithm and the ant colony system algorithm on various number of nodes. They found that the proposed algorithm completes the task in less time.
Also, Local search algorithms are widely used to solve the TSP. The research in  provides an experimental study to test the performance of the Lin-Kernighan and the Multi-Neighborhood Search. Results show that the Lin-Kernighan provides better results than the Multi-Neighborhood Search in terms of runtime.
On the other hand, several parallel solutions are proposed in the literature to solve the TSP using diverse parallel programming platforms. In , the experiment is performed on a standard multicore CPU. The reported results indicate that a gained speedup of 7.3 on 8 cores. Thus, the usage of PSO algorithm is more suitable for realtime planning for the drone. Moreover, the experiments also proved that the performance of the GA is better than the PSO. The same authors improved their results in  by proposing a parallel hybrid algorithm that exploits the advantages of both the PSO and the GA to generate a suitable path plan for fixed-wing drones. It is found that the gained speedup is 10.7 on a 12-core SMP.
In , the authors planned the drone's path using parallel ACO solution on both GPU and CUDA platforms. The generated path guides the drone in disseminating keys and collecting data from wireless sensors, which are previously deployed at minimum cost. The drone launches from a station, visits all sensors in a limited period of time, then returns back to the same station it is emitted from. In their experiment, they compared the sequential performance with the parallel implementation performance. They showed that the speedup is higher when using GPU platform.
In , the authors generate multiple route paths for several drones simultaneously using synchronous parallel SA on the GPU. Experiments' results prove that the processing time is reduced, and a better solution is acquired, as compared to the CPU implementation.
3 DESIGN AND IMPLEMENTATION
This section discusses the general design of SA. Then, the implementation of the asynchronous CSS is then discussed.
3.1 Simulated Annealing Design
Two concepts are to be defined when it comes to designing an iterative metaheuristics algorithm; namely, the solution representation and the objective function. Since SA is classified as a single-solution based metaheuristics, it requires the definition of the neighborhood. These are detailed in the following subsections .
3.1.1 Solution Representation
The solution representation of the drone route planning is a permutation of size n, where n is the number of the waypoints to be visited exactly once. Each permutation represents one solution as shown in Figure 4 as a sequence of nodes, where each node represents a waypoint and its index represents the corresponding order. The number of all permutations that represent the solution space taking into consideration the fixed point of start (ground station) is (n-1)!.
3.1.2 Neighborhood Solution
The neighborhood of a solution is found by performing a move operator which leads to a tiny perturbations to the solution S . As the drone route plan is represented by a permutation, a neighborhood is generated by the swap operator between two elements in the solution. This is illustrated in Figure 5.
3.1.3 Objective function
The objective function is used to define the goal to be achieved by the SA. The goal of the problem in-hands is looking for the shortest route plan for a drone such that it visits each waypoint exactly once. As previously mentioned, this is similar to the TSP and has a similar objective function which is shown below: [Florin](7r) in [[summation].sup.n-1.sub.i=1] [d.sub.[pi](i)[pi](i + 1)] + [d.sub.[pi](i)[pi](1) (1)
- [pi] is a permutation representing a tour of the drone;
- n is the number of waypoints.
3.2 Sequential simulated annealing
IAs previously mentioned, the SA, like other single-solution based metaheuristics, includes two main steps. The first step is to generate the initial solution, which is constructed by using a greedy heuristic, such as the nearest neighbor algorithm or randomly. In the design of the sequential algorithm of this research, the random method is used because the greedy heuristics produces a solution in local optimum, which may not be able to provide an improved local optimum solution at the end .
The second step, which is the solution improvement, the design uses the swap operator between two points to generate a neighbor solution from the current solution.
In fact, the SA algorithm imitates the process of the solid hardening, which depends on the initial temperature value and the cooling rate.
Therefore, the SA implementation consists of two main loops to provide a suitable solution. The outer loop, known as the cooling loop, is responsible for managing the temperature value. On the other hand, the inner loop, known as the equilibrium state loop, is responsible for constructing a neighbor solution from the current one, evaluating it, and computing the probability of the acceptance using the following formula:
e (- [DELTA]/T) (2)
- [DELTA] is the difference between the cost of the old and the new solution;
- T is the current temperature Accordingly, the main parameters that are to be defined during SA implementation are the initial temperature, the cooling rate, and the stopping condition. The latter might be the minimum temperature or a specific number of iterations. In this research, the stopping condition is taken based on a minimum temperature. The other parameters are determined after several experiments . The flowchart of the sequential algorithm is shown in Figure 3.
3.3 Parallel Simulated Annealing
Metaheuristic algorithms are sequential by nature; SA is no exception. Consequently, parallelizing the SA entails a challenging problem . Many approaches are proposed to parallelize SA algorithm :
- Decompose the search space into smaller parts, then assign each part to a processor to find the minimum cost and share its result with other processors.
- Apply the synchronous approach, where each processor uses the same initial solution and performs parallel improvements within the same temperature. Then at each temperature value the best solution is shared between the processors to perform parallel improvement until the end. Figure 4 illustrates the synchronous approach.
- Apply the asynchronous approach, where each processor executes SA independently. The initial solution may be the same or different across the processors. Finally, compute a reduce operation to get the best solution among them. As illustrated in Figure 5.
The synchronous parallel SA on shared-memory processor (SMP) is previously studied in . In this paper, the asynchronous parallel SA Complete Search Space (CSS) is investigated. The CSS algorithm starts with different initial solutions for the complete search space. The idea is illustrated in Figure 5.
3.3.1 Parallel SA approach on cluster
On the other hand, using the approach in  to implement SA on parallel clusters will increase the overhead of communication between nodes. This is explained by the fact that in synchronous
parallel SA, the processors frequently communicate with each other. Shared memory processor environment is suitable for such solution. However, in parallel clusters, communication is needed between processors after each inner loop. This is performed through message passing on parallel clusters. Message passing imposes an overhead on the program; and therefore, increases the speedup.
The asynchronous parallel SA CSS is therefore suggested to minimize the communication overhead. In this algorithm, each cluster node works on the complete search space (CSS) as illustrated in Figure 6. In the start, several initial solutions are generated and distributed over the nodes. Then each node applies the sequential SA. However, data parallelism is applied. At the end, all nodes send the produced route together with the final distance. The minimum distance with its corresponding route are then selected to be the best route plan.
3.4 Handling the Drone Energy Constraints
After generating the route plan using SA algorithm at the ground station, the route is evaluated in terms of the energy required to complete the planned mission. If the energy level is above a predefined threshold, then the drone is emitted according to the planned route. Otherwise, the mission is divided into multiple journeys. The drone is re-charged after the end of each trip, before starting a new one.
In fact, the drone's energy is expressed in terms of its enduring lifetime L. Therefore, the time T needed to travel the final distance D, as planned by the SA, is calculated as follows:
T = D / S (3)
- T is the time required to make the complete calculated tour, including the return trip to the ground station;
- D is the final distance as calculated by the SA algorithm;
- S is the drone's speed as specified in its hardware specifications
If the calculated time T is less than the drone's enduring lifetime L, then the drone is safely launched. Otherwise, the journey is broken into multiple trips. Figure 7 illustrates the previously mentioned steps. Therefore, the applied predefined threshold is the enduring lifetime of the emitted drone.
4 EXPERIMENTAL RESULTS
This section details the methodology used in the conducted experiments. In addition, the obtained results are discussed, analyzed, and represented graphically. The first subsection reveals the deployed environment in terms of software and hardware specifications. Subsection 4.2 explains the performance metrics used to measure the effectiveness of the developed algorithm. Subsection 4.3 details the general methodology used to collect the figures of the experiments. Finally, subsection 4.4 details the results of both the sequential and parallel CSS algorithm.
4.1 Software and Hardware Specification
This research is implemented on KACT's Saudi supercomputer SANAM. KACST is King Abdel-Aziz City for Science and Technology in Riyadh, Kingdom of Saudi Arabia. SANAM includes Intel Xeon E5-2650 CPUs, with 12 cores. The access to SANAM is available through a group of interactive login nodes, which are connected to KACST network and Internet [14, 15]. The program is coded under Linux Ubuntu 16, with JAVA using the pj2 library for threads management , and NetBeans as programming tool. In addition, Simple Linux Utility for Resource Management (SLURM) is used for Linux clusters management in SANAM. SLURM performs three main tasks: First, it is responsible for nodes allocation, management, execution, and monitoring reserved nodes. Second, it manages waiting work queues and finally, resolves conflicting resource orders .
In this research, ten nodes are used. This is the maximum allowed by KACST to the external users.
4.2 Performance Measurements
The main objectives of this research are to minimize the execution time, increase resources utilization, and increase scalability. In addition, the final distance is to be minimized. Therefore, speedup, efficiency, scalability, and final distance are used to evaluate the performance of the parallel program .
4.2.1 Speed up:
Speed up is used to measure the extent of the time reduction gained from the parallel implementation as compared to its sequential counterpart. The gained speed up is calculated as the ratio of the execution time of the sequential program Tseq to that of the parallel program Tpar :
Speed up = [[T.sub.seq]/[E.sub.par]] (4)
The efficiency (EE) is used to measure how a program is close to the
ideal speed up. In other words, it indicates the effectiveness of the parallel program to use the available resources. The ideal program efficiency is equal to 1. However, the actual efficiency is between 0 and 1. As the efficiency is closer to 1, as the program is making better use of the available hardware resources. Efficiency is computed as the ratio of the actual speed up of the parallel program Tpar to the number of processors K that are used to run the parallel program. This is expressed in the following formula  :
Efficiency = [[T.sub.par]/K] (5)
Scalability is the ability of a program to adapt to the increasing amount of problem size. In order to measure the scalability of a parallel program, the sequential version is run multiple times; each time the problem size is increased. When the program crashes, and cannot hold any more the given problem size, the last recorded size is taken. This is Nseq. The same experiment is repeated with the parallel version to get Npar. The scalability is therefore calculated according to the following formula :
Scalability = [[N.sub.par]/[N.sub.seq]] (6)
It is expected that the problem size increases with the increase of the number of processors.
The execution time of a parallel program is hardly the same when run multiple times successively. This is because the operating system is conducting its own activities at the same time as the program runs. Since these activities differ from a run to another, the resulting execution time is directly affected. The interference of the operating system always increases the resulting execution time. Therefore, to measure the execution time of a program as accurate as possible, it is run multiple times--from 7 to 10 times--and the execution time is recorded after each run. Then, the minimum of these recordings is taken since this represents the less interference from the part of the operating system.
4.4 Experimental results
The results of both versions of the SA algorithm are reported in the following subsections. The parallel platform is an SMP; with a number of threads ranging from 2 to 8. A set of experiments is conducted according to the previously detailed methodology detailed with the aim of measuring the three main performance metrics: speedup, efficiency, and scalability in addition to the quality of the solution. These are exposed in the following subsections.
The first set of experiments aims to explore the impact of the number of waypoints on the execution time. Therefore, the program is run multiple times with various number of waypoints; namely, 50, 100, 150, and 200. The experiment is done only for these four problems sizes as the increments in the execution time is linear.
Table 1 shows the minimum execution time for the sequential, SMP -as performed in -, and CSS. Note that all the parameters of SA; namely, initial temperature, the cooling rate, and the stopping condition, are fixed. Worth to mention, the parameters are chosen after several experiments to ensure the quality of the final solution. Figure 8 shows the impact of the number of waypoints on the execution time.
As seen in the graph depicted in Figure 8, the execution time of sequential and parallel SA versions is proportional to the problem size. It is noted that the execution time of the parallel cluster CSS is the highest when compared to the sequential and SMP. This is due to the fact that in CSS, each node works on the complete search space. In addition, several initial solutions are to be produced and propagated to all nodes at the beginning of the program. Therefore, the communication between the nodes imposes an overhead on the execution time.
On the other hand, the execution time of the parallel SMP is the best. This is explained to the lack of communication overhead, since data is shared in the main memory.
The gained speedup is calculated from the values recorded in Table 1 according to formula 4 The results are depicted in Figure 9. The highest speedup is achieved by the SMP version; it is equal to 6.81for 200 waypoints as compared to a speedup of 0.57 for the CSS.
4.4.2 Efficiency Measurement
The efficiency is calculated for both SMP and CSS versions according to formula 5 The results are displayed in Table 2. The corresponding graph is depicted in Figure 10. It is noticed that the efficiency is the best with parallel SMP where all threads in the nodes are utilized.
Although the efficiency is equal to 1 for the sequential program, but this does not imply that the program makes full use of the available hardware resources. However, the actual speed of the sequential program and the number of cores are both equal to one.
4.4.3 Scalability Measurement
Here, the largest number of waypoints that can be handled by each parallel SA version is divided by that handled by the sequential program. The results are reported in Table 3, with calculations deduced from formula 6 Again, the scalability in parallel SMP is too much better than that of the CSS. This is explained by the fact that all threads in the node are utilized in SMP. Thus, increasing the ability to handle larger problem sizes than sequential and parallel CSS. Moreover, the parallel CSS does not provide any improvement on the sequential SA.
4.4.4 The quality of the route plan
The final distance values are depicted in Table 1 with various SA algorithm versions and waypoints. The corresponding graph is shown in Figure 11. It is noticed that the quality of the route plan is the best with the SMP SA version as compared to the other program versions. However, the CSS SA produces very close distances as compared to the SMP SA. On the other hand, it gives better distance than the sequential. This is because the CSS SA uses more nodes working on different initial solutions; thus, increasing the chance of improving the distance.
5 CONCLUSION AND FUTURE WORK
This research targets for generating a minimum route plan distance for a single drone emitted by a military organism for image acquisition. The area in concern may be a sensitive site, a defense and/or attack war front, or an enemy's territory. Therefore, the route path should be completed in the least amount of time.
The SA algorithm is implemented to solve this problem, which is analogous to the TSP. Since metaheuristics require an extensively long time for execution, parallel computing is deployed to accelerate the SA algorithm.
Therefore, several parallel versions of the SA are developed. First, the parallel SA is previously developed . In this paper, another parallel version is implemented on SANAM supercomputer. Ten nodes are used to implement the program using Java programming language under Linux on SANAM supercomputer.
The two parallel versions are compared in terms of speedup, efficiency, scalability, and the final distance.
The reported results prove that the synchronized parallel SA on SMP outperforms the CSS for all number of waypoints in terms of gained speedup, efficiency, and scalability. On the other hand, the CSS outperforms the SMP SA in terms of quality of solution.
In the future, more methods to parallelize SA are to be investigated.
We would like to extend our gratitude to King Abdel-Aziz City for Science and Technology (KACST) in Riyadh for allowing us to use their SANAM supercomputer at all times. Not only this, but the staff was also of great support and helpful; replying to our posed questions promptly albeit their heavy duty load.
 N. Ozalp and O. K. Sahingoz, "Optimal UAV path planning in a 3D threat environment by using parallel evolutionary algorithms," in Unmanned Aircraft Systems (ICUAS), 2013 International Conference, 2013, pp. 308-317.
 M. O. UgurCekmez, "A UAV Path planning with parallel ACO algorithm on CUDA platform," presented at the IEEE Unmanned Aircraft Systems (ICUAS), FL, USA, 2014.
 M. Coeckelbergh, "Drones, information technology, and distance: mapping the moral epistemology of remote fighting," Ethics and information technology, vol. 15, pp. 87-98, Jun 2013.
 X.-f. Liu, Z.-w. Guan, Y.-q. Song, and D.-s. Chen, "An optimization model of UAV route planning for road segment surveillance," Journal of Central South University, vol. 21, pp. 2501-2510, Jun 2014.
 T. Turker, G. Yilmaz, and O. K. Sahingoz, "GPU-Accelerated Flight Route Planning for Multi-UAV Systems Using Simulated Annealing," in International Conference on Artificial Intelligence: Methodology, Systems, and Applications, 2016, pp. 279-288.
 M. Sanjabi, A. Jahanian, S. Amanollahi, and N. Miralaei, "ParSA: parallel simulated annealing placement algorithm for multi-core systems," in Computer Architecture and Digital Systems (CADS), 2012 16th CSI International Symposium, 2012, pp. 19-24.
 A. Kaminsky, "BIG CPU, BIG DATA: Solving the World's Toughest Computational Problems with Parallel Computing," 2016.
 KACST, The Saudi Supercomputer "SANAM" is the World's 2nd Leader in Energy Efficiency", KACST, 2012. [Online]. Available: https://www. kacst.edu. sa/eng/about/news/Pages/news3841117-3854.aspx.
[Accessed: 25- Apr-2017].
 V. Roberge, M. Tarbouchi, and G. Labonte, "Comparison of parallel genetic algorithm and particle swarm optimization for real-time UAV path planning," IEEE Transactions on Industrial Informatics, vol. 9, pp. 132-141, May. 2013.
 V. Roberge, M. Tarbouchi, and F. ALLAIRE, "Parallel hybrid metaheuristic on shared memory system for real-time UAV path planning," International Journal of Computational Intelligence and Applications, vol. 13, p. 1450008, Jun. 2014.
 E.-G. Talbi, Metaheuristics: from design to implementation vol. 74 ,pp 126-133: John Wiley & Sons, 2009.
 A. Ferreiro, J. Garcia, J. G. Lopez-Salas, and C. Vazquez.:An efficient implementation of parallel simulated annealing algorithm in GPUs, Journal of Global Optimization, vol. 57, pp. 863-890, Nov. 2013.
 S. Zaghloul and E. Alsafi, "Drone route planning for military image acquisition using parallel simulated annealing", International Journal of New Computer Architectures and their Applications (IJNCAA), vol. 7, no. 3, Sep, 2017.
 D. Rohr, S. Kalcher, M. Bach, A. A. Alaqeeliy,
H. M. Alzaidy, D. Eschweiler, V. Lindenstruth, S. B. Alkhereyfy, A. Alharthiy, and A. Almubaraky, "An energy-efficient multi-GPU supercomputer," in High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC, CSS, ICESS), 2014 IEEE Intl Conf on, 2014, pp. 42-45.
 "Intel[R] ARK (Product Specs). (2017). Intel[R] Xeon[R] Processor E5-2650 v4 (30M Cache, 2.20 GHz) Product Specifications. [online] Available at: https://ark.intel.com/products/91767/Intel-Xeon-Processor-E5-2650-v4-30M-Cache-2 20-GHz
[Accessed 13 Dec. 2017].".
 Slurm.schedmd.com. (2017). Slurm Workload Manager. [online] Available at: https://slurm.schedmd.com/quickstart.html
[Accessed 14 Dec. 2017]."
 T. Rauber and G. Runger, Parallel programming: For multicore and cluster systems: Springer Science & Business Media, 2013.
 A. Kaminsky, "Building Parallel Programs: SMPs, Clusters, and Java. Cengage Course Technology (2010)," ISBN 1-4239-0198-3.
Eman Alsafi and Soha S. Zaghloul, PhD
King Saud University
Table 1 Relationship between the execution time and number of waypoints with the corresponding output distance #waypoints Sa Execution Distance algorithm time (s) version 50 Sequential 1.1711 1762 Parallel 2.113 1483 cluster CSS Parallel 0.561 1389 SMP 100 Sequential 3.276 3110 Parallel 3.644 2610 cluster CSS Parallel 0.651 2162 SMP 150 Sequential 4.741 4880 Parallel 5.059 3733 cluster CSS Parallel 0.753 3487 SMP 200 Sequential 6.072 4976 Parallel 6.544 4439 cluster CSS Parallel 0.891 4320 SMP Table 2 Efficiency of sequential, SMP, and CSS versions #way Sa algorithm Executi Speed Efficiency point version on time up s (s) Sequential 1.1711 1 1 50 Parallel 2.113 0.55 0.055 cluster CSS Parallel SMP 0.561 2.09 0.17 Sequential 3.276 1 1 100 Parallel 3.644 0.9 0.09 cluster CSS Parallel SMP 0.651 5.03 0.42 Sequential 4.741 1 1 150 Parallel 5.059 0.94 0.09 cluster CSS Parallel SMP 0.753 6.3 0.52 Sequential 6.072 1 1 200 Parallel 6.544 0.93 0.093 cluster CSS Parallel SMP 0.891 6.81 0.57 Table 3 Measure scalability of parallel SA SA algorithm Size up Scalability version Sequential 3100 -- Parallel cluster CSS 3000 0.97 Parallel SMP 23000 7.4
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||shared-memory processor|
|Author:||Alsafi, Eman; Zaghloul, Soha S.|
|Publication:||International Journal of New Computer Architectures and Their Applications|
|Date:||Jan 1, 2018|
|Previous Article:||Measuring the Performance of Data Placement Structures for MapReduce-based Data Warehousing Systems.|
|Next Article:||A Finger-Mounted Haptic Device with Plane Interface.|