RP-Ring: A Heterogeneous Multi-FPGA Accelerator.
N-body simulations have been widely used in scientific and engineering applications. Problems in astrophysics, semiconductor device simulation, molecular dynamics, plasma physics, and fluid mechanics require efficient N-body simulation methods . The problem can be described as follows. The topic gives the initial positions and velocities of N particles, demanding updating their positions and velocities every T time steps. Nonetheless, the size of the N-body simulation (N) is always limited by the available computational resources, and the increasing need for larger system simulations requires more efficient computational methods. So many researchers have been interested in faster algorithms for large-scale particle simulation and invented some efficient algorithms, such as Barnes and Hut algorithms that reduce the computation complexity to 0(N log N) and FMM algorithms which have a computation complexity of O(N) . However, these algorithms use some approximation and are complex to parallelize. Direct-summation N-body algorithm computes the interaction between particles in an accurate way and is quite convenient for parallelization. What is more, direct-summation is a fundamental building-block for other algorithms , so a lot of high performance direct-summation computational platforms emerge these years. In the modified Newton dynamics simulation project of Yunnan Observatories, Chinese Academy of Sciences, we want to work out such a platform that can make use of all the resources that we have in lab and to meet the demand for power and performance at the same time.
1.1. Background. Computational solutions for N-body simulation can be categorized as CPU, GPU, ASIC, and FPGA according to the computing unit. Furthermore, these technologies vary in their cost, programming abstraction level, and power consumption . There has been one thorough study on x86-based or power-PC-based tuning  and several papers on GPU cluster implementation [2, 6]. The GRAPE ("GRAvity piPE") project built ASIC-based high performance computing solutions for gravitational force calculations where the calculation of particle interactions was calculated by an ASIC chip in the form of a fully pipelined hardwired processor dedicated to gravitational force calculation [7, 8]. Hamada et al. used the Bioler3 system to implement an FPGA-based gravitational force computing accelerator . These studies have shown that CPUs' performance is limited and ASIC offers no advantages; GPUs are competitive in performance and performance per cost; the performance per Watt figure favoured FPGA .
1.2. Related Work and Challenge. Figure 1 shows a basic structure of a hardware accelerated solution for N-body simulations. It consists of a host computer and an acceleration coprocessor for potential calculation. The host computer performs all other calculations, for example, position and velocity upgrade. More specifically, the coprocessor consists of a number of pipelines and the particle information stored in particle memory. The control/interface unit receives instruction from host computer and controls the pipeline to acquire particle information from particle memory to calculate potential [7, 9].
Figure 2 shows how to build a computing cluster with GRAPE. To build a cluster with 16 GRAPE boards, we need one host-interface board (HIB), five network boards (NBs), and 16 processor boards (PBs). Each NB has one uplink and four downlinks. Thus, the 16 PBs are connected to the host computer through two-level tree network of NBs. NB and HIB handle the communication between PB and the host computer . With the increase in the amount of PBs, the demand for NBs increases rapidly. It means that the interconnection overhead of building a large computing cluster is unacceptable and the interconnection problem becomes a challenge of building large computing cluster.
1.3. Motivation. We want to work out an accurate numerical computation method based on MOND theory. MOND (Modified Newtonian Dynamics) theory is an alternative for the popular Dark Matter (DM) theory, which successfully explains the distribution of force in an astronomical object from observed distributions of baryonic matters .
MOND's numerical algorithm is different from traditional N-body simulation's method, so the GRAPE is not suitable for our mission. MOND theory is based on potential calculation and can be described as follows: given the density distribution of baryonic matters [[rho.sub.b](x), try to figure out the final potential [PHI](x). The final potential is influenced by the distribution of two kinds of matter: [[rho.sub.b](x) and [[rho.sub.ph] (x), where [[rho.sub.b](x) is the density distribution of baryonic matters including stars and gasses and [[rho.sub.ph](x) is the density distribution of phantom dark matter, which is the theoretical hypothesis of MOND . The final gravity potential O(x) is given by the classical Poisson equation:
[[nabla].sup.2][PHI](x) = 4[pi]G ([[rho.sub.h](x) + [[rho.sub.ph](x)). (1)
[[rho.sub.ph](x) is given by the [[rho.sub.b](x)'s potential [phi](x) as in the following equation:
[[rho.sub.ph] = [nabla] x [[mu]([nabla][phi]/[a.sub.0]) [nabla][phi](x)]/4[pi]G. (2)
Finally, [[rho.sub.b](x)'s potential [phi](x) is given by the classical Poisson equation:
[[nabla].sup.2][phi](x) = 4[pi]G[[rho.sub.b](x). (3)
Different from the traditionally direct-summation N-body algorithm, MOND requires a more time-consuming potential calculation, whose computation complexity is O([N.sup.2]).
With limited project budget, we choose to use the FPGA-based direct-summation algorithm. Instead of designing new specialized boards, we reuse existing ones in order to reduce the overhead. The scale of MOND simulation is limited by the available computational resources. A single FPGA chip does not provide enough logical resource, so the multi-FPGA solution seems to be the only choice. Major contributions of our work are as follows:
(1) In order to accelerate direct-summation N-body simulation we propose an extensible heterogeneous multi-FPGA solution called RP-ring (reconfigurable processor ring) to utilize existing multiple different FPGA boards. The experiment shows that our implementation achieved two orders of magnitude speedup compared with high-end CPU implementation.
(2) In order to prevent the slowest board from dragging the overall performance down, we propose a model about how to decompose workload among FPGAs and optimize logic resource allocation. To improve the whole system's performance, this model shall divide workload based on the logic resource, memory bandwidth, and communication bandwidth of each FPGA board and allocate logic resource among potential calculation pipeline, DMA/FIFO, and other modules.
The remainder of this paper is organized as follows: Section 2 presents background information on the algorithm of MOND theory numerical simulation. The following section will present the architecture of RP-ring. Then we will present the model, which guides the decomposition of workload and logic resource's allocation. After that, we will present our hardware implementation result on heterogeneous multi-FPGA and compare it with other implementation on various GPU boards and CPUs as well as ASIC implementation. These results will then be discussed before conclusions are drawn.
2. Direct-Summation Algorithm
MOND numerical simulation is a variant of N-body simulation; the calculation can be described in the following five steps :
(1) With the known baryonic matter distribution pi,(x), calculate gravity potential <p(x) according to (3).
(2) Calculate the phantom dark matter distribution [[rho.sub.ph](x) with (2) by finite difference
(3) Solve the Poisson equation (1) to get the final potential [PHI](x).
(4) Calculate the acceleration and velocity with the final potential by [PHI](x) finite difference.
(5) Calculate the location of each particle in the next time step.
Steps (2), (4), and (5) have a computation complexity of O(N), so using CPU to do the tasks serially will not influence the performance. Steps (1) and (3) have a computation complexity of O([N.sup.2]), so we focus on accelerating them. In direct-summation algorithm, Steps (1) and (3), which calculate gravity potential, we can use the solution of the Poisson equation:
[mathematical expression not reproducible] (4)
Therefore, in the following article we use FPGA to construct potential calculation pipeline and propose the RPring solution to build a larger multi-FPGA system. It should be pointed out that this work is not limited to MOND theory numerical simulation. It can be extended conveniently to other direct-summation N-body simulations.
As (4) shows, the accumulation of the potential acting on each particle by all other particles in the system is mutually independent. We can calculate several particle-pairs' potential simultaneously, so the existing ASIC implementation represented by GRAPE puts several potential calculating pipeline in the chip to find the best degree of parallelism. In Section 1, we have analyzed the existing work and find out their bottleneck. Now we come to the RP-ring.
3.1. RP-Ring Solution. Figure 3 illustrates a ring network consisting of N FPGA boards and a host computer. Each FPGA board has on-board memory and is connected with previous/next FPGA board with cable. There are s pipelines, DMA, memory controller, and other modules in an FPGA chip. All these modules are controlled by protocol controller. The potential pipelines have two input ports, one connected to Input-FIFO and the other connected to DMA-FIFO. In order to reuse the local particle information, we fix the data from the Input-FIFO, which is received from previous board, and traverse the local particle information as Figure 4 shows. In the following paragraphs, we will explain RP-ring's control flow and data flow.
3.1.1. Control Flow. As shown in Figure 3, each FPGA board's control flow has the following features:
(1) Obtain the results from the previous FPGA board and put data into the Input-FIFO.
(2) There is DMA on-board memory to get local particle information and put it into the DMA-FIFO.
(3) The pipelines gain data from Input-FIFO and DMAFIFO, calculate the potential, and then write the result into Output-FIFO
(4) Read data from Output-FIFO, and send it to the next FPGA board through output connection.
3.1.2. Data Flow. Figure 5 shows the data flow of RP-ring. The information of the ith particle consists of its location x = ([x.sub.i], [y.sub.i], [z.sub.i]), mass [m.sub.i], and potential [p.sub.i] which is initialized to zero. When the ith particle's information reaches the zero board, the FPGA board calculates the interaction between the particle and the local particles work set 0 and stores the intermediate results [p.sub.0] + 0 into the potential field. When the ith particle's information reaches the first board, the FPGA board calculates the interaction between the particle and the local particles work set 1 and stores the intermediate results [p.sub.1] + [p.sub.0] + 0 into the potential field. Therefore, when the ith particle's information flows through the whole ring and returns to the host computer, the potential field reaches the final result [[summation].sup.n-1.sub.i][p.sub.i]. Then the host computer can continue to follow up operation.
The RP-ring solution, we propose in this paper, can avoid the problem we mentioned in Section 1. It implements an extensible heterogeneous multi-FPGA solution. Different from the GRAPE's tree network, RP-ring utilizes ring topology network. Each FPGA board's on-board memory stores a portion of particle information. During the process, each board receives particle information from the previous board, then calculates the interaction with local particle information, and finally sends the result to the next board in the ring network. For each particle's information, when it flows through the whole boards in the ring, the calculation of interaction with all other particles is finished. The advantages of this solution are as follows:
(1) In RP-ring, when the whole working set flows through the ring network once, the calculation of interaction is completed. The amount of data that needs to be transported between FPGA boards is reduced, and the demand for communication bandwidth is also reduced.
(2) The ring network topology is simpler than the tree network in GRAPE cluster. There is no need for additional network board.
(3) The interconnection protocol is quite simple. It requires little overhead to implement protocols no matter the software or hardware. Thus, we can save more resources to construct potential calculation pipeline.
3.2. Potential Pipeline. Potential pipeline is designed based on Poisson equation (4). To make each stage of the pipeline have similar latency, (4) is rewritten as
[mathematical expression not reproducible]. (5)
Figure 6 shows the design of potential pipeline. According to the complexity of different operation, the addition, substraction, and multiplication units are set to the same latency cycle as the division and square root units. In our design, x, y, z, m are presented in IEEE 754 floating-point format.
3.3. System Optimization. From the above, the ring network may result in the slowest board dragging the overall performance down, so it is important to balance the time-consumption (Ti) of particle's information flowing through each board. Processing capacity of each board varies, but by decomposing the workload based on their own capacity we can adjust Ti to improve the whole system's performance efficiently. (See suitable work set i in Figure 5.) The following section will discuss this problem in detail with a mathematical model.
The purpose of this mathematical model is, given multiple FPGA boards with known parameters, how to decompose the workload among them and choose their parameters of potential calculation pipeline, so that the whole system's maximum throughput can be obtained.
4.1. Symbol Conventions. Assume that the scale of simulation is N, and we have n FPGA boards.
N = [n.summation over (i)] [N.sub.i]. (6)
[N.sub.i] is the workload assigned to the ith FPGA board. It can be seen from the RP-ring's architecture that each board's processing capacity depends on the number of potential calculation pipelines and their operating frequency. Suppose that the ith FPGA board contains [s.sub.i] pipelines and their operating frequency is [f.sub.i]; then the performance of the board [P.sub.i] has
[N.sub.i], [varies] [s.sub.i], x [N.sub.i]. (7)
In order to maximize the whole system's throughput, we just need to allocate the workload among the FPGA boards in a proportional way according to their processing capacity, so the problem is converted to how to choose [s.sub.i] and [f.sub.i], s.t. P = [[summation].sub.i] [p.sub.i] maximum. Additionally, [f.sub.i] is a function of [s.sub.i]; that is,
[f.sub.i] = g([s.sub.i]). (8)
Therefore, in this model, [s.sub.i] is the only free variable. Finally, the problem is rewritten as
[mathematical expression not reproducible]. (9)
4.2. Constraint. Furthermore, there are three constraints in this model:
(1) FPGA logic resource constraint,
(2) memory access bandwidth constraint,
(3) communication bandwidth constraint.
FPGA logic resource constraint: in each FPGA, the logic resource consumption of FIFO, DMA, memory controller, input/output interconnection, and potential pipeline is smaller than the maximum amount of resources that FPGA can provide. Suppose that [LUT.sup.i], [FF.sup.i], [BRAM.sup.i], and [DSP.sup.i] are the amounts of LUT, Flip-Flop, BRAM, and DSP Slice that the ith FPGA provides; we use vector [R.sup.i] to present them, [R.sup.i] = ([LUT.sup.i], [FF.sup.i], [BRAM.sup.i], [DSP.sup.i]), [R.sup.i.sub.FIFO] is the resource consumption of the ith board's FIFO, and so on. Then we have
[R.sup.i] [greater than or equal to] [R.sup.i.sub.FIFO] + [R.sup.i.sub.DMA] + [R.sup.i.sub.MemCtrl] + [R.sup.i.sub.IOCtrl] + [R.sup.i.sub.Pipeline]. (10)
Apparently, [R.sup.i.sub.FIFO], [R.sup.i.sub.DMA], [R.sup.i.sub.Pipeline], [R.sup.i.sub.MemCtrl], and [R.sup.i.sub.IoCtrl] depend on the pipeline needed data bandwidth and can be seen as a function of si. That is to say,
[R.sup.i] [greater than or equal to] [h.sup.i.sub.FIFO] ([s.sub.i]) + [h.sup.i.sub.DMA] ([s.sub.i]) + [h.sup.i.sub.MemCtrl] ([s.sub.i]) + [h.sup.i.sub.Pipeline]([s.sub.i]). (11)
Memory access bandwidth constraint: the input data of potential pipeline come from previous board's result and local memory's particle information. We fixed the data from the previous board and traverse the local particle information, so half of the potential pipelines' input bandwidth is borne by the memory access bandwidth. That is to say,
[BW.sup.i.sub.Men_max] [greater than or equal to] 1/2 [BW.sup.i.sub.Men_max]. (12)
In (12), [BW.sup.i.sub.Men_max] is the maximum memory access bandwidth of the ith board, and [BW.sup.i.sub.Pipeline] is the input data bandwidth summation of all the potential pipelines in the ith board. Apparently, [BW.sup.i.sub.Pipeline] is proportional to the number of potential pipelines. That is to say, [BW.sup.i.sub.Pipeline] [varies] [s.sub.i], so (12) can be written as
[BW.sup.i.sub.Men_max] [greater than or equal to] 1/2 [BW.sup.i.sub.pipeline] = 1/2 k x [s.sub.i], (13)
where k is a constant.
Communication bandwidth constraint: as described in memory access bandwidth constraint, the ith board's communication bandwidth between previous board and next board should be greater than one over [N.sub.i] of the half of
[mathematical expression not reproducible] (14)
In conclusion, based on the target function and the three constraints, when the parameter of the boards and the needed functional relation are given, solving the optimization problem can guide us on how to decompose the work load among the boards and how to choose their parameters of potential calculation pipeline.
In this section, we will demonstrate our implementation under RP-ring solution and its performance parameters in our MOND theory numerical simulation project.
Table 1 shows our existing boards and their parameters. The experiment demonstrates that under RP-ring solution, we can,
(1) based on the boards' feature, select software or hardware to implement the interconnection protocol,
(2) according to the boards' resource, choose different interconnection media.
Thus, this solution has good flexibility and scalability and is compatible with heterogeneous multi-FPGA.
In Table 1, Jetson-TK1 is a NVIDIA Application Processor board, which is used as host computer. Zedboard, KC705, and XUPV5 are Xilinx Evaluation Kits for Zynq-7000, Kintex-7, and Virtex-5. Gemini-1 is our design FPGA board for prototyping. Figure 7 shows the top-level structure of Gemini1. Gemini-1 has two pieces of XC6VLX365T Virtex-6 FPGA; they are connected through PCB trace. Each Virtex-6 FPGA chip has SMA connector to transport data.
5.1. Topology. Figure 8 shows connection between the boards and connection between chips. In the ring network, Tegra K1 is used as host computer, and XC7Z020, XC7K325T, XC5VLX110T, and XC5VLX365T are connected through ethernet, SMA cable, or PCB trace. Figure 9 is the picture of real product.
5.2. Data Structure. The provisions of the RP-ring's particle information format are as shown in Figure 10. The location, mass, and potential field store the particle's three-dimensional coordinates, mass, and the provisional result of potential, as well as the field of Tag record FPGA boards that the particle information has passed. Once the information passes an FPGA board, the corresponding position in Tag filed is set.
When all of the bits are set, the potential field contains the final result.
5.3.1. Software Implementation. For the FPGA with integrated CPU, like Zynq-7000, the interconnection protocol can be implemented with software as Figure 11 shows. Figure 11 shows its system architecture: multiple pipelines are instantiated in FPGA; two input ports of each pipeline are connected to Input-FIFO and DMA-FIFO; the output port of each pipeline is connected to Output-FIFO. The CPU creates three processes: Input Process, Computing Process, and Output Process.
Input Process. Control Gigabyte Ethernet receives particle information from previous board and stores it to Input-FIFO. Moreover, Input Process handles the situation of retransmission, Input-FIFO's fullness, and so on.
Computing Process. Control DMA traverses local particle information from DDR3. Control pipelines compute the potential based on the particle information in Input-FIFO and local information in DMA-FIFO and then write the result to Output-FIFO.
Output Process. Control Gigabyte Ethernet Sends the result in Output-FIFO to the next board. Moreover, Output Process handles the situation of retransmission, next board's Input-FIFO's fullness, and so on.
5.3.2. Hardware Implementation. For the FPGA without integrated CPU, like XC7K325T, the interconnection protocol can be implemented with hardware as Figure 12 shows. Because the interconnection media can be ethernet cable, SMA cable, or PCB trace, the input/output connection can be ethernet controller, SMA controller, or SelectIO controller. In Figure 12, a protocol FSM replaces the role of CPU. It controls the input connection, receives the data from previous board, stores the message in the Input-FIFO, and controls the DMA traverse local data, and pipelines finish the calculation, control the output connection, and send the result in Output-FIFO to the next board.
6. Experimental Result
6.1. Logical Resource Consumption. Table 2 shows the resource utilization of each FPGA board. Particularly, Zynq FPGA has hard DDR3 controller and Gigabit Ethernet controller, and Virtex-5 FPGA has hard DDR2 and MAC. Thus, these modules do not require extra logic resource. Virtex-5 does not have existing DMA IP, so we design a simplified version.
6.2. Communication Bandwidth Consumption. In order to measure the communication bandwidth consumption between boards, we add counters to record the data traffic on the interconnection. Figure 13 shows the counters in the system and the interconnection's notation. These counters will work after each packet passes through the path. At the end of the whole experiment, we can read out the value of each counter and divide it by the time of calculation. By this way, we can confirm that our data has no loss in the communication and give out the specific throughput bandwidth of each FPGA boards.
Table 3 shows the communication bandwidth consumption result during calculating 131,072 particles' potential, the time of which is 1297.113 ms. Because of the solution's ring topologies, the consumption of communication bandwidth is very low.
6.3. Performance Comparison. Table 4 shows the number of potential pipelines in each FPGA board, operation frequency, and their performance. The conversion between GFlops and MPair/s is based on Atsushi Kawai's work . Based on the above results, we calculate the whole system's theoretical parameters. Finally, we list the system's experimental result.
6.3.1. Comparison with Software. We choose CPU and GPU solutions as the control groups of our work. Fabian's RAMSES code is a widely used method for MOND simulation . Therefore, we choose their method as the reference software implementation and use Intel Xeon E5-2660 to run different scale test. Based on RAMSES's QUMOND method and Nitin's direct N-body kernels to simulate the same work set on NVIDIA's Tesla K80 GPU, Figure 14 shows the time taken to calculate potential in different platform. The benefit from the structure of FPGA is that RP-ring is faster than other solutions at the beginning. In the right-end, the NVIDIA results appear to be better than the RP-ring, because as the number of particles increases, our platform is approaching its theoretical maximum performance (503.5 of 620.1 GFlops), and Tesla K80 has not reached its limits. As for the CPU solution, it uses much more time compared to RP-ring and GPU, limited by its poor capability of parallel computing. When simulating a system with 131072 particles, our work is 193 times faster than Xeon E5-2660 CPU and can achieve similar performance to Tesla K80.
6.3.2. Comparison with Hardware. Table 4 shows a number of hardware implementation details of N-body simulation. Junichiro's GRAPE-4 cluster uses 1692 pipeline and achieves 1.08 TFlops . And he builds a compute cluster with GRAPE-6 chips whose peak performance is 1.349 TFlops . Furthermore, his GRAPE-8 achieves 960 GFlops with just one board . Reference  showed a solution with GPU whose performance is 781 GFlops.
Some FPGA solutions are also listed in the Table 5. Lienhart et al. use FPGA to achieve 3.9 GFlops' performance . Spurzem et al.'s solution has the performance of 4.3 GFlops , Hamada et al.'s work reaches 324.2 GFlops by a board with 5 FPGA chips , and Sozzo et al.'s work makes a solution with 46.55 GFlops' performance .
7. Conclusion and Discussion
In this paper, we proposed an extensible solution: RP-ring, which is used for heterogeneous multi-FPGA-based direct-summation N-body simulation, and a model to decompose workload among each FPGA. RP-ring tries to use existing FPGA boards rather than designing new specialized boards to reduce cost. The solution can be expanded conveniently with any heterogeneous FPGA boards and the communication bandwidth requirement is quite low, so that the communication protocol could be designed to be simple and consume few resource. The model considers the constraint of FPGA's logic resource, memory access bandwidth, and communication bandwidth to divide workload reasonably and optimize the whole system's performance. We also build a heterogeneous multi-FPGA system based on RP-ring and use it for MOND theory's numerical simulation. The experimental result shows that the low cost multi-FPGA system is 193 times faster than high-end CPU implementation and achieves similar performance to high performance GPU.
An earlier version of this work was presented as a poster at 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This research was sponsored by Huawei Innovation Research Program (YB2015090102); support from Huawei Technologies Co., Ltd., is gratefully acknowledged.
 S. Bhatt, M. Chen, C.-Y. Lin et al., "Abstractions for parallel N-body simulations," in Proceedings of the Scalable High Performance Computing Conference (SHPCC-92), pp. 38-45, IEEE.
 T. Hamada, T. Narumi, R. Yokota, K. Yasuoka, K. Nitadori, and M. Taiji, "42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, (SC '09), NewYork, NY, USA, November 2009.
 S. Harfst, A. Gualandris, D. Merritt, R. Spurzem, S. P. Zwart, and P. Berczik, "Performance analysis of direct N-body algorithms on special-purpose supercomputers," New Astronomy, vol. 12, no. 5, pp. 357-377, 2007.
 T. Hamada, K. Benkrid, K. Nitadori, and M. Taiji, "A comparative study on ASIC, FPGAs, GPUs and general purpose processors in the O(N2) gravitational N-body simulation," in Proceedings ofthe NASA/ESA Conference on Adaptive Hardware and Systems (AHS '09), pp. 447-452, San Francisco, Calif, USA, July 2009.
 N. Arora, A. Shringarpure, and R. W. Vuduc, "Direct n-body kernels for multicore platforms," in Proceedings of the 38th International Conference on Parallel Processing, ICPP-2009, pp. 379-387, Austria, September 2009.
 I. Zecena, M. Burtscher, T. Jin, and Z. Zong, "Evaluating the performance and energy efficiency of n-body codes on multicore CPUs and GPUs," in Proceedings of the 2013 IEEE 32nd International Performance Computing and Communications Conference, IPCCC 2013, USA, December 2013.
 J. Making, M. Taiji, T. Ebisuzaki, and D. Sugimoto, "Grape-4: A massively parallel special-purpose computer for collisional n-body simulations," The Astrophysical Journal, vol. 480, no. 1, pp. 432-446, 1997.
 J. Makino and H. Daisaka, "GRAPE-8--an accelerator for gravitational N-body simulation with 20.5Gflops/W performance," in Proceedings of the 24th International Conference for High Performance Computing, Networking, Storage and Analysis (SC '12), Salt Lake City, Utah, USA, November 2012.
 J. Makino, T. Fukushige, and M. Koga, "A 1.349 Tflops simulation of black holes in a galactic center on GRAPE-6," in Proceedings ofthe 2000 ACM/IEEE Conference on Supercomputing, pp. 43-43, Dallas, Tex, USA, November 2000.
 B. Famaey and S. S. McGaugh, "Modified newtonian dynamics (MOND): observational phenomenology and relativistic extensions," Living Reviews in Relativity, vol. 15, article 10, 2012.
 F. Lughausen, B. Famaey, and P. Kroupa, "Phantom of RAMSES (POR): a new Milgromian dynamics N-body code," Canadian Journal of Physics, vol. 93, no. 2, pp. 232-241, 2014.
 G. Lienhart, A. Kugel, and R. Manner, "Using floating-point arithmetic on FPGAS to accelerate scientific N-Body simulations," in Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM2002, pp. 182-194, USA, April 2002.
 R. Spurzem, P. Berczik, G. Marcus et al., "Accelerating astrophysical particle simulations with programmable hardware
(FPGA and GPU)," Computer Science - Research and Development, vol. 23, no. 3-4, pp. 231-239, 2009.
 A. Castelllo, R. Mayo, and J. Planas, "Exploiting Task-Parallelism on GPU Clusters via OmpSs and rCUDA Virtualization Trustcom/BigDataSE/ISPA," in Proceedings of the IEEE Trustcom/BigDataSE/ISPA, pp. 160-165, 2015.
 E. D. Sozzo, L. D. Tucci, and M. D. Santambrogio, "A highly scalable and efficient parallel design of N-body simulation on FPGA," in Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017, pp. 241-246, USA, June 2017.
 A. Kawai and T. Fukushige, "$158/GFLOPS astrophysical N-body simulation with reconfigurable add-in card and hierarchical tree algorithm," in Proceedings of the ACM/IEEE Conference on Supercomputing (SC '06), 2006.
Shuaizhi Guo, Tianqi Wang (iD), Linfeng Tao, Teng Tian, Zikun Xiang, and Xi Jin (iD)
University of Science and Technology of China, Hefei, China
Correspondence should be addressed to Xi Jin; firstname.lastname@example.org
Received 21 August 2017; Revised 5 November 2017; Accepted 17 January 2018; Published 4 April 2018
Academic Editor: Michael Hubner
Caption: Figure 1: Basic idea of GRAPE and its top-level structure.
Caption: Figure 2: GRAPE computing cluster.
Caption: Figure 3: The architecture of RP-ring.
Caption: Figure 4: The input of potential pipeline.
Caption: Figure 5: The data flow of RP-ring.
Caption: Figure 6: Potential pipeline's detail.
Caption: Figure 7: Top-level view of Gemini-1.
Caption: Figure 8: The experiment's connection.
Caption: Figure 9: Boards.
Caption: Figure 10: The data structure of particle information.
Caption: Figure 11: Protocol's software implementation.
Caption: Figure 12: Protocol's hardware implementation.
Caption: Figure 13: Counters' location and interconnection's notation.
Caption: Figure 14: Software performance comparison.
Table 1: The parameters of the boards. Board Main chip Hard CPU Logic cells Zedboard XC7Z020 Cortex A9 85,000 Dual Core KC705 XC7K325T No 326,080 XUPV5 XC5VLX110T No 46,080 Gemini-1 XC6VLX365T x2 No 364,032 each Jetson-TKl Tegra K1 Cortex A15 -- (host computer) Quard Core Board Flip flop Block RAM DSP slice Memory Zedboard 106,400 560 Kb 220 KC705 407,600 4000 Kb 840 XUPV5 28,800 1728 Kb 48 Gemini-1 455,040 each 14,976 Kb each 576 each Jetson-TKl -- -- -- (host computer) Board access Connection bandwidth (theoretical) Zedboard 8.5 GB/s Ethernet KC705 12.8 GB/s Ethernet and SMA XUPV5 3.2 GB/s Ethernet and SMA Gemini-1 12.8 GB/s each SMA Jetson-TKl -- Ethernet (host computer) (i) UTP: ethernet (ii) SMA: subminiature version A Table 2: Resource consumption. Gemini-1 LUT REG BRAM DSP FIFO 265 129 4 0 DMA 2753 4078 8 0 IN/OUT 450 530 0 0 MEM 5889 5882 0 0 PIPELINE 209493 301565 5 480 TOTAL 227520 455040 416 576 KC705 LUT REG BRAM DSP FIFO 265 112 4 0 DMA 1322 1680 8 0 IN/OUT 1479 2082 0 0 MEM 14016 9019 2 0 PIPELINE 185509 273149 5 736 TOTAL 203800 407600 445 840 Zedboard LUT REG BRAM DSP FIFO 265 112 4 0 DMA 0 0 0 0 IN/OUT 0 0 0 0 MEM 0 0 0 0 PIPELINE 45014 80403 2 171 TOTAL 53200 106400 140 220 XUPV5 LUT REG BRAM DSP FIFO 47 57 4 0 DMA 777 516 0 0 IN/OUT 2226 2307 0 0 MEM 0 0 0 0 PIPELINE 10552 15945 0 64 TOTAL 69120 69120 128 64 Table 3: Communication bandwidth consumption. Interconnection Data traffic Time (ms) (MB) Ethernet 0 15.733 1297.2 Ethernet 1 15.641 1297.2 Ethernet 2 15.847 1297.2 SMA 0 2.911 1297.2 SMA 1 2.926 1297.2 PCB trace 2.903 1297.2 Interconnection Bandwidth Bandwidth measured (MB/s) theory (MB/s) Ethernet 0 12.128 1000 Ethernet 1 12.057 1000 Ethernet 2 12.216 1000 SMA 0 2.244 3125 SMA 1 2.256 3125 PCB trace 2.238 6400 Table 4: The number of potential pipelines. Board Pipeline Frequency Zedboard 8 200 MHz KC705 32 344.9 MHz XUPV5 2 224.3 MHz Gemini-1 32 x 2 266.1MHz Total (theory) 106 -- Total (experiment) 106 -- Board GFlops MPair/s Zedboard 32.9 866.6 KC705 227.2 5977.8 XUPV5 9.234 242.9 Gemini-1 175.3 * 2 4612.0 * 2 Total (theory) 620.1 16311.3 Total (experiment) 503.5 13244.7 Table 5: Hardware performance comparison. Implement Main chip GFlops GRAPE-4 cluster  ASIC 1080 GRAPE-6 cluster  ASIC 1349 GRAPE-8 board  ASIC 960 Lienhart et al.'s work  FPGA 3.9 Spurzem et al.'s work  FPGA 4.3 Hamada et al.'s Bioler-3  FPGA 324.2 GPU cluster  GPU 781 Sozzo et al.'s work  FPGA 46.55 Our Work FPGA 503.5
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||Research Article|
|Author:||Guo, Shuaizhi; Wang, Tianqi; Tao, Linfeng; Tian, Teng; Xiang, Zikun; Jin, Xi|
|Publication:||International Journal of Reconfigurable Computing|
|Date:||Jan 1, 2018|
|Previous Article:||Corrigendum to "An Impulse-C Hardware Accelerator for Packet Classification Based on Fine/Coarse Grain Optimization".|
|Next Article:||FPGA Implementation of Reconfigurable Finite State Machine with Input Multiplexing Architecture Using Hungarian Method.|