Utilization of graphics processing unit in sound source localization based on digital beamforming.
Key words: beamforming, GPGPU, OpenCL, FFT, parallel computing
First general purpose computing on graphics processing units (GPGPU) applications were developed in 2006 when the new generation of graphics cores with fully programmable shader processors (shader model 3.0) which can work with numbers in floating point format were introduced. At the beginning there were no standardized programming languages for parallel computing on GPUs. So these applications utilize OpenGL or DirectX application programming interfaces which are mainly optimized for 3D graphics processing. Both major GPU manufacturers realize this problem and independently work on its own parallel computing technology. The result of the development is CUDA technology by nVidia and ATI Stream technology by ATI (AMD). With CUDA technology are compatible graphics cards with nVidia G80 and later cores, ATI Stream can be used with R580 and later cores. Drawback of this solution is incompatibility of final GPGPU applications --application developed using CUDA can not work on AMD / ATI graphics cards and vice versa. Application programmers must therefore develop two versions of the same application if they want to achieve compatibility with both major GPU manufacturers. This procedure is very time consuming for software developers so many available applications can benefit usually from GPGPU of only one GPU manufacturer. This situation was not very optimal and potentially slowed down development of GPGPU applications. Solution for this comes in 2008 when was released Open Computing Language (OpenCL) by Khronos Group consortium whose members were AMD, nVidia, IBM and Intel. It is framework for creating parallel computing applications for heterogeneous parallel processing platforms consisting of multi core CPUs, GPUs and other specialized processors such as Digital Signal Processors (DSP). Graphics accelerators with DirectX 10 and later support can utilize DirectCompute API for general purpose computing too. But this functionality is available only in Microsoft Windows Vista and Windows 7 because Windows XP supports DirectX 9 API only.
In this paper is proposed utilization of the nowadays graphics processing units in floating point computing intensive tasks as acoustic source localization system working on digital beamforming method which can be used in many practical applications such as security, teleconferencing, robotic systems and other else where information is coded in audio signal source position. Due to principle of beamformer operation its localization accuracy increases with increasing number of microphone units in the array. On the other hand there is a problem how to process large amount of digital sound data in a short time. Today high-end multi core processors based on x86 architecture used in personal computers are able to run simultaneously up to 12 threads achieving floating point performance near of 100 GFLOPs while graphics processing units can process hundreds of threads simultaneously reaching theoretically up to 2700 GFLOPs in single precision. Due to this fact it can be very effective to move all computation intensive tasks to the GPU and free CPU resources to other tasks related to data acquisition, visualization and archiving.
2. LOCALIZATION SYSTEM OVERVIEW
Localization system structure is obvious from the Fig. 1. It consists of the microphone array with 15 microphone units. Their output analog signal is amplified in preamplifiers to the level suitable for next processing. Next signal enters to 4th order low-pass Bessel type analog antialiasing filter followed by 16 channel external data acquisition unit Advantech USB-4716. After analog to digital conversion data is transferred by USB interface to evaluation system which is based on standard personal computer equipped with AMD Athlon 64 X2 6000+ processor, 4GiB RAM and AMD Radeon HD5830 graphics accelerator (Dostalek et al., 2009).
Software of the evaluation system first stores acquired digital audio data into the internal memory buffers which are double buffered. Each audio channel signal is then 64x upsampled and filtered by band-pass finite impulse response (FIR) filters. After that follows processing in the delay and sum beamformer which weights are subsequently set for all examined angles of possible sound source azimuth. Maximum RMS value of the beamformer output then corresponds to sound source angle. FIR filters are implemented for maximizing computation performance by Fast Fourier Transform (FFT). FFT computation is performed 2 times per audio channel on GPU device to achieve better performance.
[FIGURE 1 OMITTED]
3. SOFTWARE IMPLEMENTATION
Software application for sound source localization using beamforming was created in Microsoft Visual Studio 2008 as Win32 application. Next chapters are focused on computational layer which is from processing speed point of view the most important part of the whole software.
3.1 OpenCL brief overview
OpenCL is framework for creating parallel computing applications for heterogeneous parallel processing platforms consisting of multi core CPUs, GPUs and other specialized processors. Using it programmer can write GPGPU applications which are portable between supported compute devices. Platform model consists of a host connected to one or more compute devices. Each compute device can be divided to compute units consisting of processing elements. Number of available compute units depends on actually used compute device hardware. OpenCL program consists of the two main parts: host program which is executed on the host and kernels which are executed on the compute devices. Kernel instance is called work-item and is uniquely determined by global ID. Work-items are divided into the specified number of the work-groups. Complete information about the OpenCL functionality can be found in the (***, 2010a), (***, 2010b) and (***, 2011).
3.2 FFT kernel
FFT is typical data-parallel task suitable for computing on the GPU device. Kernel for FFT computation on GPU device must be written due to different hardware architecture in different way than on a single thread CPU. Performance of the kernel is very sensitive on how each thread access data from memory buffers and after processing how write the results. The best memory performance is achieved when all read and write accesses are coalesced. Next performance improvement can be achieved by vectorization enabling full utilization of ALUs.
Implementation of the decimation in time radix-2 FFT is split into the three independent kernels:
* fft_ld_r_sl--optimized computation of the first stage FFT
* fft_ld_r--in-place FFT computation stages 2 to n.
* twiddle--twiddle factors pre-computation for FFT kernels.
4. VERIFICATION AND RESULTS
Previous version of the program equipment for audio source localization using digital beamforming utilizing FFTW library for FFT computations on CPU was modified to support FFT computations on GPU devices through industry standard OpenCL API. New software implementation is able to run FFT computations on CPU using FFTW library version 3.2.2 by authors Matteo Frigo and Steven G. Johnson (Frigo et al., 2006) and on GPU device using FFT computation kernels concurrently. Program verification was done on selected AMD Radeon graphics cards with different GPU hardware properties listed in the Tab. 1, where "CU" column represent number of compute units, "LM size" is local memory size in KiB and "GM size" is global memory size in MiB. Results of the FFT computation benchmark with different input data vector lengths on GPUs compared to FFTW performance on Athlon 64 X2 6000+ (3 GHz) are depicted in the graph in the Fig. 2. Computation time on GPU devices includes data transfers from host memory to GPU global memory and vice versa.
[FIGURE 2 OMITTED]
Paper presents a method how to utilize of modern graphical processing units in non-graphics related tasks such as in our case digital signal processing. In our application was GPU used in audio source location system based on digital beamforming where it performs together with CPU computational intensive FFT calculations with large input data vector sizes. Achieved results indicates that with AMD Radeon HD5830 graphics accelerator card is time needed for one localization event processing lowered to half of previous implementation when only CPU for calculation was used. Measured performance of the FFT kernel is lower than was expected so future research will be focused on its optimization.
This work was supported by the Ministry of Education, Youth and Sports of the Czech Republic under the Research Plan No. MSM 7088352102 and by the European Regional Development Fund under the project CEBIA-Tech No. CZ.1.05/2.1.00/03. 0089. This support is very gratefully acknowledged.
Dostalek, P.; Dolinay, J. & Vasek, V. (2009). Utilization of Beamforming in Sound Source Localization Applications, Proceedings of the 20th International DAAAM Symposium, ISSN 1726-9679, ISBN 978-3-901509-70-4, Katalinic, B. (Ed.), pp. 1651-1652, DAAAM International, Vienna
Frigo, M. & Johnson, S. G. (2006). FFTW3 manual for version 3.2.2. Available from: http://www.fftw.org/#documentation Accessed: 2011-02-5
*** (2011) http://www.amd.com--Advanced Micro Devices, AMD Accelerated Parallel Processing Programming Guide, Accessed on: 2011-02-10
*** (2010a) http://www.khronos.org--Khronos Group, The OpenCL Specification Version: 1.1 Document Revision: 36, Accessed on: 2011-02-15
*** (2010b) http://www.nvidia.com--nVidia corporation, OpenCL Programming Guide for the CUDA Architecture Version 3.2, Accessed on: 2011-02-15
Tab. 1. Selected GPU devices properties compared to CPU GPU/CPU Clock CU LM size GM size HD5770 850 MHz 10 32 KiB 512 MiB HD5830 825 MHz 14 32 KiB 800 MiB HD6870 900 MHz 14 32 KiB 800 MiB Athlon 64 3000 MHz 2 32 KiB 2048 MiB
|Printer friendly Cite/link Email Feedback|
|Author:||Dostalek, Petr; Dolinay, Jan; Vasek, Vladimir|
|Publication:||Annals of DAAAM & Proceedings|
|Date:||Jan 1, 2011|
|Previous Article:||Vision based control algorithm for a mobile manipulator.|
|Next Article:||Adaptive marble plate classification system based on neural network and PLC implementation.|