Printer Friendly

Indexing and retrieval of speech using perceptual linear prediction and sonogram.


The information will be in the form of speech from various sources. However, because of the difficulty of locating information in large audio archives, speech has not been valued as an archival source. The technologies incorporated in this system, and described in this paper, include speaker-independent continuous speech indexing and retrieval. (John Makhoul, 2000) Speech search has not received much attention due to the fact that large collections of un transcribed spoken material have not been available, mostly due to storage constraints. As storage becomes cheaper, the availability and usefulness of large collections of spoken documents is limited strictly by the lack of adequate technology to exploit them. Manually transcribing speech is expensive and sometimes outright impossible due to privacy concerns. This leads us to exploring an automatic approach to searching and navigating spoken document collections. (Ciprian, 2005) The growing importance of speech and multimedia data in society has necessitated the development of technologies that can index and search these mediums effectively (Kishan, 2005).

In this work the temporal envelop through RMS energy of the signal is derived for segregating individual words out of the continuous speeches using voice activity detection method.

1. Voice Activity Detection:

Voice Activity Detection to obtain reliable speech / non-speech decisions. VAD is used in a variety of speech communications systems such as speech coding, speech recognition, hands free telephony, audio conferencing, speech enhancement and echo cancellation (Ganesh babu, 2009). It identifies where the speech is voiced, unvoiced or sustained (Ramirez, 2007). The processes of discrimination of speech from silence or other background noise. These details help to deactivate the process during non-speech segment in a speech (Ananthi, 2014). It makes the smooth progress of speech processing. Isolated words in an audio speech were exploited using the long pauses in a dialog which is shown in Fig. 2. In this work the temporal envelop through RMS energy of the signal is derived for separating individual words out of the long speeches. The VAD method proposed that speech and noise are all point sources (Ji Hun park, 2013).

In this work envelop using RMS energy of the signal is separating the separate words out of the long speeches (Venkatesha, 2002).

RMS over the window of size which is shown in Eqn. 1. l is the window length. In this work 0.4 as the threshold over the energy of window.

RMS = [square root of [x.sup.2] [cross product]1] (1)

Finally, that RMS energy is used to extract the separate words in the original speech.

2. Acoustic Feature Extraction:

The purpose of feature extraction is to compress the speech signal into a vector that is representative of the speech meaningful information it is trying to characterize. In this work, acoustic features, namely Perceptual Linear Prediction (PLP) and Sonogram features are extracted.

Perceptual Linear Prediction (PLP):

The Perceptual Linear Prediction (PLP) model is developed by Hermansky. PLP models the human speech based on the concept of psychophysics of hearing (Peter, 2012). PLP discards irrelevant information of the speech and thus improves speech recognition rate. PLP approximates three main perceptual aspects namely: the critical-band resolution curves, the equal-loudness curve, and the intensity-loudness power-law relation, which are known as the cubic-root.

Detailed steps of PLP computation is shown in Fig. 3. In the first processing step of PLP the windowed audio signal is Fourier transformed. The power spectrum of windowed signal is calculated as,

P([omega])=Re[(S([omega])).sup.2]+Im[(S([omega])).sup.2] (2)

Then Power spectrum is warped onto a Bark scale using the approximation

[OMEGA](w) = 6ln (w/1200[pi] + [square root of [(w/12000[pi]).sup.2] + 1] (3)

Critical band refers to the frequency bandwidth of the "auditory filter" created by the cochlea, the sense organ of hearing within the inner ear (Petr, 2003). Bark scale correspond to 1 to 24 Critical bands. The auditory warped spectrum is convoluted with the power spectrum of the simulated critical-band masking curve to simulate the frequency resolution of human hearing. The Equal Loudness pre-emphasis Need to compensate for the non-equal perception of loudness at different frequencies. The sampled values are weighted by an equal-loudness curve that simulates the sensitivity of human hearing at different frequencies. The Intensity Loudness Power law Cubic-root amplitude compression approximates the power law of hearing that describes the nonlinear relation between the intensity of sound and its perceived loudness. The equalized values are transformed according to the power law by raising each intensity to the power of 0.33. Finally, the spectral samples are approximated by an all-pole model, usually applied in Linear Prediction (LP) analysis. The coefficients of the all-pole model can be used as features directly.


Sonogram is the current incarnation of the feature set, audio at 22 kHz sampling resolution is processed directly, in mono format (Xiaowen, 2008). Several improvements and code optimizations regarding processing time have been made and numerous options have been introduced. A number of the following steps which are carried out during audio feature extraction are now optional.

The algorithm for extracting the Sonogram is as follows: Transform audio segment into spectrogram representation using Fast Fourier Transform (FFT) with hanning window function (23 ms windows) and 50% overlap. Apply Bark scale by grouping frequency bands into 24 critical bands. Apply spreading function to account for spectral masking effects. Transform spectrum energy values on the critical bands into decibel scale [dB] (Ausgef, 2006). Calculate loudness levels through incorporating equal-loudness contours [Phon]. Compute specific loudness sensation per critical band [Sone]. For each segment the spectrogram of the audio is computed using the short time Fast Fourier Transform (STFT).

The Bark scale, a perceptual scale which groups frequencies to critical bands according to perceptive pitch regions (Eberhard, 1999), is applied to the spectrogram, aggregating it to 24 frequency bands. A Spectral Masking spreading function is applied to the signal (Schroder, 1979), which models the occlusion of one sound by another sound. The Bark scale spectrogram is then transformed into the decibel scale. Further psychoacoustic transformations are applied: Computation of the Phon scale incorporates equal loudness curves, which account for the different perception of loudness at different frequencies. Subsequently, the values are transformed into the unit Sone, reflecting the specific loudness sensation of the human auditory system. The Sone scale relates to the Phon scale in the way that a doubling on the Sone scale sounds to the human ear like a doubling of the loudness.

3. Techniques For Speech Indexing:

Gaussian mixture models (GMM):

The probability distribution of feature vectors is modeled by parametric or non parametric methods. Models which assume the shape of probability density function are termed parametric. In non parametric modeling, minimal or no assumptions are made regarding the probability distribution of feature vectors (Tang, 2012). In this section, we briefly review Gaussian mixture model (GMM), for audio classification. The basis for using GMM is that the distribution of feature vectors extracted from a class can be modeled by a mixture of Gaussian densities.

For a D dimensional feature vector x, the mixture density function for category s is defined as

p (x/[[lambda].sup.s]) = [[SIGMA].sup.M.sub.i=1] [[alpha].sup.s.sub.i] [f.sup.s.sub.i] (x) (4)

The mixture density function is a weighted linear combination of M component uni-modal Gaussian densities [f.sup.s.sub.i] (.).

Each Gaussian density function [f.sup.sub.i] (.) is parameterized by the mean vector [[mu].sup.s.sub.i] and the covariance matrix [[SIGMA].sup.s.sub.i] using

[f.sup.s.sub.i](x) = 1/[square root of [(2[pi]).sup.d][absolute value of [[SIGMA].sup.s.sub.1]] exp (-1/2 [(x--[[mu].sup.s.sub.i]).sup.T][([[SIGMA].sup.sub.i]).sup.-1] (x--[[mu].sup.sub.i]) (5)

Where [([[SIGMA].sup.sub.i]).sup.-1] and denote the inverse and determinant of the covariance matrix [[SIGMA].sup.s.sub.i], respectively. The mixture weights ([[alpha].sup.s.sub.1], [[alpha].sup.s.sub.2], ..., [[alpha].sup.s.sub.M]) satisfy the constraint [[SIGMA].sup.M.sub.i=1] [[alpha].sup.i.sub.s] = 1. Collectively, the parameters of the model [lambda]s are denoted as [[lambda].sup.s] = {[[alpha].sup.i.sub.s], [[mu].sup.s.sub.i], [[SIGMA].sup.s.sub.i]}, i=1, 2, ... M. The number of mixture components is chosen empirically for a given data set. The parameters of GMM are estimated using the iterative expectation-maximization algorithm (Render, 1984).

4. Proposed Method For Retrieving Speech Clip:

Indexing of Clips:

Creation of Index

1. Collect 100 speech s1, s2, ... s100 from TV broadcast news channels.

2. Extract one complete sentence speech audio clips of 5-10 seconds each.

3. Using the RMS energy envelope mark the beginning and end of the words.

4. Extract PLPs and Sonogram features from each word in all 100 speech clips.

5.2 Retrieval of Clips using Index:

Retrieval of speech audio for a given query

1. For a given keyword query audio clip of 1-2 seconds duration extract Sonogram, PLPs, PNCCs, SBCs features.

2. Fit the Gaussian to the features of the query keyword audio.

3. Compute the probability density function of all hundred speech database feature vectors in the index database to the query GMM.

4. The maximum pdf of the database's individual word feature vectors corresponds to the query keyword Gaussian is declared the winner.

5. Retrieve ranked list of speeches in descending order containing the matched keyword.

5. Performance Measures:

Accuracy of Retrieval:

Performance of the audio indexing system is measured by accuracy of retrieval which describes the percentage of query feature vectors which retrieves the clip corresponding to the query i.e. if the key feature vector for the clip corresponding to the test feature vector is also in the final group, and then the retrieval is a success.

Accuracy of retrieval = No. of query fv correctly retrieved/Total no. of fv used for testing x 100 (6)

Where fv denotes the feature vectors.

Average Number of Clips Retrieved for each Query:

The performance of audio retrieval is measured by the measure known as average number of clips retrieved for each query. When a query feature vector is given, all the clips assigned to the group, having the minimum distance to the query feature vector, are retrieved.

Accuracy of clips retrieval = No. of clips retrieved for each query/Total no. of queries (7)

6. Experimental Results:

For speech audio, experiments are conducted to study the performance of the retrieval algorithms in terms of the performance measures.

Database for speech audio:

The experiments are conducted for indexing speeches using the television broadcast news audio data collected from Tamil channels. Speech audio of 1 hour duration is recorded from broadcast audio using a TV tuner card. A total dataset of 100 different complete speech dialogue clips, each of 5-10 seconds duration is extracted from the 1 hour speech audio, which is sampled at 16 kHz and encoded by 16-bit.

Acoustic feature extraction:

For each of the speech audio clips of 10 seconds duration, PLPs and Sonogram features are extracted. A frame size of 20ms and a frame shift of 10 ms is used. Thereby 9 PLPs and 23 sone features are extracted for each song audio clip of 10 seconds (1000 feature vectors are obtained for a clip of 10 seconds duration) Hence, 1000 x 9 and 1000 x 23 feature vector is arrived at for each of the clips, which results in 100 such feature vector files for all the 100 clips.

Creation of index:

In our experiment, the index is created for each speech in the database. Thus each complete speech dialogue is separated into individual words by marking each word's segment through the RMS energy envelope. Then the features are extracted from each of the individual word and are represented through a compact form.

Retrieval of a clip using index:

GMM algorithm is used to for the retrieval of the speech clips. For retrieval, the keyword of interest is given as a query audio clip of duration 1-2 seconds. For every frame in the database words the probability density function is computed against the query Gaussian model. The maximum pdf of the database's individual word feature vectors corresponds to the query keyword Gaussian is declared the winner. Retrieve ranked list of speeches in descending order containing the matched keyword. The above process is repeated for a set of queries of all the 100 clips.

Fig. 6 shows Performance of indexing and retrieval for different durations of query keyword clips for various ranked retrievals.

Table.1 shows Performance of indexing and retrieval for different durations of query speech clips using various Feature sets.

7. Conclusion:

In this work, methods are proposed for indexing and retrieval of speech. In the speech indexing for speech audio clips the index is created for each speech in the database. Thus each complete speech dialogue is separated into individual words by marking each word's segment through the voice activity detection method. Then the features are using Perceptual Linear Prediction (PLP), and Sonogram extracted from each of the individual word. For Retrieval is done for all the speech query audio clip using Gaussian mixture model (GMM) models, based on the features extracted. The probability that the indexing feature vector belongs to the Gaussian is computed. The average Probability density function is computed for each of the feature vectors in the database and the retrieval is based on the highest probability. The query feature vectors were tested and the retrieval performance was studied. Performance of speech audio indexing system was evaluated for 100 clips, and the method achieves about overall 90.0% accuracy rate and a rate of average number of clips retrieved for each query.


Article history:

Received 12 October 2014

Received in revised form 26 December 2014

Accepted 1 January 2015

Available online 25 February 2015


Ananthi, S., P. Dhanalakshmi, 2014. "SVM and HMM Modeling Techniques for Speech Recognition Using LPCC and MFCC Features" Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA) Advances in Intelligent Systems and Computing, 327(2015): 519-526.

Ausgef'uhrt, 2006." Evaluation of New Audio Features and Their Utilization in Novel Music Retrieval Applications".

Ciprian Chelba, Alex Acero, 2005. "Indexing Uncertainty for Spoken Document Search" pp. 61-64, Interspeech.

Eberhard Zwicker and Hugo Fastl, 1999. Psychoacoustics--Facts and Models, volume 22 of Springer Series of Information Sci-ences. Springer, Berlin.

Ganesh Babu, C. and P.T. Vanathi, 2009. "Performance Analysis of Voice Activity Detection Algorithms for Robust Speech Recognition" International Journal of Computing Science and Communication Technologies, 2-1.

Ji Hun Park and Hong Kook Kim, 2013. "Dual-microphone voice activity detection incorporating gaussian mixture models with an error correction scheme in non-stationary noise environments" International Journal of Innovative Computing, Information and Control, 9-6.

John Makhoul, Francis Kubala, Timothy Leek, Daben liu, Long Nguyen, Richard Schwartz, Andamit Srivastava, 2000. " Speech and Language Technologies for Audio Indexing and Retrieval" PROCEEDINGS OF THE IEEE, 88-8.

Kishan Thambiratnam, and Sridha Sridharan, 2007." Rapid Yet Accurate Speech Indexing Using Dynamic Match Lattice Spotting" IEEE Transactions on audio, speech, and language processing, 15-1.

Peter, M., Grosche, 2012. "Signal Processing Methods for Beat Tracking, Music Segmentation, and Audio Retrieval", Saarbr" ucken, 9.

Petr Motlcek, 2003. "Modeling of Spectra and Temporal Trajectories in Speech Processing", DOCTORAL THESIS.

Ramirez, J., J.M. Gorriz, J.C. Segura, 2007. Voice Activity Detection- Fundamentals and Speech Recognition System Robustness. Robust Speech Recognition and Understanding, 1-22. ISBN 978-3-90261308-0

Redner, R.A. and H.F. Walker, 1984. "Mixture densities, maximum likelihood and the EM algorithm," SIAM Review, 26: 195-239.

Schroder, M.R., B.S. Atal and J.L. Hall, 1979. Optimizing digital speech coders by exploiting masking properties of the human ear. Journal of the Acoustical Society of America, 66: 1647-1652.

Tang, H., S.M. Chu, M. Hasegawa-Johnson, T.S. Huang, 2012. Partially Supervised Speaker Clustering. IEEE Transactionson Pattern Analysis and Machine Intelligence, 34(5): 959-971.

Venkatesha Prasad, R., H.S. Abhijeet Sangwan, Jamadagni, M.C. Chiranth, Rahul Sah, Vishal Gaurav, 2002." Comparison of Voice Activity Detection Algorithms for VoIP" Proceedings of the Seventh International Symposium on Computers and Communications.

Xiaowen Cheng, Jarod V. Hart and James S. Walker, 2008. "Time-frequency Analysis of Musical Rhythm" Notices of AMS, 56-3.

(1) R. Thiruvengatanadhan and (2) P. Dhanalakshmi

(1) Assistant Professor, Department of Computer Science and Engineering, Faculty of Engineering and Technology, Annamalai University, Annamalainagar-608002, Tamil Nadu, India.

(2) Associate Professor, Department of Computer Science and Engineering, Faculty of Engineering and Technology, Annamalai University, Annamalainagar-608002, Tamil Nadu, India.

Corresponding Author: R. Thiruvengatanadhan, Assistant Professor, Department of Computer Science and Engineering, Faculty of Engineering and Technology, Annamalai University, Annamalainagar-608002,Tamil Nadu, India.

Table 1: Average Number of Clips Retrieved for each Query using
various Feature sets.

Features   Average clips retrieved per query

PLP                       1.39
Sonogram                  1.58
COPYRIGHT 2015 American-Eurasian Network for Scientific Information
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2015 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Thiruvengatanadhan, R.; Dhanalakshmi, P.
Publication:Advances in Natural and Applied Sciences
Article Type:Report
Date:Jun 1, 2015
Previous Article:Available bandwidth estimation through link prediction (LP-ABE) in MANET.
Next Article:Intelligent trust based temporal data storage and retrieval methods for cloud databases.

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters