FEATURE EXTRACTION USING PULSE-COUPLED NEURAL NETWORK IN ISOLATED SPEECH RECOGNITION FEATURE EXTRACTION USING PULSE-COUPLED NEURAL NETWORK IN ISOLATED SPEECH RECOGNITION

This article presents achieved results concerning an feature extraction in isolated speech recognition problem using the Pulse-Coupled Neural Network (PCNN) approach. PCNN based feature extraction is analyzed for a direct Pulse Coded Modulation (PCM) input and a Fast Fourier Transform (FFT) coefficients input.


Introduction
The speech recognition problem may be interpreted as a speechto-text conversion problem. A speaker wants a voice to be transcribed into text by a computer. Automatic speech recognition has been an active research topic for more than four decades. With the advent of digital computing and signal processing, the problem of speech recognition was clearly posed and thoroughly studied. These developments were complemented with an increased awareness of the advantages of conversational systems. The range of the possible applications is wide and includes: voice-controlled appliances, fully featured speech-to-text software, automation of operator-assisted services, and voice recognition aids for the handicapped persons. Different approaches in speech the recognition have been adopted. They can be divided mainly into two trends -hidden Markov model (HMM) and artificial neural network (ANN).

Speech signal processing
Speech acquisition begins with a person speaking into a microphone. Generally, a speech signal is converted onto a digital form using the pulse coded modulation (PCM). This means of speech signal representation is not so suitable for a pattern recognition. However, it can be represented by a limited set of features. There are several methods available for features extraction and dimension reduction. The dimension reduction is a transformation of an input signal space into a feature space with a lower dimension. The goal of the dimension reduction is to obtain significant features for a unique pattern representation. Classical methods of dimension reduction include Karhunen -Lo_ve transform [11], singular value decomposition (SVD), etc. [5] Dimension reduction methods based on ANN are for example Kohonen Self-Organized maps [6] or principal component analysis (PCA neural networks) [10,9].
Classical methods of features extraction in digital signal processing for speech recognition include coefficients of discrete Fourier transform, linear predictive coefficients (LPC), filter bank, mel scale frequency cepstral coefficients (MFCC) etc. [7,8] The method scoring growing interest for a dimension reduction & feature extraction in a field of image processing is the Pulse Coupled Neural Network (PCNN). My work focused on the feature extraction in isolated speech recognition process using the PCNN.

A PCNN structure
The structure of a standard PCNN comes out from the structure of an input pattern which will be processed. Let us consider that the input pattern is a matrix of values for input of an isolated word. The PCNN is a single layered, two-dimensional, laterally connected neural network of pulse coupled neurons connected with values of an input matrix. Each input matrix value is associated with a pulse coupled neuron of a specific structure. The PCNN neuron consists of an input part, linking part and a pulse generator. The neuron receives the input signals from feeding and linking inputs. The feeding input is a primary input from the neuron's receptive area. The neuron receptive area consists of neighboring values of the corresponding value in the input matrix. The linking input is a secondary input of lateral connections with neighboring neurons. The difference between these inputs is that the feeding connections have a slower characteristic response time constant than the linking connections. The standard PCNN model is described as iteration by the following equations: Where F ij is the feeding input, L ij is the linking input, n is an iteration step, S ij is a value at i,j coordinates in the input matrix. W and M are the weight matrices, is the convolution operator, Y ij is the output of the neuron at i,j coordinates, V L and V F are potentials, ␣ L and ␣ F are decayed constants.
Single signals of the linking input are biased and then multiplied together. Next, the input values F ij , L ij are modulated in the linking part of a neuron. We also obtain internal activity of the neuron U ij in the specific iteration step. If internal activity is greater than dynamic threshold ⌰ ij , then the neuron generates output pulse. Otherwise, the output equals to zero. The neuron output Y ij does not necessarily need to be binary. It is possible to use a sigmoid pulse generator where the neuron takes the analogue value from 0 to 1. The input matrix is transformed through the PCNN into a sequence of temporary binary matrixes. Each of these binary matrixes has the same dimension as the input matrix. The sum of all activities in a specific iteration step gives one value representing one feature for the classification. If we have N iteration steps, we obtain N features. The one-dimensional time signal generated from the values of the output matrix Y ij (n) in every iteration step n can be defined as follows: Significant advantage of the PCNN, which is useful mainly in image recognition, is the invariance of a generated time signal to rotation, dilatation or translation of images [4]. Therefore, the PCNN is advisable for the feature generation and pattern recognition in the classification tasks using conventional neural networks or other methods. Thanks to translation invariance of generated features, the PCNN method used in speech recognition does not relay on an outstanding endpoint word detection. It is evident that the PCNN is not the neural network in the term of classification. It is only a means of feature extraction for a pattern classification using conventional neural network models, like that of multi-layer perceptron. Several models of the PCNN have been developed. The most used PCNN models are, for example, a PCNN with modified feeding input [1], fast-linking PCNN [3] or feedback PCNN [2].

Experiments and Results
The following experiments with feature extraction using the PCNN were made in my testing database consisting of 36 isolated Slovak words uttered once by 23 different speakers: The abovementioned PCNN approach was applied directly to a sequence of PCM of an isolated word. I used 16-bit PCM with f s ϭ 8kHz. The PCNN with 200 iteration steps produced 200 features for every isolated word. The 200x1 feature vectors G ab (n) where a is the word index (1 Յ a Յ 36) and b is the speaker index (1 Յ b Յ 23), were than divided by their maximum values for certain normalization reasons. Figure 2 shows the mean courses of G a (n) functions for all 36 input words, the mean course of G a (n) function for the input word a was computed as follows: In my next experiment I used the Fourier transform for the PCNN input coefficients. The sequence of PCM values was partitioned into the sequence of consecutive frames. The frame length was chosen as 128 samples with 64 samples overlap. After applying the Hamming window function which prevents some spectral leakage, the fast Fourier transform (FFT) was computed in these frames. The input matrix was formed by the first 64 FFT coefficients (as the product of FFT is symmetrical) from every time frame. The PCNN with 200 iteration steps produced 200 features for every isolated word. The feature vectors were then divided by their maximum values for certain normalization reasons. Figure 3 shows the mean courses of G a (n) functions for all 36 input words computed similarly as in my first experiment. Figure 4 shows the variance of the mean courses of G a (n) functions for input words No. 4, 5, 6, 7 with their 95% reliability level for FFT coefficients input. The experiments described were carried out in the MATLAB environment.

Conclusion
It is clear that for recognition it is very important for the feature vectors G ab (n) to be as similar as possible to the same word uttered by different speakers and at the same time these vectors should be as different as possible in different words. Difference measure can be understood, for example, as the Euclidean distance between the feature vectors. As it can be seen from the above mentioned results, the feature vectors for different words do not differ too much. On the other hand, the variance of these vectors for the given input word is much bigger than the difference measure of distinct words. It clearly shows that the speech recognition system which will rely on the introduced PCNN based feature extraction will fail. The PCNN approach has been very successful in the field of image recognition, but there is still hope that some other methods of speech signal preprocessing will be helpful in dealing with speech recognition.