COMPARISON OF SELECTED CLASSIFICATION METHODS IN AUTOMATIC SPEAKER IDENTIFICATION COMPARISON OF SELECTED CLASSIFICATION METHODS IN AUTOMATIC SPEAKER IDENTIFICATION

This paper presents performance comparison of three different classifiers applied in Automatic SpeakeR Identification: Gaussian Mixture Model (GMM), k Nearest Neighbor algorithm (kNN) and Support Vector Machines (SVM). Each classifier represents different approach to the classification procedure. Mel Frequency Cepstral Coefficients (MFCC) were used as feature vectors in the experiment. Classification precision for each classifier was evaluated on frame and recording level. Experiments were conducted over dataset MobilDat-SK, which was recorded in mobile telecommunication network. Experiment shows promising results for SVM classifier.


Introduction
Nowadays, speaker recognition is one of the basic tasks in various systems for Automatic SpeakeR Identification (ASRI), audio documents retrieval, forensic analysis, etc.Such systems allow recognizing "who is talking" from the speech signal.Identification system consists of various parts working together.In this paper, we deal with three different classification approaches for ASRI system, namely Gaussian Mixture Model (GMM), k Nearest Neighbor (kNN) and Support Vector Machines (SVM).Precision of the classifiers is experimentally evaluated by tests performed on the same dataset.We also focus on ability of the selected classifiers to be trained from limited amount of speech data.Such property is crucial in applications as speaker segmentation and matching in audio stream, or speaker retrieval in digital audio archives using Query-by-Example approach.
The paper is organized as follows.In section 2 each of used classification method is briefly discussed.Section 3 presents results of classification as well as database description, data preparation and parameters of given classifiers.

Classification techniques description
In this section, we present three different classification methods.Subsections give a short overview of GMM, kNN and SVM classifiers.The GMM is a typical classification method, which has been successfully used in many applications related to the speech.The SVM method becomes very popular in the present time due to its great classification abilities although it is computational very expen-sive method.Unlike the model based classification methods as GMM and SVM, kNN represents instance based approach to the classification process.From the set of other available classification methods, the HMM or decision trees can be mentioned.

GMM classification
In GMM classification, Gaussian mixture model is used for statistical representation of speaker pattern.The distribution of feature vectors, extracted from a speech signal, is modeled by a mixture of Gaussian density functions (Fig. 1).For a D-dimensional feature vector x, the mixture density for speaker λ r is defined as [1]: where M denotes number of components and p i r are mixture weights.Density is weighted linear combination of M component uni-modal Gaussian densities b i r (x): (2) each one parameterized by mean vector μ i r and covariance matrix Σ i r .Mixture weights must satisfy the following constraint:

/
Complete GMM is defined by mean vector, covariance matrix and mixture weights (4).(4) Every speaker, who should be recognized, has his own model that is used as his representation instead of utterances in identification procedure.
In computation of covariance matrix, we utilized diagonal covariance matrix, which usually gives better results in recognition compared to full covariance matrix.The best results in parameter estimation were achieved by using the iterative Expectation Maximization (EM) algorithm [1], [2].In this work, we used 100 iteration steps for estimation of the model.
The identification assignment is maximum likelihood classifier.Main task of the system is to make a decision if input utterance belongs to one of the set of speakers, which are represented by its models λ 1 , …, λ r , index r denotes number of speakers.This decision is based on computation of maximum posterior probability for input feature vector [1].NETLAB [13] implementation of GMM classifier is applied in the experimental part.

kNN classification
The kNN algorithm (k Nearest Neighbor) can be classed as a nonlinear nonparametric classification method [3].This algorithm is based on a very simple principle that similar data are close to each other in the searching or data space.In other words, the kNN finds for every object from test data set of k objects in the training data that are closest to the test object (nearest neighbors).The label assignment is usually based on the rule of majority voting, e.g. the most frequent class from the k nearest neighbors for given test object determines the class where this object should belong.A value of k dictates a number of closest objects from training data , , / that are taken into account at the label assignment.If the value is too small, then the result can be sensitive to noise points (objects).
If it is too large, then the neighborhood may include too many points from other classes.
Example of k-value impact to classification result is depicted in Fig. 2, where kNN algorithm classifies two dimensional data into two classes.First circle represents region with three neighbors taking into account at decision, where the orange point belongs (k ϭ 3).In this example, the classified point belongs to the "red" class.But in the case that six neighbors are considered (k ϭ 6) at label assignment, classification result is opposite and unknown point belongs to the "blue" class.
Besides a k-value, the distance metric is important to the kNN algorithm.As can be clearly seen, the distance metric represents the measure of data similarity.The choice of particular distance metric usually depends on the given classification problem.Euclidian (5) or Mahalanobis (6) distance measure are commonly used [3] and the distance between training data vector z and testing vector x are defined as follows: , where n is number of attributes (dimension) and R is the covariance matrix.
Regardless the simplicity of kNN, this method is well suitable for multi-modal classes, very flexible and belongs to top 10 data mining algorithms (IEEE Conference on data mining 2007 [3]).

SVM classification
SVM is a learning procedure based on Vapnik's statistical learning theory [4] proposed in 1979.Classification task includes Training set instance-label pairs (x i , y i ), i ϭ 1, 2, …, l where x i ʦ R n and y ʦ {1, Ϫ1}l, the SVM requires the solution of the following optimization problem defined as [5]: , subject to: Each instance in the training set contains features of observed data and class label identifying particular class -in our task it is the index of speaker.The term specified in the following equation denotes the kernel function.Training vectors are mapped into higher dimensional feature space by the kernel function.Example of using the kernel function is depicted in Fig. 3. Data from two dimensional feature space are mapped into higher three dimensional feature space by kernel function.There are four basic kernel functionslinear, polynomial, radial basis function (RBF) and sigmoid.RBF kernel function, which was used in our experiment, is defined [4]: where γ Ͼ 0.
Aim of the SVM is to find a linear separating hyperplane with the maximal margin in this higher dimensional space.C is the penalty parameter of the error term.Value of penalty parameter must suffer condition C Ͼ 0.Not every function can be used as kernel, only those that comply with Mercer's conditions [6].For SVM classification system, every attribute of the data is scaled to range [1, Ϫ1].The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric range [5].SVM classifier requires setting up one or more parameters.In our experiment, we applied C-SVM formulation: included in implementation LIBSVM [7] with RBF kernel function; therefore we / searched for two model parameters C and γ.We used Particle Swarm Optimization (PSO) [8] technique for parameter selection task.

Experimental results
Selection of classifier and feature vectors are one of the crucial parts of each classification system.Classification task is to correctly identify speakers known to the system based on the previous learning procedure.This learning process could be done by various techniques based on statistical modeling, distance measure or non-probabilistic linear binary classifier.Feature extraction is the process, when feature vectors are extracted from speaker utterances that represent information of identity to system better than the speech signal itself.Fig. 4 depicts a block diagram of classification system that we used in the experiments.
In evaluation process, we used MobilDat-SK database [9], [10].The MobilDat-SK is corpus of mobile telephone speech recorded over GSM telecommunication network in Slovak language.From the corpus consisting of 1100 speakers, utterances of 20 speakers were randomly selected, while 3 different utterances pronounced by the same speaker were stored for each of the 20 speakers.We decided to use only 20 speakers for each test because of high computational expenses and thus long training procedure of SVM classifier.In many real applications (e.g.speaker separation and indexing in audio documents), this amount of speakers is usually sufficient.In the experimental part, we were also investigated classification ability of particular classification method when only a few training data is available.Each utterance has duration of approx.8 seconds and is stored as uncompressed PCM WAV file with 16 bits resolution, and 8 kHz sample frequency.From the speaker utterances, 22 MFC coefficients were extracted as the speech features.Since MFC coefficients have great ability to describe a speech signal, we decided to employ these audio features.The frames of 30 ms length and 10 ms overlap were used.Silent frames for each speaker utterances were dropped out using short time energy threshold and simple GMM-based voice activity detector.We used 2 utterances (approx.16 seconds) as a training data and 1 utterance as a test data for every speaker.
We applied two different approaches during performance evaluation of proposed classifiers.First, we compared classification accuracy on frame level, were each feature vector influences of the overall performance.The second approach of evaluation was per- We used 3 different classifiers with the following parameters: G GMM with probability density function (PDF) composed of 8 Gaussians and diagonal covariance matrix.The number of Gaussian components was chosen according to previous studies of training GMM on small amount of data [11], [12].G SVM with RBF kernel function -model parameters selection were performed over parameters range C ϭ {2 Ϫ5 , 2 Ϫ4.9 , …, 2 19.9 , 2 20 } and γ ϭ {2 Ϫ20 , 2 Ϫ19.9 , …, 2 4.9 , 2 5 }, criterion function for model parameters selection was 5-fold cross validation accuracy, G kNN with k ϭ 7 neighbors and Euclidean metric.

Discussion of the Results and Conclusion
In this paper, we described and evaluated three different classifiers used for speaker identification task.Classification accuracy for the dataset MobilDat-SK was computed for frames as well as for whole recording of each speaker consisting of all frames.The best classification accuracy of 98.

Classification accuracy results
Tab.

Fig. 3
Fig. 3 Example of features mapping using kernel function Experiment results for classifiers are shown in Tab. 1, Fig.4.
11 % was achieved by SVM classifier.Thus the great discrimination properties of SVM as well as its ability to be trained on few examples have been proven by our experiments.Despite of high classification accuracy the disadvantage of SVM are the extremely high computational requirements resulting to very slow training procedure.It is interesting that the KNN classifier scored comparable classification accuracy -92.15 % despite of its simplicity.The drawback of kNN is increasing com-putational complexity with large database.GMM classifier achieved significantly worse classification accuracy -31.89% -than the other two classifiers.Reason of this fact is lack of training data for GMM classifier (Note, less than 16 seconds of speech data were utilized for training of the classifier).This publication is the result of the project implementations: Creating a new diagnostic algorithm for selected cancer diseases, ITMS 26220220022 supported by the Research & Development Operational Programme funded by the ERDF, and Centre of excellence for systems and services of intelligent transport II., ITMS 26220120050 supported by the Research & Development Operational Programme funded by the ERDF.