QUALITY OF SYNTHESIZED SPEECH: IMPACT OF THE NEWEST CODING APPROACHES QUALITY OF SYNTHESIZED SPEECH: IMPACT OF THE NEWEST CODING APPROACHES

most widely used of them is concatenate synthesis, which is restricted to speech signal and based on combining short speech strings to form a longer one. Output of this synthesis is the most naturally sounding synthesized speech. There are three main types of concatenate synthesis: unit-selection synthesis, which uses large database of recorded pieces of speech, such as words, phrases, sentences, etc. This synthesis produces voices, which are mostly indistinguishable from naturally-produced ones. Second type is diphone synthesis. The database used for this purpose consists of all diphones found in particular language. In contrast to former approach (unit-selection synthesis), overall quality of diphone synthesis is generally worse. Finally, last approach is domain-specific synthesis; its database consists of pre-recorded words and phrases, which makes it restricted to certain scripts.


Introduction
In recent years, synthesized speech achieves massive increase of interest in the case of development and utilization.The reason might be the fact that speech is the most natural human form of communication and therefore there are efforts to imitate human voices.Systems used for speech synthesis offer wide range of utilization, because of their level of maturity, which allows them to be integrated for example in a place where other way of communication can not be used or in the human computer interaction systems involving higher number of modalities.Therefore the synthesized speech is implemented in many applications of daily life where this kind of speech replaces real human speaker.The synthesized speech is mainly deployed, for example, in systems providing reports containing frequently changing and routine information (weather forecast, timetable), in systems offering different dialogue situations (games) or reading various scripts (SMS-reader, e-mail reader).
In contrast to naturally-produced speech, synthesized speech represents artificially made speech, i.e. given text utterance spoken by computer.It is created by unifying pieces of speech, recorded by speaker and stored in speech database.These systems are also termed as speech synthesizers.They are based on transformation technology called text-to-speech systems (TTS).In order to realize this transformation, TTS consists of many algorithms and modules.Fig. 1 shows schematic representation of text-to-speech system.
In principle, functions of the TTS system can be divided into the following parts: G Text analysis (normalization) -performs analysis of the text, which is separated into sentences.Numbers, abbreviations, symbols are replaced by their own word transcription, G Phonetic analysis -transforms the text to voice (phonemes), G Prosodic analysis -applies prosodic language characteristics to the selected phonemes, such as melody, speaking rate, volume, emphasis, pauses, accent, etc. G Synthesis of the speech -generates speech signal from given sequence of prosodically-modified phonemes.
Nowadays, there are three different approaches available to create this type of speech.The currently most widely used of them is concatenate synthesis, which is restricted to speech signal and based on combining short speech strings to form a longer one.Output of this synthesis is the most naturally sounding synthesized speech.There are three main types of concatenate synthesis: unitselection synthesis, which uses large database of recorded pieces of speech, such as words, phrases, sentences, etc.This synthesis produces voices, which are mostly indistinguishable from naturallyproduced ones.Second type is diphone synthesis.The database used for this purpose consists of all diphones found in particular language.In contrast to former approach (unit-selection synthesis), overall quality of diphone synthesis is generally worse.Finally, last approach is domain-specific synthesis; its database consists of prerecorded words and phrases, which makes it restricted to certain scripts.
Other approach is formant synthesis (widely deployed in past), which is based on the fundamental frequencies of amplitude spectrum of voice (formants).Systems deploying this synthesis generate artificial, robotic sounding speech (with constant quality), which cannot be confused with naturally-produced speech.Lastly, articulation synthesis represents new approach, which deals with straight human vocal track imitation, i. e. overall speech generation process.Synthesis is focused on providing isolated sounds, phones, simple words, etc.This approach has been poorly investigated at the cost of its complexity.
Definitively, synthesized speech should be indistinguishable from the human actual speech.It should characterize the most reliable copy not only in case of quality as well as in speaking style.There are efforts to ensure that synthesized speech will be the most natural, not fatiguing, not monotonous, and does not make efforts with respect to listening or comprehension [1].
For determining the output subjective quality of TTS systems (voice output devices), an application-oriented listening-only test ITU-T Recommendation P.85 [2] is recommended to be used.In general, ITU-T Recommendation P.85 is based on opinions of group of test subjects (at least 24 people), who listen to given synthesized samples and fill out the questionnaires.This recommendation defines the following rating scales: overall impression, acceptance, listening effort, comprehension problems, articulation, pronunciation, speaking rate and voice pleasantness.Assessment is based on rating called MOS (Mean Opinion Score), which represents the average values representing opinions of testing subjects or efforts needed to listen to synthesized speech expressed on the 5-point quality scale varied from bad (1) to excellent quality (5).The speaking rate uses 5-point scale varied from too slow (1) to too fast (5) and the acceptance uses only 2-point scale (yes -no).Each sample is played twice to each test subject.In first phase subjects answer questions on the information found in samples (e.g.train number, price the item).In second phase subjects are asked to assess the speech quality using one or more rating scales.For assessing the quality, two types of questionnaires, namely type I (Intelligibility) and Q (Quality) are used.Although, this method has been criticized for its shortcomings [3], [5], [25]; it is still frequently used for overall assessment of the speech output of TTS systems; but when such output is impaired by transmission degradations, a slightly modified version of this method or classical test according to ITU-T Recommendation P.800 [4] are mainly deployed.
In general, the quality of synthesized speech is evaluated in terms of intelligibility (how well the listener understands given samples) and naturalness (overall speech quality assessment).SUS (Semantically Unpredictable Sentences) belongs to the group of famous intelligibility tests.The semantically nonsense sentences with correct syntax are presented to subjects and their task is to correct the presented sentences.Each utterance is played only once.The most widespread naturalness test is MOS see details above (ITU-T Rec.P.85).Other example of naturalness test is Paired Comparison test (PC), where each sample is presented to subjects in two variants.Listener task is to choose one, which he prefers.Common to all these methods is that they are based on listener's judgments, which makes them inappropriate in terms of time and finance.Authors in [5], [6], [7] investigated the performance of the methods used for subjective assessment of quality of synthesized speech, especially the accuracy and reliability of approach defined in ITU-T Rec.P.85.In [5], the approach presented in ITU-T Rec.P.85 was compared with other available methods (test of intelligibility (SUS) and test of naturalness (MOS)) for evaluation of text-to-speech systems.Their aim was to investigate whether this approach provides the better performance than SUS and MOS test.Results showed that SUS test provides more rigorous measure of which systems were more intelligible than the other tests.However, the SUS revealed more errors which could be grouped.Overall, the ITU test is more suitable for testing intelligibility of specific application than a general purpose test.In particular, the reliability of this standard for evaluation of text-to-speech systems was investigated in [6].Authors examined how the ranking of TTS is changing across different text genres and listening sessions.Outputs were compared with pair-comparison test (PC), using above mentioned aspects.In terms of reliability, both tests (P.85, PC) showed very similar results (from absolute score and ranking perspective).In terms of selectivity, there were minor differences between the systems across genres.In [7], the authors have compared naturally-produced speech and synthesized speech with respect to type of the speaker (male, female).Overall, female human voice was rated more persuasive and livelier than synthetized voice.Moreover, synthesized speech spoken by female speakers was rated worse in contrast to male synthesized voice.Finally, they have observed gender stereotyping effects where the results revealed that female listeners assessed male voices more favorably than vice-versa.
In order to make evaluating the perceived quality of synthesized speech more effective is necessary to have instrumental tools.Such tools should be able to predict the quality as it would be judged in an auditory tests by test subjects.At this moment, there are not available standardized models (tools) for objective quality assessment of synthesized speech.However, there are ongoing research efforts dealing with this issue, e.g.works presented in [8][9][10].In order to design a new instrumental quality measure for text-to-speech systems (for both male and female synthesized speech), authors try to combine different approaches.In [8] model is based on hidden Markov models (HMM) trained on naturallyproduced speech.In [9], HMM-based comparison of features extracted from synthesized signal with parametric description of the synthesized speech signal (parameters from ITU-T Rec.P.563 and parameters related to vocal expression patterns) is used in this approach.In [10], the approach presented in [9] was evaluated on auditory test databases from the Blizzard Challenges 2008 and 2009.
For instance, in [18], intrusive model PESQ was applied to assess the quality of synthesized speech.Authors concluded that PESQ model can be used for evaluation of synthetized speech without usage of subjective tests.On the other hand, PESQ can not be deployed for small size of diphone samples.The behavior of nonintrusive model P.563 in case of assessment of synthesized speech is investigated in [8], [19][20][21][22], [25].Based on the results presented in [19], P.563 is better for predicting impact of transmission channel on quality of naturally-produced voice, however it has lower accuracy in prediction of the overall voice quality.Furthermore, P.563 achieves low correlation with subjective quality ratings for synthesized speech (especially in case of female synthesized voices [22]).In [20], the authors provide an explanation for this low correlation which can result from the proposed optimization of feature combinations and mapping functions in order to improve a performance of P.563 model for predicting the quality of synthesized speech.In [21] the performance of the original and modified P.563 model was also tested on synthesized speech data obtained in Blizzard Challenges 2007 and 2008.Experimental results have revealed that the algorithm, using the proposed modifications attains noticeable improvements in comparison to the original one.
Finally, there are also available studies dealing with the impact of various speech quality impairments (like noisy-type degradations, low bit rate codecs, etc.).In [23], Sebastian Moeller focused on the following issue: whether the impact of the transmission channel on the quality of synthesized speech is different from the impact on naturally-produced speech.The investigation was focused on e.g.noisy-type degradations which affected the quality of both synthesized and naturally-produced speech in the same amount; and on low bit rate codecs, which had a bit different impact on the quality of both kinds of speeches.Noisy codecs (e.g.G.726, G.728) cause more significant impact on the overall quality of synthesized speech than the artificially sounding codecs (e.g.G.729, IS-54).The signalbased comparative models, such as PESQ, TOSQA (Telecommunication Objective Speech Quality Assessment) have been applied for prediction of the quality of synthesized and naturally-produced speech impaired by low bit rate codecs.Variances in results between this models and auditory test are more considerable for synthesized than naturally-produced speech.Basically, PESQ and TOSQA are also capable to predict the quality of transmitted synthesized speech to certain degree.PESQ provides a good approximation of the quality degradation to be expected from circuit noise, whereas TOSQA model underestimates the quality at high noisy levels [24].In [25], the authors also compared the results from various auditory tests with the predictions provided by three single-ended models (P.563, Psytechnics, ANIQUE+) using naturally-produced and synthesized voices.The samples used in this study were transmitted through different telephone channels (same impairments as used in study published in [23]).Test realized in [25] revealed that these models provide distinct correlation with results of auditory tests in the case of particular experiments.
The rest of the paper is organized as follows: Section 2 describes the investigation of impact of the newest coding approaches on speech quality in case of naturally-produced and synthesized speech usage (experimental description).In Section 3, the experimental results are presented and discussed.Finally, Section 4 concludes this paper.

Description of experiment
The signals transmitted through modern telephone networks are impacted by amount of degradations.Traditional, connectionbased networks (analogue or digital) are affected by noise, loss, frequency distortion.Non-linear distortions from low bit-rate codingdecoding processes, talker echoes resulting from the delay, overall delay due to signal processing equipment, or time-variant degradations linked to packet or frames loss are examples of transmission degradations for new types of networks (mobiles or IP-based ones).A combination of all these impairments will be encountered when different networks are interconnected to form a transmission path from the service provider to the user.Thus, the whole path has to be taken into account for determining the overall quality of the service operated over the transmission network.As mentioned above, one of the new impairments introduced by mobile or IP-based networks is non-linear distortion from low bit-rate coding-decoding processes.Currently, this degradation is poorly investigated, especially with respect to its influence on synthesized speech [23].This fact motivated us to investigate the impact of this distortion on speech quality.In particular, here we focus on an impact of newest coding approaches (e.g.Speex, iLBC, EVRC-B, etc.) on speech quality predictions provided by PESQ and P.563 in case of naturally-produced and synthesized speech usage.

Reference signals and experimental scenario
In this experiment, three sentences in Slovak language with length of 12 seconds were used as reference signals.Two synthesized speech signals generated with two different TTS systems (male voices) and one naturally-produced signal (recorded in an anechoic environment; with non professional male speaker) are under consideration.The decision about using male voice came from the previous study published in [7].The tests have proved that the message produced by the male synthetic voice was rated as more favorable (e.g.good and more positive) and was more persuasive, in terms of the persuasive appeal, than the female synthetic voice.These particular differences are perceptual in nature, and more likely due to differences in synthesis quality between male and female voices.
TTS system 1 was diphone synthesizer and TTS system 2 was unit-selection synthesizer.Both systems have been developed at the Institute of Informatics of the Slovak Academy of Sciences.More about those synthesizers can be found in [26].
All speech samples have been normalized to an active speech level of Ϫ26 dB below the overload point of the digital system, when measured in accordance to ITU-T Recommendation P.56 and stored in 16-bit, 8000 Hz linear PCM; background noise was not present.

Subjective quality assessment
As mentioned above, the obtained predictions provided by PESQ and P.563 models were compared with subjective assessments to assess their accuracy.The subjective listening tests were performed in accordance to ITU-T Recommendation P.800 [4].Always up to 9 listeners were seated in listening chamber with reverberation time less than 190 ms and background noise well below 20 dB SPL (A).All together, 25 listeners (11 male, 14 female, age range 21-30 years, mean 24.08 years) participated in the tests.18 of them reported to have no experience with synthesized speech.The subjects were paid for their service.
The samples were played out using high quality studio equipment in random order.Results in Opinion Score 1 to 5 were averaged to obtain MOS-Listening Quality Subjective narrowband (MOS-LQSn) values for each sample.All together, 18 speech samples were used for subjective testing of coding impact.

Experimental results
In this section, we present and discuss the results coming from this investigation.As mentioned above, this study focuses on a comparison of the predictions provided by objective models PESQ and P.563 with subjective scores using naturally-produced and synthesized speech, whereas different current codecs have been applied (ITU-T G.711, ITU-T G.729AB, GSM-FR, Speex, iLBC and EVRC-B) to degrade the quality of the reference signal.
Figure 2 depicts behavior of the investigated codecs on quality prediction provided by two objective models (PESQ, P.563) and by auditory tests for naturally-produced speech.We can see that artificially sounding codecs are rated significantly worse in both models' predictions compared to the auditory test.Whereas for the ITU-T G.711 codec (naturally sounding codec) the predicted quality especially provided by PESQ is in better agreement with the auditory results, as in previous case.Furthermore, P.563 model under-predicts the quality much more than PESQ in all cases.
Figs. 3 and 4 show the results obtained for diphone synthesizer and unit-selection synthesizer, respectively.As can be seen from Fig. 3, diphone voice (sounds less natural than unit and natural voices) was particularly disliked by test subjects.This is probably the reason for such small ratings provided by subjects.On the basis of the presented fact, we decided to omit the diphone voice from the further analysis of the behavior of synthesized speech under coding impairments.On the other hand, the behavior of the diphone voice can be used as an example how higher unnaturalness of the signal can affect the opinions of the test users.Fig. 4 depicts the effect of the investigated codecs on MOS-LQSn and MOS-LQOn predicted by PESQ as well as P.563 models for unit voice.In contrast to naturally-produced speech (see Fig. 2), the predictions of both models are in good agreement -with the exception of some predictions provided by P.563 model, like for ITU-T G.711 codec, etc. -with the auditory ratings.
Moreover, Figure 5 presents a comparison of the behavior of the synthesized speech with the behavior of naturally-produced speech from auditory ratings perspective.As can be seen from Figure 5, there are some differences between subject ratings for the synthesized speech generated by unit-selection synthesizer and naturally-produced speech.The observed differences may be due to differences in quality dimensions perceived as degradations by the test subjects.Whereas the 'artificiality' dimension introduced by the investigated 'unnatural sounding' codecs is additional degradation for the naturally-produced speech, this is not a case for the synthesized speech, which already carries a certain degree of artificiality.
The results presented here are well in line with the results described in [24].The synthesized speech is assessed a little more pessimistically than natural speech for ITU-T G.729 codec, which is shown in Figure 5.12 (p.225, [24]).On the other hand, the synthesized speech is rated a bit more optimistically by subjects than naturally-produced speech for IS-54 codec and its combinations.The effect is much more dominant for its combinations.Unfortunately, we did not investigate this codec as well as its combinations in this study but then the GSM-FR codec was involved in this study which belongs to similar family of codecs.The same behavior as for IS-54 in [24] was also reported here for GSM-FR, probably because of very similar special techniques deployed in both codec-families.Regarding the predictions of PESQ (see Figures 5.15-5.16[24]), which were also investigated in the discussed study, they are more or less in line with our results, particularly for ITU-T G.729 codec (see Figures 2 and 4).Unfortunately, the study published in [24] is mainly focused on the different types of codecs and its combinations.This study can serve as an extension of the study published in [24].

Conclusion
The paper provided a brief overview of assessment of quality of synthesized speech.In addition, a overview of the current stateof-the-art of research dealing with this issue has also been described here, summarizing the experimental studies investigating the performance, accuracy and reliability of existing approaches and models (mainly designed for evaluating the quality of naturally-produced speech, but also new models designed directly for assessing the quality of synthesized speech) to evaluate the quality of synthesized speech.Finally, the paper described the experiment dealing with the impact of current codecs (ITU-T G.729AB, Speex, iLBC, GSM-FR and EVRC-B, ITU-T G.711) on the quality predicted by two objective models (intrusive PESQ, nonintrusive P.563) using naturally-produced and synthesized voices as an input signals.The obtained predictions provided by both models were compared with the ratings coming from the auditory test.The experiment revealed that the investigated codecs have a different impact on the quality of both naturally-produced and synthesized speech.Comparing the performance of both objective models, PESQ algorithm seems to be more appropriate for assessing the quality affected by the newest coding approaches than P.563 algorithm, especially in case of naturally produced speech.Future work will focus on the following issues.Firstly, we would like to investigate the performance of a brand new ITU-T intrusive model for predicting speech quality, namely POLQA under the same conditions as investigated here (as a part of the characterization phase of this model).Secondly, on the basis of the results (15.2 kbps, 20 ms), Speex[31] (4-8 kbps, 20 ms) and Enhanced Variable Rate Codec version B (EVRC-B) [32] (9.6 kbps, 20 ms).In principle the codecs used in this study can be divided into two groups.First group characterizes artificially (unnaturally) sounding codecs, such as ITU-T G.729AB, Speex, iLBC, GSM-FR and EVRC-B, whereas the ITU-T G.711 codec represents second group called naturally sounding codecs.

Fig. 2
Fig. 2 Impact of the investigated codecs on MOS-LQSn and MOS-LQOn's predicted by PESQ and by P.563 in case of naturally-produced speech

Fig. 3 Fig. 5
Fig. 3 Impact of the investigated codecs on MOS-LQSn and MOS-LQOn's predicted by PESQ and by P.563 in case of synthesized speech generated by diphone synthesizer