CLUSTERING OF SLOVAK SENTENCE MELODY – METHODS AND RESULTS CLUSTERING OF SLOVAK SENTENCE MELODY – METHODS AND RESULTS

find typical “sentence (clause) melody contour” for each group of sentences characterised by the sentence type and by the number of words. Sentence types recognised by the TTS system are listed in Table 3. Proposed method obtains original sentence melody contours using software Praat [11]. Then weighted-MA method [4] defined in formula (1) is used to remove short-term melody variations and to obtained sentence melody contour values. The smoothed values are computed in equidistant time events along the whole sentence (clause) waveform. The number of melody contour values is set proportionally to the number of words (bars) in the sentence. The number is equal for all sentences in the same sentence group, which makes comparison of different length waveforms possible. Sentence melody contours are further centred to 0 Hz mean frequency value. Contour dynamics remains unchanged. Similar melodies are grouped into separate clusters (the best clustering method for this purpose is found). Cluster with melodies best suited for analysed sentence type is selected and typical melody of this cluster is recommended for our TTS system. Modern speech synthesis systems implement prosody features to achieve more naturally sounding voices, trying to produce sounds similar to the human speech. This article deals with obtaining one of prosody characteristics (sentence melody contour) from human speech recordings. Similarities of melody contours are studied using cluster analysis. Several clustering methods are evaluated for this purpose, estimation of proper number of clusters is described and typical melodies for different types of Slovak language sentences (declarative, interrogative, exclamatory, clauses with final “,” etc.) are found and recommended for implementation in our text-to-speech system.


Introduction
Progress in the area of speech processing enables new ways of communication between people and computers. Reading a text from the computer screen can be extended or replaced by a text to speech synthesis, writing a text on the keyboard or choosing commands using the mouse can be enriched by voice recognition systems. A text-to-speech synthesizer (TTS) converts a text from documents into audio voice files. One of such synthesizers (TTS-KIS) [10,3] is developed in the Department of Information Networks at the Faculty of Management and Informatics of the University of Zilina.
Our TTS system implements a concatenative method of speech synthesis. The method is based on concatenation of speech elements "diphones" [3]. Diphones are selected from the TTS sound database at the time of speech synthesis.
Diphones in the database are stored in a normalised form with monotonous melody. Consequently a monotonous waveform corresponding to the input text is produced, and hence a proper melody modification is needed to achieve a naturally sounding sentence.
Our TTS system applies composed melody contour to the synthesized sentence waveform. The contour is composition of "sentence melody contour" (long-term melody trend spanning along the whole sentence or clause) and of "short-term melody contours" (related to shorter speech segments, words, bars, syllables). Our present work is focused on obtaining the long-term "sentence melody contour". We use terms "word" and "bar" interchangeably to denote the word with neighbouring proclitics and enclitics. Terms "sentence" and "clause" are interchangeably used to denote smaller sentence parts of simple, complex or compound sentences separated by punctuation marks and conjunctions (see Table 2).
The TTS-KIS system analyses text of each input sentence and chooses melody contour corresponding to the characteristics found in the text: the sentence type (recognized by punctuation marks, interrogative words …) and the number of words (bars) in the sentence. Our aim is to find typical "sentence (clause) melody contour" for each group of sentences characterised by the sentence type and by the number of words. Sentence types recognised by the TTS system are listed in Table 3.
Proposed method obtains original sentence melody contours using software Praat [11].
Then weighted-MA method [4] defined in formula (1) is used to remove short-term melody variations and to obtained sentence melody contour values. The smoothed values are computed in equidistant time events along the whole sentence (clause) waveform. The number of melody contour values is set proportionally to the number of words (bars) in the sentence. The number is equal for all sentences in the same sentence group, which makes comparison of different length waveforms possible.
Sentence melody contours are further centred to 0 Hz mean frequency value. Contour dynamics remains unchanged.
Similar melodies are grouped into separate clusters (the best clustering method for this purpose is found). Cluster with melodies best suited for analysed sentence type is selected and typical melody of this cluster is recommended for our TTS system.

Speech material
We examined sound recordings of the novel [16]. The story was narrated by a female speaker in a faster speaking rate with an emotive accent (noticeable changes in the intonation, loudness and tempo).
Recordings were cut into smaller parts corresponding to the clauses. See Table 2 for punctuation marks and conjunctions used to determine clause boundaries. (Simple sentence and parts of complex or compound sentences were taken as separate clauses.) Sound files were assigned names corresponding to the page number, sentence number, number of bars in the sentence and the type of the starting and ending punctuation marks. About 8000 sentence sounds (see Table 1) of the PCM format (22050 Hz, 16 bit/sample, mono) were stored in the 500 MB disk space.

Obtaining melody contours from voice recordings
We use the program Praat [11, 2] to obtain the glottal frequency contour F0. Praat implements a normalised autocorrelation function and best path selection algorithms. Proper settings of parameters are needed to obtain real F0 values [4]. To process the large number of voice records, the script of Praat commands was programmed.

Preparing melody contours for cluster analysis
Melody contours obtained by the program Praat are composed of equidistant values located inside detected voiced intervals (see Fig. 1).
To compare melodies of sentences with different time duration and with different distribution of voiced intervals, a smoothing method "weighted moving average" is used [4]. Weighted-MA values yЈ(t) are computed according to the formulas (1) and (2). (2) Weights decrease with raising distance from the time event t. Parameters were set to values n ϭ 12, a ϭ Ϫ2, b ϭ 1. The method removes short-term variations from the overall sentence melody and computes "smoothed sentence melody contour".
We compute melody values in equidistant time intervals. The number of contour points is set to 8-th multiply of number of words in the sentence. Each contour is further centred to 0 Hz mean frequency value. Dynamics of the contour remains unchanged.

Description of clustering methods
Once melody contours are represented by equal length vectors (values determined in equidistant time intervals) the cluster analysis of melody contour similarities is possible. Clustering methods found in the R-software's "hclust" method [12] were taken and fitness for melody contour clustering was examined. All the methods (ward, single, complete, average, mcquitty, median, centroid) perform hierarchical agglomerative clustering. Initially, each object (contour) is assigned to its own cluster, then (iteratively) distances between clusters are computed and the two closest clusters are joined.
Initial distances between clusters (dissimilarity matrix) are computed by the R-software's "dist" method as the Euclidean (geometric) distance.
The second and further iterations of a dissimilarity matrix are recomputed by the Lance-Williams dissimilarity update formula [1,6]. The formula parameters are set to values [9] according to the method chosen (ward, single …).
Each of the mentioned methods evaluate the distance between clusters in a different way, hence clusters of different properties are produced.
Single Linkage method In this method, the distance between two clusters is computed as the smallest distance of two objects, where the first object belongs to the first cluster and the second object belongs to the second cluster. Resulting clusters tend to represent long, straggly "chains" [13].
Complete Linkage method The distance between two clusters is computed as the largest distance of two objects, where the first object belongs to the first cluster and the second object belongs to the second cluster. This method tends to find extremely compact clusters [5]. The method is usually suitable when the objects form naturally distinct clusters. If the clusters tend to be somehow elongated or "chain" type nature then this method is inappropriate [13].
Average Linkage method The distance between two clusters is computed as the average distance of two objects, where the first object belongs to the first cluster and the second object belongs to the second cluster. The method tends to find globular clusters [5]. It is very efficient when the objects form natural distinct clusters and performs equally well with elongated "chain" type clusters [13]. This method is also called "the group average linkage algo-rithm" or "the unweighted pair group method average" (UPGMA) [14].
McQuitty's method In this method, when two clusters are merged into a new one, the distance from the new cluster to the old one is computed as an average of distances between two merged clusters and the old cluster [14]. Such rule corresponds to the weighted average computation, where objects in small clusters have a larger weight than those in large clusters. This method is also known as "the weighted average linkage algorithm" or "the weighted pair group method average" (WPGMA). This method (rather than the previous method) should be used when the cluster sizes are suspected to be greatly uneven [13].
Centroid Linkage method In this method, centroid (average of objects) of each cluster is computed, then the distance between two clusters is determined as the distance between centroids representing the two clusters.
Median Linkage method This method extends Centroid method by weights to consider different numbers of objects in clusters. This method is preferable to the Centroid method, when considerable differences in cluster size are expected [13].
Ward's minimum variance method In this method, the distance between two clusters is evaluated as the growth of total dispersion of objects around their respective cluster centroids. This method minimises clusters heterogeneity [8] and it tends to find globular clusters [5]. In general, it is regarded as very efficient [13].

Selection of the best method for melody contour clustering
We used two criteria to select the best method for melody contour clustering: 1. Correct separation of melody contours according to their similarities (similar melodies grouped in the same cluster, different melodies separated into different clusters). 2. Minimal number of clusters needed to accomplish the criterion 1.
Seven clustering methods (ward, single, complete, average, mcquitty, median, centroid) were evaluated. 4-member (4-word) declarative and interrogative sentences (clauses) were taken and clustering was computed by each of the methods (see corresponding dendrograms in Figs. 2 and 3). For each case criteria were evaluated and the method that best met the criteria was recommended to cluster all other sentence groups.
We evaluated Criterion 1 using: G dendrograms obtained by clustering G drawings of melody contour clusters G hearing sounds grouped into clusters Dendrograms depict inter-cluster distances and the size of clusters formed during computations. Figs. 2 and 3 shows that the Ward's method tend to create clusters of similar size. Comparing these clusters to Fig. 4 we can see good separation of different melody shapes corresponding to these clusters. On the other hand the Single method separates very small clusters while retaining different melody shapes in one large cluster -not capable to meet the Criterion 1. The other methods exhibit properties between the Ward's and the Single method.
We investigated melody contour drawings of two, three and more clusters (starting from the top of the dendrogram). We stopped at the number of clusters when further cluster division would not give new clusters with significantly different melody shapes. Obtained number of clusters is taken as the Criterion 2. Proper separation of melody contours we verified by hearing of sounds from corresponding clusters.
Ward's method created clusters of similar size (see Figs. 2, 3 and 4) and postpone separation of single or small groups of con-tours to the later iterations. The proper melody separation was achieved at the number of clusters equal to five (Criterion 2 value). The Complete method put small groups of melodies into separate clusters in earlier iterations. In the case of interrogative clauses, at the moment of five clusters, different kinds of melodies (falling and rising) were still included in the same cluster. McQuitty's method kept two large clusters while separating small size clusters. The Average method separated even smaller clusters, still keeping one or two large clusters. The Median method separated objects into clusters of a very small size. The Centroid method separated melodies nearly one-by-one. The Single method also separated single objects; keeping dissimilar melodies in one large cluster (compare Figs. 2, 3 and 4).
The Ward's method was chosen as the most suitable for clustering melody contours. It performs proper separation of melody contours using the smaller number of clusters. All the methods

Identifying melody contours for the TTS system
We used Ward's method (the best clustering method found in the previous paragraph) to compute clustering for all sentence types and different number of words. For each clustering we determined clusters with proper melody separation (also described in the previous paragraph). Then we determined the proper melody for particular groups of sentences using following criteria: G melody features described by Slovak language scientists in [7] G sentences uttered in a neutral way (without strong word accent) G number of contours in the cluster For example, analysing Slovak 4-clause determination sentences we stopped clustering at five clusters (see Fig. 5, clusters b1-b5). Cluster b1 exhibits the rise at the beginning of the clause followed by steep melody fall. The hearing of sentences confirmed sentences with very strong accent on the first word in the sentence expressing emotional speech. Cluster b2 contains flat melody contours, corresponding to repose even phlegmatic atmosphere, without conspicuous emotions. Cluster b3 exhibits a rise at the middle part. Stress is heard on the third word or on the fourth word. Cluster b4 has a tendency similar to the cluster b1, with slower decrease. The beginning of the sentence is less stressed, which adds moderate dynamics to the speech. Representative melody contour of this cluster was chosen as the "sentence melody contour" for 4-clause determination sentences in our TTS system (see Table 3). Cluster b5 contours exhibit a slow rise in the melody at the end of the clause. Hearing it we found that the cluster contains: sentences with   Letters "a, b, c, d, j, o, p, w" denote ending of the sentence with different punctuation marks (see Table 2).
melodies of non-ended sentences (similar to melody of comma "," ended sentences), sentences expressing theatricality of the story and sentences with a quiet ending (causing non-precise F0 calculation at the end of the sentence).
The clusters with melodies suitable for TTS systems are marked by asterisk "*" (see Fig. 5). Representative contours (arithmetic mean of contours) of these clusters are recommended for implementation in our TTS-KIS system. Alternation of chosen melodies with more expressive melodies (e.g. alternation of b4 and b1, Fig. 5) can be implemented to achieve more dynamic utterance production (see description of non-satisfying non-ending sentence melodies in [7]).
Another example -clustering of 4-member declarative clauses of different lengths is shown in Fig. 6.

Mapping of text characteristics into melodies recommended for TTS-KIS system
The cluster analysis showed the same results (shapes of melody contours) as language scientists described in [7] and we summarized in [15]. We recommend these melody contours for our TTS system and we have prepared mapping of the input text character-istics into recommended contours (e.g. see the 4-member clauses case in the Table 3). At the time of speech syntheses the contour is stretched to the length of the sentence waveform. The stretched contour and short-term melody contours are added to the synthesized sound.

Conclusion
In this article we described method of obtaining sentence melody contours for our TTS system. Cluster analysis is used to analyse 8000 melody sentence contours of Slovak language speech recordings. Melody contours were obtained from recordings by software Praat, prepared by the smoothing method and centred to the zero mean frequency. Then cluster analyses were performed. Seven clustering methods (implemented in "hclust" method of Rsoftware) were compared. The Ward's method was chosen as the best one for the purpose of melody contour clustering. This method was used for clustering of different types of sentences with different numbers of words (bars). For each group of sentences clusters with correct separation of melody contours were found and melodies of individual clusters were analysed. The results of the analysis were formulated as recommended melody contours for individual sentence types. Text characteristics recognised by our TTS system were mapped to recommended sentence melodies. Sentence ending labels and corresponding punctuation marks or conjunctions Ending punctuation mark or conjuction "a, i, aj, ani, či, alebo" not preceded by comma mark "," " . " " , " " : " " … " " ? " " -" " ! " We found slight differences between expectations (sentence melodies described by language scientists) and melodies obtained from real speech recordings of the book [16]. The speaker often uses "non-satisfying non-ending" melody for sentences ended with the period, where "satisfying ending" melody was assumed. This attracts attention of the listener by signalling the next action of the story.
The investigation of clusters also showed strong influence of word melody accent even after the smoothing of the overall melody contour. So the melody of shorter speech segments (words, bars, syllables) should be studied.
Mapping of text characteristics into recommended sentence melodies Table 3 and melodies expected by language scientists (4-member clauses case)

Melody type Sentence type
Characteristics recognised by the TTS system in the input text 4-member clause melodies (Fig. 5)