BUILDING OF BROADCAST NEWS DATABASE FOR EVALUATION OF THE AUTOMATED SUBTITLING SERVICE BUILDING OF BROADCAST NEWS DATABASE FOR EVALUATION OF THE AUTOMATED SUBTITLING SERVICE

The MPEG2 Program Stream was captured with time reserve, but it was not truncated because a jingle detection algorithm based on Euclidian distance or DTW is planned to be developed and evaluated on this data later. The audio subchannel is de-multiplexed from the stream using DGMPGDec DGIndex [20] GPL (GNU Public License) licensed software resulting in .mp2 ﬁle ( 48 kHz stereo 128 kbits CBR constant bitrate quality MPEG-1 Audio Layer 2 codec). Next, the audio ﬁle needs to be converted to a format compatible with transcription software and delivered to annotators (the ﬁle size is also important). The used Transcriber software [21] mentioned in the next chapter has several bugs when using mp3 format (the time was not correlating with wav or video) so the mp2 ﬁles (not compliant) were recompressed to ogg format (Ogg Vorbis 160 kbps q5.0 mono) using the freeware foobar2000 tool [22] with Vorbis plugin. After that also a mono PCM 16 kHz wav ﬁle was decompressed for database utilisation purposes. The complete video recordings is planned to be converted also to a well supported video streaming format for using in web application for presenting the database with captions for the public


Introduction
The development of continuous speech recognition (CSR) systems in Slovak language expects a large amount of different language resources to be collected [1 and 2].First of all, the speech database needs to be built, which is also the most expensive and demanding task [3].The building of the textual database of Slovak texts for language modelling is also challenging [4 and 5] and could be done using modern crawling technologies and post-processing, morphological analysis etc. [6].
The KEMT-BN2 database campaign was carried out between 2009 and 2011.It consists of broadcast news (BN) shows from the first Slovak public service broadcaster television (STV1 -Jednotka).The transcription task was mainly realized by brigadiers, and then trained and evaluated by transcription specialist.The database is a follower of the KEMT-BN1 [7] and the Slovak part of the COST-278 [8] database realized in our laboratory [9] (recorder from TA3 news television).
The purpose of the specialized BN databases is to build and evaluate the automatic transcription system for BN shows [10].This system should have special BN acoustic models for different types of speech in BN shows (F-conditions) [11], special acoustic models for anchors or speakers with high occurrence in the news (politicians, sportsmen, artists, etc.) [12, 13 and 14] and also a special language model from the BN domain [15 and 16].To use these models in the special detection system for speakers, different types of speech etc. should be provided [17 and 18].This paper describes the process of recording and collecting the audio materials of the BN shows.Next, the transcription process and the evaluation of the transcriptions are presented.Finally the statistics of the collected annotated data in the database is depicted and discussed in conclusions and future work.

Recording the shows
The broadcast news shows were recorded from DVB-T channel multiplex streaming the PS (program stream) data to the disk using Technisat Airstar PCI card [19] from testing broadcast on channel 25 in Kosice region before an official digitisation process.The MPEG2 Program Stream was captured with time reserve, but it was not truncated because a jingle detection algorithm based on Euclidian distance or DTW is planned to be developed and evaluated on this data later.
The audio subchannel is de-multiplexed from the stream using DGMPGDec DGIndex [20] GPL (GNU Public License) licensed software resulting in .mp2file (48 kHz stereo 128 kbits CBR constant bitrate quality MPEG-1 Audio Layer 2 codec).
Next, the audio file needs to be converted to a format compatible with transcription software and delivered to annotators (the file size is also important).The used Transcriber software [21] mentioned in the next chapter has several bugs when using mp3 format (the time was not correlating with wav or video) so the mp2 files (not compliant) were recompressed to ogg format (Ogg Vorbis 160 kbps q5.0 mono) using the freeware foobar2000 tool [22] with Vorbis plugin.After that also a mono PCM 16 kHz wav file was decompressed for database utilisation purposes.
The complete video recordings is planned to be converted also to a well supported video streaming format for using in web application for presenting the database with captions for the public (Fig. 1 -the COST278 TA3 part of the database on the web) [23].The video recording is important when transcribing the speaker names (from captions in the video) and topics descriptions too.

Transcription of the speech and non-speech audio events
Transcription process consists of manual orthographic transcriptions of the whole audio recording using Transcriber 1.5.1 tool -a free software under GPL license (Fig. 2) [21].The annotation process follows the LDC (Linguistic Data Consortium) transcription conventions for HUB4 [24] (DARPA-sponsored Hub4 continuous speech recognition evaluation) extended using new rules for Slovak language and future use for lexical and language modelling.The native xml file format file is .trsfile.

A) STM export
After completing the transcriptions the .stm(the NIST Scoring toolkit Sclite [25] -a more simple text file format exported from Transcriber) file is generated.The .stm file is the source format for next processing of the recordings, as segmentation and conversion to other speech database and online subtitles standards [26] which are suitable for using reference speech recognition training procedure described in [27].We developed a special set of Perl scripts for conversion from wav and stm file pairs to the standardized SpeechDat database format for this purpose [28].

B) Transcriber modifications
The Transcriber toolkit was slightly modified for these purposes.The description of noise markers was translated and extended (the annotators have to enter the noise marker/tags only using menu -to avoid frequent typos in non-speech tags).
Next, also the conversion script for stm format export was modified to include all tags in resulted stm file (some of them were filtered).
Finally, the Slovak spellchecking feature was realized using free GPL licensed Aspell (http://aspell.net/)dictionary and modifying the corresponding spelling TCL/Tk script (which should send only the words to the dictionary -not tags) which was not

C) Segmentation and foreign languages
The speech utterances in the database have not been too long and every speaker inspiration event (breathing -tint) should be regarded as a potential breakpoint.
According to segmentation, the silence inside a speaker turns shorter than 0.5 seconds was not marked at all.Breakpoint in the middle was inserted when the pause in the speech utterance is between 0.5 and 1.5 seconds.When the pause was longer than 1.5 seconds, a special silence segment was inserted [7].
Foreign language utterances were marked with language event tags and should not be transcribed at all.

Database corrections
The database includes many typos, mistakes, misspelling and strange characters also after second annotator review of every transcription.The correction process is important because every wrong annotation could decrease the quality of the resulting acoustic or language models.The process used in our laboratory for acoustic models training is very sensitive to every discontinuity in the database (refrec -Reference Recognizer from COST249) [27].This process is also affected using the conversion scripts from file pairs (wav ϩ stm) to SpeechDat format, including the mapping of the noise markers/ tags and generating the phonetic lexicon.
There are more crucial points in the acoustic models (AM) training procedure described in [29]: G Generating the word level phonetic transcription of all segments which will be used in the training procedure.Usually the script finished with error that some word was not found in the phonetic lexicon, generated during the conversion of the database.G Generating tied triphone models sometimes crash because some phoneme is missing in the decision tree or phonemes class definitions.
G Sometimes we want to try a new phoneme set (reduced or more specific) for testing the impact of the precise phoneme definitions on the resulting system.During this stage sometimes the phoneme class definition should be changed or the phoneme mapping is not properly defined and should be corrected.
G Automatic forced alignment errors.When the forced alignment procedure could not find a suitable automatic alignment for the segment and its corresponding annotation the segment will be included to the outliers list and will be discarded from the training procedure.
G Checking this outliers list and reviewing the original file and the corresponding annotation, the annotation should be corrected because there is usually some error in annotation (sometimes missing word or another word with similar meaning instead the right one -it is complicated to find this type of errors for the annotators because the brain is doing some automatic correction sometimes during monotonous work).

Database statistics
The database consists of 291 TV shows in 210 hours of material (including time reserve before and after).The total transcribed database includes 141 hours of annotated audio material (1'169'832 words in 131'884 speech utterances) and the distribution of Focus conditions is depicted in Table 1 below.
The dictionary generated from this database consists of 95'376 Slovak words and 19'425 foreign words/names, noises, not correctly spelled words, partial/misspelled words or abbreviations.
The phonetic transcription (pronunciation lexicon) was generated using our developed Perl tool, and it is a very important element of the database.The phoneme description is based on SAMPA format [30] standard.Recently we found out that the phonetic transcription based on words for SpeechDat databases is not suitable for sentences, so we decide to change the training script to accept also whole sentences phonetic transcription for better inter-word phonetics.

Conclusions
The collection of speech databases is the crucial problem when developing an automatic speech recognition engines for different domains and conditions.The broadcast news task is a very popular issue nowadays, because the government regulation specifies the minimal amount of shows with hidden subtitles for hearing impaired spectators.
The new KEMT-BN2 database brings a very important contribution to broadcast news processing.Not only for speech recognition but also for jingle detection, speech detection, speaker/anchor detection (anchor -hosting character in broadcast programs), segmentation, speaker clustering and different specialized noise modelling for domain specific tasks.
The KEMT-BN2 database has 3 times more data in every important parameter than the previous KEMT-BN1 and the Slovak part of COST-278 database together [7].
In the next period, we plan to use this database for building new acoustic models for broadcast news automatic continuous speech recognition, evaluate these models with previous versions (built on KEMT-BN1 and other databases) on new BN domain specific test set.Together with our colleagues we already prepared the new language model (LM) for BN task (adapted from huge universal LM used in our previous projects [4]).We plan also to implement the sentence level phonetic transcription process in the training script.

Fig. 1
Fig. 1 Broadcast news shows (1/2) transcriptions (3) presentation on web interface(6) with ability to send an error report(5) to the administrator with automatic timestamp (4) of the paused video During this stage the developer sometimes discovers that an unknown phoneme (or unwanted) is in the training or some phoneme or noise model is missing.G This error is usually caused by a non-Slovak word (should not be annotated or should be marked with a special tag) or filtering script error, when some tag was filtered (during training set generation) like not important for AM training (lexical tags), but then we found out that also another important noise tag was filtered.
G Generating initial monophone models.Sometimes there is a problem that for specific segment a proper label file is missing.G This error is caused by inconsistency of the two filtering and index file generating scripts, when one script filters out a segment as not suitable for training (and do not include its labels in the master label file) but the script for generating the file-list of training segments decide that this segment could be used for training.The architecture of the used training procedure should be changed in the future to use only one G Generating the phone prototypes.
The research presented in this paper was supported by the Research & Development Operational Program funded by the ERDF (ITMS 26220220155) 50% and by 7 th Framework Programme EU ICT project INDECT (FP7 -218086) 50%.