COST278BN

Full title: 
Audio indexing of Broadcast News files in different languages
Project date: 
1 January, 2004 - 31 December, 2015

Introduction

In the context of the the European COST action on Spoken Language Interaction in Telecommunication, an international collaboration in the domain of Broadcast News Transcription was started up in 2004. The focus of the collaboration was on audio indexing, but there were some activities on speech transcription as well. The creation of the COST278BN database is considered as one of the big achievements of the action. But also the work on speaker diarization was important (see papers cited below).

The COST278BN corpus

The multilingual COST278BN corpus consists of complete news shows that were broadcasted by different TV stations in different countries. Each partner contributed 3 hours of data, and the data are grouped per language into national data sets. At present, the corpus can be specified as follows:

  • Covered TV stations : 16
  • Covered languages : 11 (Basque, Croatian, Czech, Dutch, Galician, Greek, Hungarian, Portuguese, Slovakian, Slovenian (2 sets), Spanish)
  • Audio and video: the audio is (16 kHz, 16 bit, wave format) and the video (only 6 data sets also come with video) is mostly in Real Media Video.

From the specifications of the database it is clear that the national data sets are moderate in size, and therefore insufficient for the training of a national speech transcription system. The aim is more to provide data that can be used for testing, model adaptation and for research on front-end processing algorithms such as speech/nonspeech and speaker turn segmentation, background classification, speaker clustering, etc. that are supposed to be only weakly language dependent.

The corpus is freely available to all research groups that contributed at least one national data set. The groups which contributed so far are:

and the following TV stations are acknowledged for donating their broadcasted material: VRT (Belgium), Prima TV and  Nova TV (Czech Republic), TVG and EiTB (Spain),   ERT (Greece), HRT (Croatia), MTV1, TV2 and RTL KLUB (Hungary), RTP1 (Portugal), RTV-SLO1 (Slovenia), TA3 (Slovakia). 

Interested groups can join the consortium at any time. They can contact J-P Martens (Gent) about the transcription protocol to follow, example contracts to close with the TV station, etc.

  

Results: 
  1. A. Vandecatseye, J.P. Martens, J. Neto, H. Meinedo, C. Garcia-Mateo, J.Dieguez, F. Mihelic, J. Zibert, J. Nouza, P. David, M. Pleva, A.Cizmar, H. Papageorgiou, C. Alexandris (2004). “The COST278 pan-European broadcast news database”, Proceedings LREC (Lissabon), 873-876.
  2. J. Zibert, F. Mihelic, J.P. Martens, H. Meinedo, J. Neto, L.Docio, C. Garcia-Mateo, P. David,  J. Zdansky, M. Pleva, A. Cizmar, A. Zgank, Z. Kacic, C. Teleki,  K. Vicsi (2005). “The COST278 Broadcast News Segmentation and Speaker Clustering Evaluation - Overview, Methodology, Systems, Results”, Proceedings Interspeech (Lissabon), 629-632.