Zero Cost Speech

The 2016 Zero Cost Speech Recognition Task (ex QUESST)
Register to participate in this challenge on the MediaEval 2016 registration site.

Train the best possible speech recognition system for Vietnamese using only free resources.

A Zero-cost (ex QUESST) task aims at bridge the gap between “top” labs and companies having enough budget to afford buying any data for speech recognition training with “other” small players, endowed to freely available data, tools or even systems.

The goal of this task is to challenge teams to come up and experiment with bootstrapping techniques, which allow to train initial ASR system or speech tokenizers for “free”. It means, a technique which allows you to train a speech recognition system on public resource data without the need of buying (expensive, ideally any) datasets.

Target group
Any speech lab working on low / zero resource techniques for development of automatic speech recognizer or speech tokenizer. The outcome of this task for participant is not having a Vietnamese speech recognizer, but the knowledge and tools for building recognizer for any language they want.

Data
Initial set of free resources (a mix of audio data and imperfect transcripts like videos with subtitles) are provided to participants. We expect to provide several hours of data from multiple sources. But, if some participant has or knows about other free resources, he is encourages to share it with other participants if he wants to use it in the evaluation. The training set (data provided by organizers and also by other participants) will be fixed in late spring. After then, participants will not be allowed to use any other data. This should prevent participants from data gathering race and to let them focus on research. Participants can also share other resources: texts, dictionaries, feature extractors, audios, etc. The only limitation is, that this resource must be freely available to everyone for research purposes.
Development and evaluation test sets are available in time of system training. Participants can use them and adapt their system on it. However they are not provided with reference transcripts and they are not allowed to transcribe or manually analyze the development / evaluation data.

Following data is being provided by organizers:
• Forvo.com (Download of Vietnamese data from Forvo.com service. Participants are forbiden to download, use and share any of Forvo.com data by they own. Reason - not to accidently mix train/dev/eval data.)
• Rhinospike.com (Download of Vietnamese data from Rhinospike.com service. Participants are forbiden to download, use and share any of Rhinospike.com data by they own. Reason - not to accidently mix train/dev/eval data.)
• Proprietary prompted data
• Youtube data

Ground truth and evaluation
System will be evaluated by metrics based on word-error-rate -- main metric. The WER is based on comparison of word transcript match (reference and generated). Both transcript should be in uppercase and without punctuation. There is no other text normalization done.
Phoneme-error-rate (or similar sub-word unit) would be a secondary metric. Here participants provides generated sequence of phoneme like units (automatically derived units, phonemes, etc). We align these sequences to a reference phoneme transcription (using a confusion model) and calculate a similarity of your units to reference one.
We expect to use on-line scoring where participants submit their results and get immediate scoring.
Participants are provided with training, development and evaluation data at a same moment. However they do not have references for development and evaluation data. They can use LeaderBoard to score their systems and get development results. When the evaluations are over, we publish also scores on evaluation set.

Recommended reading
[1] Anguera, X., Metze, F., Buzo, A., Szőke, I., Rodriguez-Fuentes, L. J. The Spoken Web Search Task. In Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop, CEUR-WS.org, 1043, Barcelona, Spain, 2013.

[2] Anguera, X., Rodriguez-Fuentes, L. J., Szőke I., Buzo, A., Metze, F. Query-by-example Spoken Term Detection Evaluation on Low-resource Languages. In: Proceedings of the 4th International Workshop on Spoken Language Technologies for Under- resourced Languages SLTU-2014. St. Petersburg, Russia. St. Petersburg: International Speech Communication Association, 2014, pp. 24-31.

[3] Fiscus, J., Ajot, J., Garofolo, J., Doddingtion, G. Results of the 2006 Spoken Term Detection Evaluation. In Proceedings of ACM SIGIR 2007 Workshop on Searching Spontaneous Conversational Speech. Amsterdam, Netherlands, 2007.

[4] Larson et al. (eds.) Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop, CEUR-WS.org, 1043, Barcelona, Spain, 2013.

[5] Larson et al. (eds.) Working Notes Proceedings of the MediaEval 2012 Workshop, CEUR-WS.org, 927, ISSN: 1613-0073. Pisa, Italy, 2012.

[6] Metze, F., Anguera, X., Barnard, E., Davel, M., Gravier, G. Language Independent Search In MediaEval’s Spoken Web Search Task. In SLTC IEEE Speech and Language Processing Technical Committee's Newsletter, 2013.

Task organizers
Igor Szőke, BUT Speech@FIT, Czech Republic
Xavier Anguera, ELSA Corp., USA-Portugal

Task schedule
1 May 2016 (updated) Training, Development and Evaluation data release
30 June 2016 Participants can provide and share their own development data
12 September Run submission deadline
15 September: Results returned
30 Sept. 2016: Working notes paper deadline
20-21 Oct. 2016: MediaEval 2016 Workshop, Right after ACM MM 2016 in Amsterdam

Acknowledgments
We acknowledge data providers - services Forvo.com and Rhinospike.com.

MediaEval Benchmarking Initiative for Multimedia Evaluation

The "multi" in multimedia: speech, audio, visual content, tags, users, context