Spoken Web Search

Announcement of Data Release
The task has concluded at the data has been released. Please see MediaEval Datasets.

The 2013 Spoken Web Search Task
The task involves searching FOR audio content WITHIN audio content USING an audio content query. This task is particularly interesting for speech researchers in the area of spoken term detection or low-resource speech processing.

A sets of un-transcribed audio files from multiple languages and a set of queries will be provided to researchers. The task requires that each occurrence of a query within the audio content be identified. Both the correct audio files, and the locations of each query term within the audio files must be found. No transcriptions, language tags or any other metadata will be provided. The task therefore requires researchers to build a language-independent, acoustic content independent audio search system.

Participants will receive two separate sets of queries, one for development and one for evaluation. Differently from other years, this year we will provide a single reference set that both sets of queries will be searched over. Development ground truth files will be generated following the format defined by NIST in the 2006-STD evaluation campaign [1]. Such files will contain only the locations where the query terms appear within the reference data, leaving the rest of the transcription unknown. Submission of results will therefore entail running two sets of queries (development and evaluation) on the same reference data.

This year there will be a single track submission, both for zero-resource and for low-resource systems. Zero-resources systems are the ones that do not use any external speech resources such as speech labels, phone labels, phonetic dictionaries, etc., even though these might be provided by the organizers. Low-resources systems are the ones that use such data or the ones that have been trained on external data and, maybe, adapted to the target sets. Participants will still be required to state in their submissions what kind of system they developed (zero-resources or low-resources) and what data (if any) they used to develop and train their systems. Our main interest for asking participants for this information is to be able to properly compare both types of systems and see what effect external data might have in this year's task.

Baseline Spoken Term Detection Setup
In order to lower the entrance of new research teams to the competition, this year we will provide a baseline setup as a virtual kitchen appliance, which will contain both a working baseline system and a set of features extracted from the dataset.

Target group
The target group of participants for this task includes researchers in the area of multilingual speech technology (also for under-resourced languages), spoken term detection and spoken content search.

Data
This year, the reference set will be composed of audio files coming from different languages, accents and acoustic conditions so that systems will need to be built to be as generic as possible to succeed in finding queries appearing in these multiple sources. A non-final list of languages will include heavily-accented English, Albanian, Czech, Basque, Romanian and 4 African languages (different from those used last year). The size of the dataset will be around 20 hours, and the number of queries will be around 400. The number of queries per language will be proportional in some way to the number of hours available from each language. Both queries and reference audio files will be scrambled, and no information will be given on which language is spoken in each file.

Ground truth and evaluation
Most probably, the main metric that we will use for this year will be, like in previous years, the Term Weighted Value (TWV) as proposed by NIST in [1]. In addition, this year we will request participants to report the running time of the system measured as the average real-time speedup obtained by the system to automatically search for a 1 second query term in comparison to performing the search manually by listening to the whole reference data. This metric will be reported in the workshop as a secondary metric.

Recommended reading
[1] http://www.itl.nist.gov/iad/mig/tests/std/2006/docs/std06-evalplan-v10.pdf

[2] "The Spoken Web Search Task", Florian Metze, Etienne Barnard, Marelie Davel, Charl Van Heerden, Xavier Anguera, Guillaume Gravier and Nitendra Rajput, in Proc. Mediaeval workshop 2012

[3 ] "Results of the 2006 Spoken Term Detection Evaluation". Fiscus, J., Ajot, J., Garofolo, J., Doddington, G. The 2007 Special Interest Group on Information Retrieval (SIGIR-07) Workshop in Searching Spontaneous Conversational Speech

[4] Mediaeval 2012 working notes proceedings. Available: http://ceur-ws.org/Vol-927

Task organizers
Xavier Anguera, Telefonica Research, Spain
Florian Metze, Carnegie Mellon University, USA
Andi Buzo, University Politehnica of Bucharest, Romania
Igor Szoke, Brno University of Technology , Czech Republic
Luis Javier Rodriguez, University of the Basque Country, Spain

Task auxiliary
Charl van Heerden, Meraka Institute, CSIR, South Africa

Task schedule
June 3 Release of the reference data, the development query set and development set ground truth
June 3 A Virtual kitchen with a running system and acoustic features extracted from the data is available for download
July 1 Test query set release
September 9 Deadline for submission of test query set results
September 16 System results are returned to participants
28 September: Working notes paper deadline

MediaEval Benchmarking Initiative for Multimedia Evaluation

The "multi" in multimedia: speech, audio, visual content, tags, users, context

Related Links