QUESST (ex SWS)

The 2014 Query by Example Search on Speech (QUESST)
This task was formerly called “Spoken Web Search” (SWS)

The task involves searching FOR audio content WITHIN audio content USING an audio content query. This task is particularly interesting for speech researchers in the area of spoken term detection or low-resource speech processing.

A set of audio files from multiple languages (some resource-limited, some recorded in challenging acoustic conditions and some containing heavily accentuated speech will be provided to researchers. In addition, a set of queries will be provided. The task requires that each occurrence of a query within the content be identified. This year, only the appropriate audio files containing the query must be found (no need to locate the query within the file). As usual, no transcriptions, language tags or any other meta data will be provided for the test corpus. The task therefore requires researchers to build a language-independent audio-to-audio search system.

The biggest novelty this year will come in the nature of the queries being proposed. The query set will include two kinds of queries, which will be identified on the development, but not the evaluation data. Firstly, single/multi word queries will be defined just like in previous years, where occurrences of the query in the reference utterances should exactly match the lexical representation of the query. An example of this case is the query "white horse" that should match the utterance "My white horse is beautiful".

In addition, complex single/multiword queries will be proposed. Complexity in the queries can be classified in three types. The first type includes queries that might differ slightly (either at the beginning or at the end of the query) with the match in the utterance. Systems will therefore need to account for small portions of audio either at the beginning or the end of the query that do not match in the lexical form of the reference. In all cases, the matching part of any query will exceed 5 phonemes/250ms, and the non-matching audio will be much smaller than the matching part. An example of this type of queries would be "researcher" matching an utterance containing "research" (note that the inverse would also be possible).

The second type of complex queries corresponds to cases where two or more words in the query appear in different order in the search utterance, for example the query "white snow" should be able to match "snow white". Teams should not expect to find silence portions between words in the query and should develop robust techniques to account for partial matching between query and reference.

The third kind of complex queries are similar to the second one, but allowing the reference to contain some amount of 'filler' content between the different matching words. For example the query "white horse" should match with the utterances "My horse is white" as well as "I have a white and beautiful horse". In addition, each word in these queries could contain some slight variations like in the first kind of complex queries. Under no circumstance these queries will have a long amount of filler content between words.

Target group
The target group of participants for this task includes researchers in the area of multilingual speech technology (also for under-resourced languages), spoken term detection and spoken content search.

Data
The query set will contain queries of all kinds, forcing participants that want to obtain high scores in the evaluation to account for these query types in their systems. Queries in the development set will be labelled according to query type in order to facilitate system development. This year all queries will be manually recorded in order to avoid problems acoustic context problems when cutting the queries from a longer sentence. When recording the queries a normal speaking speed and clear speaking style will be used.

Like last year, this year we will provide a single reference set where both sets of queries will be searched on. This is done to reduce the computational burden caused in earlier years when participants needed to run systems on all combinations of dev and eval sets and queries, and to ensure results are comparable between development and evaluation runs. Given that the search set contains files from different languages, accents and acoustic environments, systems will need to be built as generic as possible to succeed in finding queries appearing in these multiple sources. The length of this year's reference corpus will not exceed that the one of last year (20 hours and 500 dev/eva queries) as it was noted by participants that more data would constitute a problem to be processed on time.

Like last year, there will be a single track submission, both for zero-resource and for low-resource systems. Zero-resources systems are the ones that do not use any external speech resources to develop their systems, such as speech labels, phone labels, phonetic dictionaries, etc. Low-resources systems are the ones that use external data. Participants will still be required to describe in their submission paper what kind of system they will have developed (zero-resources or low-resources) and what data (if any) they will have used to develop and train their systems. Our main interest for asking participants for this information is to be able to properly compare both types of systems and see what effect external data might have in this year's task.

Ground truth and evaluation
The primary metric used this year will be the cross entropy score (Cnxe) that was used as secondary metric last year. This metric has been used for several years in the speaker identification community and has interesting properties. Experimentally the results correlate quite well with the ATWV metric. In order to better compute the entropy-based metric, we will require that all participants return a result decision for each query-utterance pair, attaching to it a posterior probability score bounded to the range [0,1]. Alternatively, a default score for all those trials not being returned can instead be provided. The Actual Term Weighted Value (ATWV) metric used as primary metric until now will continue to be computed, and will be used as a secondary metric.

Recommended reading
[1] Anguera, X., Metze, F., Buzo, A., Szoke, I., Rodriguez-Fuentes, L. J. The Spoken Web Search Task. In Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop, CEUR-WS.org, 1043, ISSN: 1613-0073. Barcelona, Spain, 2013.

[2] Fiscus, J., Ajot, J., Garofolo, J., Doddingtion, G. Results of the 2006 Spoken Term Detection Evaluation. In Proceedings of SIGIR Special Interest Group on Information Retrieval Workshop. Amsterdam, Netherlands, 2007.

[3] Metze, F., Anguera, X., Barnard, E., Davel, M., Gravier, G. Language Independent Search In MediaEval’s Spoken Web Search Task. In SLTC IEEE Speech and Language Processing Technical Committee's Newsletter, 2013.

[4] Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop, CEUR-WS.org, 1043, ISSN: 1613-0073. Barcelona, Spain, 2013.

[5] Working Notes Proceedings of the MediaEval 2012 Workshop, CEUR-WS.org, 927, ISSN: 1613-0073. Pisa, Italy, 2012.

Task organizers
Xavier Anguera, Telefonica Research, Spain
Luis Javier Rodriguez-Fuentes, University of the Basque Country, Spain
Igor Szőke, Brno University of Technology , Czech Republic
Andi Buzo, University Politehnica of Bucharest, Romania
Florian Metze, Carnegie Mellon University, USA

Task schedule
2 June: Release of the reference data, the development query set and development set ground truth
1 July: Test query set release
9 September: Deadline for submission of test query set results (i.e., run submission deadline)
16 September: Results returned
28 September: Working notes paper deadline

MediaEval Benchmarking Initiative for Multimedia Evaluation

The "multi" in multimedia: speech, audio, visual content, tags, users, context