Rich Speech Retrieval

The 2011 Rich Speech Retrieval Task
The task goes beyond conventional spoken content retrieval by requiring participants to deploy spoken content and its context in order to find jump-points in an audiovisual collection given a set of queries.

Target group
The task is of interest to researchers in the areas of speech retrieval, multimedia retrieval and text IR.

Data
The data set will use the 2010 Wild Wild Web Task data set for a different task. The data set was gathered from a range of blip.tv shows (i.e., channels). It contains ca. 350 hours worth of data for a total of 1974 episodes (247 development / 1727 test). The episodes were chosen from 460 different shows, shows with less than four episodes were not considered for inclusion in the data set. The set is predominantly English with approximate 6 hours of non-English content divided over French, Spanish and Dutch. All videos are shared by their owners under Creative Commons license.

Participants are provided with a video file for each episode along with metadata (e.g., title + description), speech recognition transcripts.

Ground truth and evaluation
Ground truth will be generated by human annotators in a process that approximates the formulation of natural language queries. The task is a known-item task and the official evaluation metric will be Mean Reciprocal Rank (MRR).

Recommended reading
Oard, D., Wang, J., Jones, G., White, R., Pecina, P., Soergel, D., Huang, X., Shafran, I., 2007. Overview of the CLEF-2006 Cross-Language speech retrieval track. In: Peters, C., et al. (Eds.), Evaluation of Multilingual and Multi-modal Information Retrieval. Vol. 4730 of Springer Lecture Notes in Computer Science, pp. 744-758.

Task organizers:
Roeland Ordelman, University of Twente and Netherlands Institute for Sound & Vision, a.k.a. Beeld & Geluid
Maria Eskevich and Gareth Jones, Dublin City University

This task is organized by Axes and IISSCOS

MediaEval Benchmarking Initiative for Multimedia Evaluation

The "multi" in multimedia: speech, audio, visual content, tags, users, context