The Tagging Task (Wild Wild Web Version)

The Wild Wild Web Tagging Task required participants to automatically assign tags to videos using features derived from speech, audio, visual content or associated textual or social information. Participants could choose which features they wished to use and were not obliged to use all features.

Note that relevant tags capture some aspect of a video related to aboutness or to tagger perceptions. Relevance of tags is thus determined by the intellectual content of the video as a whole and not by the visual channel alone. In this respect, the tagging task is significantly different from visual concept detection, as practiced, e.g., at TRECVid.

Target group
Researchers in the area of multimedia retrieval, spoken content search and social media.

The MediaEval 2010 Wild Wild Web Tagging Task data set is a collection of creative-commons-licensed Internet video collected from and created by the PetaMedia Network of Excellence. Four criteria were imposed on dataset creation:
  • It should be representative of Internet video
  • The video should be freely distributable (creative commons licensed)
  • It should be associated with a social network
  • It should represent a first approach to the Wild Wild Web tag prediction problem (i.e., it shouldn't be challenging to the point of being utterly unsolvable)
The data set was gathered from a range of shows (i.e., channels). It contains ca. 350 hours worth of data for a total of 1974 episodes (247 development /1727 test). The episodes were chosen from 460 different shows, shows with less than four episodes were not considered for inclusion in the data set. Only episodes for which the speech recognizer achieved an average word-level confidence score of > 0.7 were included in the set. The set is predominantly English with approximate 6 hours of non-English content divided over French, Spanish and Dutch.

Participants were provided with a video file for each episode along with metadata (e.g., title + description), speech recognition transcripts and social network information (gathered from Twitter, i.e., who twittered whom about which video). Note that the latter was not used by any participants in 2010.

Here are some examples of video episodes.

Ground truth and Evaluation
The ground truth consists of tags that have been assigned to the videos by users. We denoised the tags, by choosing only high-frequency tags. Tags occurring > 10 x in a large sample of content. The result was a list of 747 tags for the development set and 1271 tags for the test set. The two sets were not mutually exclusive. Participants could approach the task as a "closed-set" tagging task, whereby they assumed knowledge of the identity of the tags and assigned tags to video. Alternately they could choose to approach the task as an "open-set" tagging task in which they predicted tags without previous knowledge of the tagset.

2010 Results and Links to Related Working Notes Papers
Two groups crossed the finish line on this task in 2010 and both treated the problem as an information retrieval problem. Among the runs that used the speech recognition transcripts only, the best results (MAP 0.16) were achieved by Dublin City University with a method that use the language modeling retrieval framework and all words included in the speech recognition transcripts and applying neither stemming nor stopword removal. Methods that discarded words with low confidence scores did not perform as well as methods using all words. Among the runs that made use of metadata, the best results were achieved by Queen Mary, University of London, with an approach that used vector space model similarity (tf-idf weights) to a set of Wikipedia articles automatically determined to be relevant for a tag. The approach combined this similarity with information from the filenames, achieving a MAP of 0.4.

Gyarmati, A. and Jones, G.J.F. DCU at MediaEval 2010 -- Tagging Task Wild Wild Web.

Chandramouli, K., Kliegr, T., Piatrik, T. and Izquierdo, E. QMUL @ MediaEval 2010 Tagging Task: Semantic Query Expansion for Predicting User Tags.

Thank you to the LIMSI laboratory of the Centre National de la Recherche Scientifique and to Vocapia Research for the multilingual speech recognition transcripts.


Task coordinator: Martha Larson, Delft University of Technology
(m.a.larson at tudelft at nl)