The 2019 Eyes and Ears Together Task: Multimodal coreference resolution in speech transcripts

Task description
The Eyes and Ears Together task investigates linking entities and associated pronouns in speech streams with objects appearing in the corresponding video streams. Participants of this task receive visual features, speech transcripts derived from a set of videos and a set of nouns and pronouns found in the speech transcripts. Using these, they are required to build a system that analyzes the video and creates bounding boxes around the regions within the video frames that correspond to the nouns and pronouns.

The Eyes and Ears Together task is motivated by the observation that human experience of and interaction with the real world is basically multimodal in nature. When observing the world, we naturally link objects that we see with speech that we hear about these objects. Such links are vital to a full understanding of the world. For this reason, automatic linking of entities with visual objects is an important topic. Until now, it has not been well explored, and still represents a formidable research challenge.

The objective of the task is to improve co-reference resolution in spoken information streams based on their association with co-incident video information, and to examine the use of visual information in speech processing more generally. Such technologies will have the potential to enhance the analysis of spoken multimedia data, where both audio and visual information are available.

Target group
Eyes and Ears Together is targeted at researchers, who work on multimedia processing, natural language processing and computer vision. Co-reference resolution is traditionally a task of the NLP field for written texts, for which previous work has rarely made sought to exploit to visual signals. Eyes and Ears Together asks researchers to develop a system that exploits visual signals to resolve references on transcriptions of naturally occurring speech. We particularly encourage participants by multidisciplinary groups bringing together complementary expertise to address this task.

Data
This task will be performed on a collection of instruction videos called “How2” [2]. This dataset consists of roughly 300 hours of videos. Participants are provided with pre-computed features from videos and time-aligned speech transcriptions. We will release links to the original videos, as distributing the original videos is not allowed due to copyright restrictions. Training data will be released with automatically detected nouns, pronouns and candidates with their corresponding visual bounding boxes. There are no explicit labels for object bounding boxes and referring expressions of pronouns. Therefore, participants are expected to build a system without explicit supervision.

Ground truth and evaluation
System output will be evaluated through accuracy of bounding box prediction given video frames, nouns and pronouns. For each video frame, there will be a ground truth bounding box that captures an object corresponding to a given noun or a pronoun. Submitted systems will be ranked by accuracy of bounding box prediction, but we will also compute F1 score of co-reference resolution given pronouns. The development and test splits of the dataset will be manually annotated with references of pronouns and bounding boxes of noun phrases and pronouns.

References and recommended reading
[1] D. A. Huang, S. Buch, L. Dery, A. Garg, L. Fei-Fei, and J. C. Niebles. 2018. Finding “It”: Weakly-Supervised, Reference-Aware Visual Grounding in Instructional Videos. In International Conference on Computer Vision and Pattern Recognition (CVPR). 5948–5957
[2] R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault, L. Specia, and F Metze. 2018. How2: A large-scale dataset for multimodal language understanding. In Proceedings of Neural Information Processing Systems (NeurIPS).
[3] Y. Moriya, R. Sanabria, F. Metze, and G. J. F. Jones. 2018 Eyes and Ears Together: New Task for Multimodal Spoken Content Analysis. In Working Notes Proceedings of the MediaEval 2018 Workshop.

Task organizers
Yasufumi Moriya, Dublin City University, Ireland first.last @adaptcentre.ie
Ramon Sanabria, Carnegie Mellon University, USA
Florian Metze, Carnegie Mellon University, USA
Gareth. J. F. Jones, Dublin City University, Ireland

Task Schedule
Data release: 31 May
Run submission: 20 September
Results returned: 23 September
Working Notes paper deadline: 30 September
MediaEval 2019 Workshop (in France, near Nice): 27-29 October 2019