Media Interstingness

The 2016 Predicting Media Interestingness Task
Register to participate in this challenge on the MediaEval 2016 registration site.

This task requires participants to automatically select images and/or video segments which are considered to be the most interesting for a common viewer. Interestingness of the media is to be judged based on visual appearance, audio information and text accompanying the data. To solve the task, participants are strongly encouraged to deploy multimodal approaches.

Interestingness should be assessed according to the following use case scenario.

The use case scenario of the task derives from a use case at Technicolor which involves helping professionals to illustrate a Video on Demand (VOD) web site by selecting some interesting frames and/or video excerpts for the movies. The frames and excerpts should be suitable in terms of helping a user to make his/her decision about whether he/she is interested in watching a movie.

For this first year, two different subtasks will be offered to participants, which correspond to two types of available media content, namely images and videos. Participants are encouraged to submit to both subtasks.

1) Predicting Image Interestingness
Given a set of key-frames extracted from a certain movie, the task involves automatically identifying those images for the given movie that viewers report to be interesting. To solve the task, participants can make use of visual content as well as accompanying metadata, e.g., Internet data about the movie, social media information, etc.

2) Prediction Video Interestingness
Given a video sequence from a certain movie, the task involves automatically identifying the shots that viewers report to be interesting in the given movie. To solve the task, participants can make use of visual and audio data as well as accompanying metadata, e.g., subtitles, Internet data about the movie, etc.

Target group
Researchers will find this task interesting if they work in either the areas of image/video interestingness/memorability/attractiveness prediction, image aesthetics, event detection, multimedia affect and perceptual analysis, multimedia content analysis, machine learning (though not limited to).

Data
The data will be extracted from ca 100 movie trailers of Hollywood-like movies. Those trailers are shared under Creative Commons licenses that allow their redistribution. For the video interestingness subtask, the data will consist of the movie shots (obtained after a manual segmentation of the trailer). Prediction will be carried out on a per trailer basis. For the image interestingness subtask, the data will consist of collections of key-frames extracted from the video shots used for the previous subtask (one key-frame per shot). This will allow comparing results from both subtasks. Again, prediction will be carried out on a per trailer basis.

Ground truth and evaluation
All data is to be manually annotated in terms of interestingness by human assessors. A pair-wise comparison protocol will be used. Annotators will be provided with a pair of images/video shots at a time and asked to tag which of the content is more interesting for them. The process is repeated by scanning the whole dataset. To avoid an exhaustive, full comparison, between all the possible pairs, a boosting selection method will be employed (i.e., the adaptive square design method). The obtained annotations are finally aggregated resulting in the final interestingness degrees of the images/video shots. No additional external metadata (e.g., movie critics, tweets, etc.) will be provided in this first year.

For both subtasks, the official evaluation metric will be the mean average precision (MAP) computed over all trailers (whereas average precision is to be computed on a per trailer basis, over the top N best ranked images/video shots).

Recommended reading
[1] H. Katti, K.Y. Bin, C.T. Seng, M. Kankanhalli, Interestingness Discrimination in Images, 2008;
[2] Y-G. Jiang, Y. Wang, R. Feng, X. Xue, Y. Zheng, H. Yan, Understanding and Predicting Interestingness of Videos, Proc. AAAI, 2013;
[3] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, L. van Gool, The Interestingness of Images, Proc. ICCV, 2013;
[4] A. Khosla, A. Sarma, and R. Hamid, What Makes an Image Popular?, Proc. ICWWW, 2014;
[5] P. Isola, J. Xiao, D. Parikh, and A. Torralba, What Makes a Photograph Memorable?, PAMI, 2014;
[6] Y. Fu, T.M. Hospedales, T. Xiang, S. Gong, Y. Yao, Interestingness Prediction by Robust Learning to Rank, Proc. ECCV, 2014;
[7] S. Yoon, V. Pavlovic, Sentiment Flow for Video Interestingness Prediction, Proc. WHCEU, 2015;
[8] M. Soleymani, The quest for visual interest. Proc. ACMMM 2015.
[9] D. E. Berlyne, Conflict, arousal and curiosity, McGraw-Hill, 1960.
[10] P. J. Silvia, Appraisal components and emotion traits: Examining the appraisal basis of trait curiosity, Cognition and Emotion, 2008.

Task organizers
Claire-Helene Demarty, Ngoc Duong, Alexey Ozerov, Frédéric Lefebvre, Vincent Demoulin, Technicolor, France;
Bogdan Ionescu, University Politehnica of Bucharest, Romania;
Mats Sjöberg, University of Helsinki, Finland;
Hanli Wang, Tongji University, China;
Toan Do, Singapore University of Technology and Design, Singapore & University of Science, Vietnam.

Task auxiliaries
Yu-Gang Jiang, Fudan University, Shanghai, China.

Task schedule
6 June 2016: Development data release
30 June 2016: Test data release
9 September 2016: Run submission
23 Sept. 2016: Working notes paper deadline
20-21 Oct. 2016: MediaEval 2016 Workshop, Right after ACM MM 2016 in Amsterdam.

Acknowledgments
This task is made possible by a collaboration of the following projects:
• the National Natural Science Foundation of China under Grant 61472281,
• the "Shu Guang'' project of Shanghai Municipal Education Commission and Shanghai Education Development Foundation under Grant 12SG23,
• the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No. GZ2015005).

MediaEval Benchmarking Initiative for Multimedia Evaluation

The "multi" in multimedia: speech, audio, visual content, tags, users, context