MediaEval 2019
MediaEval is a benchmarking that offers challenges in multimedia retrieval, access and exploration. Our mission is to allow researchers working in computer science and other multimedia related field an opportunity to work on tasks that are related to human and social aspects of multimedia. MediaEval emphasizes the 'multi' in multimedia and seeks tasks involving multiple modalities, e.g., audio, visual, textual, and/or contextual. Our larger aim is to promote reproducible research that makes multimedia a positive force for society.

The official list of tasks that will run in MediaEval 2019 will be announced in March. However, if you are planning your participation, we have compiled a preliminary list of tasks that we are planning to offer (see below).

We are also looking forward to innovative proposals for new tasks. If you are interested in proposing a task, please contact Martha Larson m.a.larson at The official call for proposals will be posted shortly.

MediaEval 2018 Timeline

If you are interested in proposing a task:
Please see the Call for Task Proposals
1 March 2019: Task proposal deadline

If you are interested in participating in a task:
mid-March-May 2019: Registration for task participation
May-June 2019: Development data release
June-July 2019: Test data release
Run submission: mid-September 2019
Workshop: 27-29 October 2019 near Nice, France (scheduled so that participants can combine the workshop with attendance at ACM Multimedia 2019 in Nice).

Preliminary Task List

Eyes and Ears Together
Participants of this task are expected to build a system that analyzes video (visual features and speech transcripts) and creates bounding boxes around the regions of the video frames that correspond to nouns and pronouns in the speech transcript. The dataset used for the investigation is a collection of instruction videos, the How2 dataset. The output of participant systems will be evaluated in terms of the accuracy of the bounding boxes. The task is designed to encourage researchers to work on textual and visual domains simultaneously, and to advance research on multimodal processing.

GameStory: Video Game Analytics Challenge
In this task, participants analyze multi-view, multimedia data captured at a Counter-Strike: Global Offensive event. The data includes sources such as audio and video streams, commentaries, game data and statistics, interaction traces, viewer-to-viewer communication. We ask participants to develop systems capable of multi-stream synchronization and event detection. Optionally, teams may (additionally or alternatively) create a summary that is an engaging and captivating story, which boils down the thrill of the game to its mere essence. The task opens the area of e-sports (which has over 150 million regular users) for multimedia researchers.

Multimedia Recommender Systems
Participants can choose between one of two tasks that investigate the use of multimedia content for the purpose of improving the ability of recommender systems to predict items relevant to users’ interests. Participants analyze items and create feature sets that combine modalities (audio, visual, image, text). The first task is movie recommendation and requires participants to predict the average rating of a movie and the variance of that rating. The movie dataset includes links to the videos (Youtube URLs), precomputed state of the art audio-visual features, and metadata from MovieLens. The second task is news recommendation and requires participants to predict the number of views for news articles. The news dataset is collected from a set of German publishers and spans multiple months. It includes text snippets, image URLs (and some pre-extracted neural image features).

No-audio Multimodal Speech Detection
Participants receive videos (top view) and sensor readings (acceleration and proximity) of people having conversations in a natural social setting and are required to detect speaking turns. No audio is signal is available for use. The task encourages research on better privacy preservation during recordings made to study social interactions, and has the potential to scale to settings where recording audio may be impractical.

Pixel Privacy
Video trailer explaining the task
Participants receive a set of images and are required to enhance them. The enhancement should achieve two goals: (1) Protection: It must block the ability of an automatic pixel-based algorithm from correctly predicting the setting (scene class) at which the photo was taken (i.e., prevent automatic inference) and (2) Appeal: It must make the image more beautiful or interesting from the point of view of the user (or at least not ruin the image from users’ point of view.) The task extends the state of the art by looking at the positive (protective) ability of adversarial machine learning, and also exploring how people’s natural preference for appealing images can be linked to privacy protection.

Predicting Media Memorability
For the task, participants will be provided with extensive datasets of multimedia content (images and/or videos) associated with memorability annotations. Participants will be required to train computational models capable to infer multimedia content memorability from features of their choice (some features provided). The ground truth consists of scores reflecting how memorable (both in the short and the long term) video content is for a general-audience viewer, which was collected using recognition tests.

Scene Change (Brave New Task)
The task is interested in exploring fun faux photo’s, images that fool you at first, but can be identified as an imitation on closer inspection. Task participants are provided with images of people (as a “foreground segment”) and are asked to change the background scene to Paris. Results are evaluated by user studies that measure how long general audience viewers require to discover that the background has been switched. The task encourages the development of technology that allows people to fantasize with photos without engaging in deceptive practices.

Other potential tasks:
Medico, Acoustic Brainz, Flood detection (social media and satellite images), Identifying triggers for viral content, identifying modified images, analyzing table tennis videos.

General Information about MediaEval

MediaEval was founded in 2008 as a track called "VideoCLEF" within the CLEF benchmark campaign. In 2010, it became an independent benchmark and in 2012 it ran for the first time as a fully "bottom-up benchmark", meaning that it is organized for the community, by the community, independently of a "parent" project or organization. The MediaEval benchmarking season culminates with the MediaEval workshop. Participants come together at the workshop to present and discuss their results, build collaborations, and develop future task editions or entirely new tasks. MediaEval co-located itself with CLEF in 2017, with ACM Multimedia in 2010, 2013, and 2016, and with the European Conference on Computer Vision in 2012. It was an official satellite event of Interspeech in 2011 and 2015. In 2019, we celebrate our ten-year anniversary with a workshop held just after ACM Multimedia 2019. Past working notes proceedings of the workshop include:

MediaEval 2015:
MediaEval 2016:
MediaEval 2017:
MediaEval 2018:

MediaEval 2019 Sponsors and supporters
Intelligent Systems, Delft University of Technology, Netherlands

If you are interested in becoming a MediaEval 2019 supporter, please contact Martha Larson at m.a.larson at

Did you know?
Over its lifetime, MediaEval teamwork and collaboration has given rise to over 700 papers in the MediaEval workshop proceedings, but also at conferences and in journals. Check out the MediaEval bibliography.