Datasets

MediaEval Datasets 2010-2015

Retrieving Diverse Social Images 2015
This dataset is designed to support research in the areas of information retrieval that foster new technologies for improving both the relevance and the diversification of search results with explicit focus on the social media context. The dataset consists of Creative Commons data for around 153 one-concept Flickr queries and 45,375 images for development and 139 Flickr queries (69 one-concept - 70 multi-concept) and 41,394 images for testing; metadata, Wikipedia pages and content descriptors for text and visual modalities. Data is annotated for the relevance and the diversity of the photos.

Bogdan Ionescu, Alexandru Lucian Gînscă, Bogdan Boteanu, Mihai Lupu, Adrian Popescu, and Henning Müller. 2016. Div150Multi: a social image retrieval result diversification dataset with multi-topic queries. In Proceedings of the 7th International Conference on Multimedia Systems (MMSys '16). ACM, New York, NY, USA

Paper: http://imag.pub.ro/~bionescu/index_files/DivTask_MMSys2016.pdf
Data set at: http://traces.cs.umass.edu/index.php/Mmsys/Mmsys

Context of Experience Task 2015
The aim of the dataset is to support the development of recommender systems, as well as computer vision and multimedia retrieval algorithms capable of automatically predicting which videos are suitable for inflight consumption. Right Inflight consists of 318 human-annotated movies, for which we provide links to trailers, a set of pre-computed low-level visual, audio and text features as well as user ratings. The annotation was performed by crowdsourcing workers, who were asked to judge the appropriateness of movies for inflight consumption.

Michael Riegler, Martha Larson, Concetto Spampinato, Pål Halvorsen, Mathias Lux, Jonas Markussen, Konstantin Pogorelov, Carsten Griwodz, and Håkon Stensland. 2016. Right inflight?: a dataset for exploring the automatic prediction of movies suitable for a watching situation. In Proceedings of the 7th International Conference on Multimedia Systems (MMSys '16). ACM, New York, NY, USA.

Data set at: http://traces.cs.umass.edu/index.php/Mmsys/Mmsys
See also: http://gesture.chalearn.org/icpr2016_contest

Synchronizing Event Media (SEM) 2015
This data supports work in the area of synchronizing multimedia (photos, videos, audio) that was captured at the same event. for this task includes four different events. The data set includes ground truth and an evaluation script. Please download the first three events here:
http://mmlab.disi.unitn.it/MediaEvalSEM2015
and the fourth here:
https://icosole.lab.vrt.be/viewer/home

Placing 2015
Images, videos and geo-coordinates for tacking the task of multimodal location estimation of social multimedia. The data is available via the Amazon Web Services (AWS) data portal of Multimedia Commons Initiative, see.
http://multimediacommons.wordpress.com/yli-geo-placing-task-datasets
Also check out:
Bart Thomee, Benjamin Elizalde, David A. Shamma, Karl Ni, Gerald Friedland, Douglas Poland, Damian Borth, Li-Jia Li. 2016. Communications of the ACM, Vol. 59 No. 2, Pages 64-73
http://cacm.acm.org/magazines/2016/2/197425-yfcc100m/fulltext

Affective Impact 2015
LIRIS-ACCEDE consists of video excerpts with a large content diversity annotated along affective dimensions. The data was used in the MediaEval 2015 Affective Impact task.
http://liris-accede.ec-lyon.fr

Person Discovery Task 2015
Automatically discover people speaking in a raw TV broadcast, and tag them.
http://dataset.ina.fr
(Choose: Listes des corpus -> 6 mois de 20h)

QUESST 2014
The QUESST 2014 search dataset consists of 23 hours or around 12.500 spoken documents in: Albanian, Basque, Czech, non-native English, Romanian and Slovak (PCM encoded with 8 KHz sampling rate and 16 bit resolution). The spoken documents (6.6 seconds long on average) were extracted from longer recordings of different types: read, broadcast, lecture and conversational speech.
http://speech.fit.vutbr.cz/software/quesst-2014-multilingual-database-query-by-example-keyword-spotting

Placing 2014
Jaeyoung Choi, Bart Thomee, Gerald Friedland, Liangliang Cao, Karl Ni, Damian Borth, Benjamin Elizalde, Luke Gottlieb, Carmen Carrano, Roger Pearce, and Doug Poland. 2014. The Placing Task: A Large-Scale Geo-Estimation Challenge for Social-Media Videos and Images. In Proceedings of the 3rd ACM Multimedia Workshop on Geotagging and Its Applications in Multimedia (GeoMM '14)
https://sites.google.com/a/yli-corpus.org/intro/mediaeval-2014-placing-task-dataset

Synchronizing Event Media (SEM) 2014
Data set of Olympic Games held in London in 2012 and the Vancouver Winter Olympic Games of 2010 that supports work on the synchronization of multimedia:
http://mmlab.disi.unitn.it/MediaEvalSEM2014

Retrieving Diverse Social Images 2014
B. Ionescu, A. Popescu, M. Lupu, A. L. Ginsca, B. Boteanu, H. Müller, Div150Cred: A Social Image Retrieval Result Diversification with User Tagging Credibility Dataset. MMSys 2015.
http://traces.cs.umass.edu/index.php/Mmsys/Mmsys

Violent Scenes Detection (2011-2014)
The data (and the information on purchasing the movies) for the Violent Scenes Detection task has been made available by the task organizers here:
http://www.technicolor.com/en/innovation/research-innovation/scientific-data-sharing/violent-scenes-dataset
This dataset reunites data from the 2011, 2012, 2013 and 2014 task editions.

Spoken Web Search 2013 (now called QUESST)
This multilingual database contains 20 hours of utterance audio (the data you search in), ~500 development and ~500 evaluation audio queries (the data you search for), scoring scripts and references.
http://speech.fit.vutbr.cz/software/sws-2013-multilingual-database-query-by-example-keyword-spotting

Placing 2013
The Placing Task requires participants to estimate the geographical coordinates (latitude and longitude) of photos. This data set contains more than 8.5 million training images and more than 250,000 test images which were crawled from Flickr (all with Creative Commons license).
http://www.st.ewi.tudelft.nl/~hauff/placingTask2013Data.html

Visual Privacy 2013
PEViD: Privacy Evaluation Video Dataset: The dataset consists of 21 video clips (16 seconds each, full HD, 25 fps) and associated annotations in xml format of privacy-sensitive regions. Video clips show people performing various actions in indoor and outdoor environments during day and night time.
http://mmspg.epfl.ch/page-106274-en.html

Emotion in Music 2013
Data set contains 45-second music clips extracted randomly the full songs. The 45 seconds excerpts (clips) are annotated also for the full length clips using arousal and valence levels on 9 point scale. A set of features, extracted by openSMILE, are also available with the data. The data set originally had 1000 creative common licensed songs annotated continuously (dynamically) on arousal and valence dimensions. We found some redundant songs and fixed some problems which reduced the number of songs down to 744.
http://cvml.unige.ch/databases/emoMusic

Similar Segments in Social Speech 2013
This data set supports a task involving speech recordings. Input to the system is a 1-10 second audio/video region of interest, and the desired output an ordered list of regions similar to it, matching as closely as possible the judgments of human searchers.
http://www.cs.utep.edu/nigel/ssss

The following datasets are available via MMSys 2014, where they have been published in the dataset track:

Div400: The 2013 Retrieving Diverse Social Images Dataset
(MediaEval 2013 Retrieving Diverse Social Images Task)
396 landmark locations are represented via 43,418 Flickr photos and metadata, Wikipedia pages and content descriptors for text and visual modalities. The dataset comes with associated relevance and diversity assessments performed by human annotators.
See http://traces.cs.umass.edu/index.php/Mmsys/Mmsys

Fashion 10000: An Enriched Social Image Dataset for Fashion and Clothing
(MediaEval 2013 Crowdsourcing Task)
32,000+ Flickr Images uploaded with Creative Commons licenses accompanied by crowdsourcing labels reflecting their relevance to fashion and to specific categories of clothing items or fashion accessories.
See http://traces.cs.umass.edu/index.php/Mmsys/Mmsys

ReSEED: Social Event Detection Dataset
(MediaEval 2013 Social Event Detection Task)
This set consists of about 430,000 photos from Flickr together with the underlying ground truth consisting of about 21,000 social events. All the photos are accompanied by their textual metadata. The ground truth for the event groupings has been derived from event calendars on the Web that have been created collaboratively by people.
See http://traces.cs.umass.edu/index.php/Mmsys/Mmsys

Thank you to Mathias Lux (Alpen-Adria-University Klagenfurt, Austria) the MMSys 2014 dataset chair and all the other people that contributed to the success of the MMSys 2014 dataset track.

The following four datasets are available via MMSys 2013, where they were published in the dataset track:

The 2012 Social Event Detection Dataset
(MediaEval 2012 Social Event Detection Task)
More than 160 thousand Flickr photos and their accompanying metadata, as well as a list of 149 manually selected and annotated target events, each of which is defined as a set of relevant photos.
See http://traces.cs.umass.edu/index.php/Mmsys/Mmsys

A Professionally Annotated and Enriched Multimodal Data Set on Popular Music
(2012 MusiClef Task @ MediaEval)
A multimodal data set of professionally annotated music, including editorial metadata about songs, albums, and artists, as well as MusicBrainz identifiers to facilitate linking to other data sets.
See http://traces.cs.umass.edu/index.php/Mmsys/Mmsys

Fashion-focused Creative Commons Social dataset
(MediaEval 2013 Crowdsourcing Task)
A mix of general images as well as images that are focused on fashion (i.e., relevant to particular clothing items or fashion accessories). The dataset contains 4810 images and related metadata.
See http://traces.cs.umass.edu/index.php/Mmsys/Mmsys

Blip10000: A social Video Dataset containing SPUG Content for Tagging and Retrieval
(MediaEval Tagging Task 2010, 2011, and 2012; Rich Speech Retrieval 2011; Search and Hyperlinking 2012)
A dataset containing comprehensive semi-professional user-generated (SPUG) content, including audiovisual content, user-contributed metadata, automatic speech recognition transcripts, automatic shot boundary files, and social information for multiple 'social levels'.
See http://traces.cs.umass.edu/index.php/Mmsys/Mmsys

Thank you to Pablo Cesar (CWI, Netherlands) the MMSys 2013 dataset chair and all the other people that contributed to the success of the MMSys 2013 dataset track.

MediaEval Benchmarking Initiative for Multimedia Evaluation

The "multi" in multimedia: speech, audio, visual content, tags, users, context