Violent Scenes Detection

The 2014 Affect in Multimedia Task: Violent Scenes Detection
In this task participants should create systems that automatically detect physically violent segments in a typical Hollywood production movie. The task is related to a use case originating at Technicolor, which involves helping users select movies suitable for young children in terms of their violent content, and it generalises to a use case scenario where video material should be filtered for appropriate audience. To solve the task participants are strongly encouraged to deploy multimodal approaches that make use of visual, auditory and textual modalities.

This task is a follow-up of last year's edition. Based on that experience, the following definition of violence is adopted for this year: The targeted violent segments are those “one would not let an 8 years old child see in a movie because they contain physical violence”.

In the 2014 edition of the violent scenes detection task, we are concentrating on this subjective definition of violence, which was optional last year, and thus not used by all participating groups. The organisers feel that the subjective definition is closer to the real-world use case, and lies closer to the MediaEval ideal of strong human and social relevance. Furthermore, for the groups that submitted subjective runs in 2013, the results were consistently better than with the objective definition.

In this year, we will also concentrate on segment-level detection: systems are expected to return starting and ending times of each violent segment. No shot detection will be required or provided by the organisers, which will allow better generalisation to all kinds of videos, not only edited material.

Finally, we will include a new generalisation task where participants detect violence in short web videos, using the same system as in the main task. The aim is to find systems that generalise well to other types of video content.

Target group
Researchers in the areas of event detection, multimedia affect, or multimedia content analysis (but not limited to).

Task and Data
This year, two different challenges are proposed: a main task which addresses the detection of violence in the context of typical Hollywood productions and a generalization task that addresses short Internet movies specific to web media platforms. Participants are strongly encouraged to submit to both tasks, however, on request, we may also accept single task submission.

Main task
For the main task, the data set will consist of about 30 Hollywood movies that must be acquired by the participants. The training data will consist of 25 movies used in previous years' editions of this task. The movies are of different genres, from extremely violent movies to movies without violence. Any features automatically extracted from the video, including the subtitles, can be used by participants for preparing the required runs. In addition to the mandatory baseline run, optional runs will be allowed where participants can use additional external data, e.g., Internet resources.

Generalisation task
For the generalisation task, a separate test set will be used, which consists of short web videos, e.g., from YouTube or from the Internet Archive. The training set will be the same as for the main task, and a baseline run must be submitted which uses only the provided training data. Optional extra runs may use additional external data as well. The aim of this task is to explore how well the violence detectors generalise to different types of video content.

Ground truth and evaluation
Violence ground truth is created by human assessors and is provided by the task organisers. In addition to segments containing physical violence (according to the subjective definition given above), annotations include the following high-level concepts: presence of blood, fights, presence of fire, presence of guns, presence of cold arms, car chases and gory segments, for the visual modality; gunshots, explosions and screams for the audio modality. Note that participants are welcome to carry out detection of the high-level concepts. However, concept detection is not a requirement for the task since these high-level concept annotations are provided for training purposes.

The official evaluation metric will be the mean average precision (MAP), but other relevant metrics will also be provided (e.g., precision-recall, detection error trade-off curves, etc).

Recommended reading
[3] Acar, E., Hopfgartner, F., Albayrak, S. Violence Detection in Hollywood Movies by the Fusion of Visual and Mid-level Audio Cues. In Proceedings of ACM International Conference on Multimedia. ACM, Barcelona, Spain, 2013, 717-720.

[1] Bermejo Nievas, E., Deniz Suarez, O., Bueno García, R., Sukthankar, R. Violence Detection in Video using Computer Vision Techniques. Proceedings of the 14th International Conference on Computer Analysis of Images and Patterns - Volume Part II. Seville, Spain, 2011, 332-339.

[2] Demarty, C. H., Penet, C., Schedl, M., Ionescu, B., Quang, V. L., Jiang, Y.-G. The MediaEval 2013 Affect Task: Violent Scenes Detection. In Working Notes Proceedings of the MediaEval 2013 Workshop, CEUR-WS.org, ISSN 1613-0073. Barcelona, Spain, 2013.

[3] de Souza, F. D.M., Chavez, G.C., do Valle, E. A., Araijo, A. de A. Violence Detection in Video Using Spatio-Temporal Features. In Proceedings of 23rd SIBGRAPI Conference on Graphics, Patterns and Images. Gramado, Brazil, 2010, 224-230.

[4] Giannakopoulos, T., Makris, A., Kosmopoulos, D., Perantonis S., Theodoridis, S. Audio-visual Fusion for Detecting Violent Scenes in Videos, in Artificial Intelligence: Theories, Models and Applications. In Proceedings of SETN 6th Hellenic Conference on Artificial Intelligence. Athens, Greece, 2010, 91-100.

[5] Ionescu, B., Schlüter, J., Mironică, I., Schedl, M. A Naive Mid-level Concept-based Fusion Approach to Violence Detection in Hollywood Movies. In Proceedings of ACM ICMR International Conference on Multimedia Retrieval. ACM, Dallas, USA, 2013, 215-222.

[6] Lin, J.,Wang, W. Weakly-Supervised Violence Detection in Movies with Audio and Video Based Co-training, in Advances in Multimedia Information Processing – PCM. In Proceedings of 10th Pacific Rim Conference on Multimedia. Bangkok, Thailand, 2009, 930-935.

[7] Affect Task working notes papers in the Proceedings of the MediaEval 2013 Workshop, CEUR-WS.org, ISSN 1613-0073. Barcelona, Spain, 2013.

Task organizers
Mats Sjöberg, Aalto University, Finland: mats (dot) sjoberg (at) aalto (dot) fi
Bogdan Ionescu, University Politehnica of Bucharest, Romania
Yu-Gang Jiang, Fudan University, Shanghai, China
Vu Lam Quang, Multimedia and Communications LAB, University of Information Technology, VNU-HCMC, Vietnam
Markus Schedl, Johannes Kepler University, Linz, Austria

Task auxiliaries
Claire-Helene Demarty, Technicolor, France

Task schedule
May 15: Development data release
June 16: Test data release
September 15: Run submission due
September 19: Results returned
September 28: Working notes paper deadline

Supporting projects
• Academy of Finland funding grants no. 255745 and 251170.
• UEFISCDI SCOUTER (under grant no. 28DPST/30-08-2013).
• National Natural Science Foundation of China (#61201387 and #61228205)
• China's National 973 Program (#2010CB327900)
• Austrian Science Fund (FWF): P25655
• EU FP7-ICT-2011-9: project no. 601166 ("PHENICX")

MediaEval Benchmarking Initiative for Multimedia Evaluation

The "multi" in multimedia: speech, audio, visual content, tags, users, context