Scene Change

The 2019 Scene Change Task (Brave New Task)

Task description
The MediaEval 2019 Scene Change Task is interested in exploring fun faux photo’s, images that fool you at first, but can be identified as an imitation on closer inspection. Task participants are provided with images of people (as a “foreground segment”) and are asked to change the background scene to Paris. We call this switch “scene change”.

Based on the dataset provided, participants are asked to develop a system that addresses the main task of creating a composite image:

Image compositing: given a foreground segment and a background image, the participant should blend the segment with the background in a manner that is appealing to the user. This is done for several popular landmarks in Paris. Only the foreground segment may be manipulated, so that the background image is recognizable as the specific landmark.

Participants are encouraged to improve the systems that address the main task, by developing additional sub-systems:

Background image retrieval: given a foreground segment the participant should retrieve a suitable background image from the collection of background images taken near the same landmark, which is a good fit. Then, the foreground segment should blend with the background image as in the main task with respect to, for example, lighting conditions and perspective.
Foreground segmentation: the foreground segment and the original foreground image is provided. Segmentation has seen some remarkable advances recently, but remains a difficult task, for example with respect to hair. Participants are invited to refine the provided segmentations and gain insights from there.

Note that for this task photorealism is not a goal in and of itself. Similarly to [1], we do strive for realism in the sense of acceptability, which includes enjoyability and shareability, rather than of physical accuracy. Physical accuracy is not required for acceptability, for example it is known that in artistic work impossible lighting conditions and colors do not interfere with the viewer’s understanding of the scene and often go unnoticed [2]. We adopt the assumption that optimizing for this realism captures distracting properties of the composed image, resulting in more appealing final images.

scenechangeexample

Can you tell at first glance who was in Paris? Can you tell at second glance?

Task background and motivation
The task has multiple motivations:

More and more examples where large group of tourists, often taking selfies, cause harm to the environment arise [6,7]. Scene change could be a partial solution to this problem, relieving pressure on these popular areas.
As computer scientists we make methods to make people fool around with photos in a way that is not fully deceptive. Developing technologies for “shallow fakes” provides an alternative to recent work, aimed at deep deception [9], in which the intent of the creator is that the fabricated image is not recognized as such. By benchmarking, we can evaluate methods and metrics for performing and quantifying deceptiveness in multimedia. If we can find practical methods for doing so, people can enjoy new creations without being deceived into accepting fiction as fact.
Access to scene change functionality is currently restricted to a small group including painters, photographers, Adobe® Photoshop® users and computer graphics experts. There is a large gap to bridge in commoditizing scene change. Giving users more control over their own photos will allow them to exercise creativity, have fun and promote their privacy more at the same time. The relatively recent surge of creative tools (e.g. Animoji, Snapchat Lenses) suggests that people enjoy creative control over their images and videos. However, closer consideration of the functionality of these tools reveals limitations: the creative possibilities are potentially so much wider than what is currently available to users.

This year we focus on Paris, both for being a highly popular tourist destination and the availability of a Paris Dataset [8]. In 2017, France was the most visited country in the world, with Paris having a total of 23,6 million hotel visits [10, 11].

Target group
The task targets (but is not limited to) people interested in art and social media, multimedia retrieval, machine learning, adversarial machine learning, privacy and computer vision.

Depending on your research interests, you might want to experiment in other directions. We have provided a recommended reading list (below) with some suggestions. You might consider using a Generative-Adversarial-Network based approach, for instance building on the work of Lin et al. 2018. You could also try an approach similar to that of Lalonde et al. 2007, who retrieve foreground segments that match certain conditions to the background.

Data
The data will be drawn from the ADE20k [5] dataset and the Paris dataset.

Ground truth and evaluation
Participants submit scene change examples for all images in the test set. The scene change is evaluated in an user study, where study participants are randomly shown original and composed images and are asked to point out whether the image is original or not. The study is repeated twice, once time-restricted, similar to [4] and once unrestricted (as in the prior example). A good algorithm produces shallow fakes: it demonstrates a high error rate on the time-restricted experiment and a low error rate on the unbounded experiment. Submissions are ranked on the difference in error rates between the two experiments.

References
[1] Karsch, K., Hedau, V., Forsyth, D., & Hoiem, D. (2011). Rendering synthetic objects into legacy photographs. ACM Transactions on Graphics (TOG), 30(6), 157.
[2] Cavanagh, P. (2005). The artist as neuroscientist. Nature, 434(7031), 301.
[4] Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, Dawn Song. Spatially Transformed Adversarial Examples. ICLR 2018.
[5] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 633-641).
[6] Roy, E. A. (2018, December 06). Instacrammed: The big fib at the heart of New Zealand picture-perfect peaks. Retrieved from https://www.theguardian.com/world/2018/dec/07/instacrammed-the-big-fib-at-the-heart-of-new-zealand-picture-perfect-peaks
[7] Gammon, K. (2019, March 19). #Superbloom or #poppynightmare? Selfie chaos forces canyon closure. Retrieved from https://www.theguardian.com/environment/2019/mar/18/super-bloom-lake-elsinore-poppies-flowers
[8] Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2008, June). Lost in quantization: Improving particular object retrieval in large scale image databases. In 2008 IEEE conference on computer vision and pattern recognition (pp. 1-8). IEEE.
[9] Güera, D., & Delp, E. J. (2018, November). Deepfake video detection using recurrent neural networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) (pp. 1-6). IEEE.
[10] UNWTO Tourism Highlights, 2017 Edition. (2017, August). Retrieved from http://www2.unwto.org/publication/unwto-tourism-highlights-2017
[11] Tourism in Paris - Key Figures - Paris tourist office. Retrieved from https://press.parisinfo.com/key-figures/Tourism-in-Paris-Key-Figures

Recommended reading
Karsch, K., Hedau, V., Forsyth, D., & Hoiem, D. (2011). Rendering synthetic objects into legacy photographs. ACM Transactions on Graphics (TOG), 30(6), 157.
Lalonde, J. F., Hoiem, D., Efros, A. A., Rother, C., Winn, J., & Criminisi, A. (2007). Photo clip art. ACM transactions on graphics (TOG), 26(3), 3.
Lin, C. H., Yumer, E., Wang, O., Shechtman, E., & Lucey, S. (2018, March). ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9455-9464).

For more insight on the state of the art in segmentation, you could take a look at the winner of COCO 2018. The slides of the winner’s presentation can be found here: http://presentations.cocodataset.org/ECCV18/COCO18-Detect-MMDET.pdf.
Furthermore there are also industry solutions that offer segmentation, such as https://www.remove.bg and https://online.photoscissors.com.

Task organizers
Simon Brugman, Radboud University, Netherlands
Martha Larson, Radboud University, Netherlands

Task schedule
Data release: 31 May
Runs due: 20 September
Results returned: 23 September
Working Notes paper due: 30 September
MediaEval 2019 Workshop (in France, near Nice): 27-29 October 2019

Acknowledgements
NWO TTW Open Mind
stw-open-mind-eng-cmyk_zonder_stw

MediaEval Benchmarking Initiative for Multimedia Evaluation

The "multi" in multimedia: speech, audio, visual content, tags, users, context