The 2018 Human Behavior Analysis Task: No-Audio Multimodal Speech Detection in Crowded Social Settings

Task description
An important but under-explored problem is the automated analysis of conversational dynamics in large unstructured social gatherings such as networking or mingling events. Research has shown that attending such events contributes greatly to career and personal success [1]. This task focuses on analyzing one of the most basic elements of social behavior: the detection of speaking turns.

Task participants are provided with video of individuals participating in a conversation that was captured by an overhead camera. These images are frames from the video that is provided to the participants. Note that the example has been blurred for use on this website, but participants will be provided with the original video.

Each individual is also wearing a badge-like device, recording tri-axial acceleration. This image shows the device:

The goal of the task is to automatically estimate when the person seen in the video starts speaking, and when they stop speaking using these alternative modalities. In contrast to conventional speech detection, for this task, no audio is used. Instead, the automatic estimation system must exploit the natural human movements that accompany speech (i.e., speaker gestures, as well as shifts in pose and proximity).

This task consists of two subtasks:
  • Unimodal classification: Design and implement separate speech detection algorithms exploiting each modality separately: Teams must submit separate decisions for the wearable modality and for the video modality.
  • Multimodal classification: Design and implement a speech detection approach that integrates modalities. Teams must submit a multimodal estimation decision, using some form of early, late or hybrid fusion.
Speaking predictions must be made for every second. However, it is left to the teams if they decide to use a different interval length and later interpolate or extrapolate to the second level.

Task background and motivation
Previous work has shown the benefit of deriving features from speaking turns for estimating many different social constructs such as dominance, or cohesion to name but a few. Unlike traditional tasks that have used audio to do this, here, the idea is to leverage the body movements (i.e., gestures) that are performed during speech production, which are captured from video and/or wearable acceleration and proximity. The benefit of this is that it enables a more privacy-preserving method of extracting socially relevant information and has the potential to scale to settings where recording audio may be impractical. The relationship between body behavior such as gesturing while speaking has been well-documented by social scientists [2].

Some efforts have been made in recent years to try and estimate these behaviors from a single body worn triaxial accelerometer, hung around the neck [3,4]. This form of sensing could be embedded into a smart ID badge that could be used in settings such as conferences, networking events, or organizational settings. In other work, video has been used to estimate speaking status [6,7]. Despite these efforts, one of the major challenges has been in achieving competitive estimation performance compared to audio-based systems. As yet, exploiting the multi-modal aspects of the problem is an under-explored area that will be the main focus of this challenge.

Target group
This challenge is targeted at researchers in computer vision and signal processing. The aim is to provide an entry-level task that has a clearly definable ground truth. Understanding of the nuances of human speech behavior in social settings would help to inform the design of solutions to this task. However, the task is designed to be addressable without this knowledge. By offering an entry-level task we hope support researchers currently not familiar with social signal processing to learn more about the problem domain, and lay the groundwork for future tasks addressing other forms of human behavior reflected in more subtle behavioral cues (e.g., personality, attraction, conversational involvement).

The data consists of 70 people who attended one of three separate mingle events (cocktail parties). Overhead camera data as well as wearable tri-axial accelerometer data for an interval of 30 minutes is available for this challenge. Each person used a wearable device (to record the acceleration acceleration) hung around the neck as a conference badge. A subset of this data will be kept as a test set. All the samples of this test set will be for subjects who are not in the training set.

All the data is synchronized. The video data is mostly complete, with some segments missing as the participants can leave the recording area at any time (eg. go to the bathroom). The frame rate of the video and sample rate of the accelerometer data are captured at 20Hz. For each subject, individual videos will be provided. Here are screen shot examples for one subject. Note that due to the crowded nature of the events, there can be strong occlusions between participants in the video.

Ground truth and evaluation
Manual annotations are provided for binary speaking status (speaking / non-speaking) for all people. These annotations are carried out for every frame in video (20 FPS). As mentioned above, speaking predictions must be made for every second.

Since the classes are severely imbalanced, we will be using the Area Under the ROC Curve (ROC-AUC) as the evaluation metric. Thus, participants should submit non-binary prediction scores (posterior probabilities, distances to the separating hyperplane, etc.).

The task will be evaluated using a subset of the data left as a test set. All the samples of this test set will be for subjects who are not present in the training set.

For evaluation, we will ask the teams to provide the following estimations for the two subtasks states above (unimodal and multimodal):
  • Person independent: All samples are provided to the classifier together, irrespective of the subject that the samples came from. Note that the test samples we provide will samples taken from people who are not in the training data.
  • (optional) Person specific: Only samples generated from the same subject are provided to the classifier. So we expect participants to train one classifier for each person and output test results per person-specific classifier. This can be a useful sanity check as the performance of the method, which should, in theory, perform better when trained on a specific person rather than other people.

Recommended reading
[1] Wolff, H.-G. and Moser, K. , Effects of networking on career success: a longitudinal study. Journal of Applied Psychology, 94(1):196, (2009).
[2] McNeill, D.: Language and Gesture, vol. 2. Cambridge University Press (2000)
[3] Hung, H., Englebienne, G., Kools, J.: Classifying social actions with a single accelerometer. In: Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, pp. 207–210. ACM (2013)
[4] Gedik, E. and Hung, H., Personalised models for speech detection from body movements using transductive parameter transfer, Journal of Personal and Ubiquitous Computing, (2017)
[5] Hung, H. and Ba, S. O., Speech/non-speech Detection in Meetings from Automatically Extracted Low Resolution Visual Features, Idiap Research Report, (2010)
[6] Cristani, M., Pesarin, A., Vinciarelli, A., Crocco, M. , and Murino, V.,Look at who’s talking: Voice activity detection by automated gesture analysis, In the workshop on Interactive Human Behavior Analysis in Open or Public Spaces, International Joint Conference on Ambient Intelligence,(2011).
[7] Cabrera-Quiros, L., Demetriou, A., Gedik, E., Van der Meij, L., Hung, H., The MatchNMingle Dataset: A Novel Multisensor Resource for the Analysis of Social Interactions and Group Dynamics In-the-wild During Free-standing Conversations and SpeedDates, Under review.

Task organizers
Hayley Hung, Delft University of Technology, Netherlands, h dot hung at tudelft dot nl
Laura Cabrera Quiros, Delft University of Technology, Netherlands, and Escuela de Ingeniería Electrónica at the Instituto Tecnológico de Costa Rica, Costa Rica, l dot c dot cabreraquiros at
Ekin Gedik, Delft University of Technology, Netherlands

Task schedule
Development data release: 1 June 2018
Test data release: 31 August 2018
Runs due: 28 September 2018
Working Notes paper due: 12 October 2018