Learning to segment human sequential behavior to detect the intention for interaction

Learning to segment human sequential behavior to detect the intention for interaction

Author: Lisa Scherf

Supervisors: Prof. Constantin Rothkopf, PhD., Susanne Trick

Submission: January 2021

Abstract:

Humans subconsciously recognize intentions among each other and react accordingly. They can easily decide if and when another person intends to engage in an interaction, but for assistive robots this is still a pressing task.

In order to achieve a more intuitive interaction between humans and robots, this study investigates how humans signal an intention-for-interaction and how these signals can be detected. A block building experiment is used to build a multimodal dataset of human-human interactions covering a variety of different interactions and the preceding intentions-for-interactions. The general task was to assemble a tower from building blocks following step-by-step instructions. The design of the experiment triggers the participants to initiate different kinds of interactions with the investigator. The malfunctioning of a pen or a subtask requiring help from the investigator are examples for situations leading to an intention-for-interaction.

Two supervised classifiers – a logistic regression model and a long short-term memory network – are trained on this manually labeled dataset and achieve good performance using a combination of orientation, eye-gaze, and acoustic features to predict the probability of an intention-for-interaction.

Figure 1: Predicted probabilities for an intention-for-interaction and corresponding ground truth labels using a logistic regression model on unseen data of one participant. The model was trained on all remaining participants, respectively.
Figure 1: Predicted probabilities for an intention-for-interaction and corresponding ground truth labels using a logistic regression model on unseen data of one participant. The model was trained on all remaining participants, respectively.

Experimental validation shows that a multimodal feature set gives better precision and recall than using only one feature. The results indicate the existence of behavioral patterns for an intention-for-interaction across different people and situational contexts.