Human-Robot Interaction Conversational User Enjoyment Scale (HRI CUES)

Bahar Irfan, Jura Miniota, Sofia Thunberg, Erik Lagerstedt, Sanna Kuoppamäki, Gabriel Skantze, André Pereira Bahar Irfan, Jura Miniota, Gabriel Skantze, and André Pereira are with the Division of Speech, Music and Hearing at the KTH Royal Institute of Technology, 100 44 Stockholm, Sweden. E-mail: {birfan, jura, skantze, atap}@kth.se.Sofia Thunberg is with the Department of Computer and Information Science, at Linköping University, 581 83 Linköping, Sweden. Email: sofia.thunberg@liu.se.Erik Lagerstedt is with the School of Informatics at the University of Skövde, 541 28 Skövde, Sweden. Email: erik.lagerstedt@his.se.Sanna Kuoppamäki is with the Division of Health Informatics and Logistics at the KTH Royal Institute of Technology, 141 57 Huddinge, Sweden. E-mail: sannaku@kth.se.This work was supported by KTH Digital Futures (Sweden) and the Swedish Research Council project 2021-05803.

Abstract

Understanding user enjoyment is crucial in human-robot interaction (HRI), as it can impact interaction quality and influence user acceptance and long-term engagement with robots, particularly in the context of conversations with social robots. However, current assessment methods rely solely on self-reported questionnaires, failing to capture interaction dynamics. This work introduces the Human-Robot Interaction Conversational User Enjoyment Scale (HRI CUES), a novel scale for assessing user enjoyment from an external perspective during conversations with a robot. Developed through rigorous evaluations and discussions of three annotators with relevant expertise, the scale provides a structured framework for assessing enjoyment in each conversation exchange (turn) alongside overall interaction levels. It aims to complement self-reported enjoyment from users and holds the potential for autonomously identifying user enjoyment in real-time HRI. The scale was validated on 25 older adults’ open-domain dialogue with a companion robot that was powered by a large language model for conversations, corresponding to 174 minutes of data, showing moderate to good alignment. The dataset is available online¹¹1HRI CUES Dataset (anonymized transcripts, annotation scores, and self-reported user perceptions): https://paperswithcode.com/dataset/hri-cues-dataset. Additionally, the study offers insights into understanding the nuances and challenges of assessing user enjoyment in robot interactions, and provides guidelines on applying the scale to other domains.

Index Terms:

User Enjoyment, Human-Robot Interaction, Metrics, Open-Domain Dialogue, Companion Robot, Annotation, Large Language Model, Dataset

I Introduction

User enjoyment, referring to the user’s subjective perception and experience of the enjoyment of interaction, is an important indicator of acceptance of robots and willingness to engage with them over time [1]. Particularly in the context of conversational agents or companion robots, where the primary goal often revolves around providing emotional support or companionship, enjoyment serves as a vital metric for evaluating the effectiveness of such systems. Therefore, developing reliable and efficient methods for measuring user enjoyment in Human-Robot Interaction (HRI) scenarios is essential for designing and improving future generations of robots.

Refer to caption — Figure 1: Human-Robot Interaction Conversational User Enjoyment Scale (HRI CUES).

User enjoyment is closely linked to the intention to use robots, particularly among older adults [1]. Conversational companion robots are often developed to provide social or emotional support to older adults in a home or care home environment [2]. Prior studies in HRI explored older adults’ acceptance, use, and interaction with robots, showing that older adults often display challenges in interacting with a conversational agent, such as the lack of conversational responses and difficulties in hearing and understanding the voice interaction [3], consequently attributing a low level of social acceptance to the robot [4]. Designing robots with their individual needs and preferences in mind by involving them in the design process with participatory design research techniques, such as focus groups, interviews, and iterative developments based on their feedback, could potentially help alleviate these interaction challenges [5, 3, 6]. The recent introduction of Large Language Models (LLMs) has enabled the development of companion robots equipped with social capabilities, eliminating the need for Wizard of Oz, which was the common approach in conversational HRI studies (e.g., [3, 4]), and the inherent human influence that hinders the construction of robots capable of autonomously mitigating errors [7, 8]. Recent studies applied LLMs to conversational robots in various domains, including therapy [9], service [10], and care for older adults [11, 12], which demonstrate their potential and limitations in diverse contexts that lead to enjoyable or unpleasant experiences, further showing the importance of detecting user enjoyment during conversations with robots.

Sustaining enjoyment, especially in daily encounters such as for companion robots, is a challenging task yet to be solved. User interest and engagement may fluctuate within day-to-day interactions, but also within the interaction itself, based on the robot’s performance in conversation flow, content, and contextual memory, which may affect user enjoyment. If user enjoyment can be detected autonomously in the conversation, preventive measures can be taken to improve the interaction where necessary, such as changing the conversation topic. However, despite the numerous studies investigating user engagement in HRI [13], measure of user enjoyment is limited to self-reports from users. Not only can self-reports be unreliable due to demand characteristics, self-presentation, or Hawthorne effect due to conformity to perceived norms or researcher expectations [14], but they represent an overall feedback of the interaction rather than an instantaneous measure throughout. While affect recognition systems can detect laughter and smiles [15], enjoyment is a complex feeling that can be conveyed through other multimodal cues (see Section IV-C). Even in interpersonal communication, enjoyment has been analyzed from an external perspective only within the context of marriage [16]. Thus, there is no scale or an automatic system for assessing user enjoyment in conversations with a robot, the former being required to develop the latter.

This work contributes with the Human-Robot Interaction Conversational User Enjoyment Scale (HRI CUES), illustrated in Fig. 1, which is a novel scale for assessing enjoyment in conversations with robots from an external (third-party) perspective, deriving from videos of 28 older adults’ open-domain dialogue with a companion robot using an LLM. The scale is developed through rigorous annotator evaluations and discussions, and provides a structured framework for evaluating enjoyment in conversations with robots. The scale seeks to offer an additional means of assessing user enjoyment in HRI by considering fine-grained conversation exchange levels (i.e., turn-by-turn) and the overall interaction level, which complements self-reported enjoyment from users, with future potential use for autonomously identifying enjoyment in real-time HRI. In addition, by providing a detailed exploration of the instances of annotators’ concordance and divergence, based on turn-by-turn analysis of enjoyment, in addition to the underlying reasons for discrepancies between users’ self-reported enjoyment ratings based on metrics typically used in HRI studies, the study offers invaluable insights for understanding the nuances and challenges of assessing user enjoyment in interactions with robots. Deriving from these challenges, a step-by-step guideline is offered for future HRI researchers to adapt the user enjoyment scale to other application domains.

II Background and Related Work

II-A Defining enjoyment

A popular definition of enjoyment is being in the state of flow [17]. Flow is defined as the optimal experience, which provides a deep sense of enjoyment. It happens when an individual is fully engaged in a task that provides an optimal amount of challenge and engagement. Flow is characterized by a set of factors, such as a fading sense of ‘self’, a sense that duration is altered, and deep and effortless involvement in the task. The theory states that enjoyment is not obtained in a relaxed state, that it is necessary to be challenged, and links repeated experiences of flow to mastery of a skill.

Another theory derives from the flow theory to define true fun as the experience that occurs when a person is experiencing flow, playfulness, and connection all at the same time [18]. If one or two of the three components that constitute true fun are present, the experience will make a person feel joy or satisfaction, but not true fun. Similarly to what characterizes being in a state of flow, experiencing true fun is characterized by losing track of time, letting go, and being completely present in the moment, with the addition of laughter, feeling free, a sense of child-like excitement, and joy.

While both flow and true fun emphasize high engagement through optimal challenge, other theories recognize enjoyment in less intense states. For instance, flow-like states can be differentiated from the overall positive valence of an experience [19]. This relates to the circumplex model of emotion [20], which features arousal (low to high) on one axis and pleasure or valence (negative to positive) on the other axis [21]. Certain theories of enjoyment focus on high valence values that can contain lower-engagement positive emotions (e.g., content or calm), while others prioritize high-arousal states (e.g., excitement) [19]. The ‘happy’ emotion in the circumplex model reflects a balance of high arousal and positive valence. Our work aligns with the theory of true fun in seeking both higher levels of arousal and valence, while incorporating other elements for classifying lower levels of enjoyment. Casual, open-domain conversations between older adults and robots may involve aspects of both arousal and positive emotions. As such, both factors will be relevant to our holistic model of enjoyment.

II-B Assessing enjoyment

Enjoyment is generally evaluated through self-reported questionnaires tailored to the specific application domain. For instance, the Quality of Life Enjoyment and Satisfaction Questionnaire (Q-LES-Q) [22] and Physical Activity Enjoyment Scale (PACES) [23] are used in healthcare applications. In Human-Computer Interaction (HCI) and HRI research, enjoyment is not often the primary focus for evaluating user perceptions, but is typically included as part of a more comprehensive model [24]. Enjoyment frequently appears as a self-reported measure, either as a construct within established models like the Unified Theory of Acceptance and Use of Technology (UTAUT) [25], or as single-item measures in custom questionnaires (e.g.,“Did you feel fun?” [26], “Was playing with the robot enjoyable/not enjoyable?” [27]). Enjoyment was found to be highly correlated with ‘satisfying’, ‘entertaining’, ‘exciting’,‘fun’, and ‘interesting’ in HRI [28]. Technology Acceptance Model (TAM) [29] was also adapted to HRI by incorporating measures of affect and cognition to improve its accuracy in explaining technology adoption, and this adaptation included questions about perceived enjoyment [30].

User enjoyment has been shown to correlate with the intention to use a robot among older adults [1], highlighting its importance for long-term interactions. The Almere model [31] is an extended version of the UTAUT that is widely used in research on robots for older adults [32]. It incorporates enjoyment, social interaction, and social influence as factors mediating the acceptance and intention to use robots.

While user enjoyment is commonly measured through self-reporting in HRI, it has several limitations, such as conforming to perceived norms or researcher expectations or the (in)ability to recall the events and report correctly from memory [33, 14]. In addition, it is often desirable to estimate what a user is feeling by assessing it from an external perspective when self-reporting is not possible or the goal is to automate behavior at the dialogue exchange level during interactions in real-time. External assessments and self-reporting are not mutually exclusive and can instead complement each other. The results of one method can even be used to validate the other. While prior research used smiles and laughter for automatic classification of user enjoyment (within the context of story-telling) [15], these signals can be contradictory, ambiguous and highly context dependent [34, 35], and enjoyment is often expressed through other multimodal cues.

II-C Assessing enjoyment in conversations

User satisfaction, which correlates significantly with user enjoyment in conversations [36, 37], is typically evaluated in relation to a task [38, 39]. For instance, the Paradigm for Dialogue System Evaluation (PARADISE) [38] is a framework for evaluating user satisfaction in dialogue systems, based on self-reported satisfaction on a dialogue level and is influenced by other metrics, such as task success in travel booking and accessing emails. Similarly, interaction quality, which evaluates user satisfaction from an external perspective on the exchange level, was analyzed by three annotators in bus schedule inquiries with chatbots over phone calls from a data corpus of 200 dialogues and a lab study with 38 subjects [39]. An autonomous system was developed based on their ratings, which correlated highly with them, but not with users’ self-reported satisfaction scores, which was attributed to the subjectivity of the measure and variability in user perceptions.

User satisfaction has also been measured in the text domain with an annotation protocol similar to our study, based on a dataset of 1000 dialogues between 50 users and a chatbot on attentive listening and conversations about animals [40]. Two annotators were recruited and an annotator alignment session was conducted. The annotators were requested to go through the conversation exchanges once, without going back or looking at the history. The annotators used three metrics of user satisfaction: ‘smoothness of the conversation’, ‘closeness perceived by the user towards the system’, and ‘willingness to continue the conversation’, rated from 1 to 7. However, no agreement was found between the annotators, even after changing the granularity of the scale to two levels, low and high, showing the complexity of evaluating a subjective measure from an external perspective. The study focused on developing an algorithm to measure user satisfaction.

A similar study, also exploring user satisfaction, was conducted to develop a multimodal model based on annotated data [41]. The data corpus consisted of conversations of 60 participants with a virtual agent that was controlled through Wizard of Oz. The wizard annotated the user satisfaction from their perspective after each dialogue exchange, and the user and wizard both provided the satisfaction level of the user at the end of the conversation. In addition, five annotators were recruited to annotate the dialogues from an external perspective. All annotators gave each exchange in the conversation a score in the following metrics: topic continuance (1: ‘strongly change the topic’ to 7: ‘strongly continue the topic’), external sentiment (1: ‘the participants seemed bored with the dialogue’ to 7: ‘participants seemed to enjoy the dialogue’), and self-sentiment (1: ‘want to stop talking/confused about the system utterances’ to 7: ‘enjoy talking/satisfied with the talk’). Notably, the two latter were labeled enjoyment on the upper end of the scale. They received a good level of agreement between the annotators on the exchange level. To annotate the dialogue level, a questionnaire was used with 18 items that were designed to represent three labels: ‘coordinateness’, ‘awkwardness’, and ‘friendliness’. Based on these criteria, the developed multimodal model has outperformed the annotators’ in evaluating user satisfaction at the overall interaction level.

Our user enjoyment scale was developed based on the work of Reimnitz and Rauer [16], which is the only scale that specifically evaluates user enjoyment in conversations from an external perspective. The study assessed the enjoyment in conversations between 64 married couples and compared that to each spouse’s marital happiness. For measuring enjoyment, they developed a scale for observational coding that took into consideration affective signs and the tone of the interaction. The scale ranged from 1 (very low enjoyment) to 7 (very high enjoyment), with 3 as a neutral anchor. Two annotators took into account both affective signs (e.g., mutuality of the interaction, tone of voice, consistent mutual gaze, facial expressions, physical touching, body language) and the tone of the interaction (i.e., neutral, enthusiastic, and delightful) when rating enjoyment. The annotators had good to moderate agreement, with intraclass correlation (ICC) on 20% of the interactions. The study found that couples who displayed high enjoyment in their conversations also reported having a happier marriage. This aligns with prior research in human relationships, which found that mutually enjoyable behavior leads to increased intimacy, trust, security, and satisfaction in long-term relationships [42], signaling that enjoyment could be highly influential in achieving long-term HRIs.

Similar to this scale, our study analyzes dyadic conversations outside of task-oriented settings, but with a focus on evaluating enjoyment in conversations with autonomous robots. Although prior research described in this section evaluated autonomous conversational systems using user satisfaction metrics, which touch upon aspects of enjoyment, these metrics are often more focused on task-specific aspects. In contrast, our study emphasizes understanding enjoyment within daily (open-domain) conversations with robots, based on older adults’ interactions with a companion robot. To the best of our knowledge, no prior study measures enjoyment in open-domain conversations with robots from an external perspective that take into account multimodal aspects of HRI. Our work seeks to bridge this gap by proposing a scale that not only emphasizes the importance and complexity of user enjoyment, but also provides a methodological framework to assess it from an external perspective in other application domains of HRI. The scale aims to provide an additional tool to evaluate enjoyment complementary to self-reported measures from users within conversation exchange and overall interaction levels, with the potential to be used for autonomous systems to adapt the conversations on the fly.

III Data Collection

This work aims to offer an additional tool for assessing user enjoyment in conversational HRI. Our research seeks to evaluate perceived enjoyment within conversation exchanges and interaction levels from an external perspective, to complement self-reported questionnaires and provide a holistic view of the interaction as it progresses. To achieve this, we embarked on developing a user enjoyment scale and validating it within the context of an HRI scenario. As outlined in Section I, achieving and maintaining user enjoyment is important for engaging users and encouraging continued interactions with robots, especially in daily encounters. This becomes particularly prominent for companion robots for older adults that aim to provide social and emotional support to mitigate loneliness in their daily lives. Having daily conversations spanning a wide range of topics, i.e., open-domain dialogue, plays a pivotal role in achieving this, as it enables more engaging and fulfilling conversations that cater to the user’s various emotional and cognitive needs. While this has been challenging to achieve in the past with conversational robots [43], recent advancements in LLMs enable these capabilities, which make them suitable architectures for companion robots. In addition, these robots should be tailored to older adults’ unique needs and preferences to meet their expectations and provide them with enjoyable conversations. Thus, it is crucial to iteratively involve older adults in the design process to both learn their needs and effectively assess the developed systems.

Thus, this work builds upon the data from our prior work on the participatory design development of an autonomous companion robot that integrates an LLM for conversations with older adults, as described in [12], to build the user enjoyment scale for conversational HRI and evaluate enjoyment. Initially, preliminary interviews were conducted with 6 Swedish-speaking older adults who talked to the robot individually for 4-13 min. Based on the feedback from the initial study, the robot architecture was improved to eliminate initial challenges. Following that, four design workshops were conducted with 28 older adults who talked to the robot for 7 min with surveys and interviews that followed. This section summarizes the robot architecture used and the study details, described in detail in [12].

III-A Robot Architecture

The Furhat robot was employed in the study, featuring a neutral-looking face that underwent user validation before interactions. The robot’s face engine incorporated smiles and eyebrow raises during conversations to enhance naturalness and provide non-verbal feedback to users without context analysis. To further refine the interactions, the robot incorporated subtle behaviors like blinking, eye shifts, and brief gaze aversion while speaking, based on silences in user input.

GPT-3.5 (text-davinci-003, OpenAI) was used for dialogue generation, as it was the most capable LLM at the time (March 2023). Prompting was used to give an empathetic persona to the robot, guiding it to ask open questions, listen actively with follow-up questions, and reflect on situations.

Initially, English was chosen as the communication language for speech recognition (Google Cloud Speech-to-Text), dialogue generation, and synthesis (Amazon Polly) due to the more extensive training data available for LLMs. However, an initial study with Swedish-speaking older adults showed the need for communicating in their native language [12]. Hence, Swedish was used in the follow-up study investigated in this paper. A USB microphone array (Seeed Studio) was used in both studies to obtain clear audio for speech recognition.

While the conversation with the robot was autonomous, a wizard interface was used to start the interaction with the user by entering the participant ID. The initial and final²²2Robot response to start the interaction: “Hello! I am Furhat, the personalized companion robot. What is your name?” Robot response to end the interaction: “I would love to talk more another time, but for the sake of time, I need to say goodbye. Thank you for talking with me. Take care!” robot responses were pre-scripted to ensure that the interaction started and ended the same way for all participants. The rest of the interaction was fully autonomous based on the user’s responses and the responses generated by the LLM. The wizard interface was also used to end the interaction if necessary, i.e., if the participant wants to end the conversation early or an error occurs in the system that requires a restart to continue the conversation where it is left off. A 7-minute timer was set (checked automatically after each user response), after which the robot would say its pre-scripted response to ensure a fair comparison between users.

III-B Data

To understand older adults’ perceived benefits and challenges in interaction with a conversational companion robot, and correspondingly develop a robot that meets their expectations, we conducted preliminary interviews with Swedish-speaking older adults aged 65 and over, in which they talked with a robot autonomously and individually in English. The study had 6 (3 men, 3 women) Swedish-speaking healthy older adults, between 66 to 86 years old ( $M=78.3$ , $SD=8.3$ ). However, due to it being the very first evaluation of an LLM on a social robot with older adults, the interactions had a lot of failures that are analyzed thoroughly in [12]. Interaction failures can lead to lower likeability and satisfaction [44] and cause negative tone and emotion in user responses [45]. Thus, to prevent biasing the enjoyment scale solely towards negative experiences, but also have an understanding of user reactions to frequent technical failures in current architectures, we chose the most successful interaction (as defined by the least number of failures and the longest length of interaction) from this study to be included as a basis of alignment for developing the enjoyment scale, as described in Sections IV-B and IV-C. The other criterion was that the interaction contained ‘highs and lows’, that is, the participant reacted positively (e.g., smile, laugh), neutrally, and negatively (e.g., frown, getting impatient) in the video to enable the annotators to understand the spectrum of responses. The corresponding subject (denoted as S0) was an 83-year-old male without prior experience of robots. The participant lived with their partner in their own home. The interaction lasted 13.5 min (53 turns). The video was recorded from a side angle, facing both the participant and the robot. The participant gave informed consent for recording, analysis of data, and anonymized (blurred and without a name) image and video sharing for publications.

Following the preliminary interviews, technical improvements were made for turn-taking, the robot’s persona, and architecture to overcome the interaction failures. Based on the feedback from the interviews, the interaction language was changed to Swedish. Subsequently, four participatory design workshops were conducted with 28 older adults having an autonomous open-domain conversation with the robot individually for approximately 7 minutes. Prior to the robot interactions, the robot’s capabilities were demonstrated through a researcher having a conversation with the robot (2 min), and focus group discussions were made using design scenarios of everyday activities to understand their expectations of companion robots. The researcher(s) were present in the room (to interfere if necessary) during the individual robot interactions. Following the interactions, the participants completed a 68-question Likert scale (1 to 5) questionnaire based on HRI and open-domain dialogue literature. Based on several studies described in Section II, user satisfaction, fun, and interestingness of the conversation were evaluated and categorized under the user enjoyment construct. To account for discomfort in the conversation [46], the strangeness of the conversation was also evaluated, similar to [47]³³3“Did you feel something strange in that dialogue with the robot?” was used in [47]. We adapted it slightly for the Likert scale., by reverse-coding it in analysis:

1.

I was satisfied with my conversation with the robot.
2.

It was fun talking to the robot.
3.

The conversation with the robot was interesting.
4.

It felt strange talking to the robot.

All interactions with the robot were video-recorded by an external camera facing both the participant and the robot at a side angle, as well as through the robot camera to record the participant’s face. All participants gave informed consent for recording, analysis of data, and anonymized (blurred and without a name) image and video sharing for publications.

Participants were recruited by distributing the invitation at our university’s communication channels, social media, and platforms for gathering senior citizens. In total, 28 (13 men, 15 women) Swedish-speaking healthy older adults between 66 and 86 years old registered as volunteers. We divided this data for the purposes of this study.

Two of the subjects were selected for annotator alignment: a 69-year-old woman (denoted as S1) and a 75-year-old man (denoted as S2). S1 and S2 did not have any prior interaction with a robot. S1 and S2 lived with their partners in their own homes. The selection basis was to find interactions containing a range of ‘highs and lows’, as in the previous study, aiming to complement S0 with interactions with fewer failures. S1’s interaction lasted 7.5 min (27 turns) and S2’s interaction lasted 7.3 min (27 turns).

After the initial alignment of annotators with 3 videos to create HRI CUES (Section IV), the remaining participant interactions (except one due to lack of a side-video) were used for HRI CUES evaluation (Section V). The resulting data consisted of 25 participants’ (12 men, 13 women) interactions, with a mean age of 74.6 ( $SD=5.8$ ). 20 participants had no prior interaction with a robot, and only one had previously talked with a robot. Interaction duration was $M=7.4$ min ( $SD=1.5$ ) with 12 to 29 turns. Each turn lasted 5 to 61 seconds ( $M=17.7$ , $SD=7.2$ ). The total duration of the videos was 174 min, corresponding to 590 turns.

We provide the dataset for HRI CUES that includes anonymized transcripts, annotation scores, and self-reported user perceptions online⁴⁴4HRI CUES Dataset: https://paperswithcode.com/dataset/hri-cues-dataset. Videos of the interactions are available upon request, contingent upon a signed agreement to maintain data confidentiality in accordance with General Data Protection Regulation.

IV Assessing Enjoyment from Conversations

This work addresses the lack of user enjoyment analysis of conversation from an external perspective in HRI. We start from an existing enjoyment scale in human-human relations to develop HRI CUES, by complementing it with annotations of older adults’ interactions with a conversational companion robot. This section describes not only the scale proposed in this work, but also a final complete methodology to evaluate enjoyment from videos of conversations with robots. It also provides annotation guidelines and details the practices taken for establishing inter-rater reliability in annotations [48].

IV-A Annotator Selection

Due to the lack of a clear definition of user enjoyment and its subjectivity resulting in high variability in both user perceptions and understanding by a third party, the selection of the right experts as annotators is critical. This is important in general, especially in multidisciplinary fields like HRI, in particular when investigating complex concepts like enjoyment that have relatively different meanings in different academic traditions [49]. The annotators in this study should not only be able to detect and understand multimodal cues exhibited by the users to detect enjoyment, but also align well in their perceptions, such that this measure can be used by other researchers based on their understanding and recommendations. In addition, being well-versed in the literature of the user metric in question (user enjoyment) as well as its difference from similar metrics (e.g., user satisfaction) is necessary to ensure correlations with prior literature, as well as users’ reported perceptions of such metrics.

Familiarity with the target population (participant group) is also important in establishing a better understanding of their needs and reactions. Researchers whose backgrounds focus on HRI with the target population (e.g., older or young adults, children, people with disabilities) can put their interactions in context from social, cognitive, and ethnographic perspectives. Annotators also need to be thoroughly familiar with the socio-cultural background of the participants, as culture affects their perceptions of robots and their interactions [50]. In addition, understanding the nuances and culture-specific idioms (e.g., ‘cold turkey’) and proverbs (e.g., ‘bite the bullet’) in conversations will be easier for an annotator that is a native speaker.

While a combination of all these aspects is difficult to find in a single annotator, a group of annotators would be able to complement each other, such that during alignment and development of the scale, their horizons can be expanded by the perspectives of the others. While typically two annotators establish inter-rater reliability in qualitative analysis, employing three annotators could better suit the complexity of the task, allowing for tie-breaking and alignment across diverse backgrounds [51, 48]. Correspondingly, we selected three annotators ( $M_{age}=30$ , $SD=2.94$ ) who are researchers in the mid-late stages of their PhD, with a background in user enjoyment (Annotator 1, denoted as A1), HRI with older adults and cognitive science (A2), and multimodal HRI and cognitive science (A3). The annotators were native Swedish speakers and thoroughly familiar with Swedish culture.

IV-B Familiarizing with Data

As a starting point to familiarize annotators with the data and develop a user enjoyment scale for conversational HRI, the annotators were given a slightly adapted version of the enjoyment scale by Reimnitz and Rauer⁵⁵51 (very low): no evidence of pleasure. Pair never has fun or enjoys the interaction, although there may be joint interaction. There is no mutual enjoyment of positive affect or negative interaction. 3 (neutral anchor): there is occasional positivity that is not strong or frequently displayed and may be displayed by only one partner towards the other. Pair is doing OK together but without real joy or enthusiasm for their shared interactions. 7 (very high): the pair is very satisfied with the interaction and activity. The couple shows mutual enjoyment in their interaction marked with shared exuberance and/or delight. There is consistent visual regard coupled with affective sharing. [16] on human-human conversations of married couples, in which ‘user’ was used instead of ‘couple’, and references that relate to couples (‘mutual enjoyment’, ‘affective sharing’, and ‘exuberance’) were removed. Instead of a 7-point Likert scale, which may be difficult to align given only the lowest, natural anchor, and highest enjoyment values, a 5-point scale was used:

•

1 (very low): no evidence of pleasure. The user never has fun or enjoys the conversation, although there may be joint interaction.
•

3 (neutral anchor): there is occasional positivity that is not strong or frequently displayed. The user does not have real joy or enthusiasm for the conversation.
•

5 (very high): the user is very satisfied with the conversation. The user shows enjoyment in their conversation marked with enthusiasm and/or delight.

Annotators were encouraged to use the full scale (i.e., not abstaining from giving 1 or 5).

Due to the subjectivity of user enjoyment, it was necessary to establish common grounds on the levels of user enjoyment prior to the annotators analyzing all the robot interactions individually [48]. Thus, three exemplar videos (S0, S1, and S2) as explained in Section III-B were chosen that contain a range of negative and positive responses from the user and a variety of technical failures.

In order to incrementally familiarize the annotators with the modalities that a front view (taken from the robot’s camera) and side view (external camera facing robot and participant) of robot interaction may introduce, the first exemplar video (S0) contained only the side view, the next one (S1) contained only the front view, and finally the third one (S2) contained both views, as shown in Fig. 2 and as used in the final annotations.

Conversational turns (exchanges) were chosen as the basis of annotations, because they were mostly similar in duration for participants, as well as for paving the way for understanding user enjoyment via autonomous systems to adapt and improve the interaction continuously. As such, the segments to be annotated were created automatically based on the turns (Robot-Participant pairs) in manually-corrected (for timing and content) transcripts. All videos started with the robot’s first phrase (greeting of the user). A turn ends (and a new turn starts) when the participant stops speaking, as that holds the potential to be detected by an automatic system for evaluating the turn that could be used to generate a new response.

Annotators were guided to apply the rating scale on a per-turn basis, assessing both the robot’s response and the participant’s subsequent input within each turn. Furthermore, they were tasked with delivering an overarching assessment of enjoyment encompassing the entire interaction, referred to as overall enjoyment. Within this context, annotators were encouraged to provide an in-depth rationale for their ratings, adopting an open-ended approach to offer comprehensive insights. They were asked to elaborate on the aspects and multimodal cues they considered in shaping their evaluations, along with the methodology they employed. Additionally, they were asked to provide whether any challenges or difficulties were encountered while evaluating overall user enjoyment for the interaction and its details. The ratings for each turn were to be recorded within the ELAN file (Fig. 2), while a separate document was designated for annotators to record their overall interaction rating and provide open-ended responses.

IV-C Annotator Alignment

Based on their individual annotations of three robot interaction videos (S0-S2), annotators were asked to meet to align themselves to decide more objectively what each level of the scale corresponds to, such that an agreement can be reached for the analysis of the remaining interaction videos. In addition, they were asked to discuss the aspects and multimodal cues used to give the corresponding scores, in a turn-by-turn fashion, as well as the overall user enjoyment.

To facilitate discussions, a list of aspects and multimodal cues from HRI [52, 53, 54, 55], HCI [54], and human-human interaction [56, 57, 58, 59, 16] literature was given to the annotators, which were previously used in affective computing, user engagement, user enjoyment, conversation, and turn-taking analysis, in addition to the principal researcher’s analysis of the challenges of applying LLMs into conversational robots [12]:

•

Facial expressions: smile, laughter, frown, rolling eyes, sigh, other expressions (e.g., smirk, squinting eyes, raising eyebrows). Emotion models [60, 20, 61] were described for further context.
•

Gaze: Mutual gaze, gaze length, gaze aversion, other gaze targets (e.g., objects, experimenter)
•

Body language: Gestures, gesture duration, gesture frequency, gesture intensity, posture, body orientation, head orientation, arm position (e.g., folded/ open), movement, physical contact, pointing, adaptors (e.g., touching hair, bouncing legs), nodding/ head shakes, proximity (distance to the robot)
•

Vocal features: Tone, pitch, pace, volume/ loudness, energy
•

Dialogue responses: Content, sentiment, length, mirroring, pauses in response, rephrasing/ clarifications, anthropomorphism, disengagement cues (responses that bring the conversation to a halt, e.g., “That is good to know”)
•

Conversation: Context, topic, topic initiation, topic closure, topic duration, tone (e.g., neutral, enthusiastic, and delightful), vocal fillers (e.g., “uh”, “erm”), conversation length, repairs (dealing with failures in interaction), referral to previous topics/parts in a conversation, willingness to talk about personal matters, asking questions about the conversation partner (robot), agreement/ disagreement
•

Turn-taking: Speaker dominance, willingness to take a turn, interruption, response time, backchannelling
•

Interacting with others: Interacting with the experimenter/ third party during the conversation with the robot

Annotators were encouraged to discuss whether they made use of these elements in their analysis, their usefulness and importance in assessing user enjoyment (even if these aspects were not present in the videos), including any other aspects/ cues they have previously used during familiarization.

Annotators were requested to systematically review the three videos, examining each conversation turn individually. They were prompted to assess various factors, including rating, cues, and aspects, while also considering the detectability of user enjoyment in each turn. Annotators were further instructed to identify contrasting and supporting arguments for their ratings, as the reasons behind the divergence between the annotators can be just as, if not more, valuable than the concordance between the annotators [62]. Upon completion of the turn analysis, annotators were guided to discuss their overall user enjoyment rating, identify which aspects, multimodal cues, and conversation turns contributed most significantly to their conclusions, and consider the relative importance (weights) assigned to these modalities in their assessments. Annotators were also encouraged to evaluate whether specific aspects or cues were observable from both the frontal view (the robot’s camera that directly captures participants) and the side view (the external camera that captures both entities) and reflect on how analyzing these different perspectives may have influenced their assessments. Annotators were advised to keep their ratings, both within ELAN and the accompanying document that justifies their scores, readily accessible on their screens as a reference point during the discussions.

Self-reported user perceptions were given to the annotators at this stage, as reported by questionnaire ratings in terms of the level of user satisfaction, fun, interestingness of the conversation, and strangeness of talking to the robot, as described in Section III-B. These metrics correlate highly with each other (Cronbach’s $\alpha=0.84$ ), with lower $\alpha$ when any of the metrics are excluded. The annotators were instructed to view the ratings after the overall user enjoyment in the interaction had been discussed for the corresponding participant. Based on the user perceptions, how these results correlate with their findings and the reasons behind discrepancies were discussed.

Finally, the annotators were asked to develop a user enjoyment scale for conversational HRI that serves as both a guideline for their remaining annotations and a reference for future research on user enjoyment (Section IV-D).

IV-D Human-Robot Interaction Conversational User Enjoyment Scale (HRI CUES)

The discussions between the annotators (A1, A2, and A3) lasted four hours, during which they carefully went through three example videos (S0-S2) that they had previously annotated and discussed the cases one by one. The annotators viewed every turn in the interaction several times and exchanged the reasons behind their rating in terms of multimodal cues, until they all aligned on the rating for each turn. Simultaneously, they created a list of signs of enjoyment and dis-enjoyment that they had used during their enjoyment evaluation. Towards the end of the session, they settled on a 5-item scale based on the initial provided scale, ranking from very low enjoyment to very high enjoyment. The final user enjoyment scale, namely the Human-Robot Interaction Conversational User Enjoyment Scale (HRI CUES), is:

1

Very low enjoyment — Discomfort and/or frustration
2

Low enjoyment — Boredom or interaction failure
3

Neutral enjoyment — Politely keeping up the interaction
4

High enjoyment — Smooth and effortless interaction
5

Very high enjoyment — Immersion in the conversation and/or deeper connection with the robot

Two exchanges per rating of the scale from the alignment interactions are provided as illustrative examples in the video provided in the footnote⁶⁶6HRI CUES exemplary exchanges: https://youtu.be/VmKvGM0pyec.

To rate an exchange higher on the user enjoyment scale (4 or 5), the annotators looked for different signs of enjoyment, which, for example, were smirking, movement, flow of conversation (the topic is moving forward), no strain or discomfort, asking questions [to the robot], smooth turn-taking, dynamic tonality (and dynamic phrasing of sentences), being playful, sharing personal experiences [to the robot], sharing an understanding (common ground) [with the robot], and anthropomorphizing [the robot].

To rate an exchange lower on the scale (1 or 2), the annotators looked for signs of dis-enjoyment, which, for example, were low energy, sighing, tiredness, long breaths, restless movements (i.e., adaptors, such as moving in the chair from side to side or changing arm position), flat tonality, silence, awkward and negative facial expressions, flaring nostrils, disengagement cues (e.g., turning away from the robot, or responding in a way that disrupts the conversation flow, such as “That is true”), and topic closure (e.g., “Let’s talk about something else”). In addition, robot behaviors that disrupted the interaction flow, such as repeated questions, were considered to be strong causes of dis-enjoyment.

Neutral enjoyment (3) refers to a lack of these cues, in which conversation content (and context) becomes more relevant, such as having small talk or continuing the conversation without having much interest in the topic.

In cases where the exchange has cues from multiple scale levels, the annotators determined the dominant level in that interaction. This could be done by observing the intensity of the cues, the significance of the cues, or the interaction trajectory. On the other hand, if an annotator observed strong cues from two moderately or highly distinct levels (as opposed to subsequent levels), they would annotate using a level between those. For instance, as evident in this exchange⁷⁷7Exchange with cues from multiple levels: https://youtu.be/2HA-_5B9JHs, when there is discomfort at the beginning (1), but the user continues to politely keep up the interaction (3), the exchange would be annotated as a 2, the mid-point between the levels.

There were also a few cases that were difficult to categorize as enjoyment or dis-enjoyment, and therefore were interpreted as more context-dependent, which, for example, were gaze aversion, attention on the experimenter or camera, topic duration, and initiation. For instance, gaze aversion could be due to thinking, floor management, intimacy regulation (cf., [63]), or as a reaction to something the robot said or did.

As general guidelines for annotating user enjoyment, it became clear that it was important to get acquainted with the participants, where different participants had different sets of signals. While watching the videos, the annotators learned each person’s rhythm and gestures for what was interpreted as a ‘baseline’ behavior from which the person could deviate during the interaction. This means that the same type of gesture (e.g., keeping one’s arms crossed) could be interpreted differently for different participants. Instead, an emphasis was placed on the change in behavior. It was also important to separate content from context, i.e., it is essential to be mindful of what is being said (conversation content, e.g., topic), but the focus should be more on the whole feeling of the exchange.

The interaction failure does not necessarily refer to a robot failure (e.g., incorrect response, speech recognition failure, turn-taking error, disengagement cue), since robot failures can lead to amusement, anthropomorphism, or empathy in the user, thereby increasing user enjoyment. Interaction failure rather refers to the situation when either the user (e.g., interrupting the robot) or the robot made a failure that resulted in the interruption being disrupted, leading to low enjoyment.

When annotating the videos, the annotators assumed that in the future, robots would be able to judge the level of user enjoyment in real-time while having a conversation. Therefore, the videos were annotated segment by segment (turn by turn), with each segment being watched only once, similar to [40].

IV-E Annotation using HRI CUES

After establishing HRI CUES, all annotators independently rated the remaining 25 videos described in Section III-B. The same methodology was employed as in Section IV-B, with the only difference being the enjoyment scale, as HRI CUES was used instead of the initial scale. That is, the annotators rated 590 turns (174 min) using ELAN with both side and frontal view of the interaction (Fig. 2), viewing each turn only once, in addition to providing an overall enjoyment score per interaction. They also provided an explanation for their overall ratings and the challenges they faced during the annotation, as in Section IV-B. The annotation was conducted over 8 days. The results are reported in the next section.

V Results

Following the discussions involved in the annotator alignment that redefined the user enjoyment scale and methodology, annotators rated the remaining 25 videos of robot interactions from our participatory design workshop individually.

V-A Distribution of User Enjoyment

Fig. 3 shows how each annotator rated the interaction exchanges (turns), indicating that the interactions mainly were ( $45.9$ %) regarded as neutral in enjoyment, with rare occurrences of very low ( $9.2$ %) and very high ( $13.9$ %) enjoyment, showing a near Gaussian distribution of user enjoyment for each annotator. Fig. 8 (in the Appendix) shows the rating distributions of annotators per participant, which display a similar Gaussian distribution of perceived enjoyment by the annotators, with some participants (e.g., P25, P27) perceived to have a more enjoyable interaction than others (e.g., P3).

V-B Rater Reliability

To evaluate the reliability of the annotators’ enjoyment ratings, we employed the Intraclass Correlation Coefficient (ICC), similar to [16], which is a statistical measure used to assess the reliability or consistency of ratings provided by multiple raters (or annotators) [51]. ICC values range from 0 to 1, with higher values indicating greater agreement among raters⁸⁸8ICC value less than 0.5 is poor reliability, between 0.5 and 0.75 is moderate, between 0.75 ad 0.9 is good, and above 0.9 is excellent reliability [51].. This study focuses on two specific forms of ICC:

•

ICC(2) - Single Random Raters: Designed for situations where each subject is rated by the same raters, and those raters are considered to be randomly selected from a larger population of possible raters.
•

ICC(2,k) - Average Random Raters: An extension of ICC(2), applied when the average ratings of k raters are considered, enhancing the reliability of the measurement.

Similarly to how our data was coded, we present rater reliability for each conversation exchange and the overall enjoyment score provided by each annotator for the interactions.

V-B1 Per Conversation Turn

The resulting annotations per conversation turn of 25 videos are shown in Fig. 9 in the Appendix. Treating each conversation turn as a repeated measures factor in the reliability analysis:

•

The ICC(2) was $0.47$ with 95% confidence interval ranging from $0.23$ to $0.69$ , indicating poor to moderate level of reliability. This was statistically significant ( $p<0.001$ ) with an $F$ -statistic of $3.83$ ( $df_{1}=24$ , $df_{2}=48$ ).
•

For the average ratings of all coders, the ICC(2,3) was $0.72$ with 95% confidence interval of $0.47$ to $0.87$ , suggesting a poor to good level of reliability, which was statistically significant ( $p<0.0001$ ). This was further supported by the same $F$ -statistic.

Based on the visual inspection of annotator ratings (Fig. 9), A1 was identified to diverge from A2 and A3. A1 was substantially more positive ( $M=3.31$ for turns) from the other annotators ( $A2:M=3.12$ , $A3:M=3.11$ ). To confirm this, we evaluated ICC with A1 excluded:

•

ICC(2) for single random raters rose substantially to $0.74$ , with 95% confidence interval ranging from $0.49$ to $0.88$ , indicating a much stronger reliability between A2 and A3. This result was statistically significant ( $p<0.001$ ) with an $F$ -statistic of $6.52$ ( $df_{1}=24$ , $df_{2}=24$ ).
•

When considering the average ratings of the remaining two annotators (k=2 in ICC), ICC(2,2) was an impressive $0.85$ , with 95% confidence interval of $0.66$ to $0.93$ , suggesting moderate to excellent reliability, which was statistically significant ( $p<0.0001$ ). This was further confirmed by the same $F$ -statistic.

These results confirmed our initial conclusion. In addition, removing A2 or A3 separately decreased ICC. Subsequent discussions with the annotators further confirmed a divergence in the ratings provided by A1 (Section VI-C2).

V-B2 Overall Enjoyment Score

The overall enjoyment scores per annotator are presented in Fig. 2 in the Appendix. Reliability among overall enjoyment scores was:

•

The ICC(2) for single random raters was found to be $0.48$ , with 95% confidence interval ranging from $0.24$ to $0.69$ , indicating poor to moderate level of reliability. This value was statistically significant ( $p<0.001$ ) with an $F$ -statistic of $3.74$ ( $df_{1}=24$ , $df_{2}=48$ ).
•

The average ratings from three annotators (ICC(2,3)) was $0.73$ , with 95% confidence interval of $0.48$ to $0.87$ , suggesting a poor to good level of reliability, which was statistically significant ( $p<0.0001$ ) with $F$ -statistic of $3.74$ ( $df_{1}=24$ , $df_{2}=48$ ).

Similar to per-turn analysis, excluding the divergent annotator (A1) led to improved reliability:

•

ICC(2) increased to $0.58$ , with 95% confidence interval ranging from $0.25$ to $0.79$ , suggesting a higher consistency between two annotators. This was statistically significant ( $p<0.001$ ) with an $F$ -statistic of $3.74$ ( $df_{1}=24$ , $df_{2}=24$ ).
•

When considering the average ratings of the remaining two annotators, the ICC(2,2) for average random raters was $0.74$ , with 95% confidence interval of $0.4$ to $0.88$ , further indicating enhanced reliability. This was statistically significant ( $p<0.001$ ) by the same $F$ -statistic of $3.74$ ( $df_{1}=24$ , $df_{2}=24$ ).

These findings underscore the importance of selecting consistent raters and the benefit of averaging ratings across multiple annotators to achieve enhanced reliability in measuring user enjoyment in robot conversations.

V-C Correlations with Self-Reported User Perceptions

Users’ subjective ratings of enjoyment during the interactions were obtained from the questionnaire in the participatory design workshops after their interaction with the robot, in terms of user satisfaction, fun, interestingness, and strangeness of the conversation, as described in Section III-B, which are presented in Fig. 2 (in the Appendix) along with the annotator overall enjoyment ratings per interaction. The items had high correlation (Cronbach’s alpha = $0.84$ ), with the removal of each item reducing the correlation in the construct. These self-reported scores and the average of these scores were compared against annotators’ overall enjoyment scores to evaluate how well the annotators could perceive their enjoyment. Spearman correlation was used across four Likert scale items and the average of these scores, with 95% confidence interval (ranging from $0.71$ to $0.92$ ). The results were as follows:

•

Overall vs. User Average: Not statistically significant ( $p=0.08$ ) moderate positive correlation ( $r=0.36$ ),
•

Overall vs. User Satisfaction: Not statistically significant ( $p=0.06$ ) moderate positive correlation ( $r=0.39$ ),
•

Overall vs. User Fun Talking: Not statistically significant ( $p=0.21$ ) weak positive correlation ( $r=0.26$ ),
•

Overall vs. User Conversation Interesting: Not statistically significant ( $p=0.68$ ) very weak positive correlation ( $r=0.09$ ),
•

Overall vs. User Felt Strange (Reversed): Statistically significant ( $p=0.04$ ) moderate positive correlation ( $r=0.42$ ).

VI Discussion

VI-A Enhanced Reliability with Averaged Annotator Ratings

While having a similar distribution of ratings by all annotators, the reliability analysis revealed a marked distinction between ICC(2), which assesses the reliability of single random raters, with ICC(2,k), which considers the average ratings of multiple annotators. The latter consistently demonstrated higher reliability across both overall enjoyment scores and conversational turn ratings. This finding highlights the substantial benefit of collective annotator wisdom over individual assessments in assessing user enjoyment.

The superior reliability of ICC(2,k) highlights the inherent variability in subjective experiences and perceptions of enjoyment, suggesting that averaging across multiple annotators can effectively mitigate individual biases and variations in judgment, leading to a more reliable representation of true user enjoyment. This finding is critical, as it emphasizes the importance of incorporating multiple perspectives to achieve a more accurate and consistent evaluation of user enjoyment in conversational interactions with robots. Consequently, the distinction between ICC(2) and ICC(2,k) results not only validates the robustness of our user enjoyment scale but also illustrates the methodological importance of employing multiple annotators for capturing the complex and subjective nature of enjoyment in human-robot conversations.

VI-B Correlation with Self-rated Enjoyment Scale

To further validate our user enjoyment scale, we analyzed correlations between the annotators’ overall enjoyment scores and the participants’ subjective enjoyment ratings. Results revealed a statistically significant, moderate positive correlation between overall enjoyment scores and the (reversed) ‘felt strange’ item, indicating that higher enjoyment scores were associated with decreased feelings of awkwardness during the interaction. This appears to greatly align with our annotator’s discussions and the resulting enjoyment scale that classifies the presence of signs of discomfort as the scale’s lowest level.

However, the correlations between annotator scores and other user-reported measures — such as satisfaction, average enjoyment, fun in talking, and interest in conversation — although trending towards moderate positive correlations, did not reach statistical significance. These findings suggest a nuanced relationship between observed enjoyment and user self-reported experiences. While annotators can detect general levels of comfort and ease within interactions, capturing comprehensive internal subjective enjoyment may require additional data, such as initial expectations towards the robot, personality traits, as well as physiological responses, not accessible through direct observation. This discrepancy underscores the complexity of correlating observed behavior with subjective internal states and highlights the challenge of fully capturing user enjoyment in HRI. Despite these considerations, having an additional tool to evaluate user enjoyment in conversations is highly valuable for HRI research to assess and develop robots that are enjoyable to interact with, within all parts of the conversation, especially in daily life. This approach aligns with how humans naturally adapt our conversations in real-time based on the external cues we observe in others. The measure presented in this paper offers a valuable tool for collecting observational data to train autonomous enjoyment detection systems that can be used on robots or other agents.

VI-C Final Alignment Post Annotation

After analyzing the results, the annotators were asked to have another set of discussions by presenting them with Fig. 9 and 10 in the Appendix (without reliability or correlation scores), which lasted four hours. They were asked to pinpoint instances of rating divergence and concordance within their assessments. Particular emphasis was placed on identifying turns where annotators disagreed or agreed most fervently. Subsequently, the annotators were prompted to watch the corresponding video segments to explore the reasons behind their ratings and the rationale for their agreement or disagreement. The primary objective was to gain insight into potential major concordance (Section VI-C1) and divergence (Section VI-C2) in the way dialogue exchanges were annotated and elucidate the underlying reasons for these variations.

Following the turn-by-turn analysis, annotators were asked to identify the two most significant discrepancies (highest and lowest) between their ratings and self-reported participant perceptions (Section VI-C3). They were asked to engage in discussions exploring potential reasons underlying the disparities and similarities between their perceptions and those of the users from multiple aspects based on their expertise.

VI-C1 Concordance Between Annotators

In numerous instances across the videos, a concordant agreement was observed among the annotators. As an illustrative example, in the interaction of the ninth participant (P9, turn 13 in Fig. 4), the conversation exchange exhibited a seamless progression, and the participant’s enjoyment level was distinctly conveyed through expansive bodily gestures. Notably, all annotators unanimously assigned a rating of five for the exchange since the participant threw themselves backward in the chair laughing. In the same video, the annotators all assigned a rating of two for the exchanges where the participant’s response was marked by sighing and a demeanor suggestive of resignation. This reaction occurred as a response to an unnecessary repetition initiated by the robot, specifically at turn 23.

In another example (P10, Fig. 4), the annotators assigned a rating of 1 to turn 16 to indicate that the participant openly expressed their negative thoughts due to the robot not making eye contact with the participant. During this interaction, the participant also attempted to establish contact with the experimenter. Subsequently, the participant made an effort to politely maintain a dialogue with the robot according to social norms, a behavior that garnered consensus among the annotators as being representative of a rating of 3.

In conclusion, the annotators agreed when the user enjoyment scale aligned clearly with participant behavior. However, in most cases, the interaction between the robot and the participant did not correspond as clearly or unambiguously to the user enjoyment scale. This is likely due to the complex and situational nature of the cues in the interaction, making it challenging to develop comprehensive yet precise guidelines for annotation. Instead, the general scale needs to be interpreted by the annotators for the particular use case to find anchor points that are appropriate for the specific context.

VI-C2 Divergence Between Annotators

Throughout the analysis of the 25 videos, there were instances where annotators differed substantially in their assessments. For instance, for P1 (Fig. 5), at turn 4, A1 assigned a rating of 5, while A2 scored it as 2, and A3 as 3. The participant’s laughter posed a challenge as it was perceived both as a sign of high enjoyment (by A1) and, conversely, as an expression of frustration towards the situation or the robot (by A2 and A3), rather than amusement with the robot. Furthermore, at turn 15, A1 assigned a rating of 3, while A2 rated it 1, and A3 as 2. In this context, the participant remained entirely silent, awaiting the robot to initiate further interaction. Annotators interpreted this silence differently, seeing it as politeness, boredom, or discomfort.

For another participant (P25, Fig. 5), during turn 8, A1 assigned a rating of 5, whereas A2 rated it as 1, and A3 as 2. The participant asked the robot to make more eye contact with them, which could be interpreted as a period of heightened immersion and anthropomorphism or criticism. Following this, the annotators consistently exhibited discord in their assessments until turn 17, when they reached a consensus once more. For instance, at turn 11, A1 marked it as a 5, A2 as 3, and A3 as 1. In this case, the participant expressed reservations about sharing personal information with the robot due to unfamiliarity, yet did so while smiling and posing a question to the robot in a playful tone. This complexity in the interaction exchange presented challenges for the annotators, as it encompassed a multitude of actions. While the verbal content suggested discomfort, the presence of laughter, smiling, and playful tonality indicated enjoyment. Consequently, the annotators encountered mixed signals, and the resulting ratings depended on which aspect of the interaction they prioritized.

The notable inconsistencies between A1’s ratings and those of the others led to the inclusion of reliability results for the more consistently aligned group of annotators (A2 and A3).

VI-C3 Similarities and Discrepancies Between User and Annotator Perceptions

While annotators aligned well with a large proportion of the participants in their perceptions, there were substantial discrepancies for some of the participants. For instance, for P6 and P21 (see Fig. 6), it can be noted that the annotators and the participant interpreted user enjoyment in a similar way, with overall interaction scored as 5 and 4, matching that of the average of user reported values. The conversations went well, and the participants seemed to take a playful approach in the interaction, which was reflected in both the participants’ and annotators’ scores.

The conversation context might cause a discrepancy between the annotators’ assessment and users’ rating of the enjoyment. For instance, P3 (Fig. 7) talked about a controversial topic (UFOs) with the robot. The robot repeatedly questioned the participant (e.g., “Why do you think that?”, “Can you tell more about what gave you this insight?”) when they were affirmative about having observed the existence of UFOs. The repetitive questions could have been perceived as offensive or discomforting due to the nature of the topic and their stance towards it, despite it being a type of interaction failure (repetition of the same phrase) that occurred with other participants as well. The participant stated to the robot that they wanted to change the topic twice, and then turned to the experimenters to voice this desire (after 3.5 minutes), in addition to displaying cues of anxiety (e.g., playing with fingers, looking around at the camera and at the experimenters), which can confirm the belief from the annotators that the participant had a negative experience, who rated the interaction low in enjoyment. The participant managed to change the topic on their own to talk about the weather, and later about the university and robots for the rest of the conversation. However, the participant gave high scores (see Fig. 7) in all aspects of enjoyment. The participant might have rated the experience as more enjoyable and interesting than the annotators due to researcher bias, i.e., that the participant felt the need to please the researcher [64]. Given the controversial topic discussed, the positive ratings can be interpreted as a strategy to avoid judgment from the researchers, as the participant frequently gazed at the experimenter during several exchanges, while displaying signs of enjoyment (e.g., smiles, thumbs up, nods) even after the topic change. Another reason could be the novelty effect, since they might have been happy to talk with a robot regardless of the negative experience.

Another participant (P13, Fig. 7) experienced the interaction as less enjoyable than the annotators interpreted. This might be because the participant was experiencing a high number of technical and social failures from the robot while still displaying enjoyable signs. The participant used the robot for transactional requests rather than a casual conversation. The transactional nature of the conversation combined with failures might explain why the participant gave low enjoyment scores. The annotators gave a higher score because the participant seemed to forgive the failures, laugh them off, and continue the conversation smoothly. This can be seen as an important reminder that the annotators are not always assessing the same aspects as the participants in their self-assessment.

VI-D Adapting HRI CUES to Other Domains

A final round of annotator discussions (lasting 3 hours) was conducted to adapt the developed scale to other domains in HRI, extending its application beyond the context of companion robots for older adults to encompass other conversational contexts where analyzing enjoyment is crucial.

The process of using the Human-Robot Interaction Conversational User Enjoyment Scale (HRI CUES) is twofold. First, the scale is the primary supportive instrument for engaging with the dataset and finding an agreement among annotators regarding which level of enjoyment an exchange represents. Secondly, it is important to highlight the cultural and context-dependent changes when assessing enjoyment with the scale, therefore, we recommend that the annotators reach an agreement on which multimodal cues are important in their study. In the previous section, we presented the relevant cues in our study located in a Nordic Western setting, but these could differ in another setting. For example, gestures, such as thumbs up, mean different things in different parts of the world. Our proposed scale can already be directly applied in many settings where users engage with social robots (or potentially other agents) in conversations. However, in many other cases, we encourage further adapting the scale to the particular domain.

To replicate our methodology and adapt HRI CUES to another domain, the following framework should be applied:

1.

Recruit three annotators with relevant and complementary backgrounds who are familiar with the specific culture and context of the study.
2.

Establish the intended usage of the annotations such that the annotators can tailor their annotations to fit that use case and align their views on the practical meaning of enjoyment in that context.
3.

Ask annotators to systematically annotate three example videos (from the dataset) using the HRI CUES. Example videos should exemplify different interaction outcomes from the dataset and be represented from various angles (front, side, both). Encourage the annotators to look for their own cues of enjoyment or dis-enjoyment, and describe the reasons behind their overall enjoyment scores based on those after viewing each interaction, and the corresponding challenges of detecting enjoyment.
4.

Arrange a discussion between the annotators to identify contrasting and supportive arguments for multimodal cues associated with the scale, aiming to precisely determine the cues for each segment and strive for consensus for the corresponding rating. Give each segment sufficient time for discussions while avoiding getting stuck on small details. When faced with a difficult case, note what is not agreed on and move to the next segment.
5.

Based on the discussion, construct an annotation schema, which should contain the cues that were agreed on for assessing the enjoyment in relation to the scale, especially emphasizing the cues that were discussed and not immediately agreed on.
6.

Annotate the remaining dataset turn-by-turn using the HRI CUES and the multimodal cues. This is done by looking at each exchange once and annotating in real-time without going back in the data to not influence the evaluation of the beginning segments of the interaction by already knowing the end segments.

In this paper, we evaluated HRI CUES through the interactions of older adults with a conversational companion robot using Furhat at the university premises. However, this was only one example of how the scale can be used; HRI CUES is generalizable to other contexts in which a user interacts with a social robot. Therefore, the second to fifth steps of the framework above are crucial for the annotators to adapt the communication cues to their context and setting of the study. HRI CUES does not require any adjustment as such, but it requires interpretation with respect to the context of the use case. The interpretation is facilitated by the six-step framework, hence, it is important to find appropriate annotators who are familiar with the particular application area.

VI-E Challenges and Limitations

In this work, we introduced a novel scale for annotators to evaluate user enjoyment in conversational HRI. However, enjoyment is a subjective measure, and thus, is challenging to evaluate and agree on between annotators and correlate with users perceptions, given the multitude of aspects connected to it. For instance, the context and length of the exchange affected how the enjoyment was perceived. When segments were overly brief, the exchange did not always contain sufficient information for a fair assessment. More frequently, however, excessively long segments were complex to analyze. For example, as described in Section VI-C2, the annotators often interpreted the exchange differently due to long segments that contained several cues belonging to separate levels of the scale. In these cases, a different assessment approach may be necessary as the longer segments introduced an additional factor for the annotators to consider: which part or aspect of the exchange to emphasize in the assessment. Due to the complex nature of enjoyment as a concept, which highlights the necessity of pre-coding discussions among the annotators, aligning on the definition of enjoyment of relevance to the particular study is challenging. One solution is to focus on the change of behavior within the exchange. For instance, if the robot’s response improved the user’s demeanor towards the robot, the exchange should be rated towards the level that contains higher enjoyment cues in the scale, and conversely if it had a negative impact. Other alternatives could be to use an aggregated score, the rating that corresponds to the majority of the segment, the most/least extreme rating, or the rating that corresponds to the first or last part of the segment. The choice of method for these situations should be determined during the annotation alignment process, taking into account the specific use case and domain.

The differences between the annotators’ overall rating on user enjoyment and user perceptions (as discussed in Section VI-C3) might be due to the participant’s expectation of the robot’s social and technical level, while the annotators only look at the interaction itself. In addition, since this is their first interaction with a robot for most of the participants (20 out of 25), the ‘novelty effect’ might have changed their perceptions more positively or negatively, given that the duration (7 minutes) is not long enough to overcome it [65]. However, the variability observed by the annotators in user enjoyment states throughout the interaction (e.g., Fig. 5) shows that conversation context may alter the novelty effect, providing a more complete picture of the enjoyment throughout the interaction than a self-reported score at the end of the interaction. In addition, the users’ responses to the questionnaire may differ from their actual attitudes towards the robot [66]. These support the importance of HRI CUES as an additional tool to evaluate user enjoyment, providing means for real-time estimation of enjoyment in conversational agents.

Our scale is designed to serve as a tool for assessing perceived user enjoyment during interactions with robots, intended primarily for researchers in the field of HRI. However, given that open-domain dialogue may involve sensitive information that individuals may be hesitant to share with unfamiliar parties, it is imperative to obtain explicit consent from participants before they talk with the robot. In our studies, we ensured this consent prior to participants’ interactions with the robot, employing the use of the term ‘sentiment analysis’ in the consent form, and explained to all participants that their interactions would be analyzed in terms of their affective states (‘identification of feelings’) by both researchers and automated systems. While obtaining consent is essential for researchers utilizing our scale in future studies, it is crucial to recognize that this process may influence participants’ behavior and conversation topics, as they may be reluctant to share sensitive memories or be concerned about being judged by others. Consequently, this can result in a disconnection with the robot, posing challenges in achieving high levels of enjoyment during interactions. Nonetheless, this impact is likely to diminish over the course of the interaction or across multiple interactions, particularly in long-term settings. Additionally, researchers must exercise caution to uphold the privacy and confidentiality of participants when sharing data, ensuring that they remain unidentifiable in images and videos (as demonstrated in the examples provided for the scale), and removing any sensitive information. Moreover, researchers should remain mindful of their own biases and subjectivity, which may lead to variations in the interpretation of enjoyment compared to the participants’ experiences. User enjoyment is often context-specific, indicating that users’ behavioral and affective expressions are connected to specific socio-cultural contexts, including values, norms, and expectations of what is considered appropriate in certain situations [67]. These underscore the significance of employing multiple annotators for the scale that are familiar with the socio-cultural background of the target population, and using the scale to complement self-reported user perceptions to have a deeper understanding of the interaction quality. As human-robot interactions continue to evolve, ensuring a deep understanding of user enjoyment not only elevates the quality of these interactions but also paves the way for more empathetic and meaningful connections.

VII Conclusion

Our research contributed a novel scale for measuring user enjoyment in conversations with a robot from an external perspective. The scale was developed through rigorous discussions of three annotators with complementary and relevant backgrounds to user enjoyment and the application domain. Older adults’ interactions with a companion robot were used as the basis for developing the scale, which was evaluated on 174 minutes of interactions of 25 participants. Inter-rater reliability analysis showed the importance of using multiple annotators, with moderate to good alignment, where the disagreements arose from the complexity and the subjectivity of user enjoyment, where a user can show various signs of enjoyment and dis-enjoyment within a single conversation exchange. The overall user enjoyment rated per interaction correlated significantly with users’ perceived level of strangeness of the conversation, which signifies that the (dis)comfort experienced in the interaction was correctly identified by the annotators, and shows the importance of including dis-enjoyment levels in the scale. These findings validate our user enjoyment scale and emphasize the critical role of methodological rigor in assessing subjective experiences within conversational robot interactions. Our study emphasizes the value of using multiple annotators and proposes potential scale refinements to further enhance consistency in quantifying the nuanced concept of enjoyment across application domains. The developed scale and the corresponding dataset aim to provide a tool for measuring user enjoyment from an external perspective to supplement self-reported user enjoyment responses in HRI research, with future potential application for autonomous detection of user enjoyment in real-time in robots and agents for adapting conversations contingently to provide enjoyable and long-lasting interactions.

Acknowledgments

We would like to thank Aida Hosseini for manually correcting transcripts of robot interactions, and the study participants for their time and efforts.

References

[1] M. Heerink, B. Kröse, B. Wielinga, and V. Evers, “Enjoyment intention to use and actual use of a conversational robot by elderly people,” in Proceedings of the 3rd ACM/IEEE international conference on Human robot interaction, 2008, pp. 113–120.
[2] J. Abdi, A. Al-Hindawi, T. Ng, and M. P. Vizcaychipi, “Scoping review on the use of socially assistive robot technology in elderly care,” BMJ Open, vol. 8, pp. 1–20, 2018.
[3] S. Kuoppamäki, R. Jaberibraheem, M. Hellstrand, and D. McMillan, “Designing Multi-Modal Conversational Agents for the Kitchen with Older Adults: A Participatory Design Study,” International Journal of Social Robotics, Sep. 2023.
[4] S. Thunberg, M. Arnelid, and T. Ziemke, “Older adults’ perception of the furhat robot,” in Proceedings of 10th International Conference on Human-Agent Interaction (HAI22), 2022, pp. 4–12.
[5] H. R. Lee, S. Šabanović, W.-L. Chang, S. Nagata, J. Piatt, C. Bennett, and D. Hakken, “Steps Toward Participatory Design of Social Robots: Mutual Learning with Older Adults with Depression,” in Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction. Vienna Austria: ACM, Mar. 2017, pp. 244–253.
[6] S. Šabanović, “Robots in society, society in robots,” International Journal of Social Robotics, vol. 2, no. 4, pp. 439–450, 2010.
[7] C. Breazeal, C. Kidd, A. Thomaz, G. Hoffman, and M. Berlin, “Effects of nonverbal communication on efficiency and robustness in human-robot teamwork,” in 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2005, pp. 708–713.
[8] L. D. Riek, “Wizard of oz studies in hri: a systematic review and new reporting guidelines,” J. Hum.-Robot Interact., vol. 1, p. 119–136, 2012.
[9] Y. K. Lee, Y. Jung, G. Kang, and S. Hahn, “Developing social robots with empathetic non-verbal cues using large language models,” in 2023 32nd IEEE International Conference on Robot & Human Interactive Communication (RO-MAN), 2023.
[10] N. Cherakara, F. Varghese, S. Shabana, N. Nelson, A. Karukayil, R. Kulothungan, M. Farhan, B. Nesset, M. Moujahid, T. Dinkar, V. Rieser, and O. Lemon, “Furchat: An embodied conversational agent using llms, combining open and closed-domain dialogue with facial expressions,” in Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SigDIAL), 2023.
[11] W. Khoo, L.-J. Hsu, K. J. Amon, P. V. Chakilam, W.-C. Chen, Z. Kaufman, A. Lungu, H. Sato, E. Seliger, M. Swaminathan, K. M. Tsui, D. J. Crandall, and S. Sabanović, “Spill the tea: When robot conversation agents support well-being for older adults,” in Companion of the 2023 ACM/IEEE International Conference on Human-Robot Interaction. New York, NY, USA: ACM, 2023, pp. 178–182.
[12] B. Irfan, S.-M. Kuoppamäki, and G. Skantze, “Between reality and delusion: Challenges of applying large language models to companion robots for open-domain dialogues with older adults,” Autonomous Robots, 2023, preprint at https://doi.org/10.21203/rs.3.rs-2884789/v1.
[13] C. Oertel, G. Castellano, M. Chetouani, J. Nasir, M. Obaid, C. Pelachaud, and C. Peters, “Engagement in human-agent interaction: An overview,” Frontiers in Robotics and AI, vol. 7, p. 92, 2020.
[14] B. Irfan, J. Kennedy, S. Lemaignan, F. Papadopoulos, E. Senft, and T. Belpaeme, “Social psychology and human-robot interaction: An uneasy marriage,” in Companion of the 2018 ACM/IEEE International Conference on Human-Robot Interaction. ACM, 2018, pp. 13–20.
[15] F. Lingenfelser, J. Wagner, E. André, G. McKeown, and W. Curran, “An event driven fusion approach for enjoyment recognition in real-time,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 377–386.
[16] S. J. Reimnitz and A. J. Rauer, “Mutual enjoyment in older couples’ conversations and its links to marital satisfaction,” Personal Relationships, vol. 29, no. 2, pp. 332–349, 2022.
[17] M. Csikszentmihalyi, R. Larson et al., Flow and the foundations of positive psychology. Springer, 2014, vol. 10.
[18] C. Price, The Power of Fun: How to Feel Alive Again. Random House Publishing Group, 2021.
[19] E. D. Mekler, J. A. Bopp, A. N. Tuch, and K. Opwis, “A systematic review of quantitative studies on the enjoyment of digital entertainment games,” in Proceedings of the SIGCHI conference on human factors in computing systems, 2014, pp. 927–936.
[20] J. A. Russell, “A circumplex model of affect.” Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980.
[21] P. Ekman, “An argument for basic emotions,” Cognition & emotion, vol. 6, no. 3-4, pp. 169–200, 1992.
[22] J. Endicott, J. Nee, W. Harrison, and R. Blumenthal, “Quality of life enjoyment and satisfaction questionnaire: a new measure.” Psychopharmacology bulletin, vol. 29, no. 2, pp. 321–326, 1993.
[23] D. Kendzierski and K. J. DeCarlo, “Physical activity enjoyment scale: Two validation studies.” Journal of sport & exercise psychology, vol. 13, no. 1, 1991.
[24] M. Kono and K. Araake, “Is it fun?: Understanding enjoyment in non-game hci research,” arXiv preprint arXiv:2209.02308, 2022.
[25] V. Venkatesh, J. Y. Thong, and X. Xu, “Consumer acceptance and use of information technology: extending the unified theory of acceptance and use of technology,” MIS quarterly, pp. 157–178, 2012.
[26] S. Nishimura, T. Nakamura, W. Sato, M. Kanbara, Y. Fujimoto, H. Kato, and N. Hagita, “Vocal synchrony of robots boosts positive affective empathy,” Applied Sciences, vol. 11, no. 6, p. 2502, Mar 2021.
[27] M. D. Cooney, T. Kanda, A. Alissandrakis, and H. Ishiguro, “Interaction design for an enjoyable play interaction with a small humanoid robot,” in 2011 11th IEEE-RAS International Conference on Humanoid Robots. IEEE, 2011, pp. 112–119.
[28] K. M. Lee, Y. Jung, J. Kim, and S. R. Kim, “Are physically embodied social agents better than disembodied social agents?: The effects of physical embodiment, tactile interaction, and people’s loneliness in human–robot interaction,” International Journal of Human-Computer Studies, vol. 64, no. 10, pp. 962–973, 2006.
[29] F. D. Davis, “Perceived usefulness, perceived ease of use, and user acceptance of information technology,” MIS Quarterly, vol. 13, no. 3, pp. 319–340, 1989.
[30] M. M. Van Pinxteren, R. W. Wetzels, J. Rüger, M. Pluymaekers, and M. Wetzels, “Trust in humanoid robots: implications for services marketing,” Journal of Services Marketing, vol. 33, no. 4, 2019.
[31] M. Heerink, B. Kröse, V. Evers, and B. Wielinga, “Assessing acceptance of assistive social agent technology by older adults: the almere model,” International Journal of Social Robotics, vol. 2, pp. 361–375, 2010.
[32] J. Piasek and K. Wieczorowska-Tobis, “Acceptance and long-term use of a social robot by elderly users in a domestic environment,” in 2018 11th International Conference on Human System Interaction (HSI), 2018, pp. 478–482.
[33] L. K. Fryer and D. L. Dinsmore, “The promise and pitfalls of self-report: Development, research design and analysis issues, and multiple methods,” Frontline Learning Research, vol. 8, no. 3, p. 1–9, Mar. 2020.
[34] J. Ginzburg, E. Breitholtz, R. Cooper, J. Hough, and Y. Tian, “Understanding laughter,” in 20th Amsterdam Colloquium, 2015.
[35] M. Haakana, “Laughter and smiling: Notes on co-occurrences,” Journal of Pragmatics, vol. 42, no. 6, pp. 1499–1512, 2010.
[36] S. Lee and J. Choi, “Enhancing user experience with conversational agent for movie recommendation: Effects of self-disclosure and reciprocity,” International Journal of Human-Computer Studies, vol. 103, pp. 95–105, 2017.
[37] B. Lee and M. Y. Yi, “Understanding the empathetic reactivity of conversational agents: Measure development and validation,” International Journal of Human–Computer Interaction, vol. 0, no. 0, pp. 1–19, 2023.
[38] M. Walker, D. Litman, C. Kamm, and A. Abella, “Evaluating spoken dialogue agents with paradise: Two case studies,” Computer Speech & Language, vol. 12, no. 4, pp. 317–347, 1998.
[39] A. Schmitt and S. Ultes, “Interaction quality: Assessing the quality of ongoing spoken dialog interaction by experts—and how it relates to user satisfaction,” Speech Communication, vol. 74, pp. 12–36, 2015.
[40] R. Higashinaka, Y. Minami, K. Dohsaka, and T. Meguro, “Issues in predicting user satisfaction transitions in dialogues: Individual differences, evaluation criteria, and prediction models,” in International Workshop on Spoken Dialogue Systems Technology, 2010.
[41] W. Wei, S. Li, S. Okada, and K. Komatani, “Multimodal user satisfaction recognition for non-task oriented dialogue systems,” in Proceedings of the 2021 International Conference on Multimodal Interaction, ser. ICMI ’21. New York, NY, USA: ACM, 2021, p. 586–594.
[42] L. Campbell, R. A. Martin, and J. R. Ward, “An observational study of humor use while resolving conflict in dating couples,” Personal Relationships, vol. 15, no. 1, pp. 41–55, 2008.
[43] K. Jokinen and G. Wilcock, “Chapter 1 - multimodal open-domain conversations with robotic platforms,” in Multimodal Behavior Analysis in the Wild, ser. Computer Vision and Pattern Recognition, X. Alameda-Pineda, E. Ricci, and N. Sebe, Eds. Academic Press, 2019, pp. 9–26.
[44] S. Honig and T. Oron-Gilad, “Understanding and resolving failures in human-robot interaction: Literature review and model development,” Frontiers in Psychology, vol. 9, 2018.
[45] D. Kontogiorgos, M. Tran, J. Gustafson, and M. Soleymani, “A systematic cross-corpus analysis of human reactions to robot conversational failures,” in Proceedings of the 2021 International Conference on Multimodal Interaction, ser. ICMI ’21. New York, NY, USA: Association for Computing Machinery, 2021, p. 112–120.
[46] C. M. Carpinella, A. B. Wyman, M. A. Perez, and S. J. Stroessner, “The robotic social attributes scale (rosas): Development and validation,” in 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2017, pp. 254–262.
[47] T. Iio, Y. Yoshikawa, M. Chiba, T. Asami, Y. Isoda, and H. Ishiguro, “Twin-robot dialogue system with robustness against speech recognition failure in human-robot dialogue with elderly people,” Applied Sciences, vol. 10, no. 4, 2020.
[48] C. O’Connor and H. Joffe, “Intercoder reliability in qualitative research: Debates and practical guidelines,” International Journal of Qualitative Methods, vol. 19, p. 1609406919899220, 2020.
[49] E. Lagerstedt and S. Thill, “Multiple roles of multimodality among interacting agents,” ACM Transactions on Human-Robot Interaction, vol. 12, no. 2, pp. 1–13, 2023.
[50] K. Haring, C. Mougenot, F. Ono, and K. Watanabe, “Cultural differences in perception and attitude towards robots,” International Journal of Affective Engineering, vol. 13, pp. 149–157, 10 2014.
[51] T. K. Koo and M. Y. Li, “A guideline of selecting and reporting intraclass correlation coefficients for reliability research,” J Chiropr Med, vol. 15, no. 2, pp. 155–163, Jun 2016.
[52] S. M. Anzalone, S. Boucenna, S. Ivaldi, and M. Chetouani, “Evaluating the engagement with social robots,” International Journal of Social Robotics, vol. 7, no. 4, pp. 465–478, 2015.
[53] M. F. Jung, “Affective grounding in human-robot interaction,” in 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2017, pp. 263–273.
[54] G. Skantze, “Turn-taking in conversational systems and human-robot interaction: A review,” Computer Speech & Language, vol. 67, p. 101178, 2021.
[55] R. Stock-Homburg, “Survey of emotions in human–robot interactions: Perspectives from robotic psychology on 20 years of research,” International Journal of Social Robotics, vol. 14, no. 2, pp. 389–411, 2022.
[56] L. Mondada, “Challenges of multimodality: Language and the body in social interaction,” Journal of Sociolinguistics, vol. 20, no. 3, pp. 336–366, 2016.
[57] R. Clift, Conversation Analysis, ser. Cambridge Textbooks in Linguistics. Cambridge University Press, 2016.
[58] R. Shalihah, M. Rusijono, and A. Mariono, “The role of multimodal communication in language learning: Making meaning in conventional learning spaces,” in Proceedings of the International Conference on Language Phenomena in Multimodal Communication (KLUA 2018). Atlantis Press, 2018/07, pp. 230–233.
[59] M. Rasenberg, W. Pouw, A. Özyürek, and M. Dingemanse, “The multimodal nature of communicative efficiency in social interaction,” Scientific Reports, vol. 12, no. 1, p. 19111, 2022.
[60] A. Mehrabian, Basic Dimensions for a General Psychological Theory: Implications for Personality, Social, Environmental, and Developmental Studies. Cambridge: Oelgeschlager, Gunn & Hain, 1980.
[61] A. Ortony, G. L. Clore, and A. Collins, The Cognitive Structure of Emotions. Cambridge University Press, 1988.
[62] R. S. Barbour, “Checklists for improving rigour in qualitative research: a case of the tail wagging the dog?” BMJ, vol. 322, no. 7294, pp. 1115–1117, 2001.
[63] S. Andrist, X. Z. Tan, M. Gleicher, and B. Mutlu, “Conversational gaze aversion for humanlike robots,” in Proceedings of the 2014 ACM/IEEE International Conference on Human-Robot Interaction. New York, NY, USA: Association for Computing Machinery, 2014, p. 25–32.
[64] H. Noble and J. Smith, “Issues of validity and reliability in qualitative research,” Evidence-based nursing, vol. 18, no. 2, pp. 34–35, 2015.
[65] Jost, Céline and Le Pévédic, Brigitte and Belpaeme, Tony and Bethel, Cindy and Chrysostomou, Dimitrios and Crook, Nigel and Grandgeorge, Marine and Mirnig, Nicole, Ed., Human-robot interaction : evaluation methods and their standardization. Springer, 2020, vol. 12.
[66] M. Reimann, J. van de Graaf, N. van Gulik, S. van de Sanden, T. Verhagen, and K. Hindriks, “Social robots in the wild and the novelty effect,” in Social Robotics, A. A. Ali, J.-J. Cabibihan, N. Meskin, S. Rossi, W. Jiang, H. He, and S. S. Ge, Eds. Singapore: Springer Nature Singapore, 2024, pp. 38–48.
[67] G. A. Van Kleef, A. Cheshin, A. H. Fischer, and I. K. Schneider, “The social nature of emotions,” Frontiers in psychology, vol. 7, p. 896, 2016.

VIII Biography Section

Fig. 8 shows the HRI CUES rating distributions of the annotators per participant, which shows a similar trend of normal distribution.

Fig. 9 shows the HRI CUES ratings of the annotators within each conversation turn per participant.

Fig. 10 shows the overall interaction ratings of the annotators per participant in comparison to the self-reported user perceptions for user satisfaction, fun, interestingness of the conversation, and strangeness of talking to the robot (which is reverse-coded).