Keywords

1 Introduction

With the practical use of conversational agents and robots using smartphones, the necessity for dialogue techniques for free chatting with agents and robots is increasing. In particular, in recent years, research on constructing dialogue agent systems for entertainment and counseling has been actively conducted, and attention has been paid to the development of actual services. We are aiming to realize a dialogue agent system that allows for text chatting with conversational agents of existing anime characters that have natural movements. Many have one or more favorite anime characters when they are child, and many will continue to dream about having a variety of daily conversations with these characters, even as adults. However, to date, the construction of a dialogue system that reflects the personality of a real character has not been realized to our knowledge. Thus, we here propose a construction method for making such a dream possible in a realistic way.

We try to develop a dialogue-agent system of an existing anime character that operates on the basis of text chat as a first attempt to realize such a system. So that the system generates the natural behaviors of a character, two elemental technologies are needed: verbal behavior (utterance) generation for responding to any user utterances and nonverbal behavior (body motion) generation for the system’s utterances. We tackled these two research problems to generate the utterances and body motions of an existing character with the system.

In the generation of verbal behavior, it is a major issue to be able to generate utterances in the form of text that reflect the personality of existing characters in response to any user questions. If the utterances do not properly reflect a character’s personality, the user may feel discomfort or get bored quickly without feeling that they are talking to the actual character. This could also break the personality that people expect of characters. Generally, a lot of data must be collected that reflects the personality of a specific character. To generate perfectly correct utterances, appropriate utterances must be created for every utterance that the user enters. However, collecting speech data while keeping a personality consistent is costly. In addition, appropriate answers to a variety of user questions must be generated from a limited amount of data. Therefore, it is necessary to 1) to efficiently collect high-quality data that accurately reflects the individuality of a character and 2) to use the collected data to generate appropriate answers to various user questions.

For this problem 1), we propose the use of a data collection method called “role play-based question-answering” [3], in which users play the role of the characters and answer the question which user ask the characters, to efficiently collect responses that accurately reflect the personality of a particular character for many questions. For this problem 2), we propose a new utterance generation method that uses a neural translation model with the collected data.

Rich and natural expressions of body motion greatly enhance the appeal of agent systems. Therefore, generating agent body movements that are more human and enriched is an essential element in building an attractive agent system. However, not all existing anime characters move as naturally and as diversely as humans. Characters can also have unique movements, for example, a specific pose that appears with a character-specific utterance. Therefore, it is also important to reproduce unique movements to realize a system that reflects a character’s personality. It is known that there is a strong relationship between the content of an utterance and body motion. The motion should be suited to the content. If a skilled creator created a motion that satisfies these two aspects for every utterance, the dialogue agent would probably be able to express motion that completely matches that of a character. However, in dialogue systems that generate a variety of utterances, this is not practical.

Therefore, in this context, we propose the use of a motion generation method that is comprised of two methods of introduce movements in dialog agents of existing characters that are more human-like and natural and introduce character-specific motions. First, we propose a method that can automatically generate whole-body motion from utterance text in order to make anime characters have human-like and natural movements. Second, in addition to these movements, we try to add a small amount of characteristic movement on a rule basis to reflect the personality.

As a target for applying these proposed utterance and motion generation methods, we construct a dialogue system with “Ayase Aragaki,” a character of the light novel “Ore no Imoto ga Konna ni Kawaii Wake ga Nai” in Japanese, which means “My Little Sister Can’t Be this Cute” in English. The novel is a popular light novel that has sold over five million copies in Japan and has been animated. Ayase is not the main character, but she has an interesting personality called “yandere.” According to Wikipedia, this means that she is mentally unstable, and once her mental state is disturbed, she acts out as an outlet for her emotions and behaves extremely violently. For this reason, she is a very popular character.

We constructed an agent text-chat dialogue system that reflects her personality by using the proposed construction method. We evaluated the usefulness of the implemented system from the viewpoint of whether her responses are good and her personality could be reflected properly. As a result, both of our proposed methods for generating utterances and body motions were found to be useful for improving the users’ impression of goodness and the reflectiveness of the anime character’s personality in the responses of the system. This suggests that our proposed construction method will greatly contribute to realizing text-dialogue-agent systems of characters.

Fig. 1.
figure 1

Site screen for data collection using “role play-base question-answering”

2 Utterance Generation Method

2.1 Approach

Generally, when generating an utterance that reflects personality in such a way that quality is guaranteed, there is a problem in that the cost is high because large-scale utterance-pair data must be prepared and used manually in advance. In this research, we propose using a data collection method called “role play-based question-answering” [3] to efficiently construct high-quality utterance pairs reflecting the individuality of a character. This is a method in which multiple users participate online, ask questions to a specific character, or answer a question as is, to efficiently collect high-quality character-like utterance pairs. Specifically, the user has two roles. One is to ask a character a question (utterance). The user asks a character a question that they want to ask on a variety of topics. This question is notified to all users. The second role is to become a character and answer that question. This makes it possible to efficiently collect utterance pairs of questions and answers by sharing and using the questions of the user and answering each question. In addition, the role-playing experience itself is interesting, so there is no need to modify it in order to make it easier for users to participate [12]. By using this method, we thought that it would be possible to efficiently collect utterance-pair data that seems to reflect an anime character.

Our proposed utterance generation method uses a neural translation model, which is one of the latest machine learning methods for text generation that has been attracting attention in recent years and that extracts appropriate answers from collected data.

2.2 Data Collection Using Role-Playing Question Answering

NICONICO DougaFootnote 1, a video streaming service, offers a channel service for fans of various characters. Use of the channel is limited to registered subscribers. In our research, a bulletin board that responds to questions in tandem with this channel service was built for the channel about “Ayase Aragaki”. Figure 1 shows a screenshot of the site. Users can freely ask Ayase Aragaki questions from a prepared text form. A user who wants to respond to Ayase Aragaki can freely answer a question. At the same time as answering, labeling of emotions accompanying an utterance is performed. There were eight classifications for the labeled emotions: normal, angry, fear, fun, sad, shy, surprise, and yandere.

To increase the users’ motivation to participate, the website showed the ranking of users by the number of posts. In addition, a “Like” was placed next to each answer. If a user’s answer seems to be Ayase Aragaki (-like), “Like” is pressed. We devised it so that the evaluation of users can be reflected in the quality of answers. In October 2017, a website was opened, and the service was operated for about 90 days. A total of 333 users participated. The collected utterance pairs exceeded 10,000 in about 20 days, and finally, 15,112 utterance pairs were obtained by the end of the service. Users voluntarily participated in this response site and were not paid. Nevertheless, the fact that we were able to collect such a large amount of data suggests that data collection using the role play-based question-answering method is useful.

Table 1. Results of user evaluation of experience with role play-based question-answering service

A questionnaire evaluation was conducted for participating users in order to determine their satisfaction with using the question-answering site. A total of 36 users cooperated in the evaluation and responded to the items shown in Table 1 on the basis of a five-point Likert scale (1–5 points). Table 1 shows the average of the user evaluation values.

Looking at the results, a high rating of 4.08 was obtained for the item “Did you use the website comfortably?” This suggests that the experience was comfortable for users when using the service on our website. A high rating of 4.53 and 4.56 also were obtained for the items “Did you enjoy the role play-based question-answering?” and “Do you want to experience this web service again?” These suggest that the service was attractive to the users.

Next, to evaluate the quality of the collected utterance data of Ayase, a subjective evaluation was performed by the participating users. For about 50 utterance pairs, which were selected randomly from all collected data of 15,112 utterance pairs, participating users evaluated whether they were natural and properly reflected her personality. The mean scores of naturalness and personality were 3.61 and 3.74 on a five-point Likert scale (1 to 5 points). This indicates that the quality of the response-utterance data collected through role play-based question-answering was reasonably high. However, it was a surprising that it was difficult to obtain a rating of 4.0 or more, even if the response data was created by human users. In other words, it was suggested that utterance generation that reflects the individuality of a particular character was a difficult task even for humans.

2.3 Proposed Utterance Generation Technique

We thus propose an utterance generation method that uses the collected utterance-pair data. Since the amount of data collected was not large enough to train an utterance generation model using neural networks [17], we used the approach of extracting optimal responses from the obtained utterance data. In other words, we addressed the problem of selecting a response for the most relevant utterance pair against a user’s utterance. In this study, a neural translation model was used to select an appropriate utterance pair. Using the results obtained from LUCENEFootnote 2, a popular open source search engine, was not enough just to match words to user utterances. We developed a new method for this. This method focuses on recent advances in cross-lingual question answering (CLQA) [8] and neural dialogue models [17]. In addition, we matched the semantic and intention levels of the questions so that the appropriate answer candidates were ranked higher.

  1. 1.

    Given question Q as input, LUCENE searches the top N question-response pairs \((Q'_1,A'_1), \dots , (Q'_N,A'_N)\) from our dataset.

  2. 2.

    For Q and \(Q'\), question type determination and named entity extraction are performed, and the question type and named entity (using Sekine’s extended named entity [16] as the system) are extracted. To what extent a named entity asked by Q is included in \(A '\) is calculated, and this is used as the question type match score (qtype_match_score).

  3. 3.

    Using the focus extraction module, focus (noun phrase indicating topic) is extracted from Q and \(Q'\). If the focus of Q is included in \(Q'\), the focus score (center-word_score) is set to 1; otherwise, it is set to 0.

  4. 4.

    The translation model calculates the probability that \(A'\) is generated from Q, that is, \(p(A'|Q) \). We also calculate \(p(Q|A') \) as the reverse translation probability. Such reverse translation has been validated in CLQA [8]. The generation probability is normalized by the number of words on the output side. Since it is difficult to integrate a probability value with other scores due to differences in range, we rank the answer candidates on the basis of this probability and translate the translation score (translation_score; _translation_score). Specifically, when the rank of a certain answer candidate is r, the translation score is obtained as follows.

    $$\begin{aligned} 1.0 - (r-1)/\mathrm {max\_rank} \end{aligned}$$
    (1)

    Here, max_rank is the maximum number of possible answer candidates. The translation model was learned by pre-training about 500,000 general question-response pairs and then performing fine-tuning with the utterance pairs obtained from the complete question-response. In the reverse model, the same processing was performed, exchanging the questions and responses. For the training, the OpenNMT toolkit was used with the default settings.

  5. 5.

    The similarity between Q and \(Q'\) is measured by the semantic similarity model. Word2vec [13] is used for this. First, for each Q and \(Q'\), a word vector is obtained, an average vector is created, and cosine similarity is calculated, and this is set as a similarity score (\(semantic\_similarity\_score\)).

  6. 6.

    The previous scores are added by weight, and a final score is obtained.

    (2)

    Here, \(search\_score\) is a score obtained from the ranking of search results by LUCENE, and it is obtained from expression 1. \(w_1, \dots , w_6\) is a weight, which is 1.0 in this study.

  7. 7.

    On the basis of the above score, the answer candidates are ranked, and the top items are output.

    The most appropriate answer sentence to a question is selectively generated with this combination of various types of language processing.

3 Motion Generation Method

3.1 Approach

It has been shown that giving appropriate body movements to agents and humanoid robots not only improves the natural appearance but also promotes conversation. For example, actions accompanying utterances have the effect of enhancing the persuasiveness of utterances, making it easier for the other party to understand the content of the utterances [9]. Therefore, generating agent body movements that are more human and enriched is an essential element in building an attractive agent system. As mentioned above, however, not all existing anime characters move as naturally and as diversely as humans. Additionally, characters can have unique movements, for example, a special pose that appears with a character-specific utterance. Therefore, it is also important to reproduce unique movements so that the system reflects a character’s personality. Since there is a strong relationship between utterance content and body motion, motion should be suited to the content. If a skilled creator were to create a motion that satisfies these two aspects for every utterance, the dialogue agent would probably be able to express motion that perfectly expresses a character. However, for dialogue systems that generate a variety of utterances, this is not practical.

Therefore, for practical motion generation, we propose using a motion generation method comprised of two methods that introduce characteristic movements that are more human-like and natural and introduce character-specific motions. First, we propose a method that can automatically generate whole-body motion from utterance text so that anime characters can make human-like and natural movements. Second, in addition, we try to add a small amount of characteristic movement on a rule basis to reflect personality.

The proposed motion generation method makes the motion of animated characters more natural and human. In a text dialogue system, linguistic information obtained from system utterances may be used as input to generate motions. In past research on motion generation using linguistic information, we mainly worked on the generation of a small number of motions using word information, such as the presence or absence of nodding and limited hand gestures [5,6,7]. In this research, we tried to generate more comprehensive whole body movements by using various types of linguistic information. As a specific approach, we constructed a corpus containing data on speech linguistic information and motion information obtained during human dialogue, learn the co-occurrence relationship of these using machine learning, and generate motion using speech linguistic information as input. In the next section, the construction of the corpus data, the motion generation method, and its performance are described. Then, we introduce a way to add specific body motions that reflect an anime character’s personality.

3.2 Collecting Data for Motion Generation

A linguistic and non-linguistic multi-modal corpus, including spoken language and accompanying body movement data, was constructed for two-party dialogues. The participants in the two-way dialogue were Japanese men and women in their 20 s–50 s, who had never met. There was a total of 24 participants (12 pairs). Participants sat facing each other. To collect a lot of data on various actions, such as chats, discussions, and nodding and hand gestures associated with utterances, we used dialogues in which animated content was explained. In these dialogues, each participant watched an episode (Tom & Jerry) with different content and explained the content to their conversation partner. The conversation partner was free to ask the presenter questions and to have a free conversation. For recording utterances, a directional pin microphone attached to each subject’s chest was used. A video camera was used to record the overall dialogue situation and the participants. The video was recorded 30 Hz.

The total time of the chats, discussions, and explanations was set to 10 min each, and in this study, the data of the first 5 min were used. For each pair, one chat dialogue, one discussion dialogue, and two explanation sessions were conducted. Therefore, we collected 20 min of conversation data for each pair, and we collected a total of 240 min of conversation data for 12 pairs.

Next, we show the acquired linguistic and non-linguistic data.

  • Utterance: After manually transcribing the utterances from the voice information, the content of the utterances was confirmed, and the sentences were divided. Furthermore, each sentence was divided into phrases by using a dependency analysis engine [4]. The number of divided segments was 11,877.

  • Face direction: Using the face image processing tool OpenFace [15], three-dimensional face orientation information was taken from the front of the participants with a video camera. The angles of yaw, roll, and pitch were obtained. Each angle was classified as micro when the angle was 10\(^{\circ }\) or less, small when it was 20\(^{\circ }\) or less, medium when it was 30\(^{\circ }\) or less, and large when it was 45\(^{\circ }\) or more.

  • Nodding: Sections in the video where nodding occurred were manually labeled. Continuous nods were treated as one nod event. In addition, the number of times was classified into five stages from 1 to 5 (or more). In addition, for the depth of the nod, OpenFace was used to calculate the difference between the start of the head posture pitch at the time of the nod and the angle of the deepest rotation. The angle was classified as micro when the angle was 10\(^{\circ }\) or less, small when it was 20\(^{\circ }\) or less, medium when it was 30\(^{\circ }\) or less, and large when it was 45\(^{\circ }\) or more.

  • Hand gesture: Sections in the video where hand gesture occurred were manually labeled. A series of hand-gesture motions were classified into the following four states.

    • Prep: Raise your hand to make a gesture from the home position

    • Hold: Hand held in the air (waiting time until gesture start)

    • Stroke: Perform gesture

    • Return: Return hand to home position

    However, in this study, for simplicity, a series of actions from Prep to Return were treated as one gesture event. Furthermore, the types of hand gestures were classified into the following eight types based on the classification of hand gestures by McNeil [10].

    • Iconic: Gestures used to describe scene descriptions and actions.

    • Metaphoric: Like Iconic, this is a pictorial and graphic gesture, but the specified content is an abstract matter or concept. For example, the flow of time.

    • Beat: Adjusts the tone of speech and emphasizes speech. Shake your hand or wave your hand according to your utterance.

    • Deictic: A gesture that points directly to a direction, place, or thing, such as pointing.

    • Feedback: Gestures issued in synchronization with, consent to, or in response to another person’s utterance. A gesture that accompanies an utterance in response to an utterance or gesture in front of another person. In addition, gestures of the same shape performed by imitating the gestures of the other party.

    • Compellation: Gesture to call the other person.

    • Hesitate: Gesture that appears at the time of hesitation.

    • Others: Gestures that are unclear but seem to have some meaning.

  • Upper body posture: We observed postures when the participants were seated, and there was no significant change in the seated position. For this reason, front and back position of upper body was extracted on the basis of the three-dimensional position of the head. Specifically, the difference between the coordinate position in the front-back direction of the head position obtained using OpenFace and the position of the center position was obtained. From the position information, the angle of the posture change of the upper body was calculated as micro when it was 10\(^{\circ }\) or less, small when it was 20\(^{\circ }\) or less, medium when it was 30\(^{\circ }\) or less, and large when it was 45\(^{\circ }\) or more.

Table 2. List of generated label for each motion part

Table 2 shows the list of parameters of the obtained corpus data. In addition, ELAN [18] was used for manual annotation, and all the above data were integrated with a time resolution 30 Hz.

3.3 Proposed Motion Generation Method

Table 3. Performance of generation model and chance level. The each score shows F-measure.

Using the constructed corpus data, we input a word, its part of speech, a thesaurus, a word position, and the utterance action of one entire utterance as input, and we created a model that generates one action class for each clause for each of the eight actions shown in the table by using the decision tree algorithm C4.5. That is, eight option labels were generated for each clause. Specifically, the language features used were as follows.

  • Number of characters: number of characters in a clause

  • Position: the position of the phrase from the beginning and end of the sentence

  • Words: Word information (Bag-of-words) in phrases extracted by the morphological analysis tool Jtag [1]

  • Part-of-speech: part-of-speech information of words in a clause extracted by Jtag [1]

  • Thesaurus: Thesaurus information of words in a phrase based on Japanese vocabulary

  • Utterance act: Utterance act estimation method using word n-gram and thesaurus information [2, 11]. Utterance act extracted for each sentence (33 types).

The evaluation was performed by cross-validation done 24 times, in which the data of 23 of the 24 participants were used for learning, and the remaining data of one participant was used for evaluation. We evaluated how much actual human motion could be generated from only the data of others. Table 3 shows the average of F-value as a performance evaluation result. The chance level indicates the performance when all classes with the highest number of correct answers are output. Table 3 shows that the accuracy was significantly higher than the chance level for all generation targets (results of paired t-test: \(p<.05\)). The results show that the proposed method, which uses words, their parts of speech and thesaurus information, word positions, and actions performed during the entirety of speaking, obtained from the spoken language, is effective in generating whole-body motions as shown in Table 3.

3.4 Additional Original Motion Reflecting Character’s Personality

In addition to the motion generation proposed in the previous section, motions unique to Ayase were extracted from the motions in the animation, and the four original motions shown in Table 4 were added. These actions were selected in collaboration with Ayase’s creators who have experience in creating animation. For these original actions, words and sentences that trigger the actions were set, and when these words and sentences appear in a system utterance, the original actions take precedence over the output results of the action generation model. Although the number of movements was as small as four, we could not find any more distinctive movements to note, so we thought that the number of movements was sufficient.

Fig. 2.
figure 2

Example of scene in which original motion of raising and lowering arm and protruding face is performed in accordance with text display of “Pervert!” included in system utterance.

Table 4. Additional original movements and example utterance text to trigger

4 Construction of Dialogue System Reflecting Anime Character’s Personality

Fig. 3.
figure 3

System architecture of our system

Using the proposed methods for utterance and motion generation, we constructed a dialogue system that can respond to user utterances with utterances and motion. Figure 2 shows a diagram of the system configuration.

The user enters input text from the chat UI. When the dialog manager receives it, the user text is first sent to the utterance generator. After acquiring the system utterance from the utterance generation unit, the system utterance is transmitted to the motion generation unit, and the motion information of each clause of the system utterance text is obtained.

In addition to this, while this is not necessary for text dialogues, it is also possible to obtain uttered speech with the speech synthesis unit. In this system, the speech obtained from the speech synthesis unit is used to generate the lip-sync motion.

The dialogue manager sends the system utterance text, motion schedules, and voices to the agent animation generator. In the agent animation part, the utterance text is displayed in a speech bubble above the character at equal time intervals from the first character, And the motion of the agent is generated in sync with the display of utterance characters according to a motion schedule. As a means of generating motion animation, a CG character was created in UNITY, and an animation corresponding to the motion list in Tables 2 and 4 list was generated in real time. At this time, the eight objects shown in Table 2 can operate independently, and the head motion is generated by mixing all parameters of the number of nods, deepness of head movement, and head directions (yaw, roll, pitch). When utterance text that is registered as a trigger for generating specific motion in Table 4. A specific motion is generated instead of a motion generated by our generation model. All motions of the agent are generated according to the timing of the utterance text display. An example of the presentation screen is shown on the right side of Fig. 3.

It is also possible to send a system utterance from the dialog management unit to the chat UI and to present the system utterance in the chat UI in addition to the user utterance shown on the left side of Fig. 3.

5 Subjective Evaluation

5.1 Evaluation Method

The effectiveness of the proposed method was evaluated in subject experiments using the constructed dialogue system. As an evaluation item, we evaluated the usefulness of the responses of the dialogue system by the proposed utterance- and motion generation methods. The purpose of this evaluation was to pay particular attention to character reproducibility (character-like) in addition to the goodness of the responses.

The following three conditions were set as experimental conditions for utterance generation.

  • U-AIML: A rule-based method written in AIML, which is a general method used for utterance generation, was used. Specifically, we used a large-scale AIML database that has been constructed up to 300K utterance pairs. In Japanese, sentence-end expressions are some of the most important elements that indicate character, so these expressions were converted to expressions like those used by Ayase by using a sentence-end conversion method [14].

  • U-PROP: A utterance is generated by using our proposed utterance generation method described in Sect. 2. The weight parameters \(w_1\) to \(w_6\) were experimentally set to 1.0 in Formula (1).

  • U-GOLD: An utterance is generated by using collected data from the role play-based question-answering method in Sect. 2.2. When multiple answers were given to a question, one was selected at random.

By comparing the U-AIML and U-PROP conditions, the usefulness of the proposed utterance generation was compared with manual utterance generation and evaluated. We also compared the U-PROP and U-GOLD conditions to evaluate how much the proposed utterance generation method is useful to human-generated.

The following four conditions were set as the experimental conditions for motion generation.

  • M-BASE: Generates basic character movements such as for lip sync and facial expressions. For generating facial expressions, we created animations for facial expressions corresponding to the eight emotions collected with the role play-based question-answering method in Sect. 2.2. Facial expressions under the U-PROP and U-GOLD conditions were generated by using the collected data. For the U-AIML condition, humans annotated the emotion label for each utterance manually. The labels were used to generate facial expressions.

  • M-RAND: In addition to lip sync and facial expressions, whole body movements were randomly generated.

  • M-PROP1: Motion generation method was used on basis of human data.

  • M-PROP2: In addition to using the motion generation method with human data, a small amount of motion unique to the character was added.

By comparing the M-BASE and M-RAND conditions, we evaluated the usefulness of motion generation for the whole body, and we compared the M-RAND and M-PROP1 conditions to evaluate the usefulness of the proposed motion generation method by learning human motion data. Also, by comparing the M-PROP1 and M-PROP2 conditions, we evaluated the usefulness of adding a small amount of unique character-specific motions in addition to the proposed generation by learning human motion data.

Twelve conditions combining these three conditions of utterance generation and four conditions of motion generation were set as experimental conditions.

Table 5. Items and questionnaire of subjective evaluation
Fig. 4.
figure 4

Results of subjective evaluation of impression of “goodness” of overall response.

As an experimental method, the same user utterance was set for comparison under each condition, and the utterance and motion of the system in response to the user utterance were evaluated. Specifically, ten question utterances were randomly extracted from the collected question-answer data. The subjects observed the user’s utterance text for 3 s and then watched a video showing the response of the system. At each viewing, the subjects evaluated the impression of the response of the dialogue system using a seven-point Likert scale (1 to 7 points). Specific evaluation items are shown in Table 5. Since video of 10 utterances was prepared for each of the twelve conditions, video viewing and evaluation were performed 120 times. Considering the order effect, the presentation order of the video presented for each subject was randomized.

5.2 Evaluation Result

Fig. 5.
figure 5

Results of subjective evaluation about impression of “character-likeness”’ of overall response.

An experiment was performed with seven subjects. The mean value of each subjective evaluation item for each subject under each experimental condition was calculated, and the mean values are shown in Figs. 4 and 5.

First, we performed a two-dimensional analysis of variance to evaluate the effect of the factors of the utterance and motion conditions on the rating value of the overall good responses. As a result, a simple main effect for both motion conditions was observed, and no interaction effect was observed (utterance condition: \(F(2,72)=3.92, p<.05\), motion condition: \(F(3,72)=3.82, p<.05\)).

Next, since a simple main effect for the factors of the utterance conditions was observed, multiple comparisons with the Bonferroni method were performed to verify which utterance conditions differed under each motion condition. Since there are many combinations of conditions and our main objective was to confirm to usefulness of the U-PROP condition, this paper mainly describes only the two differences between the U-AIML and U-PROP conditions and between the U-PROP and U-GOLD conditions.

First, a significant trend was observed between the U-AIML and U-PROP conditions only under the M-PROP1 condition (\(p<.10\)). Therefore, it was suggested that the proposed utterance generation method had a higher evaluation value than the general utterance generation using AIML when the proposed motion generation method (without original motion) was used. Under the M-RAND and M-PROP2 conditions, a significant difference was observed between the U-PROP and U-GOLD conditions (\(p<.05\), \(p<.01\)). Therefore, it was suggested that an utterance made by a human was higher in terms of value than that of the proposed utterance generation method when random motion generation and the proposed motion generation method (with original motion) were used.

Next, since a simple main effect of the factors of the motion conditions was observed, similarly, it was verified by multiple comparisons to determine which motion condition had a difference under each utterance condition. Since there are many combinations of conditions and our main objective was to confirm the usefulness of our proposed motion generation method, this paper mainly describes only the three differences between the M-BASE and M-RAND conditions, the M-RAND and M-PROP1 conditions, and the M-PROP1 and M-PROP2 conditions.

First, a significant trend was observed between the M-BASE and M-RAND conditions only under the U-AIML condition (\(p<.10\)). Therefore, it was suggested that when utterance generation using AIML was used, the randomness of motion generation was lower in terms of the evaluation value than when no motion was generated. In addition, under all utterance conditions of U-AIML, U-PROP, and U-GOLD, a significant difference or a significant tendency was observed between the M-RAND and M-PROP1 conditions (\(p<.01\), \(p<.01\), \(p<.10\)). Therefore, it is suggested that the evaluation value of the goodness of overall response was higher (or the trend was higher) with the proposed motion generation method (without handmade original motion) than with random motion generation, regardless of the utterance generation conditions. In addition, a significant trend was observed between the M-PROP1 and M-PROP2 conditions under the U-PROP condition (\(p<.10\)). Therefore, it is suggested that when an utterance is generated with the proposed utterance generation method, the evaluation value tended to be higher when the original motion was not added than when the original motion was added to the method.

Next, the same analysis was performed for the evaluation value of the character likeness of the overall response. As a result, a simple main effect of the utterance and motion conditions was observed, and no interaction effect was observed (utterance condition: \(F(2,72)=3.92\), \(p<.05\), motion condition: \(F(3,72)=3.82, p<.05\)).

Next, using multiple comparison with Bonferroni’s method, we verified which utterance conditions differed under each motion condition. First, under the M-RAND, M-PROP1, and M-PROP2 conditions, significant differences or significant trends between the U-AIML and U-PROP conditions were observed (\(p<.01\), \(p<.01\), \(p<.10\)). Under M-RAND and M-PROP1, significant differences were found between the U-PROP and U-GOLD conditions (\(p<.05\), \(p<.05\)). Therefore, it was suggested that when using the random motion generation and the proposed motion generation method (without original motion), the evaluation value of the human response utterances was higher with the random motion generation than with the proposed utterance generation method.

Next, it was verified that there was a difference between the motion conditions under each utterance condition. First, under only the U-AIML condition, a significant difference was observed between the M-BASE and M-RAND conditions (\(p<.05\)). Therefore, it was suggested that when using AIML-based utterance generation, generating a random action would have a lower evaluation value than when not generating a motion. Under all utterance conditions of U-AIML, U-PROP, and U-GOLD, a significant difference or a significant tendency was observed between the M-RAND and M-PROP1 conditions (\(p<.10\), \(p<.10\), \(p<.01\)). Therefore, it is suggested that the proposed motion generation method (without original motion) has a higher evaluation value (or tendency) than random motion generation, regardless of the utterance generation conditions. In addition, a significant trend was observed between the M-PROP1 and M-PROP2 conditions under the U-GOLD condition (\(p<.10\)).

6 Discussion

From the evaluation results, it was confirmed that utterance generation and motion generation affected the impression in both items of the goodness of response and character-likeness. In addition, the rating value of the dialogue system constructed using the proposed construction method (U-PROP+M-PROP1) was as high as 5.83 for goodness of response and 5.98 for character-likeness in rating-value range of 1–7.

Although the number of samples was too small for seven subjects, there was no difference under all motion conditions, but the proposed utterance generation had better responses and character-likeness than utterance generation using AIML. It was also suggested that the proposed utterance generation method is more useful for the good response reflecting the character’s personality than the manual utterance generation. It was also suggested that the proposed motion generation (without the original motion) similarly improves the impression as compared with the random motion generation.

As a result, it was found that, for the proposed motion generation method, the evaluation value of the original motion added was lower than that of the original action not added. After the experiment, subjects conducted a hearing survey, and it was found that the transition between the original and other normal motions was not smooth and that the timing did not completely match the display of the utterance sentence. Since the created original motion has a larger overall motion than the normal motion, it is conceivable that the transition with other motions did not go smoothly. Therefore, when adding the original motion to the proposed motion generation method, it was found that it is important to consider design considerations so that a smooth transition between previous motions is made when an original motion is generated.

In addition, under the motion conditions (M-PROP1) where the original motion was not included in the proposed motion generation, the rating value was very high at just a little under 6 points under the U-PROP+M-PROP1 condition (5.83). This suggests the possibility of generating a response that gives the impression of being good and character-like without inserting an original movement. Our motion generation model can generate the average motion of many people since it was trained with the movements of 24 people. This suggests that body motion that reflects the average movement can improve the impression of being good and character-like without inserting an original movement. Even if an average movement is given to the anime character agent, it would be possible to sense his/her individuality. This is a very interesting result. Of course, depending on the design and settings of the anime character, this technique may not always be effective. This is because it may be better for awkward robots not to behave like a human. However, if it is appropriate for a character to move like a human, our proposed motion generation method can be an effective means to enhance the quality of responses and character-likeness. Detailed evaluation of effectiveness using more diverse characters is one of our future tasks.

Finally, our proposed utterance and motion generation methods have made it possible to realistically realize a dialogue system of an existing animation character, which has been difficult to date. We cannot say that our proposed construction method has achieved a perfect system, but we believe it is worthwhile to prove the effectiveness of a new method that can efficiently realize such a system.

We have plans to carry out additional experiments to handle more samples and to verify in detail whether there is a mutual effect between speech and motion conditions. We will also improve the construction method to create a dialogue-agent system using other existing anime characters.

7 Conclusion

In this paper, we proposed a construction method for efficiently constructing a text chat system with animation for existing anime characters. We tackled two research problems to generate verbal and nonverbal behaviors. In the generation of verbal behavior, it is a major issue to be able to generate utterance text that reflects the personality of existing characters in response to any user questions. For this problem, we propose the use of the role-playing question-answering method to efficiently collect high-quality paired data of user questions and system answers that reflect the personality of an anime character. We also propose a new utterance generation method that uses a neural translation model with the collected data. Rich and natural expressions of nonverbal behavior greatly enhance the appeal of agent systems. However, not all existing anime characters move as naturally and as diversely as humans. Therefore, we propose a method that can automatically generate whole-body motion from spoken text so that anime characters can make human-like and natural movements. In addition to these movements, we try to add a small amount of characteristic movement on a rule basis to reflect personality. We created a text-dialogue agent system of a popular existing anime character by using our proposed generation models. As a result of a subjective evaluation of the implemented system, our models for generating verbal and nonverbal behavior improved the impression of the agent’s responsiveness and reflected the personality of the character. In addition, the generation of characteristic motions with a small amount of characteristic movements based on heuristic rules is not effective, but rather the character generated by our generation model that reflects the average motion of persons had more personality. Therefore, our proposed generation models and construction method will greatly contribute to realizing text-dialogue-agent systems of existing characters.