Keywords

1 Introduction

Cartoon portraits are used in various contexts such as social media, blog games, application profiles, and avatars. In Japan, it is common to use avatar portraits instead of face photos to express one’s identity. Because drawing a portrait requires certain senses and skills, it is difficult for those who do not have such skills to draw a portrait. Therefore, it is becoming increasingly important to create a system that automatically generates portraits.

In previous studies, the website “photoprocessing.com” [1] created images with reduced color and brightness, and added lines to the images to create facial contours. However, “photoprocessing.com” has the following issues: the parameters must be adjusted manually, the resulting portrait is a traced image of the photo, and if there are pixels with different RGB values for the face, those pixels are regarded as edges. According to Wu et al. [2], the system acquires features from a facial image, and performs a nonlinear transformation to obtain parameters for drawing a portrait using a neural network. The portrait is generated from the obtained parameters using a portrait drawing tool. As a result, caricatures with similar characteristics may be generated. Currently, several Generative Adversarial Networks (GANs) have been proposed to convert facial images into portraits. CariGANs [3] uses unsupervised learning to convert face photos into portraits, and APDrawingGAN [4] converts each part of the face individually. CariGANs prepares portraits drawn by various illustrators and carries out learning without associating the input images with the portraits serving as teachers. Therefore, the learning that reflects the individuality of each illustrator cannot be performed. APDrawingGAN does not focus on the unique personal touch of illustrators because the teacher data used consists of traced portraits.

In this study, we propose a method for generating a caricature through deep learning with a small amount of training data using a Generative Adversarial Network (GAN). After using a GAN to learn from a pair consisting of a face image and a portrait drawn by a professional illustrator, the test image is input into the GAN to generate a portrait. For the learning process, the training data was prepared so that gender and age could be distributed equally. A comparison is made between the previous methods, pix2pix and CycleGAN; and the proposed methods, Paired-CycleGAN and Cyclepix. Assuming that the training data size was small, we conducted evaluation experiments using the generation results from 90 and 189 input images, and examined the optimal method for generating caricatures from face images.

2 Related Work

2.1 Generative Adversarial Nets

Goodfellow et al. [5] proposed GAN, as a method for efficiently training generative models in deep learning. In GAN, a generator (Generator) and a discriminator (Discriminator) are used and trained. The Generator intends to generate data similar to the training data, and the Discriminator outputs the probability that the generated data is the training data. The Generator performs training, so that the Discriminator identifies the generated data as training data. On the other hand, the Discriminator identifies the training data as training data, and distinguishes the generated data as generator generated data. By training the Generator and Discriminator hostilely in this way, they become methods that compete with each other and improve accuracy. The problem is that learning is not stable.

2.2 Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Network

Radford et al. [6] proposed the Deep Convolutional Generative Adversarial Network (DCGAN) aiming to improve the accuracy of image generation through unsupervised learning. Their points included the elimination of pooling in the convolutional layer, the learning using Batch Normalization for the efficiency of the Generator, the elimination of the fully connected layer, the adaptation of the tanh function in the Generator output layer, and the change to Leaky Rectified Linear Unit (LeakyReLU) for all the activation functions of the Discriminator. As a result, they succeeded in stabilizing the GAN learning process. In this study, we use the method proposed by DCGAN for the network structure of the Generator and the Discriminator.

2.3 Image-to-Image Translation with Conditional Adversarial Networks

Isora et al. [7] proposed pix2pix, a type of image translation algorithm using GAN. By learning and taking into account the relationship between image domains from two pairs of images, they developed a method for generating a transformed image by interpolating from an input image. Pix2pix uses Conditional GAN. It has the same configuration as DCGAN, and uses U-NET as the Generator. The Discriminator uses PatchGAN and focuses on the local rather than the whole, and distinguishes between training data and data generated by the Generator. However, it has been pointed out that a large amount of training data is required and that teacher images must be prepared in pairs.

2.4 Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks

Zhu et al. [8] proposed CycleGAN, a type of image translation algorithm using GAN and similar to pix2pix discussed in Subsect. 2.3 above. In Cycle GAN, the domain of the image is defined, and the image is collected for each domain and used as training data. The main feature is that there is no need for preparing the training data to be paired. Let the image sets from each domain be X and Y respectively, and prepare a generator that converts them from X to Y and Y to X. In addition, two discriminators corresponding to both cases are prepared. Image generation without pairwise learning is possible using the adversarial loss error (loss) used in GAN and the cycle consistency loss proposed in this paper as the evaluation criterion for learning. However, it is shown that the accuracy is lower than that of pix2pix because of unsupervised learning.

2.5 APDrawingGAN: Generating Artistic Portrait Drawings from Face Photos with Hierarchical GAN

Yi et al. [4] proposed APDrawingGAN, a GAN for drawing portraits. This system is divided into a global net, which converts the entire image, and a local net, which converts the right eye, left eye, nose, mouth, and hair. Finally, these outputs are combined using Fusion net. They proposed the Line-promoting distance transform loss as an error. In addition, they reported the success in generating a highly accurate caricature using a small number of training images by allowing for small deviations often seen in caricatures.

3 Generative Adversarial Networks

3.1 Previous Methods (Pix2pix and CycleGAN)

As mentioned above in the Related Work section, pix2pix [7] is supervised learning, and CycleGAN [8] is unsupervised learning, which is the basis of GANs that perform image-to-image translation. Pix2pix has been reported to successfully convert aerial photographs into maps and generate line drawing bags from color image bags. It is also the basic conversion method of supervised learning. CycleGAN has converted images of horses to zebras and photographs to paintings of various styles, which is the basic conversion method of unsupervised learning. We have decided to use these GANs for portrait generation as examples of previous research. Figure 1 shows the pix2pix network architecture that performs domain conversion through supervised learning. In this network, a pair of images is prepared for the training data, a relationship between the images is learned from the pair of images, and an output image is generated by interpolating an input image based on the relationship. The Generator generates a portrait from a face photo; and the Discriminator receives a portrait and either of the portrait or a training image generated by Generator as a pair, and identifies whether the input portrait is an image generated by the Generator or the training data.

Fig. 1.
figure 1

Pix2pix network architecture.

Figure 2 shows the network architecture of CycleGAN. In CycleGAN, a Discriminator is added to the pix2pix network, and two generators are provided for each Discriminator. The Generator outputs photos from portraits using the inverse conversion of Generator 1. In Fig. 2, there are two types of learning: conversion from a face photo to a portrait, and then reconversion to the face image; and conversion from a portrait to a face photo, and then reconversion to the face image. This network architecture has a reconversion function with the addition of Generator 2, which converts portraits into facial photos. Discriminator 2 discriminates between the face photo generated by Generator 2 and the real face photo, and outputs the discrimination result.

Fig. 2.
figure 2

CycleGAN and Cyclepix network architecture.

Because the domain conversion of images in CycleGAN is performed without teachers instead of one-to-one mapping (as in pix2pix), overfitting is less likely to occur in CycleGAN than in pix2pix. Although the generation of high-precision images can be expected even with less training data, CycleGAN has a problem in that the generation of images similar to the training data cannot always be achieved after the learning process. For example, CycleGAN may learn to generate a portrait of another person instead of a portrait of the same person. This may result in the drawing of glasses for a person without glasses or the drawing a beard for a person without a beard. Based on the abovementioned example, CycleGAN is not suitable for generating portraits that reflect the personality of professional illustrators. In addition, pix2pix is not suitable for use in portraits that are deformed and drawn because it is vulnerable to deformation, with resultant overfitting likely to occur.

3.2 Proposed Methods (Paired CycleGAN and Cyclepix)

In this study, we decided to change to supervised learning based on CycleGAN to solve the problem of poor accuracy in pix2pix, and the problem of its inability to learn according to the training data. We propose two methods: Paired CycleGAN and Cyclepix. The first method uses the generated portrait and photo pairs as inputs to the Discriminator, and the latter employs pixel errors between the generated image and the teacher image in addition to adversarial loss. Figure 3 shows the network architecture of Paired CycleGAN, which was modified based on CycleGAN. CycleGAN has the problem that it is cannot always learn according to the training data because of unsupervised learning. Therefore, we propose Paired CycleGAN that conducts supervised learning by changing the input of the Discriminator to a pair based on CycleGAN. As shown in Fig. 3, inputs to Discriminator 1 and Discriminator 2 are pairs of face photos and portraits generated by each Generator.

Fig. 3.
figure 3

Paired CycleGAN network architecture

4 GANs Used for the Experiment

4.1 Network Structure

In this study, we employed the same network architecture as CycleGAN [8]. The Generator and Discriminator have a similar structure for all networks in the experiments. Figure 4 shows the structure of the Generator. The Generator consists of an encoder that extracts facial image features, a translator that translates facial feature maps into portrait feature maps, and a decoder that restores facial features into image shapes. In the encoder, the feature map is generated three times using a convolutional layer, and the data size is compressed using stride convolution. The translator uses nine residual blocks called ResNet [9] to perform translation. The residual block has a skip connection, which is a method that outputs the result of performing convolution twice and the input using the skip connection, and is useful for addressing the gradient vanishing problem. The decoder transposes a convolutional layer two times to restore the feature map to the size of the image and then performs a convolutional layer to generate a portrait image. The activation function uses Rectified Linear Unit (ReLU) except for the output layer, and tanh only for the output layer. Batch normalization is performed in the Instance Normalization part.

Fig. 4.
figure 4

Generator network structure

The Discriminator can take only the generated image as input, or it can take a pair of the generated image and the input face image of the Generator, or the training data and the input face image of the Generator; hence, the number of input channels may change. The structure of the Discriminator adopts a method known as PatchGAN, as described in Sect. 2.4. One-dimensional convolution is performed while compressing the size using convolution by three times of stride, and a Boolean value of 32 × 32 × 1 as output.

4.2 Loss Function

The learning method and hyperparameters were determined based on CycleGAN [8]. The Generator��s loss function is different for each network, whereas the adversarial loss is the same. Equation 1 shows the generator’s adversarial loss.

$$ L_{Adv} = - {\mathbb{E}}_{x} \sim_{px\left( x \right)} \left[ {logD\left( {G\left( x \right)} \right)} \right] $$
(1)

L1 Loss is defined by Eq. 2, where x is the training data, y is the output data, and z is a random noise vector.

$$ L_{L1} = {\mathbb{E}}_{x,y,z} \left[ {\left\| {y - G\left( {x,z} \right)} \right\|_{1} } \right] $$
(2)

The loss of pix2pix is a weighted sum of L1 Loss, which takes the pixel error between the portrait generated by the Generator and the training data; and adversarial loss, which indicates the error correctly identified as a fake in the Discriminator.

Equation 3 shows the loss function of pix2pix. In the experiment, we set \( \lambda_{Adv} = 1.0 \) and \( \lambda_{L1} = 10.0 \).

$$ L_{G} = \lambda_{Adv} *L_{Adv} + \lambda_{L1} *L_{L1} $$
(3)

Paired CycleGAN uses the same loss function as CycleGAN. The cycle consistency loss denoted by “Cycle,” takes the error between the face image data obtained by further reconversion of the generated data and the face photo of the input data. The loss function of CycleGAN is a weighted sum of the cycle consistency loss and adversarial loss, as shown in Eq. 4. In the experiment, we set \( \lambda_{Adv} = 1.0 \), and \( \lambda_{Cycle} = 10.0 \).

$$ L_{G} = \lambda_{Adv} *L_{Adv} + \lambda_{Cycle} *L_{Cycle} $$
(4)

The loss of Cyclepix is calculated as a weighted sum of L1 loss, adversarial loss, and cycle consistency loss, as shown in Eq. 5. In the experiment, we used \( \lambda_{Adv} = 1.0 \), \( \lambda_{L1} = 2.5 \), and \( \lambda_{Cycle} = 10.0 \).

$$ L_{G} = \lambda_{Adv} *L_{Adv} + \lambda_{L1} *L_{L1} + \lambda_{Cycle} *L_{Cycle} $$
(5)

The Discriminator loss function is the same in all networks. Equation 6 shows the discriminator’s adversarial loss.

$$ L_{Adv} = {\mathbb{E}}_{{x \sim p_{data\left( x \right)} }} \left[ {logD\left( x \right)} \right] + {\mathbb{E}}_{z} \sim_{p\left( z \right)} [\log \left( {1 - D\left( {G(z)} \right)} \right)] $$
(6)

Adam [10] was used as the optimizer with parameters \( \beta_{1} = 0.5 \), and \( \beta_{2} = 0.999 \). The learning rate was set to 0.0002 for both the Generator and Discriminator. For all iterations, both the Generator and Discriminator were updated once, and the learning was terminated when it was considered that the data could be generated with some accuracy for each network.

5 Experiments

To evaluate the differences between the four networks, each network was trained using photos and pairs of portraits as training data. Additionally, we conducted an experiment to obtain a subjective valuation of the portraits using student subjects.

5.1 Datasets

Two datasets were prepared for this study: one consisted of portraits drawn by a student illustrator, and the other consisted of portraits drawn by a professional illustrator. We handed over the same person’s face photos to each illustrator. Figure 5 shows some examples of the dataset. The student illustrator draws portraits by faithfully tracing the outline of the face without deforming it from the photo. The professional illustrator, on the other hand, deforms the contours and facial features based on the face image, resulting in a portrait with a shape that is different from the face photo.

In this experiment, a pair of a face photo and a portrait drawn by the professional illustrator was used as training data. In the first step, we collected a total of 100 face photos so that each of the 4 categories (females aged 0–40 and ≥41, and males aged 0–40 and ≥41) contains 25 photos. Of the 100 face photographs, 90 were used as training data and 10 were used as test data. The test data selection criteria were made equal for each category. As test data, we selected 2 photos each from the photos of women 0–40 years old and men over 40 years old, and 3 photos each from women over 40 years old and men 0–40 years old. Next, we added 99 images with features such as beards, glasses, and gray hair to the training data. The face photo was scanned at 257–300 pixels and trimmed at 256 × 256 pixels. A total of 189 training data were prepared using image processing, such as scaling and horizontal flipping. In this experiment, we show the results under two conditions: 90 and 189 training data.

Fig. 5.
figure 5

Dataset example

5.2 Comparison of Generated Results

Figures 6, 7 and 8 show the results generated by each network that learned the test data. Caricatures generated from teacher images of the student illustrator and the professional illustrator are shown in the upper two rows and the lower two rows, respectively. The upper part of the two rows is the result of using 90 training images, and the lower part is that of using 189 training images. The column direction indicates different methods, and the results using the four methods described in Sect. 3 and AutoEncoder [11] as a reference method are listed.

AutoEncoder and pix2pix fail to compose the facial parts of the portraits, whereas Paired CycleGAN creates rather good portraits but often fails.

However, in the portrait generated by the student illustrator in the second column of Fig. 7, the glasses are drawn in the portrait even though they are not worn in the photo. Thus, it was found that there was a case in which the facial features were incorrectly reflected in the portrait.

However, some cases in which the facial features were incorrectly reflected in the portrait were found in the results of CycleGAN. For example, in the second row of Fig. 7, the glasses are drawn in the portrait even though they are not worn in the photo. Cyclepix, on the other hand, did not find any cases that incorrectly reflected facial features. From the results of Cyclepix in Fig. 8, it was found that difficult facial features such as frameless glasses could be best reflected in portraits using various methods. A comparison of each illustrator’s portraits demonstrated that the portraits by the student illustrator could be generated in relatively high precision using pix2pix and AutoEncoder, whereas those by the professional illustrator could not compose facial parts using those methods. In other networks, it was also found that portraits from the student illustrator could be generated with higher accuracy than those from the professional illustrator. Regarding the generation of the hair portion, we compared the male portraits shown in Fig. 6 with the female portraits shown in Fig. 7. It can be seen that the hairstyle of the male portrait is close to the teacher image, but that of the female portrait is not well generated. This is because the hairstyle in the portrait of the male is a common hairstyle, while the woman tied her hair back, and there was no photograph of the woman tying her hair back in the teacher data.

Fig. 6.
figure 6

Portrait generation result for male in their 20 s (test data)

Fig. 7.
figure 7

Portrait generation result for women in their 20 s (test data)

Fig. 8.
figure 8

Portrait generation result for men in their 50 s (test data)

Figure 9 shows the portraits generated by the trained networks using the training data. It can be seen that pix2pix and Paired CycleGAN almost reproduced the training data, while CycleGAN did not reproduce the training data accurately. This result indicates that overfitting occurred in CycleGAN as a result of the poor accuracy of the test data. In Cyclepix, the portraits were generated almost as per the training data because of the proper test data, indicating that optimal learning was achieved. From the results of the training data, no differences were found between each illustrator.

Fig. 9.
figure 9

Portrait generation result for men in their 20 s (training data)

Next, Fig. 10 shows some examples in which CycleGAN did not learn according to the training data, whereas Cyclepix did. The image at the top is a face image of a man in his twenties without glasses. As a result of the erroneous learning performed in Cycle GAN, it can be seen that the portrait is generated by erroneously recognizing an area where eyes are shaded as glasses. In the case of the image at the bottom, the shadow of the person’s face was incorrectly converted to a mustache. We also observed that Cyclepix was able to learn the clothed areas of face photos according to the training data. The above results showed that Cyclepix can solve the problem of CycleGAN in that it does not always learn according to the training data. From the results of the cases of training and test data, it was considered that pix2pix and Paired CycleGAN were not suitable for generating portraits, because overfitting occurred in these methods for a small amount of test data. On the other hand, in CycleGAN and Cyclepix, accurate portraits were obtained from the test data and they were considered suitable networks for the generation of portraits.

Fig. 10.
figure 10

Portrait generation result (training data)

It was found that CycleGAN had a problem, in that training data could not be restored accurately by learning a small amount of training data. On the other hand, Cyclepix was found to be suitable for portrait generation because training data can be restored by learning using a small amount of training data.

5.3 An Experiment to Evaluate the Similarity of Generated Portraits

Because it is not appropriate to use the error value directly to evaluate a portrait generated by GANs, an evaluation experiment was performed on the subjects using the similarity between the portrait drawn by the illustrator and the portrait generated by each network. In the evaluation experiment, 8 portraits generated with two numbers of training data (90 and 189) in 4 networks and a portrait drawn by the illustrator were displayed on the screen. 10 subjects aged between 23 and 24 were asked to evaluate for similarity using a 5-point scale (−2: dissimilar, −1: slightly dissimilar, 0: neither, 1: somewhat similar, 2: similar) and to rank 8 portraits of 10 sets. The eight portraits were displayed in random order.

The results of the 5-point scale shown in Table 1 are 1.22 for CycleGAN (189), 1.04 for Cyclepix (90), 0.97 for Cyclepix (189), and 0.4 for CycleGAN (90). It can be said that the subjects judged that the portraits generated by these methods were similar to the original portraits. Paired CycleGAN and pix2pix had negative values, and were evaluated as dissimilar. According to the ranking results, CycleGAN (189) was first, followed by Cyclepix (90) and Cyclepix (189).

Table 1. Evaluation experiment results

6 Conclusion and Discussion

We presume that CycleGAN and Cyclepix were highly evaluated because each face part was not blurred and was reproduced clearly, resulting in accurate portraits. For pix2pix and Paired CycleGAN, it is considered that overfitting occurred because the input of the Discriminator was a pair of data, which conditioned adversarial loss. Regarding the cycle consistency loss used for CycleGAN and Cyclepix, we think that cycle consistency loss contributed to an increase the generation accuracy by more than L1 Loss because the accuracy of Paired CycleGAN was better than that of pix2pix. Overfitting occurred in pix2pix because, when deforming like portraits, if the shape of each contour or feature changes, the input and output cannot be consistent and the reproduction accuracy decreases. It was confirmed that CycleGAN did not always provide learning results according to the training data as a result of the unsupervised learning, even in portrait generation. To solve this problem, supervised learning that adds L1 Loss, which is the pixel error between the generated caricature and the teacher’s caricature, works well than supervised learning that pairs the input of the Discriminator in Paired CycleGAN. If the input of the Discriminator is a pair, learning will be excessively adapted to the training data because the correspondence of input data is known, while if the input of the Discriminator is only the generated data, the versatility of the learning improves because of the comparisons with plural training data. CycleGAN’s evaluation was high in the experiment because facial parts were drawn clearly without causing overfitting for the unsupervised learning, and there were few obvious cases of failure in the experimental data. For this reason, it is considered that in Cyclepix, which uses supervised learning based on pixel errors, the accuracy increases as the data increases and Cyclepix can reflect the characteristics of illustrator drawing.

From the perspective of the difference between illustrators, we found that AutoEncoder and pix2pix can be generated with some accuracy when the input and output shapes match, as in the case of the student illustrator. However, in most cases, the drawn portraits are deformed and the input and output shapes do not match, as in the case of the professional illustrator. As a result, the face parts could not be composed and the generation failed. Therefore, these methods are not suitable for reflecting the individuality of illustrators using a small amount of training data.

Finally, there remains a problem whereby features such as mustache and frameless glasses cannot be completely reflected in portraits because of the small amount of training data, and the problem that the Discriminator is too strong for the Generator. Future challenges include increasing training data and changing the network structure and image size, so that facial features such as mustache and frameless glasses can be extracted. We will also try to reduce the Discriminator’s influence by lowering the Discriminator’s learning rate or by reducing the Discriminator’s learning frequency for the Generator.