×

Why do deep convolutional networks generalize so poorly to small image transformations? (English) Zbl 1433.68389

Summary: Convolutional Neural Networks (CNNs) are commonly assumed to be invariant to small image transformations: either because of the convolutional architecture or because they were trained using data augmentation. Recently, several authors have shown that this is not the case: small translations or rescalings of the input image can drastically change the network’s prediction. In this paper, we quantify this phenomena and ask why neither the convolutional architecture nor data augmentation are sufficient to achieve the desired invariance. Specifically, we show that the convolutional architecture does not give invariance since architectures ignore the classical sampling theorem, and data augmentation does not give invariance because the CNNs learn to be invariant to transformations only for images that are very similar to typical images from the training set. We discuss two possible solutions to this problem: (1) antialiasing the intermediate representations and (2) increasing data augmentation and show that they provide only a partial solution at best. Taken together, our results indicate that the problem of insuring invariance to small image transformations in neural networks while preserving high accuracy remains unsolved.

MSC:

68T07 Artificial neural networks and deep learning
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62M45 Neural nets and related approaches to inference from stochastic processes
68T05 Learning and adaptive systems in artificial intelligence

Software:

ScatNet

References:

[1] Aharon Azulay and Yair Weiss. Why do deep convolutional networks generalize so poorly to small image transformations?arXiv preprint arXiv:1805.12177, 2018. · Zbl 1433.68389
[2] Tamara L Berg and Alexander C Berg. Finding iconic images. InComputer Vision and Pattern Recognition Workshops, 2009. CVPR Workshops 2009. IEEE Computer Society Conference on, pages 1-8. IEEE, 2009.
[3] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs.arXiv preprint arXiv:1412.7062, 2014.
[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille.Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.IEEE transactions on pattern analysis and machine intelligence, 40(4):834-848, 2017.
[5] Gong Cheng, Peicheng Zhou, and Junwei Han. Learning rotation-invariant convolutional neural networks for object detection in vhr optical remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 54(12):7405-7415, 2016a.
[6] Gong Cheng, Peicheng Zhou, and Junwei Han. Rifd-cnn: Rotation-invariant and fisher discriminative convolutional neural networks for object detection. InComputer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2884-2893. IEEE, 2016b.
[7] Why do deep convolutional networks generalize so poorly to small image transformations? · Zbl 1433.68389
[8] Taco Cohen and Max Welling. Group equivariant convolutional networks. InInternational Conference on Machine Learning, pages 2990-2999, 2016.
[9] Taco S Cohen and Max Welling. Transformation properties of learned visual representations. arXiv preprint arXiv:1412.7659, 2014.
[10] Sander Dieleman, Kyle W Willett, and Joni Dambre. Rotation-invariant convolutional neural networks for galaxy morphology prediction.Monthly notices of the royal astronomical society, 450(2):1441-1459, 2015.
[11] Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu. Exploiting cyclic symmetry in convolutional neural networks.arXiv preprint arXiv:1602.02660, 2016.
[12] Logan Engstrom, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. A rotation and a translation suffice: Fooling cnns with simple transformations.CoRR, abs/1712.02779, 2017a. URLhttp://arxiv.org/abs/1712.02779.
[13] Logan Engstrom, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. A rotation and a translation suffice: Fooling cnns with simple transformations.arXiv preprint arXiv:1712.02779, 2017b.
[14] Carlos Esteves, Christine Allen-Blanchette, Xiaowei Zhou, and Kostas Daniilidis. Polar transformer networks.arXiv preprint arXiv:1709.01889, 2017.
[15] Kunihiko Fukushima and Sei Miyake. Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. InCompetition and cooperation in neural nets, pages 267-285. Springer, 1982.
[16] Robert Gens and Pedro M Domingos. Deep symmetry networks. InAdvances in neural information processing systems, pages 2537-2545, 2014.
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision, pages 1026-1034, 2015.
[18] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700-4708, 2017.
[19] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition.Neural computation, 1(4):541-551, 1989.
[20] Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. InComputer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 991-999. IEEE, 2015. · Zbl 1458.68236
[21] Elad Mezuman and Yair Weiss. Learning about canonical views from internet image collections. InAdvances in Neural Information Processing Systems, pages 719-727, 2012.
[22] Rahul Raguram and Svetlana Lazebnik. Computing iconic summaries of general visual concepts. InComputer Vision and Pattern Recognition Workshops, 2008. CVPRW’08. IEEE Computer Society Conference on, pages 1-8. IEEE, 2008.
[23] Laurent Sifre and St´ephane Mallat. Rotation, scaling and deformation invariant scattering for texture discrimination. InComputer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1233-1240. IEEE, 2013.
[24] Ian Simon, Noah Snavely, and Steven M Seitz.Scene summarization for online image collections. InComputer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1-8. IEEE, 2007.
[25] Eero P Simoncelli, William T Freeman, Edward H Adelson, and David J Heeger. Shiftable multiscale transforms.IEEE transactions on Information Theory, 38(2):587-607, 1992.
[26] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014.
[27] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2013.
[28] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1-9, 2015.
[29] Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. InComputer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1521-1528. IEEE, 2011.
[30] Tobias Weyand and Bastian Leibe. Discovering favorite views of popular places with iconoid shift. InComputer Vision (ICCV), 2011 IEEE International Conference on, pages 1132- 1139. IEEE, 2011.
[31] Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. Harmonic networks: Deep translation and rotation equivariance. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), volume 2, 2017.
[32] Yichong Xu, Tianjun Xiao, Jiaxing Zhang, Kuiyuan Yang, and Zheng Zhang. Scale-invariant convolutional neural networks.arXiv preprint arXiv:1411.6369, 2014.
[33] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
[34] Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated residual networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 472-480, 2017.
[35] Why do deep convolutional networks generalize so poorly to small image transformations? · Zbl 1433.68389
[36] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. InEuropean conference on computer vision, pages 818-833. Springer, 2014.
[37] Richard Zhang.Making convolutional networks shift-invariant again.arXiv preprint arXiv:1904.11486, 2019.
[38] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586-595, 2018.
[39] Bolei Zhou, Aditya Khosla, ‘Agata Lapedriza, Aude Oliva, and Antonio Torralba.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.