skip to main content
research-article
Open access

Towards Discrete Object Representations in Vision Transformers with Tensor Products

Published: 14 March 2024 Publication History

Abstract

In this work, we explore the use of Tensor Product Representations (TPRs) in a Vision Transformer model to form image representations that can later be used for symbolic manipulation in a neurosymbolic model. We propose the Tensor Product Vision Transformer (TP-ViT), an enhancement of a Vision Transformer that incorporates TPRs, an object representation methodology that utilizes filler and role vectors to represent objects. TP-ViT is the first application of TPRs on visual input, and we report qualitative and quantitative results which show that the use of TPRs allows for the formation of more targeted and diverse object representations when compared to a standard Vision Transformer.

References

[1]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009). https://doi.org/10.1109/cvpr.2009.5206848
[2]
Klaus Greff, Sjoerd Van Steenkiste, and Jürgen Schmidhuber. 2020. On the binding problem in artificial neural networks. arXiv preprint arXiv:2012.05208 (2020).
[3]
Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. 2011. Transforming auto-encoders. In International conference on artificial neural networks. Springer, 44–51.
[4]
Qiuyuan Huang, Li Deng, Dapeng Wu, Chang Liu, and Xiaodong He. 2019. Attentive Tensor Product Learning. Proc. Conf. AAAI Artif. Intell. 33, 01 (July 2019), 1344–1351.
[5]
Michael Iuzzolino, Yoram Singer, and Michael C Mozer. 2019. Convolutional bipartite attractor networks. arXiv preprint arXiv:1906.03504 (2019).
[6]
Jindong Jiang, Sepehr Janghorbani, Gerard De Melo, and Sungjin Ahn. 2019. Scalor: Generative world models with scalable object representations. arXiv preprint arXiv:1910.02384 (2019).
[7]
Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weissenborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Sylvain Gelly, Thomas Unterthiner, and Xiaohua Zhai. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
[8]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional Neural Networks. Commun. ACM 60, 6 (2017), 84–90. https://doi.org/10.1145/3065386
[9]
Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, Weihao Sun, Gautam Singh, Fei Deng, Jindong Jiang, and Sungjin Ahn. 2020. Space: Unsupervised object-oriented scene representation via spatial attention and decomposition. arXiv preprint arXiv:2001.02407 (2020).
[10]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
[11]
Michael C Mozer, Denis Kazakov, and Robert V Lindsey. 2018. State-denoised recurrent neural networks. arXiv preprint arXiv:1805.08394 (2018).
[12]
David P Reichert and Thomas Serre. 2013. Neuronal synchrony in complex-valued deep networks. arXiv preprint arXiv:1312.6115 (2013).
[13]
David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. 2019. Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557 (2019).
[14]
Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, Jürgen Schmidhuber, and Jianfeng Gao. 2019. Enhancing the transformer with explicit relational encoding for math problem solving. arXiv preprint arXiv:1910.06611 (2019).
[15]
Paul Smolensky. 1990. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial intelligence 46, 1-2 (1990), 159–216.
[16]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[17]
Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision. 558–567.
[18]
Chongzhi Zhang, Mingyuan Zhang, Shanghang Zhang, Daisheng Jin, Qiang Zhou, Zhongang Cai, Haiyu Zhao, Shuai Yi, Xianglong Liu, and Ziwei Liu. 2021. Delving deep into the generalization of vision transformers under distribution shifts. arXiv preprint arXiv:2106.07617 (2021).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CSAI '23: Proceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence
December 2023
563 pages
ISBN:9798400708688
DOI:10.1145/3638584
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2024

Check for updates

Author Tags

  1. computer vision
  2. neurosymbolic AI
  3. object representations
  4. tensor product representations
  5. vision transformer

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Ministry of Higher Education Malaysia

Conference

CSAI 2023

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 98
    Total Downloads
  • Downloads (Last 12 months)98
  • Downloads (Last 6 weeks)10
Reflects downloads up to 24 Oct 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media