Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.12037 (cs)

[Submitted on 18 Mar 2024 (v1), last revised 19 Mar 2024 (this version, v2)]

Title:MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control

Authors:Enshen Zhou, Yiran Qin, Zhenfei Yin, Yuzhou Huang, Ruimao Zhang, Lu Sheng, Yu Qiao, Jing Shao

Abstract:It is a long-lasting goal to design a generalist-embodied agent that can follow diverse instructions in human-like ways. However, existing approaches often fail to steadily follow instructions due to difficulties in understanding abstract and sequential natural language instructions. To this end, we introduce MineDreamer, an open-ended embodied agent built upon the challenging Minecraft simulator with an innovative paradigm that enhances instruction-following ability in low-level control signal generation. Specifically, MineDreamer is developed on top of recent advances in Multimodal Large Language Models (MLLMs) and diffusion models, and we employ a Chain-of-Imagination (CoI) mechanism to envision the step-by-step process of executing instructions and translating imaginations into more precise visual prompts tailored to the current state; subsequently, the agent generates keyboard-and-mouse actions to efficiently achieve these imaginations, steadily following the instructions at each step. Extensive experiments demonstrate that MineDreamer follows single and multi-step instructions steadily, significantly outperforming the best generalist agent baseline and nearly doubling its performance. Moreover, qualitative analysis of the agent's imaginative ability reveals its generalization and comprehension of the open world.

Comments:	Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2403.12037 [cs.CV]
	(or arXiv:2403.12037v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.12037

Submission history

From: Enshen Zhou [view email]
[v1] Mon, 18 Mar 2024 17:59:42 UTC (4,221 KB)
[v2] Tue, 19 Mar 2024 14:52:28 UTC (4,221 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators