Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.21757 (cs)

[Submitted on 31 Jul 2024 (v1), last revised 12 Sep 2024 (this version, v2)]

Title:Learning Video Context as Interleaved Multimodal Sequences

Authors:Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng Shou

View PDF HTML (experimental)

Abstract:Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who, relationship, and reason). In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts. Our core idea is to represent videos as interleaved multimodal sequences (including images, plots, videos, and subtitles), either by linking external knowledge databases or using offline models (such as whisper for subtitles). Through instruction-tuning, this approach empowers the language model to interact with videos using interleaved multimodal instructions. For example, instead of solely relying on video as input, we jointly provide character photos alongside their names and dialogues, allowing the model to associate these elements and generate more comprehensive responses. To demonstrate its effectiveness, we validate MovieSeq's performance on six datasets (LVU, MAD, Movienet, CMD, TVC, MovieQA) across five settings (video classification, audio description, video-text retrieval, video captioning, and video question-answering). The code will be public at this https URL.

Comments:	Accepted by ECCV 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2407.21757 [cs.CV]
	(or arXiv:2407.21757v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.21757

Submission history

From: Qinghong Lin [view email]
[v1] Wed, 31 Jul 2024 17:23:57 UTC (11,515 KB)
[v2] Thu, 12 Sep 2024 14:01:56 UTC (11,495 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Learning Video Context as Interleaved Multimodal Sequences

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Learning Video Context as Interleaved Multimodal Sequences

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators