The aim of the present study was to predict the response of non-small cell lung cancer (NSCLC) patients to immune checkpoint inhibitors (ICIs) by leveraging computed tomography (CT) images using deep learning techniques. Retrospectively, 624 sequential CT images were gathered from 156 patients at Jiangsu Province Hospital, along with their clinical data. The dataset was subsequently partitioned into three groups: training (n=547), validation (n=64), and test (n=64). Moreover, an external validation cohort included 37 CT images from patients at Nanjing Pukou Peoples' Hospital, accompanied by comprehensive clinical data. An advanced Video Vision Transformer (ViViT) model incorporating global self-attention was utilized to analyse patients treated with ICIs and predict their response. The ViViT model's efficacy was evaluated using a confusion matrix and a receiver operating characteristic curve (ROC). Notably, the ViViT model demonstrated predictive prowess for ICIs response, yielding respective areas under the receiver operating characteristic curve (AUC) of 0.74 (95% CI: 0.69-0.78), 0.74 (95% CI: 0.61-0.86), 0.76 (95% CI: 0.62-0.88), and 0.69 (95% CI: 0.5-0.87) in the training, validation, test, and external validation cohorts. The present study illustrates how a deep learning model can provide a non-invasive means to predict clinical outcomes in NSCLC patients undergoing ICIs, potentially transforming personalized treatment approaches for individuals with NSCLC.