A Joint Representation Using Continuous and Discrete Features for Cardiovascular Diseases Risk Prediction on Chest CT Scans

Abstract

Introduction

Cardiovascular diseases (CVD) continue to pose a significant public health concern, accounting for 45% of deaths in Europe [wilkins2017european] and 31% in the United States [mozaffarian2016heart]. Despite a notable 7% reduction in CVD mortality rates since 2005, largely attributed to advances in clinical screening and prevention strategies, heart disease remains one of the leading causes of death in the United States [d20191997]. The ongoing challenge is to identify asymptomatic patients who would benefit most from preventive measures, such as initiating statin therapy [ridker2008rosuvastatin] and others, in a convenient, accessible, accurate, and non-invasive manner. This is a crucial aspect of effective preventive cardiology patient management.

Several guidelines endorse different CVD risk prediction models. The 2010 guidelines [greenland20102010] from the American College of Cardiology / American Heart Association (ACC / AHA) suggest the use of the Framingham Risk Score (FRS) [wilson1998prediction, eichler2007prediction, d2008general], a well-established and thoroughly validated model that considers traditional risk factors such as cholesterol levels and blood pressure, etc. In contrast, the 2016 European guidelines favor the Systematic Coronary Risk Evaluation (SCORE) model [piepoli2016guidelines]. The 2016 Chinese guidelines introduce the China-PAR model, specifically designed to estimate the 10-year risk for atherosclerotic cardiovascular disease [yang2016predicting, liu2018predicting]. However, these recommended risk prediction models rely on a limited set of predictors, which may result in suboptimal risk management performances on various patient populations.

Recent studies have begun to utilize CT scans to extract quantitative imaging biomarkers, demonstrating good potential for surpassing the predictive performance of multivariate FRS [pickhardt2020automated]. With the developments of deep learning in medical imaging, previous work explored integrating CT scans with deep learning algorithms to automatically obtain CT quantitative biomarkers related to CVD risk [xu2023ai, eng2021automated, zeleznik2021deep]. The methods specialized in multivariate analysis are designed for CVD risk prediction and can contain considerable interpretability. However, they may often overlook the interactions among different factors, and due to restrictive modeling assumptions and a limited number of predictors, these methods tend to attain only moderate predictive performance [siontis2012comparisons]. On the other hand, some studies [van2019direct, chao2021deep] have directly applied deep learning to predict CVD risk from chest CT scans. While these methods can greatly improve the predictive capacity and accuracy, they provide little interpretable insight into the model’s decision-making process.

In this study, we propose a simple yet effective information fusion method, called DeepCVD, that extracts continuous deep features and calculates discrete quantitative biomarkers from CT scans. These features are subsequently integrated using a joint representation module that fosters interaction, ultimately enhancing the precision of CVD risk prediction and enabling the analysis of each feature’s contribution to the model’s decision-making process. In the feature extraction phase, discrete features are generated from image segmentation models as CT quantitative biomarkers, crafted based on current clinical knowledge to improve the model’s predictive accuracy and applicability. Meanwhile, continuous deep image features derived from a trained deep CVD classifier encompass rich higher-order semantic contextual information pertinent to CVD, albeit challenging to quantify. We propose a joint representation module to fully utilize the complementary information of these two channels of features for superior CVD risk prediction. Specifically, we employ an instance-wise feature gating mechanism (IFGM) to align the densities and dimensions of the input continuous and discrete features. We apply a Gated Residual Network (GRN) [lim2021temporal], a flexible and non-linear operation, to obtain more robust feature embeddings. Inspired by the multi-head attention mechanism [vaswani2017attention], we have designed a soft instance-wise feature interaction mechanism (SIFIM), facilitating thorough interactions among features across diverse subspaces and capturing their intricate interrelations for joint representation learning. Importantly, this strategy preserves the autonomy of each input feature, enabling physicians to understand their respective contributions to the model’s decisions easily. Furthermore, our feature interaction technique allows the model to adaptively output the contribution of each feature for various CVD, aligning closely with practical clinical settings. Our approach has achieved competitive performance using both public datasets and a private external validation patient cohort.

Refer to caption — Fig. 1: | Overview of the proposed DeepCVD framework, the training and testing cohorts. a. Pubilic LDCT-NLST and the external standard-dose chest CT (NERC-MBD) cohorts. b. Schematic overview of DeepCVD. It takes chest CT as input and outputs the probability of CVD risk, and the individual contribution score of each biomarker. The DeepCVD framework consists of two stages: the first stage involves the extraction of deep continuous features and discrete CT quantitative biomarkers, while the second stage entails features joint representation followed by CVD risk prediction. c. ROC curves of CVD risk prediction on LDCT-NLST and NERC-MBD testing cohorts.

Results

0.1 Dataset overview.

In this retrospective study, a total of two cohorts were used to develop and evaluate the performance of DeepCVD (Fig. 1a). Specifically, the National Lung Screening Trial (LDCT-NLST) cohort comprised 10,395 patients with 33,413 volumes, whereas the National Engineering Research Center for Medical Big Data (NERC-MBD) cohort included 6,393 patients with 17,207 volumes.

The LDCT-NLST cohort is a significant dataset intended to evaluate the effectiveness of low-dose chest CT scans in lung cancer screening. In this trial, it also collects information related to the CVD of subjects, making the dataset applicable for research in CVD risk prediction [chao2021deep, van2019direct]. Each subject undergoes one to three CT examinations, each producing multiple CT scans employing different CT reconstruction kernels. It has 33,413 CT volumes from 10,395 subjects, and each subject is labeled as either CVD-Positive or CVD-Negative. A subject is considered CVD positive if any cardiovascular abnormalities are reported in their CT screening examinations, or if the subject dies of CVD. CVD-negative subjects have no history of CVD during the clinical trial, no cardiovascular abnormalities reported in any CT scans, and do not die from circulatory system diseases.

The NERC-MBD cohort comprises 17,207 standard-dose chest CT volumes from 6,393 subjects. Of these, 7,650 CT volumes are from 2,828 CVD-positive subjects, and 9,557 CT volumes are from 3,565 CVD-negative subjects. A subject labeled as CVD-positive is identified based on the diagnosis report from the Department of Cardiology. Subjects diagnosed with acute cardiovascular disease have CT scans from the year prior to diagnosis collected, while those diagnosed with chronic cardiovascular disease have CT scans from the two years prior to diagnosis collected. Subjects without any cardiovascular abnormalities in their CT examination reports or clinical conclusions are categorized as CVD-negative. It is important to note that this cohort includes subjects with 16 types of cardiovascular diseases, and detailed information is summarized in Extended Data Fig. 3.

0.2 Development of joint representation CVD risk prediction system.

We developed the continuous and discrete features joint representation CVD risk prediction system (DeepCVD) in the LDCT-NLST cohort (Fig. 1a). We adopt the three random subsets generated by the prior study [chao2021deep] entirely, with the training set accounting for 70% (7,268 subjects), the validation set 10% (1,042 subjects), and the independent test set 20% (2,085 subjects). We found that the LDCT-NLST cohort contained some data that did not meet the requirements of our model, such as incomplete field of view (FOV), missing slices, and severe cardiac artifacts. Two radiologists cleansed the dataset and removed the problematic subjects, resulting in 10,325 subjects. The training set comprises 7,227 subjects, the validation set comprises 1,033 subjects, and the test set includes 2,065 subjects. Our goal was to achieve accurate CVD risk prediction using chest CT scans while providing physicians with reliable bases for model decisions.

We designed a two-stage system. In the first stage, eighteen discrete CT quantitative biomarkers based on physician’s insights, along with continuous deep features, were extracted from the LDCT-NLST training set and validation set. The second stage focused on training the joint representation CVD risk prediction model based on the discrete quantitative biomarkers and continuous deep features. The process began with feature alignment through an instance-wise feature-gated mechanism to obtain independent embedding vectors, followed by a soft instance-wise feature interaction mechanism to conduct feature interactions at the instance level and to compute attention weights for each instance feature to achieve joint representation. Finally, the model outputted prediction outcomes and contribution scores for each instance feature (Fig. 1b). The CVD risk prediction results of our model were evaluated on the LDCT-NLST testing cohort and the NERC-MBD cohort (Fig. 1c).

Table 1: Quantitative evaluation of DeepCVD with different methods on LDCT-NLST testing cohort.

Methods	Accuracy (95% CI)	Sensitivity (95% CI)	Specificity (95% CI)	F1 Score (95% CI)	AUC (95% CI)	p-value
DeepCVD^m	0.833 (0.816-0.848)	0.616 (0.576-0.655)	0.918 (0.903-0.931)	0.675 (0.642-0.708)	0.875 (0.857-0.891)	-
Tri2D-Net^∗c	0.819 (0.802-0.837)	0.485 (0.444-0.525)	0.952 (0.940-0.962)	0.603 (0.565-0.640)	0.869 (0.850-0.886)	0.118
ResNet34^c	0.796 (0.779-0.814)	0.543 (0.502-0.583)	0.896 (0.880-0.911)	0.601 (0.563-0.634)	0.844 (0.825-0.861)	$<0.001$
nnUNet-J^c	0.824 (0.806-0.840)	0.599 (0.559-0.638)	0.913 (0.898-0.926)	0.658 (0.621-0.691)	0.874 (0.856-0.890)	0.034
ViT-B^c	0.651 (0.629-0.669)	0.560 (0.519-0.600)	0.686 (0.662-0.709)	0.475 (0.442-0.507)	0.676 (0.650-0.701)	$<0.001$
nnFormer-J^c	0.788 (0.769-0.805)	0.692 (0.653-0.728)	0.825 (0.805-0.844)	0.648 (0.616-0.678)	0.837 (0.817-0.856)	$<0.001$
Xgboost^d	0.771 (0.755-0.788)	0.264 (0.230-0.301)	0.971 (0.961-0.978)	0.394 (0.352-0.437)	0.835 (0.817-0.853)	$<0.001$

Note: “*” indicates that Tri2D-Net is directly used as an open-source model trained on the LDCT-NLST training cohort, all other models are trained from scratch based on the LDCT-NLST training cohort. “J” indicates that we have incorporated a classification head into the classic segmentation networks, thus achieving multitask joint training to enhance CVD risk prediction performance. Detailed information about comparison methods can be found in the Methods Section. “m” indicates a model that jointly represents discrete quantification features and continuous deep features. “c” represents an end-to-end deep learning model. “d” represents a model that uses only discrete quantification features.

0.3 Performance on the internal LDCT-NLST testing cohort.

The proposed DeepCVD was used to identify subjects with CVDs in the LDCT-NLST testing cohort, which consisted of 2,065 patients (583 patients CVD-Positive and 1,482 patients CVD-Negative). The quantitative performance was summarized in Table 1, and the receiver operating characteristic curves (ROCs) of multiple methods were shown in Fig. 1c. DeepCVD achieved an area under the curve (AUC) of 0.875 (95% Confidence Interval (CI), 0.857-0.891), a sensitivity of 0.616 (95% CI, 0.576-0.655), and a specificity of 0.918 (95% CI, 0.903-0.931). From Table 1, it was evident that models using our designed quantitative biomarkers or directly using prior knowledge, such as pericoronary calcification and epicardial fat, whether Xgboost [chen2016Xgboost], Tri2D-Net [chao2021deep], or our proposed DeepCVD, demonstrated significantly better specificity compared to models like ResNet34 [he2016deep] that did not use prior knowledge. However, the complex encoding of prior knowledge can impair the ability to suppress false positives (non-encoded Xgboost specificity is 0.971, while encoded Tri2D-Net and DeepCVD had specificities of 0.952 and 0.918, respectively). By combining quantitative biomarkers and continuous deep features, DeepCVD increased the sensitivity by 27.0% and the F1 score by 11.9% compared to Tri2D-Net, at the expense of slightly reduced specificity.

To further present the quality of the embeddings of different methods in the LDCT-NLST testing cohort, t-SNE [van2008visualizing] was used to visualize the embeddings of different methods. In Fig. 2, the color of the nodes corresponded to CVD-Positive and CVD-Negative, verifying the discriminative power of the method. As indicated in Fig. 2, DeepCVD and Tri2D-Net can achieve more compact and separated clusters compared with other methods that do not incorporate prior insights.

0.4 Performance on the external standard-dose chest CT testing cohort.

To assess the generalizability of our proposed DeepCVD, we directly applied the model trained on the LDCT-NLST dataset to the NERC-MBD standard-dose chest CT dataset. DeepCVD achieved an AUC of 0.843 (95% CI, 0.837-0.849), accuracy of 0.819 (95% CI, 0.813-0.824), sensitivity of 0.756 (95% CI, 0.746-0.765), and specificity of 0.870 (95% CI, 0.863-0.876). These results showed significant performance improvement (p $<0.001$ ) compared to the previous state-of-the-art approach Tri2D-Net [chao2021deep] (accuracy increased by 9.8%, sensitivity by 28.4%, F1 score by 17.1%, and AUC by 5.9%, while the specificity remained almost the same). In comparison, other methods had inferior performance that was statistically significant (all p $<0.001$ ). Detailed evaluation performance was summarized in Table 2.

Although the specificity of DeepCVD was not the highest, all other quantitative performance metrics were the best. Additionally, compared to other methods, DeepCVD demonstrated the most stable performance on both the internal LDCT and external standard-dose chest CT dataset (AUC: internal 0.875, external 0.843; ACC: internal 0.835, external 0.819), showcasing DeepCVD’s generality and accuracy holding well in this large-scale external testing cohort.

Table 2: Quantitative evaluation of DeepCVD with different methods on NERC-MBD testing cohort.

Methods	Accuracy (95% CI)	Sensitivity (95% CI)	Specificity (95% CI)	F1 Score (95% CI)	AUC (95% CI)	p-value
DeepCVD^m	0.819 (0.813-0.824)	0.756 (0.746-0.765)	0.870 (0.863-0.876)	0.788 (0.780-0.795)	0.843 (0.837-0.849)	-
Tri2D-Net^∗c	0.746 (0.739-0.752)	0.589 (0.578-0.600)	0.872 (0.865-0.878)	0.673 (0.664-0.682)	0.796 (0.790-0.803)	$<0.001$
ResNet34^c	0.796 (0.791-0.803)	0.705 (0.695-0.715)	0.869 (0.863-0.876)	0.755 (0.747-0.763)	0.832 (0.826-0.839)	$<0.001$
nnUNet-J^c	0.708 (0.701-0.714)	0.747 (0.737-0.756)	0.676 (0.667-0.686)	0.694 (0.686-0.702)	0.780 (0.773-0.787)	$<0.001$
ViT-B^c	0.643 (0.635-0.649)	0.720 (0.710-0.730)	0.580 (0.570-0.590)	0.642 (0.633-0.650)	0.716 (0.708-0.724)	$<0.001$
nnFormer-J^c	0.726 (0.720-0.733)	0.751 (0.741-0.761)	0.706 (0.697-0.715)	0.709 (0.701-0.717)	0.795 (0.788-0.802)	$<0.001$
Xgboost^d	0.679 (0.672-0.686)	0.369 (0.358-0.380)	0.928 (0.922-0.933)	0.505 (0.494-0.517)	0.820 (0.813-0.826)	$<0.001$

0.5 Feature contribution analysis and visualization.

To interpret the contribution of each instance-wise feature in the decision-making process of CVD risk prediction, we have categorized four scenarios to demonstrate the contribution scores of the features learned by DeepCVD (Fig. 3). Detailed extraction of quantitative CT biomarkers is illustrated in the Methods Section.

(i) Evidence that is directly visible in the CT scans and can diagnose certain CVD, is also included in the quantitative biomarkers. The corresponding biomarkers usually make the main decision of our model. Fig. 3a shows a CT volume with the thoracic aortic aneurysm (TAA). Within the learned contribution scores, we find that the biomarkers describing the shape of the aorta (AMD, AMDSTD, and ATI) accounted for 27.5%, playing a decisive role, while deep features contributed only 7.1%.

(ii) Evidence that is directly visible in the CT volume, which can diagnose certain CVD, but is not included in the quantitative biomarkers, usually results in the main decision of our model being made jointly by deep features and relevant biomarkers. Fig. 3b shows an example of a case of pulmonary arterial hypertension (PAH), where the diameter of the pulmonary artery is noticeable enlarged and exceeds that of the ascending aorta, corresponding to the typical imaging symptoms of PAH. The information about the pulmonary artery diameter is not included directly in our quantitative biomarkers, but PAH will also bring indirect signs of pulmonary texture and heart morphology. The biomarkers related to heart morphology (CHR, CLD, and CSD) accounted for 25.2%, the biomarkers related to pulmonary textures (LLR, RLR, LHR, and RHR) accounted for 16.2%, and the deep features contributed 30.8%, forming good complementary for each other.

(iii) Evidence that indirectly suggested the presence of certain CVD visible in CT scans. Deep features mainly make the decision of our model, but other related biomarkers also had corresponding decision contributions. Fig. 3c shows a CT volume from a patient with acute myocardial infarction (AMI) six months ago, from which we can observe that the decision proportion of deep features accounted for 42.7%. The proportion of biomarkers related to heart morphology (CHR, CLD, and CSD) accounted for 16.4%, biomarkers related to pericardial fat (PFATV, PFATM, and PFATSTD) accounted for 16.6%, and biomarkers related to coronary calcification (CACS and CACV) accounted for 4.3%.

(iv) CT scans did not have imaging evidence to diagnose a certain CVD. Deep features almost entirely decided on our model. Fig. 3d shows a CT volume from a healthy individual, where we can see that deep feature contributed for 80.5%.

To further explore whether the distribution of the contribution score to learned characteristics exhibits characteristics associated with different diseases, we presented additional illustrations in various CVDs (see Extended Data Fig. 1 and Extended Data Fig. 2).

Discussion

It is essential to identify asymptomatic patients at risk of CVD who could benefit greatly from preventive measures. While significant progress has been made in this area, challenges such as suboptimal prediction performance or a lack of interpretable evidence from model outputs make it difficult to use these models in clinical settings. Our DeepCVD can simultaneously represent continuous deep features and discrete quantitative biomarkers to address these issues. This framework fully utilizes the complementary strengths of prior knowledge-based discrete quantitative biomarkers and rich, CVD-relevant continuous deep features. The joint representation of features enables comprehensive interaction and integration between features, significantly enhancing CVD risk prediction performance. Furthermore, our joint representation is carried out at an instance-wise feature level, providing adaptive contribution scores for each biomarker and deep feature in the model decision-making process, offering physicians reliable predictive results.

This study thoroughly validates the effectiveness of DeepCVD. It successfully combines the advantages of discrete and continuous features, improves CVD risk prediction performance, and demonstrates strong generalizability. Multisource feature fusion techniques are also widely used in the medical domain, as feature concatenation is a commonly adopted and effective method for joint feature representation [kiela2014learning]. In the first stage, we concatenate continuous and discrete features in this manner and then feed them into the second stage features joint representation module. We observe an increase in AUC from 0.844 to 0.862 compared to directly using a CVD risk classifier with ResNet34. However, this mode of feature interaction does not fully leverage the respective advantages of quantitative discrete biomarkers and deep continuous features, with specificity increasing from 0.896 to 0.956, while sensitivity decreased from 0.543 to 0.440 (as indicated in Table 1 and Fig. 6c). Furthermore, although our model is trained on LDCT-NLST, it demonstrates remarkable generalization to standard-dose chest CT scans as shown in Table 2. Moreover, our joint representation module outperformed models like Xgboost, even when only discrete features were used, demonstrating the benefits of effective feature interaction for enhancing model performance (shown in Fig. 6d).

We have also discovered that incorporating prior knowledge, either through direct predictions based on CT quantitative biomarkers using Xgboost or by integrating them into deep models such as Tri2D-Net, effectively improves the model’s specificity. Tri2D-Net incorporated pericardial fat and calcification as strong constraints in the prediction model. In the real world, not all CVDs are strongly associated with just two biomarkers, which reduces the model’s sensitivity and its ability to generalize to real-world scenarios. As shown in Fig. 2, although Tri2D-Net also demonstrates high-quality embeddings across the LDCT-NLST test set, its performance significantly drops on the external test set (see Table 2). This is mainly because the external test set includes many subjects with cerebral infarction (refer to Extended Data Fig. 3). Consequently, using CAM [selvaraju2017grad] for model interpretability also becomes less meaningful. In contrast, our model aligns and flexibly processes discrete and continuous features through the instance-wise feature-gated mechanism to generate more robust embeddings. Then, the soft instance-wise feature interaction mechanism achieves thorough interaction and fusion of features while maintaining relative independence. This ensures that our model optimally balances specificity and sensitivity (as seen in Table 1 and Table 2), and automatically learns the relationships between different CVDs and various biomarkers and deep features (as seen in Fig. 3). This informs physicians about the role of each feature in the decision-making process and opens up possibilities to discover new clinical biomarkers in the CVD domain.

A recent study utilized a mature and fully automated abdominal CT-based algorithm with predefined metrics to quantify aortic calcification, muscle density, the ratio of visceral to subcutaneous fat, liver fat, and bone mineral density for assessing CVD risk [pickhardt2020automated]. The research demonstrated that the multivariate combination of CT biomarkers could effectively enhance CVD risk prediction performance over traditional risk factors. For example, the combination of four CT-based quantitative biomarkers—aortic calcification, muscle density, the ratio of visceral to subcutaneous fat, and liver fat—resulted in a 2-year AUC of 0.817 (95% CI 0.768-0.866). Although this study used a different retrospective cohort and abdominal non-contrast CT, the results of the two studies both demonstrated the potential CVD risk prediction value of harnessing the rich biometric tissue data embedded within all body CT scans that typically go unused in routine practice. However, this approach requires predefined CT biomarkers, and the combination of these biomarkers to enhance CVD risk prediction performance needs to be validated through repeated experiments. Our approach goes a step further by using deep features to represent those CT biomarkers that cannot or have not yet been predefined, thereby enhancing CVD risk prediction performance. Furthermore, through a unique design, our model can output the contribution of each feature in the decision-making process. This not only provides more insights for doctors but also suppresses those CT biomarkers that do not aid in CVD risk prediction.

While DeepCVD achieved significant results on both the LDCT-NLST and NERC-MBD testing cohorts with interpretable predictions, this study does have certain limitations. CVD encompasses a range of conditions affecting various organs and tissues, including the brain and heart. However, the LDCT-NLST and NERC-MBD testing cohorts consist of a limited range of diseases with an uneven distribution of chest CT quantities for each disease. Additionally, our model trained on the chest LDCT, focuses solely on discrete CT quantitative biomarkers and deep features within the chest region. Some CVDs may not present direct or indirect signs on chest CT, or they may not have affected the organs or tissues in the chest yet, leading to lower prediction performance. In the NERC-MBD testing cohort, we observed that the prediction performance for diseases such as cerebral infarction and occlusion of the precerebral artery is lower than for diseases related to the heart and major blood vessels, such as ischemic heart disease (Extended Data Fig. 3), indicating that there is room for improvement in our method. Lastly, while we gained insight into the role of each feature in the decision-making process through the learned contribution score (shown in Fig. 3), we have not yet established a direct link between biomarkers and specific CVDs through the contribution score, especially when deep features predominantly drive the model’s decisions.

In clinical practice, the fusion of imaging information with clinical information can result in increased accuracy, mode informative clinical decision-making, and improved patient outcomes [huang2020fusion]. Our DeepCVD can seamlessly integrate into clinical information. In future work, we aim to expand the discrete features from CT quantitative biomarkers to multimodal biomarkers, incorporating laboratory indicators closely linked to CVDs such as blood pressure, cholesterol, and smoking history into the model to enhance predictive performance. Additionally, we plan to establish connections between biomarkers and specific CVDs through contribution scores, providing actionable guidance for physicians in subsequent diagnoses and achieving greater clinical value.

Methods

This study aims to accurately predict CVD risk using chest CT scans and provide physicians with reliable information for model decisions. Our method is based on two findings and assumptions. First, while well-known quantitative biomarkers from CT scans, such as the coronary artery calcium score, are considered risk factors for CVD [pickhardt2020automated, iacobellis2022epicardial], we believe that many undefined or difficult-to-quantify biomarkers related to CVD have not been fully exploited within chest CT scans. Fully utilizing these biomarkers can effectively improve the performance of CVD risk prediction. Second, the relevance and contributions of biomarkers should vary across different CVDs, requiring these features to interact sufficiently to address the diversity of CVDs while also maintaining relative independence so that the model can adaptively output the contribution of each biomarker in decision-making.

We present a new pipeline for CVD risk prediction, as depicted in Fig. 1b. The pipeline consists of two main stages. In the first stage, we extract discrete CT quantitative biomarkers and continuous deep features. We generate $N$ discrete quantitative biomarkers (where $N=18$ in this study) based on clinical insights and four pre-trained segmentation models. We use deep features obtained from a pre-trained deep CVD risk classifier to comprehensively represent these undiscovered or difficult-to-quantify biomarkers, which serve as continuous deep features. The second stage primarily involves the joint representational learning of discrete quantitative biomarkers and continuous deep features. It begins with feature alignment through an instance-wise feature-gated mechanism to obtain independent embedding vectors, followed by a soft instance-wise feature interaction mechanism to conduct feature interactions at the instance level and compute attention weights for each instance feature to achieve joint representation. Finally, the model outputs prediction outcomes and contribution scores for each instance feature.

0.6 Deep continuous feature and discrete biomarker extraction

0.7 Continuous deep features extraction.

The extraction of deep continuous features relies on a pre-trained CVD risk prediction model, which has been trained on the LDCT-NLST training set. The model is based on a progressive coarse-to-fine framework. In the coarse stage (heart localization), a lightweight cardiac segmentation model is applied to localize the volume of the region of interest in the heart. The input chest LDCT scan is cropped into a smaller volume to reduce the influence of irrelevant content noise and conserve computational resources. Moving to the fine stage, the cropped volume is used to train a classifier based on ResNet34 [he2016deep], tasked with distinguishing between CVD-Positive and CVD-Negative cases. Ultimately, the outputs from the layer preceding the final fully connected layer of ResNet34 are extracted. These high-dimensional embeddings are rich in features relevant to CVDs, denoted as $x_{1}$ ( $x_{1}\in\mathbb{R}^{1\times D}$ , here $D=512$ ), and serve as continuous deep features of our proposed method.

0.8 Discrete CT quantitative biomarkers extraction.

The CT quantitative biomarkers extraction process utilizes four specialized body part segmentation models, each trained on the 400 internal chest CT scans. These models are designed to segment different anatomical structures, including the heart chambers and pericardium, aortic and coronary calcium, aortic structure, and left and right lungs. The foundational architecture for these models is MedFormer [gao2022data]. These four segmentation models are applied to segment the whole LDCT-NLST dataset, with the results being reviewed by two radiologists and revised if necessary. The overall failure rate is less than 0.5 $\%$ . Fig. 4 visually showcases the results obtained from these fully automated body part segmentation models.

We have established stable quantitative measures for each tissue composition based on the automated segmentation results without additional learning or adjustment. A total of $N$ quantitative biomarkers are calculated, with certain biomarkers such as the coronary artery calcium score (CACS) and the cardiothoracic ratio (CRT) having established associations with CVD and mortality in previous research [pickhardt2020automated, iacobellis2022epicardial, girardi2021aortic, hemingway1998cardiothoracic, dey2012epicardial]. Others are characterized by physicians using the results of the four segmentation models. All of these biomarkers are scalar and are symbolized as $x_{i}$ (where $i\in[2,N+1]$ ).

(i) Based on the segmentation results of the pericardium, three quantitative biomarkers related to pericardial fat are calculated: PFATV, PFATM, and PFATSTD. PFATV primarily serves to quantify the volume of pericardial fat. Initially, we used the result of pericardial segmentation (here marked in red in 4f) to identify the location of the pericardium on the CT scan. Subsequently, we further delineate the regions of pericardial fat (as depicted in 4b) within the Hounsfield Unit (HU) range of [-190HU, -30HU] [dey2012epicardial]. Following this, we count the number of voxels within the 3D pericardial fat mask (illustrated in 4c), and by multiplying this count with the voxel volume ( $x$ resolution $\times$ $y$ resolution $\times$ $z$ resolution), we derive the value for PFATV, expressed in $mm^{3}$ . Finally, we compute the mean and standard deviation of the attenuation intensity values for all voxels within the pericardial fat regions, denoted as PFATM and PFATSTD, respectively.

(ii) Based on the segmentation results of calcifications (Coronary Artery Calcification and Thoracic Aorta Calcification), four calcification-related quantitative biomarkers are calculated: CACS, CACV, ACS, and ACV. CACS and ACS are primarily derived from an area perspective, calculated according to the Agatston Score [agatston1990quantification] for coronary and thoracic aortic calcification scores, respectively. CACV and ACV, on the other hand, are calculated from a volumetric perspective, quantifying the volume of coronary calcification and thoracic aortic calcification, respectively. First, the calcification segmentation results are refined, with only regions with CT attenuation intensity greater than or equal to 130HU retained as the final calcification segmentation mask. Then, calculating calcification scores, the results are uniformly reconstructed to 3mm before computing CACS and ACS to minimize the impact of different CT slice thicknesses. In the calculation of calcification volume, the number of voxels within the 3D coronary calcification mask (green areas in Fig. 4d and Fig. 4e) and thoracic aorta calcification mask (red regions in Fig. 4d and Fig. 4e) are directly tallied, and then multiplied by the voxel volume ( $x$ resolution $\times$ $y$ resolution $\times$ $z$ resolution) to obtain CACV and ACV (in $mm^{3}$ ).

(iii) Based on the segmentation results of the thoracic aorta, three quantitative biomarkers characterizing the morphology of the thoracic aorta are computed: ATI, AMD, and AMDSTD. ATI represents the curvature of the thoracic aorta [girardi2021aortic], while AMD and AMDSTD describe the overall maximum diameter of the aorta and its variation along the entire length. To calculate ATI, we first extract the centerline of the thoracic aorta using the segmentation outcome, and then determine the locations of the root point and end point (referred to as Point $R$ and Point $E$ in Fig. 4f). We then calculate ATI as the ratio of the length of the centerline (in $mm$ ) to the straight linear distance between the point $R$ and the point $E$ (also in $mm$ ). Subsequently, we generate cross-sectional views at 1 $mm$ intervals along the centerline, with Fig. 4i and Fig. 4j exemplifying the cross-sectional views generated at positions $n$ and $m$ in Fig. 4(f). For each cross-sectional view, we measure the maximum diameter of the thoracic aorta, denoted as $D=[d_{1},...,d_{n},...,d_{m},...,d_{k}]$ (where $K$ = centerline length/interval (1 $mm$ ) + 1, it represents the total number of cross-sectional views). Finally, we compute AMD as the maximum value in $D$ (AMD = $max(D)$ ) and ANDSTD as the standard deviation of the diameters in $D$ (AMDSTD = $std(D)$ ).

(iv) Based on the results of the heart segmentation (cardiac chambers and pericardium), four quantitative biomarkers are computed to characterize the morphology and structure of the heart: CHR, CLD, CSD, and CTR [troxler2018role]. CHR is defined as the ratio of the volume of the cardiac chambers (the green regions in Fig. 4f) to the total volume of the entire heart (the combined red and green areas in Fig. 4f). Using the segmentation results of the heart, we select the maximal four-chambers axial view of the heart and perform an elliptical fitting on this segmented image to derive the cardiac long diameter (CLD) and short diameter (CSD). In Fig. 4f, half of line $AB$ represents the CSD while half of line $CD$ represents the CLD. Based on the calculation method described in the previous study [girardi2021aortic], we determine the width of the heart, represented by line $FG$ in Fig. 4g. Similarly, utilizing the segmentation results of the lungs (as seen in Fig. 4l), we obtain the width of the entire lung, denoted by line $HI$ in Fig. 4g. Finally, CTR is calculated as the ratio of line $FG$ to line $HI$ .

(v) The segmentation results of the lungs are used to calculate four quantitative biomarkers related to lung texture: LLR, RLR, LHR, and RHR. First, We use the segmentation outcomes for the left and right lungs separately to identify the regions of the lungs in the CT scan that correspond to pulmonary areas. We then further demarcate the regions with high attenuation intensity (as shown in Fig. 4m) using a threshold above -200HU. Consequently, LHR represents the ratio of the number of high attenuation intensity voxels to the total number of voxels in the left lung, and RHR is defined as the ratio of the number of high attenuation intensity voxels to the total number of voxels in the right lung. In a similar fashion, we delineate the regions with low attenuation intensity (as shown in Fig. 4n) using a threshold below -950HU, and then calculate the LLR and RLR respectively.

Continuous and discrete features joint representation

Recent studies have shown that methods that integrate features from various sources into one representation can offer complementary information to each other and lead to better performance [kiela2014learning]. However, not all features contribute to the prediction of the target, and certain artificially designed image quantitative biomarkers may actually hinder performance. Furthermore, the relevance and specific contributions of the features to the output target are typically unknown. Therefore, we have developed a features joint representation module (Fig. 5a) which can help identify which features are most significant for the prediction problem, reinforcing the most relevant features for CVD risk prediction, and suppressing any unnecessary features that could have a negative impact on performance. This module comprises an instance-wise feature-gated mechanism and a soft instance-wise feature interaction mechanism.

0.9 Instance-wise feature gated mechanism.

The high-dimensional continuous features are derived from a pre-trained deep model, while each discrete biomarker is a scalar feature quantified based on the prior knowledge of physicians. Therefore, the initial step is to ensure that these distinct types of features are properly aligned within the feature space. The extracted continuous and discrete features are denoted as $X=[x_{1};x_{2};...;x_{N+1}]$ . To maintain the relative independence of each type of feature and understand its contribution, we do not directly encode all features $X$ . Instead, an individual encoder $F_{i}$ is applied to each $x_{i}$ as the input embedding $e_{i}$ . The encoder $F_{i}$ consists of two fully connected layers:

\displaystyle e_{i}=F_{i}(W_{i},x_{i})

(1)

where $e_{i}$ is a vector feature embedding of $x_{i}$ ( $e_{i}\in\mathbb{R}^{1\times L}$ , in here $L=32$ ). $F_{i}$ is the $i^{th}$ encoding operation, and $W_{i}$ is the trainable weights of the encoder. Then we obtain the overall encoded features embedding matrix $E$ ( $E\in\mathbb{R}^{{(N+1)}\times L}$ ).

To improve the expressive power of the model and better capture the relationship between features and the output target, we apply non-linear processing to each instance-wise feature embedding $e_{i}$ . However, determining the extent of required non-linear processing remains a complex task. Some studies [pedro2000unified, hawkins2004problem] suggest that simpler models may benefit in datasets with noise. Given that our discrete quantitative biomarkers are automatically derived from the four pre-trained body part segmentation models without any additional adjustment, it implies that the input contains unavoidable noise. In consideration of this, we employ the GRN proposed in [lim2021temporal], which is a notably flexible and simple architecture (Fig. 5b), as the non-linear operation.

\displaystyle GRN(e)=LayerNorm(GLU(\eta)+e)

(2)

\displaystyle\eta=Dropout(FC(ELU(FC(e))),p)

(3)

Where $FC$ is the fully-connected Layer, $ELU$ is the Exponential Linear Unit activation function [clevert2015fast], $GLU$ is the Gated Linear Units [dauphin2017language] and $LayerNorm$ is the standard layer normalization [ba2016layer]. During training, dropout is applied, and the dropout rate $p$ is set as 0.5. For each instance-wise feature embedding $e_{i}$ , a non-linear operation is employed by its own GRN:

\displaystyle g_{i}=GRN_{i}(e_{i})

(4)

0.10 Soft instance-wise feature interaction mechanism.

Once the continuous and discrete features are aligned into the same dimensional space, we proceed to model the interactions between features and their contributions to the prediction target. The critical issue is how to fuse features from various sources to offer complementary information to each other. Given the recent significant performance of multi-head self-attention networks [vaswani2017attention] in modeling complex relationships, we utilize a multi-head attention mechanism to accomplish the interaction and fusion of features, employing different heads for different representation subspaces. First, we flatten the instance-wise feature embeddings $G=[g_{1};g_{2};...;g_{N+1}]$ ( $G\in\mathbb{R}^{{(N+1)}\times L}$ ) into a long-term relationship vector $f=[g_{1}^{\intercal},g_{2}^{\intercal},...,g_{N+1}^{\intercal}]^{\intercal}$ , then we feed it to the multi-head attention module for feature interaction and combine outputs concatenated from all heads (in the study, head number set as 2):

\displaystyle m=MultiHeadAttention(f)

(5)

Finally, the instance-wise feature weights are generated by feeding $m$ through a fully connected layer, followed by a softmax layer:

\displaystyle s=Softmax(FC(m))

(6)

After we get the instance-wise feature contribution scores $s$ ( $s\in\mathbb{R}^{1\times{(N+1)}}$ ), then through the contribution scores $s$ and the processed instance-wise feature embeddings $G$ , we can obtain the final classification representation $c$ ( $c\in\mathbb{R}^{1\times L}$ ):

\displaystyle c=s\otimes G

(7)

where $\otimes$ represents matrix inner-product operation.

Ablation studies

We provide ablation analysis on the LDCT-NLST testing cohort to further investigate the effectiveness of the proposed joint representation approach.

0.11 The effectiveness of different components.

To validate the effectiveness of various components in our proposed feature joint representation approach, we conducted an ablation study and presented the results on the LDCT-NLST dataset in Fig. 6a. Several key observations can be drawn from these findings. Firstly, the incorporation of the GRN module significantly improves AUC, F1 Score, and Accuracy compared to using the Encoder alone when only the instance-wise feature-gated mechanism is employed. Secondly, even without using the GRN in the instance-wise feature-gated mechanism, the soft instance-wise feature interaction mechanism effectively boosts the model’s performance. Thirdly, achieving considerable best results is possible when both the instance-wise feature gated mechanism and the soft instance-wise feature interaction mechanism are used simultaneously. This is primarily attributed to the efficient modeling of discrete features and deep continuous features, leveraging their respective strengths to promote the learning of the feature joint representation with the assistance of both the GRN and the soft instance-wise feature interaction mechanism. Additionally, our approach empirically confirms an intuitive phenomenon: as prior knowledge is more intricately processed within the model, its ability to enhance model specificity gradually decreases, helping the model to strike a balance between specificity and sensitivity.

0.12 The impact of different deep feature extraction models.

We studied the influence of different deep feature extraction models on the performance of CVD risk prediction and presented the results in Fig. 6b. It is observed that different deep feature extraction models have a certain impact on performance. Our approach demonstrates the best overall performance when employing the ResNet34 as the deep feature extractor, especially regarding AUC, Accuracy, and F1 Score. Furthermore, we also find that our approach performs better on most performance metrics when using light CNN architecture in the first stage than when using transformer architecture. For example, when using ResNet34 as opposed to ViT-B, the AUC increased from 0.827 to 0.875, an improvement of 5.8%, Accuracy increased by 5.6%, Sensitivity by 45.6%, while Specificity only decreased by 1.6%. Additionally, coupled with the data from Table 1, it can be seen that regardless of which deep feature extraction model is used when combined with the discrete quantitative biomarkers for our proposed joint representational learning approach, their performance is enhanced to varying degrees. Specifically, ViT-B’s AUC increased from 0.676 to 0.827, an improvement of 22.3%, nnFormer’s AUC from 0.837 to 0.851, and ResNet34’s AUC from 0.844 to 0.875.

0.13 The effectiveness of features joint representation module.

We validated the effectiveness of the feature joint representation module employed in the second stage from two perspectives: (i) Under the precondition of using the continuous and discrete quantitative biomarkers obtained from the first stage as inputs, our method demonstrates significant performance gains in Accuracy, Sensitivity, F1 Score, and AUC when compared with the approach that directly concatenates the features. Notably, Sensitivity improved by 40.0%, and F1 Score by 19.0%. Detailed results are shown in Fig. 6c. (ii) When only the discrete quantitative biomarkers derived from the first stage are used as inputs, our joint representation approach substantially improves performance metrics including Accuracy, Sensitivity, F1 Score, and AUC over Xgboost, which also utilizes discrete quantitative biomarkers. The comprehensive results are presented in Fig. 6d.

Comparison methods

To comprehensively evaluate the effectiveness of our proposed method, we selected two different classes of methods for comparative analysis based on the various representations of features: (i) Machine learning methods utilizing discrete features, for which we chose the most commonly used Xgboost [chen2016Xgboost]. (ii) Deep learning methods using continuous features, including classification networks such as the CNN architecture ResNet34 [he2016deep], the CNN-Attention architecture Tri2D-Net [chao2021deep], and the Transformer architecture ViT-B [dosovitskiy2020image]. Tri2D-Net is currently the leading method for CVD risk prediction on the LDCT-NLST dataset. The other category is based on multi-task learning approaches, with prior research [caruana1997multitask] demonstrating that models discover more general feature representations in solving multiple tasks, thereby improving generalization to unseen data. Additionally, we are aware that the nnUNet framework [isensee2021nnu] possesses strong data preprocessing and augmentation capabilities. Therefore, we integrated classification heads with fully connected layers and average pooling operations into segmentation networks such as nnUNet and nnFormer [zhou2023nnformer]. These heads receive input from the deepest feature layer of the encoder. Outputs provided by the multi-task methods include predicted cardiac segmentation masks and CVD risk classification probabilities. Except for Tri2D-Net, which directly uses an open-source model¹¹1https://github.com/DIAL-RPI/CVD-Risk-Estimator trained on the LDCT-NLST training set, all other comparison methods are trained from scratch based on the LDCT-NLST training set.

Statistical analysis

The sensitivity and specificity of DeepCVD for CVD risk prediction were evaluated by calculating the 95% confidence intervals using the Clopper-Pearson method based on 1,000 bootstrap replications of the data. In our setting, CVD risk prediction was a binary classification task, and p-values for accuracy comparisons were calculated through McNemar’s test.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The imaging data of this study are from a public chest low-dose CT dataset (LDCT-NLST) and a private external chest standard-dose CT dataset (NERC-MBD). The LDCT-NLST dataset is publicly available at https://biometry.nci.nih.gov/cdas/learn/nlst/images/. The NERC-MBD dataset is used under a research agreement for the current study and is not publicly available. Source data is provided with this paper.

Code availability

The code used for DeepCVD implementation depends on internal tooling and infrastructure, is under patent protection (application number: CN117274185B), and thus cannot be publicly released. All experiments and implementation details are described sufficiently in the Methods section for replication with non-proprietary libraries. The foundational architecture for the four specialized body part segmentation models of our work is available in an open source repository: https://github.com/yhygao/CBIM-Medical-Image-Segmentation. The ResNet34 continuous deep feature extraction model used in this study is implemented from: https://github.com/pytorch.

References

Acknowledgments

Author contributions

For the three first co-authors, M.X. was responsible for data cleaning, deep learning model development, internal evaluation and drafted the manuscript, C.F. was responsible for collecting and preprocessing external data, as well as conducting evaluations of various methods on the external data, Y.Z. was responsible for the model training of different comparison methods, and they all participated in the experimental design and drafted the manuscript. W.G. helped deploy all test models to evaluate the external data. J.Q. and P.L. helped collect the external testing cohort. L.L. aided in the experimental design and edited the manuscript. H.C. assisted with the evaluation of the models. M.X. and K.H. were responsible for the conception and design of the experiments and oversaw overall direction and planning and drafted the manuscript.

Competing interests

The authors declare that they have no competing interests.