Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2022 Mar 18;38(10):2749–2756. doi: 10.1093/bioinformatics/btac136

Non-negative Independent Factor Analysis disentangles discrete and continuous sources of variation in scRNA-seq data

Weiguang Mao 1,2, Maziyar Baran Pouyan 3,, Dennis Kostka 4,5,6,7, Maria Chikina 8,9,
Editor: Anthony Mathelier
PMCID: PMC9113312  PMID: 35561207

Abstract

Motivation

Single-cell RNA-seq analysis has emerged as a powerful tool for understanding inter-cellular heterogeneity. Due to the inherent noise of the data, computational techniques often rely on dimensionality reduction (DR) as both a pre-processing step and an analysis tool. Ideally, DR should preserve the biological information while discarding the noise. However, if the DR is to be used directly to gain biological insight it must also be interpretable—that is the individual dimensions of the reduction should correspond to specific biological variables such as cell-type identity or pathway activity. Maximizing biological interpretability necessitates making assumption about the data structures and the choice of the model is critical.

Results

We present a new probabilistic single-cell factor analysis model, Non-negative Independent Factor Analysis (NIFA), that incorporates different interpretability inducing assumptions into a single modeling framework. The key advantage of our NIFA model is that it simultaneously models uni- and multi-modal latent factors, and thus isolates discrete cell-type identity and continuous pathway activity into separate components. We apply our approach to a range of datasets where cell-type identity is known, and we show that NIFA-derived factors outperform results from ICA, PCA, NMF and scCoGAPS (an NMF method designed for single-cell data) in terms of disentangling biological sources of variation. Studying an immunotherapy dataset in detail, we show that NIFA is able to reproduce and refine previous findings in a single analysis framework and enables the discovery of new clinically relevant cell states.

Availability and implementation

NFIA is a R package which is freely available at GitHub (https://github.com/wgmao/NIFA). The test dataset is archived at https://zenodo.org/record/6286646.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Single-cell RNA sequencing (scRNA-seq) techniques have allowed researchers to query the complexity of transcription regulation at an unprecedented level of detail. scRNA-seq technologies have the power to reveal both distinct cell types and transcriptional heterogeneity within a defined cell population. As the number and size of single-cell datasets increases, it becomes important to develop methods that can quickly summarize the biological information embedded in a scRNA-seq dataset as a set of interpretable variables that can be used for downstream analysis. One kind of summary measure is the identity and number of cell types present in a dataset. In recent years, there has been a proliferation of clustering methods designed to address this problem (Butler et al., 2018; Kiselev et al., 2017). Clustering approaches assume that the data is well described by a discrete set of cell types, but in many cases, questions about continuous biological variation, such as developmental trajectories or levels of pathway activation are also of interest.

Such continuous variables do not conform to the assumptions of clustering algorithms but can be effectively modeled as latent factors. For example, for cycling cells variation along a continuous circle-like cell-cycle trajectory has been repeatedly discovered in single-cell data, both using sophisticated latent variable models (Buettner et al., 2015) and simple Principle Component Analysis (PCA) (Kowalczyk et al., 2015).

Of course, cell-type/cluster identity can also be thought of as a latent factor, albeit a discrete one, thus suggesting that it is possible to design a single factor analysis framework that can discover both cell types and continuous biological variables in a single analytic step. In the ideal case a factor analysis representation of a scRNA-seq dataset would identify both discrete cell types and continuous pathway effects and importantly maximally disentangle difference sources of biological variation. In other words, each factor should represent a single biological variable that could be measured directly using a different assay.

To address these goals, we propose Non-negative Independent Factor Analysis (NIFA) that can simultaneously models uni- and multi-modal factors thus combining clustering and latent variable analysis into a single framework. NIFA is conceptually related to ICA due to non-Gaussian priors on the factors, however, unlike ICA NIFA is a true factor analysis method which models the data likelihood with a factorized distribution. Our modeling framework is optimized to induce disentanglement between biological sources of variation (such as different cell types and pathways) so that individual factors are maximally aligned with specific biological variables and thus can be used in the downstream analysis. We evaluate our method on simulated and real data (see Supplementary Table S1), and we show how it can be used to glean new biological insights by detailed analysis of an immunotherapy scRNA-seq dataset (Sade-Feldman et al., 2018).

2 Materials and methods

2.1 The statistical model

Our statistical model expands on the standard data factorization framework. Given a P-by-N, matrix X (gene by cell, indexed by j and n), we assume that individual observations xjn are independent conditioned on a set of factors. Specifically, given a P-by-K, loading matrix A and a K-by-N factor matrix S, we assume that xjn has a Gaussian distribution N(AjSn,β1) with the common precision parameter β (see Fig. 1A and E).

Fig. 1.

Fig. 1.

(A) NIFA decomposes the scRNA-seq matrix X into loading matrix A and latent variable matrix S. The loading matrix A is constrained to be non-negative. (B) NIFA can disentangle the discrete and continuous variations into different latent variables. In the ideal scenario, the underlying distribution of latent variables related to pathway effects should be uni-modal while the underlying distribution of celltype-related latent variables follow bimodal distributions with clear separation between modes. (C) Both uni-modal and multi-modal variations can be jointly modeled by Gaussian mixture prior. The number of modes can be determined automatically via inference. (D) By imposing multi-modal prior, we force the latent factors to rotate and align with the directions that can best separate the cell-type identity. (E) Overview of the NIFA probabilistic graphical model. The table summarizes the probabilistic variables and the assumption of prior distributions. The details can be found in the Supplementary Methods

To maximize discovery of biologically meaningful factors, one can impose additional constraints or priors on the distributions of A and S, and the specific form of these priors can greatly influence the results. For example, the popular NMF methods constrains the factors and loadings to be non-negative. Probabilistic matrix factorization approaches such as the popular PEER model (Stegle et al., 2010) correspond to hierarchical priors (Gaussian in the case of PEER) on both A and S. Inducing loading sparsity through constraints or priors is another popular approach designed to produce interpretable representations by limiting the number of features (genes) that are influenced by each factor. Examples of such sparse models include Sparse Factor Analysis (SFA) (Engelhardt and Stephens, 2010), SFAmix (Gao et al., 2013), Non-parametric Bayesian Sparse Factor Analysis (NBSFA) (Knowles and Ghahramani, 2011) and Penalized Matrix Decomposition (Witten et al., 2009). NMF is also easily combined with sparsity inducing penalties (Lin and Boutros, 2019) and this is the approach taken by the probabilistic scRNA-seq specific method scCoGAPS (Fertig et al., 2010).

In our model, we specify priors for A using an NMF-like approach. A is assumed to come from a truncated normal distribution where a =0 and b= indicating each entry Aji falls within the interval [0,) (here the variable i indexes the factors). η and λ denotes the mean and the inverse of the variance. Φ(·) is the cumulative distribution function of the standard normal distribution.

P(Aji|ηji,λi)=(2π)12(λi)12exp(12λi(Ajiηji)2)·1(Aji0)1Φ(ηjiλi12) (1)

The prior on S is inspired by ICA and corresponds to mixture of Gaussians with M components (indexed by m), parameterized by the mean μim and variance σim (for factor i and component m). Let ϵimn (i=1,,K,m=1,,M,n=1,,N) be an indicator variable with ϵimn=1 if Sin was generated by component m and zero otherwise. Then we get for the prior of the nth column of S:

P(Sn|ϵ,μ,σ)=i=1Km=1MN(Sin|μim,σim)ϵimn (2)

Putting this all together we get the joint likelihood decomposed over columns of X (cells) P(X,A,S,ϵ,μ,σ,β) is as follows (Equation 3).

P(X,A,S,ϵ,μ,σ,β)=n=1NP(Xn|A,Sn,β)matrix factorizationP(Sn|ϵ,μ,σ)prior on Sj=1Pi=1KP(Aji|ηji,λi)prior on Ai,m,nP(ϵimn)i,mP(μim)P(σim)P(β)hierarchical priors (3)

2.2 Parameter inference

To efficiently infer the parameters, we apply variational inference technique, more specifically, mean-field approximation (Blei et al., 2017). By assuming each variational parameter is independent of each other, we formulate the joint posterior distribution Q(S,A,ϵ,μ,σ,β) (see Equation 4) for the model and minimize the KL-divergence between Equations (3) and (4) to derive the expression q(·) for each variational parameter as an approximation of single posterior distribution.

Q(S,A,ϵ,μ,σ,β)=n=1Nq(Sn)·j,iq(Aji)·i,m,nq(ϵimn)·i,mq(μim)q(σim)·q(β) (4)

2.3 Data availability

We evaluate NIFA on several ‘gold’ or ‘silver standard’ scRNA-seq datasets (Gong et al., 2018) with pre-annotated cell types. Supplementary Table S1 summarizes the datasets we use. Gold standard datasets contain relatively homogeneous cell lines and experimental conditions are well controlled, while silver standard datasets define cell types based on expert knowledge. Most data were obtained through (https://github.com/hemberg-lab/scRNA.seq.datasets), or through corresponding GEO repositories: GSE120575 (Sade-Feldman et al., 2018), GSE133656 (Liu et al., 2019) and GSE70245 (Olsson et al., 2016). We also include simulated datasets Kumar (Kumar et al., 2014), generated by Splatter (Zappia et al., 2017) and Zheng (Zheng et al., 2017) as simulation input (both datasets are provided as part of a systematic evaluation of scRNA-seq clustering methods, Duò et al. (2018).

3 Results

3.1 Non-negative Independent Factor Analysis (NIFA): theory

NIFA is a probabilistic matrix factorization model which decomposes the scRNA-seq matrix X into the loading matrix A and the sources/latent variables S (Fig. 1A). While factor analysis models are routinely used for pre-processing, if the factors themselves are to be analyzed it is important that they are interpretable, that is they should have a clear correspondence to a specific biological variable. A general popular approach to enhance interpretability is enforcing certain structures or constrains for the latent variables or their loadings such as positivity (as exemplified by the successful non-negative matrix factorization (NMF) and variants (Kotliar et al., 2019; Sherman et al., 2020; Stein-O’Brien et al., 2019; Zhu et al., 2017). NIFA maximizes interpretability by explicitly modeling the generative process of single cell data as a combination of two types of variables: cell-type identity and pathway activity.

While the semantic difference between these is not absolute, we can differentiate them in terms of their statistical properties. For the purpose of our model, we assume that cell identity corresponds to clusters, which in a generative model would correspond to discrete variables. Conversely, pathway activity corresponds to continuous variation, with cell cycle being a canonical example. NIFA is designed to optimize the interpretability by isolating these factors into separate components. Figure 1B corresponds to the ideal representation of the discrete (depicted in blue) and continuous (depicted in green) generative variables. Typically, discrete latent variables are inferred via clustering while continuous variables are inferred via matrix factorization/factor analysis.

The key contribution of NIFA is that it unifies both procedures into a single framework. To do this the NIFA model considers the continuous relaxation of clustering, where clusters correspond to multi-modal continuous instead of discrete variables (Fig. 1C, blue factors). Each factor is modeled with a mixture of Gaussians prior which encourages multi-modal factors to become concentrated around their modes, thus enhancing the separation between high density regions (Fig. 1C). The multi-modality is maximized when the factor is optimally aligned with a single cluster and thus the model encourages factors that represent individual cell types (Fig. 1D).

This procedure is similar to the popular technique of Independent Component Analysis (ICA) (Hyvärinen and Oja, 2000), a dimensionality reduction (DR) technique which instead of attempting to model all the variation in the data finds maximally non-Gaussian components. ICA is a popular single-cell DR technique because it finds directions that maximally separate different cell types (Butler et al., 2018). One important difference between ICA and NIFA is that for NIFA the exact form of non-Gaussianality is explicitly specified as Gaussian mixtures, whereas ICA more general.

A second important difference between ICA and NIFA is that NIFA does not prefer multi-modal factors but only ‘enhances’ them if it finds them. This flexibility arises from the Bayesian updates which allow the model to automatically determine the number of Gaussian mixture components for each factor. Since in the model factors are encouraged to concentrate at their modes, multi-modal components become ‘more multi-modal’ while the shape of the uni-modal components is changed relatively little (see Section 2 for more details). It is important to note that since the Gaussian mixture priors reduce to a single Gaussian in the uni-modal case the mixture assumption cannot contribute significantly to the interpretability of uni-modal factors. However, like NMF the NIFA model includes a non-negativity constraint on the factor loadings (Fig. 1E) which encourages parsimonious and biologically interpretable factors. Thus, for the case of uni-modal factors NIFA defaults to NMF-like behavior. Overall, our NIFA model combines assumption about the factors and the loadings to effectively induce disentanglement between individual cell-type identity factors as well as continuous pathway activity.

3.2 Simulation studies

We simulate artificial data considering 2000 genes (P =2000), 500 cells (N =500), six latent factors (K =6) and two mixture components per factor (M =2, see Fig. 1A and Supplementary Methods for full details). Briefly, we simulate latent factors and their mixture components independently, but the columns of the loading matrix A are correlated (details in Supplementary Methods). For the decomposition, we set K =8 (all methods) and M =3 (NIFA) to see if NIFA can robustly recover the right latent factors given larger K and M, which is often the case in practice. Figure 2 summarizes our findings. While no method recovers all latent factors correctly, NIFA is able to more accurately recover latent factors compared with other methods. Also, NIFA correctly recovers the number of mixtures per factor as two (reporting zero for one of three possible mixture components). We also explored the same setting as above but with uni-modal factors (M =1) and find that in this case NIFA performs as well as ICA (parallel), and that both NIFA and ICA (parallel) outperform other methods (data not shown).

Fig. 2.

Fig. 2.

Evaluation on a simulated dataset. Boxplot of the correlation between simulated S and those recovered by SVD, ICA (parallel), ICA (deflation), NMF and NIFA. We find that the best performance is achieved by NIFA compared with all the other common methods

3.3 Evaluation

Meaningful evaluation of factor models is not necessarily straightforward, and several different options exist. A natural metric is the reconstruction error of a factor model (Levitin et al., 2019). However, for any specific decomposition, there exist infinitely many alternative decompositions with exactly the same reconstruction error; yet, these may differ greatly with respect to the individual factors and loadings and, perhaps, biological implications.

Biological interpretability of decompositions is one of the attractions of factor models, but it is harder to evaluate and quantify than reconstruction error. Given substantial knowledge about the data is at hand, we can describe as a desirable property of an interpretable factor analysis is that there is one-to-one correspondence between inferred factors and known variables that significantly contribute to data variance. For example, factors that have one-to-one correspondence to known cell types are straightforward to interpret and can directly be used in subsequent analyses. However, if factors correspond to combinations of cell types (which could be an equally valid solution in terms of reconstruction error), interpretation is difficult and usefulness is more limited.

Therefore, our evaluations are focused on assessing how well factor analysis models are able to disentangle known sources of biological variation. Such evaluations must necessarily reference some external ground truth, and for scRNA-seq data we consider two sources: known cell-type identity (used to evaluate factors) and gene to pathway membership (used to evaluate loadings). With this approach we compare the performance of NIFA with SVD, ICA, NMF and scCoGAPS (Stein-O’Brien et al., 2019) across 14 datasets (see Supplementary Table S1). scCoGAPS is a Bayesian NMF method that uses an ensemble-based approach to subset cells and collects consensus factors from parallelized Bayesian NMF runs across subsets.

3.3.1 Cell type identification

Here, we investigate the ability of NIFA and five other methods to recover latent factors corresponding to pre-annotated cell types. In the case of scRNA-seq data the gold or silver standard of cell-type identity (see Supplementary Table S1) is one such variable. In the ideal case each cell type corresponds to a unique factor in the model. To evaluate this property, we compute maximum one-to-one correlations between factors and cell-type assignments (Fig. 3). We find that on average our NIFA model performs better than ICA, NMF and scCoGAPS at this cell type detection task. Importantly, while there can be large differences between NMF and ICA, the performance of NIFA (which combines features of both) always tracks with the best method.

Fig. 3.

Fig. 3.

Evaluation of one-to-one correspondence between factors and cell types. Given a set of factors and a set of cell-type labels, we evaluate the maximum correlation between each cell type and a factor. For clarity, we plot the mean correlation values across all cell types. To account for the possibility that different models may need different number of factors (K), we report the results at varying K. We compare NIFA with ICA, NMF (KL-loss), scCoGAPS and SVD as a baseline. We find that on average our NIFA model performs better than competing decomposition methods on average and importantly while there can be large differences among NMF, ICA, the performance of NIFA (which combines features of both) always keeps track with the best method

3.3.2 Pathway enrichment

One important feature of factor analysis models is that the factors should be interpretable even in the absence of any ground truth knowledge. In such cases, the factors are interpreted by inspecting the genes in their loadings. The expectation is that for a factor that captures a unique biological variable (which could be binary cell type or continuous pathway activation), the top loading genes are enriched for a few known functional modules. We evaluate this property by computing pathway enrichment for each factor using a hypergeometric test with the top 500 genes as foreground. This evaluation strategy allows us to evaluate existence of a biological focus in the model, independent of cell-type annotation. In this way, the model can be credited for finding factors which capture pathway or cell-type signals even if these do not correspond to an annotated cell type. The pathway databases we use are ‘canonical pathways’ from MSigDB (Liberzon et al., 2011) and a comprehensive set of cell-type markers from xCell (Aran et al., 2017). For canonical pathways, we excluded pathways that had greater than 20% overlap with ribosomal or mitochondria genesets (defined as ‘KEGG_RIBOSOME’ and ‘KEGG_OXIDATIVE_PHOSPHORYLATION’, respectively). We found that these are consistently enriched but provide little biological insight as variation in these pathways is often technical.

We then quantify the overall biological enrichment of a single loading vector as the mean fold enrichment for pathways that are significant at FDR < 0.05. Pathway enrichment metrics summarized across all factors are plotted in Figures 4 and 5 for canonical pathways and xCell respectively. We find that not surprisingly the performance of all methods is much better for real biological datasets than simulated ones (Zhengmix4eq, Zhengmix4uneq and Zhenmix8eq). We also find that among the biological datasets NIFA is a consistently top performer in both ‘canonical pathway’ and xCell evaluations, though the effect is more dramatic for xCell.

Fig. 4.

Fig. 4.

Pathway enrichment of ‘canonical pathways’ from mSigDB (Liberzon et al., 2011). Enrichment is quantified as average fold enrichment among all factor-pathway pairs where the pathway is significantly over-represented in the top 500 loading genes (hypergeometric test, FDR < 0.05). The first two rows are biological datasets. The last row (Zhengmix4eq, Zhengmix4uneq and Zhenmix8eq) corresponds to simulated datasets. All SimKumar datasets are excluded from this evaluation as they were not supplied with real gene names

Fig. 5.

Fig. 5.

Pathway enrichment for xCell cell-type signatures. This figure is generated identically to Figure 4 but using the xCell genesets

3.4 In-depth evaluation of the Sade-Feldman et al. immunotherapy dataset

Antibodies that block immune checkpoint proteins, including CTLA4, PD-1 and PD-L1 are increasingly used to treat a variety of cancers. While checkpoint inhibitor (CI) therapy can be remarkable effective, not all patients respond (Larkin et al., 2015). Determining the biological factors that facilitate or impede response to CIs remains an important and unresolved problem. In order to demonstrate how NIFA can be used to gain biological insight, we performed several in-depth analyses of the Sade-Feldman et al. immunotherapy dataset (Sade-Feldman et al., 2018). This dataset consists of 16 291 individual immune cells from 48 tumor samples of melanoma patients treated with CIs. The dataset contains both pre-treatment and post-treatment samples and the patients are classified into responders and non-responders. The dataset is supplied with human curated cell-type annotations that server as the silver gold standard for our bench-marking analysis.

We applied our NIFA model to the entire single-cell dataset using K =25 which corresponds to the k with maximal correlation with known cell-type annotations (see Fig. 3). The distributions of the inferred factors and the corresponding inferred Gaussian mixture fits are plotted in Supplementary Figure S1. Each NIFA factor that has the best correspondence to existing annotations is given the same name. NIFA finds both uni- and multi-modal factors and as expected the multi-modal factors are more likely to correspond to cell types.

To investigate which variables are associated with immunotherapy response, the resulting factors were mean aggregated to a single value for each unique patient sample. We also summarized the human-annotated cell-type indicators as their mean values, corresponding to fraction of cells in sample. The resulting summary statistics were tested for association with response using Wilcoxon rank sum test and Benjamini-Hochberg FDR adjustment (separately for NIFA factors or human annotations). Pre- and post-treatment samples were analyzed separately and the resulting variables that were significant in either the pre-treatment or post-treatment comparison at FDR < 0.2 are plotted in Figure 6A. Top loading genes corresponding to each significant NIFA factor are show in Table 1.

Fig. 6.

Fig. 6.

NIFA analysis of signatures associated with immunotherapy response. (A) NIFA-derived signatures of human-annotated cell types are mean aggregated per patient sample and the resulting summary statistics are tested for association with immunotherapy response. Variables that are significant at FDR < 0.2 are shown with their respective normalized and centered ranksum statistics [ranksum/(number-of-positives × number-of-negatives) − 0.5, equivalent to binary classification AUC-0.5]. Pre-treatment effects are on the x-axis and post-treatment effects are on the y-axis. NIFA variables most closely matched to human annotations are grouped with gray ellipses. (B) Differences in Treg (Regulatory T cells) identification between NIFA and human annotations. Heatmap of canonical Treg marker genes (FOXP3 and IL2RA) across all cells annotated as Tregs by either method and 1000 randomly sampled other T cells. Overall, NIFA identifies fewer Treg cells and has a higher correlation with FOXP3 and IL2RA expression. While the NIFA Treg factor is significantly negatively associated with response in post-treatment samples, the corresponding human annotation is not (panel A). (C) A new myeloid signature positively associated with response. Heatmap of top loading genes along with the factor values for NIFA factor 24 across all cells identified as ‘Monocytes/Macrophages’ and 1500 randomly sampled cells. NIFA identifies a subset of the Monocytes/Macrophages cells with unique gene expression pattern. While general myeloid signatures (that is Monocytes/Macrophages and Dendritic cells) were negatively associated with response, the NIFA-24 signature has the opposite pattern (see panel A)

Table 1.

Top loading genes for each factor found to be associated with immunotherapy response

LV ID Name Genes
5 Myeloid signature FCN1, LYZ, TIMP1, S100A9, S100A8, SERPINA1, VCAN, IL1B, IFI30, PLAUR
7 Regulatory T cells CTLA4, TNFRSF18, RGS1, TIGIT, CD4, BATF, PIM2, PRDM1, FOXP3, ARID5B
8 TCF7 T-cell signature DGKA, DDX17, SMG1P1, DENND2D, ARHGEF1, DOCK8, NPIPB5, NLRC5, TCF7, N4BP2L2
9 Dendritic cells GZMB, IGJ, PLAC8, NAPSB, ALOX5AP, GPR183, AC096579.7, IRF7, BCL11A, CLIC3
11 Lymphocytes exhausted/cell-cycle STMN1, RRM2, TUBA1B, TYMS, KIAA0101, TUBB, HIST1H4C, HMGB2, NUSAP1, CDK1
13 B-cells CD74, IGHM, MS4A1, CD79A, IGKC, IRF8, CD79B, CD37, BCL11A, CD52
14 HSPD1, FLNA, BIRC3, REL, HSPE1, COTL1, WARS, PSME2, HSPB1, SLC25A3
15 Monocytes/macrophages CD74, FTL, CTSB, B2M, FTH1, PSAP, S100A11, IFI30, VIM, ALDOA
16 Interferon ISG15, IFI44L, MX1, IFI6, XAF1, STAT1, ISG20, IFITM1, TRIM22, IFI44
17 Exhausted CD8 T cells GZMA, RAC2, CLIC1, NKG7, CORO1A, IL32, ARPC1B, CNN2, LCK, PSMB9
19 NFKB/AP1 T-cell signature VIM, NFKBIA, FOS, TNFAIP3, ANXA1, SLC2A3, CD52, B2M, JUNB, S100A4
20 Memory T cells EEF1B2, GAS5, TOMM7, LDHB, SELL, IL7R, EIF3E, COX7C, EIF3L, FAIM3
24 Myeloid signature MT1X, MT1F, CD14, MT1E, FN1, FBP1, MT2A, S100A4, S100A9, AGTRAP
25 C1QBP, NME1, HSP90AB1, GTF3A, NHP2, PPA1, CCT7, CNBP, CCT2, SNHG1

Note: To facilitate biological interpretability, ribosomal genes and genes that we not provided with HGNC symbols were not considered. Factors that best match the known human annotations were named accordingly and are in bold. Other factors could be clearly identified as coherent biological pathways are named based on the loadings. Factors 14 and 25 did not have a clear correspondence to any pathway or cell-type signature and are left unnamed.

Each NIFA factor that has the best correspondence to human annotations is given the same name and the results are grouped with ovals in Figure 6. We find that for these matched variables, there is an overall good correspondence between NIFA factors and human-annotated cell types. Specifically, both methods discover B cells as the variable most positively predictive of response and a CD8 T-cell/exhaustion/cell-cycle signature (termed ‘Lymphocytes exhausted/cell-cycle’ in the original study) as the most negatively predictive.

For some subtle patterns the results of NIFA and human annotations can diverge. Human annotations such as ‘Lymphocytes’ and ‘Cytotoxicity’ were not well reproduced by NIFA (correlation of 0.46 and 0.46, respectively) and the corresponding NIFA variables are not significant. On the other hand, NIFA found three different T-cell signatures (8, 19 and 20) which were all associated with the ‘memory T-cell’ human annotation and all were significantly predictive of response. One of these signatures has TCF7 as a top loading gene and thus NIFA was able to independently discover one of the key findings of the original study—that the fraction of TCF7-positive T cells is highly associated with response.

Aside from generally reproducing the main findings of the original study, NIFA was able to uncover additional patterns. For example, we find that presence of Tregs is negatively associated with response in the post-treatment samples. The corresponding human annotation is however not significant despite the fact that the two variables are highly correlated (Pearson correlation = 0.71). Human regulatory T cells are difficult to identify from transcriptional profiles. There are no genes that are unique to this cell type. The canonical transcription factor (FOXP3) and surface marker (CD25/IL2RA) can also be transiently expressed by non-Treg CD4 cells (Chen and Oppenheim, 2011); on the other hand, because of noise in scRNA-seq data the absence of these markers does not exclude Treg status. Upon closer inspection, we find that NIFA is more conservative in designating Tregs than the human annotation counterpart. Using the NIFA mixture components we can perform a hard cell-type assignment based on the probability of being in the high-expression component. Using 0.5 as cutoff, NIFA finds 1418 Tregs, in contrast to 1740 of human annotated ones. We find that this discrepancy is highly non-random and that the NIFA Tregs are more likely to express both FOXP3 and IL2RA (Fig. 6C) indicating that the NIFA Treg signature is more specific.

Overall, within this dataset a large number of the human-annotated cell types and NIFA factors are associated with response but some general patterns emerge. Specifically, the presence of myeloid cell types is negatively associated with response while presence of lymphocytes, exclusive of those with an exhaustion-like phenotype (e.g. B cells and CD4 memory cells), is positively associated with response (see Fig. 6A). The general trend that a high myeloid to lymphocyte ratio is associated with worse outcome is observed across a variety of cancers (Thorsson et al., 2018). Our NIFA-based analysis however finds a myeloid signature (NIFA latent factor 24) that corresponds to a subset of annotated ‘Monocytes/Macrophages’ cells is positively associated with response, with an effect size that is similar to the lymphocyte populations (Fig. 6A). This myeloid subset is identified by high levels of metallothionein genes (MT1X, MT1F, MT1E and MT2A) and some metabolic genes (see Fig. 6B). Metallothioneins are a family of small proteins that play important roles in metal homeostasis and protection against heavy metal toxicity, DNA damage and oxidative stress (Si and Lang, 2018). Their induction in tumor-associated macrophages (TAMs) has been noted (Ge et al., 2012) but to our knowledge this is the first report of an association with clinical outcome.

3.5 Runtime and scalability

NIFA is a Bayesian method with a large number of parameters and thus is expected to be more computationally demanding than simpler factorization methods such as NMF. However, our implementation uses closed-form variational solutions and time-intensive computations are performed in C++ backend making NIFA competitive with other approaches. Our benchmark dataset based on the 1.3 Million Brain Cell dataset (Zheng et al., 2017) from which we sampled 100K cells. Overall, we find that NIFA is considerably faster than scCoGAPS and comparable to NMF (see Fig. 7 and Supplementary Methods for full details).

Fig. 7.

Fig. 7.

Running time comparison of ICA (deflation/parallel), NIFA, NMF and scCoGAPS with different number of factors k on the benchmark dataset

4 Discussion

We propose a factor analysis model designed specifically for single-cell data. The model combines features of PCA, ICA and NMF. Specifically, our model optimizes the PCA-like matrix reconstruction objective with mixture of Gaussian priors on the factors which encourages decomposition along multi-modal directions. We also adopt truncated Gaussian priors on the loadings thus imposing an NMF-like strict non-negativity constraint (see Fig. 1E). Using a variational Bayes framework allows us to automatically fit hyperparameters such as the number of Gaussian mixtures. We evaluate our model using both known cell identity and pathway information and demonstrate that NIFA generates biologically coherent factors that align well this prior knowledge.

One additional feature of our model is that this fully Bayesian framework is readily extensible. For example, it easily supports gene-specific priors for the loadings. This makes it possible to use known biological pathways as additional constraints. We plan on developing this extension in our future work.

Funding

This work was funded by the National Institutes of Health [R01 HG009299-5, U24 DK112331-05, R01 AI043603-20 and R01 EY030546-01A1]. This work also received funding from the Defense Health Agency through the Naval Medical Research Center [9700130] and the Defense Advanced Research Projects Agency [contract number N6600119C4022 and IHMC 2019-29-01-SC5]. This research was supported in part by the University of Pittsburgh Center for Research Computing through the resources provided.

Conflict of Interest: none declared.

Supplementary Material

btac136_Supplementary_Data

Contributor Information

Weiguang Mao, Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA; Joint Carnegie Mellon-University of Pittsburgh Ph.D. Program in Computational Biology, Pittsburgh, PA 15260, USA.

Maziyar Baran Pouyan, Department of Developmental Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA.

Dennis Kostka, Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA; Joint Carnegie Mellon-University of Pittsburgh Ph.D. Program in Computational Biology, Pittsburgh, PA 15260, USA; Department of Developmental Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA; Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA.

Maria Chikina, Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA; Joint Carnegie Mellon-University of Pittsburgh Ph.D. Program in Computational Biology, Pittsburgh, PA 15260, USA.

References

  1. Aran D.  et al. (2017) xcell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biol., 18, 220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Blei D.M.  et al. (2017) Variational inference: a review for statisticians. J. American Stat. Assoc., 112, 859–877. [Google Scholar]
  3. Buettner F.  et al. (2015) Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol., 33, 155–160. [DOI] [PubMed] [Google Scholar]
  4. Butler A.  et al. (2018) Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol., 36, 411–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chen X., Oppenheim J.J. (2011) Resolving the identity myth: key markers of functional cd4+ foxp3+ regulatory t cells. Int. Immunopharmacol., 11, 1489–1496. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Duò A.  et al. (2018) A systematic performance evaluation of clustering methods for single-cell rna-seq data. F1000Research, 7, 1141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Engelhardt B.E., Stephens M. (2010) Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoS Genet., 6, e1001117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Fertig E.J.  et al. (2010) Cogaps: an r/c++ package to identify patterns and biological process activity in transcriptomic data. Bioinformatics, 26, 2792–2793. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Gao C.  et al. (2013) A latent factor model with a mixture of sparse and dense factors to model gene expression data with confounding effects. arXiv preprint arXiv:1310.4792.
  10. Ge Y.  et al. (2012) Induction of metallothionein expression during monocyte to melanoma-associated macrophage differentiation. Front. Biol., 7, 359–367. [Google Scholar]
  11. Gong W.  et al. (2018) Drimpute: imputing dropout events in single cell RNA sequencing data. BMC Bioinformatics, 19, 220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Hyvärinen A., Oja E. (2000) Independent component analysis: algorithms and applications. Neural Netw., 13, 411–430. [DOI] [PubMed] [Google Scholar]
  13. Kiselev V.Y.  et al. (2017) Sc3: consensus clustering of single-cell RNA-seq data. Nat. Methods, 14, 483–486. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Knowles D., Ghahramani Z. (2011) Nonparametric Bayesian sparse factor models with application to gene expression modeling. Ann. Appl. Stat., 5, 1534–1552. [Google Scholar]
  15. Kotliar D.  et al. (2019) Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-seq. Elife, 8, e43803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Kowalczyk M.S.  et al. (2015) Single-cell RNA-seq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic stem cells. Genome Res., 25, 1860–1872. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Kumar R.M.  et al. (2014) Deconstructing transcriptional heterogeneity in pluripotent stem cells. Nature, 516, 56–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Larkin J.  et al. (2015) Combined nivolumab and ipilimumab or monotherapy in untreated melanoma. N. Engl. J. Med., 373, 23–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Levitin H.M.  et al. (2019) De novo gene signature identification from single-cell RNA-seq with hierarchical Poisson factorization. Mol. Syst. Biol., 15, e8557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Liberzon A.  et al. (2011) Molecular signatures database (msigdb) 3.0. Bioinformatics, 27, 1739–1740. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Lin X., Boutros P.C. (2019) NNLM: Fast and Versatile Non-Negative Matrix Factorization. R package version 0.4.3. https://CRAN.R-roject.org/package=NNLM
  22. Liu C.  et al. (2019) Treg cells promote the srebp1-dependent metabolic fitness of tumor-promoting macrophages via repression of cd8+ t cell-derived interferon-γ. Immunity, 51, 381–397. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Olsson A.  et al. (2016) Single-cell analysis of mixed-lineage states leading to a binary cell fate choice. Nature, 537, 698–702. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Sade-Feldman M.  et al. (2018) Defining t cell states associated with response to checkpoint immunotherapy in melanoma. Cell, 175, 998–1013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Sherman T.D.  et al. (2020) Cogaps 3: Bayesian non-negative matrix factorization for single-cell analysis with asynchronous updates and sparse data structures. BMC Bioinformatics, 21, 1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Si M., Lang J. (2018) The roles of metallothioneins in carcinogenesis. J. Hematol. Oncol., 11, 107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Stegle O.  et al. (2010) A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in EQTL studies. PLoS Comput. Biol., 6, e1000770. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Stein-O’Brien G.L.  et al. (2019) Decomposing cell identity for transfer learning across cellular measurements, platforms, tissues, and species. Cell Syst., 8, 395–411. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Thorsson V.  et al. ; Cancer Genome Atlas Research Network. (2018) The immune landscape of cancer. Immunity, 48, 812–830. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Witten D.M.  et al. (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10, 515–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Zappia L.  et al. (2017) Splatter: simulation of single-cell RNA sequencing data. Genome Biol., 18, 174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Zheng G.X.  et al. (2017) Massively parallel digital transcriptional profiling of single cells. Nat. Commun., 8, 14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Zhu X.  et al. (2017) Detecting heterogeneity in single-cell rna-seq data by non-negative matrix factorization. PeerJ, 5, e2888. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btac136_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES