Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jul;10(3):515-34.
doi: 10.1093/biostatistics/kxp008. Epub 2009 Apr 17.

A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis

Affiliations

A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis

Daniela M Witten et al. Biostatistics. 2009 Jul.

Abstract

We present a penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix. We approximate the matrix X as circumflexX = sigma(k=1)(K) d(k)u(k)v(k)(T), where d(k), u(k), and v(k) minimize the squared Frobenius norm of X - circumflexX, subject to penalties on u(k) and v(k). This results in a regularized version of the singular value decomposition. Of particular interest is the use of L(1)-penalties on u(k) and v(k), which yields a decomposition of X using sparse vectors. We show that when the PMD is applied using an L(1)-penalty on v(k) but not on u(k), a method for sparse principal components results. In fact, this yields an efficient algorithm for the "SCoTLASS" proposal (Jolliffe and others 2003) for obtaining sparse principal components. This method is demonstrated on a publicly available gene expression data set. We also establish connections between the SCoTLASS method for sparse principal component analysis and the method of Zou and others (2006). In addition, we show that when the PMD is applied to a cross-products matrix, it results in a method for penalized canonical correlation analysis (CCA). We apply this penalized CCA method to simulated data and to a genomic data set consisting of gene expression and DNA copy number measurements on the same set of samples.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
A graphical representation of the L1- and L2-constraints on u in the PMD(L1, L1) criterion. The constraints are as follows: ‖u22 ≤ 1 and ‖u1c. Here, u is two-dimensional, and the grey lines indicate the coordinate axes u1 and u2. Left: the L2-constraint is the solid circle. For both the L1- and L2-constraints to be active, c must be between 1 and formula image. The constraints ‖u1 = 1 and ‖u1=formula image are shown using dashed lines. Right: The L2- and L1-constraints on u are shown for some c between 1 and formula image. Small circles indicate the points where both the L1- and the L2-constraints are active. The solid arcs indicate the solutions that occur when Δ1 = 0 in Algorithm 3. The figure shows that in 2D, the points where both the L1- and L2-constraints are active do not have either u1 or u2 equal to 0.
Fig. 2.
Fig. 2.
Simulated CGH data. Top: results of PMD(L1, FL); middle: results of PMD(L1, L1); bottom: generative model. PMD(L1, FL) successfully identifies both the region of gain and the subset of samples for which that region is present.
Fig. 3.
Fig. 3.
Breast cancer gene expression data: a greater proportion of variance is explained when SPC is used to obtain the sparse principal components, rather than SPCA. Multiple SPC components were obtained as described in Algorithm 2.
Fig. 4.
Fig. 4.
The efficacy of PMD(L1, L1) is demonstrated using a simulation in which Xis generated from 2 sparse latent factors, called u1and u2, and Yis generated from 2 sparse latent factors, called v1and v2. The PMD(L1, L1) method identifies linear combinations of these sparse factors. Details of the simulation are given in Section A. of the Appendix.
Fig. 4.
Fig. 4.
The efficacy of PMD(L1, L1) is demonstrated using a simulation in which Xis generated from 2 sparse latent factors, called u1and u2, and Yis generated from 2 sparse latent factors, called v1and v2. The PMD(L1, L1) method identifies linear combinations of these sparse factors. Details of the simulation are given in Section A. of the Appendix.
Fig. 5.
Fig. 5.
PMD(L1, FL) was performed for the breast cancer data set. Left: for each chromosome, the weights of vobtained using PMD(L1, FL) are shown. All the vweights shown are positive, but the results would not be affected by flipping the signs of both vand u. On chromosome 2, vhas no nonzero elements. Right: for each chromosome, uand vwere computed on a training set consisting of 3/4 of the samples, and cor(Xu,Yv)is plotted, where Xand Yare the training (dashed) and test (solid) data.

Similar articles

Cited by

References

    1. Boyd S, Vandenberghe L. Convex Optimization. New York: Cambridge University Press; 2004.
    1. Chin K, DeVries S, Fridlyand J, Spellman P, Roydasgupta R, Kuo W-L, Lapuk A, Neve R, Qian Z, Ryder T. Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell. 2006;10:529–541. and others. - PubMed
    1. Dudoit S, Fridlyand J, Speed T. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association. 2001;96:1151–1160.
    1. Eckart C, Young G. The approximation of one matrix by another of low rank. Psychometrika. 1936;1:211.
    1. Friedman J, Hastie T, Hoefling H, Tibshirani R. Pathwise coordinate optimization. Annals of Applied Statistics. 2007;1:302–332.

Publication types