Abstract
RNA-sample pooling is sometimes inevitable, but should be avoided in classification tasks like biomarker studies. Our simulation framework investigates a two-class classification study based on gene expression profiles to point out how strong the outcomes of single sample designs differ to those of pooling designs. The results show how the effects of pooling depend on pool size, discriminating pattern, number of informative features and the statistical learning method used (support vector machines with linear and radial kernel, random forest (RF), linear discriminant analysis, powered partial least squares discriminant analysis (PPLS-DA) and partial least squares discriminant analysis (PLS-DA)). As a measure for the pooling effect, we consider prediction error (PE) and the coincidence of important feature sets for classification based on PLS-DA, PPLS-DA and RF. In general, PPLS-DA and PLS-DA show constant PE with increasing pool size and low PE for patterns for which the convex hull of one class is not a cover of the other class. The coincidence of important feature sets is larger for PLS-DA and PPLS-DA as it is for RF. RF shows the best results for patterns in which the convex hull of one class is a cover of the other class, but these depend strongly on the pool size. We complete the PE results with experimental data which we pool artificially. The PE of PPLS-DA and PLS-DA are again least influenced by pooling and are low. Additionally, we show under which assumption the PLS-DA loading weights, as a measure for importance of features regarding classification, are equal for the different designs.
Similar content being viewed by others
Abbreviations
- N :
-
Total number of available single samples (with subscript A or B for class A or B)
- N S :
-
Number of samples used for the single sample arrays (with subscript A or B for class A or B)
- \({N_{S_P}}\) :
-
Number of samples used for the pools (with subscript A or B for class A or B)
- A S :
-
Number of arrays for the single sample design
- A P :
-
Number of arrays for the pools
- A :
-
Total number of arrays which can be financed
- m p :
-
Pool size
- N P :
-
Number of pools
- u g,i :
-
Random variables on the scale of measured intensities for a microarray experiment for gene g and sample i
- \({u_{g,p_j}}\) :
-
Random variables on the scale of measured intensities for a microarray experiment for gene g and pool p j
- μ g :
-
Mean gene expression level of gene g
- \({\sigma_{b}^{2}}\) :
-
Biological variance
- X g,i :
-
Gene expression values on the log scale for gene g and sample i (with subscript A or B for class A or B)
- \({X_{g,p_j}}\) :
-
Gene expression values on the log scale for gene g and pool p j (with subscript A or B for class A or B)
- w i :
-
Proportion of sample i in a pool
- cov :
-
Covariance
- corr :
-
Correlation
- ISF :
-
Informative simulated feature(s)
- LDA:
-
Linear discriminant analysis
- PE(s):
-
Prediction error(s)
- PLS-DA:
-
Partial least squares discriminant analysis
- PPLS-DA:
-
Power partial least squares discriminant analysis
- RF:
-
Random forest
- sd :
-
Standard deviation
- SVM:
-
Support vector machines
- SVML:
-
Support vector machines with linear kernel
- SVMR:
-
Support vector machines with radial kernel
- D sim :
-
Informative simulated feature set
- \({D_{m_p}^{\rm M}}\) :
-
Important features for classification with method M in a design with pool size m p
- \({I_{1}^{\rm M}}\) :
-
\({= D_{1}^{\rm M} \cap D_{\rm sim}^{\rm M}}\)
- \({I_{1:m_p}^{\rm M}}\) :
-
\({= I_{1}^{\rm M} \cap D_{m_p}^{\rm M}}\) for method M important informative simulated features which coincide in the single sample design and in a design with pool size m p
- X t :
-
Transposed matrix of X
- |I|:
-
Cardinality of I
- abs(a):
-
Absolute value of the real number a
References
Affymetrix (2004) Sample pooling for microarray analysis: a statistical assessment of risks and biases. Technical note, Part no. 701494, rev. 2
Allison DB, Cui X, Page GP, Sabripour M (2006) Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 7(1): 55–65
Barker M, Rayens W (2003) Partial least squares for discrimination. J Chemom 17(3): 166–173
Biomarkers Definition Workgroup: (2001) Biomarkers and surrogate endpoints: preferred definitions and conceptual framework. Clin Pharmacol Ther 69(3): 89–95
Boulesteix A-L (2004) Pls dimension reduction for classification with microarray data. Stat Appl Genet Mol Biol 3(1). doi:10.2202/1544-6115.1075
Boulesteix A-L, Strobl C, Augustin T, Daumer M (2008) Evaluating microarray-based classifiers: an overview. Cancer Inf 6: 77–97
Breiman L (2001) Random forests. Mach Learn 45: 5–32
Dettling M (2004) Bagboosting for tumor classification with gene expression data. Bioinformatics 20(18): 3583–3593. doi:10.1093/bioinformatics/bth447
Dettling M, Buehlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19(9): 1061–1069
Díaz-Uriarte R, de Andrés SA (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformat 7: 3. doi:10.1186/1471-2105-7-3
Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A (2009) e1071: Misc functions of the Department of Statistics (e1071), TU Wien. R package version 1.5-20. http://CRAN.R-project.org/package=e1071
Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97: 77–87
Feng Z, Prentice R, Srivastava S (2004) Research issues and strategies for genomic and proteomic biomarker discovery and validation: a statistical perspective. Pharmacogenomics 5(6): 709–719. doi:10.1517/14622416.5.6.709
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531–537
Indahl UG, Martens H, Næs T (2007) From dummy regression to prior probabilities in pls-da. J Chemom 21: 529–536
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP (2003) Summaries of affymetrix genechip probe level data. Nucleic Acids Res 31(4): e15
Jensen JLWV (1906) Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta Math 30: 175–193
Kendziorski C, Irizarry RA, Chen KS, Haag JD, Gould MN (2005) On the utility of pooling biological samples in microarray experiments. Proc Natl Acad Sci USA 102(12): 4252–4257
Kerr MK (2003) Design considerations for efficient and effective microarray studies. Biometrics 59(4): 822–828
Lapointe J, Li C, Higgins JP, van de Rijn M, Bair E, Montgomery K, Ferrari M, Egevad L, Rayford W, Bergerheim U, Ekman P, DeMarzo AM, Tibshirani R, Botstein D, Brown PO, Brooks JD, Pollack JR (2004) Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci USA 101(3): 811–816. doi:10.1073/pnas.0304146101
Liaw A, Wiener M (2002) Classification and regression by randomForest. http://CRAN.R-project.org/doc/Rnews/
Liland KH, Indahl U (2009) Powered partial least squares discriminant analysis. Chemometrics 23: 7–18
Liu H, Li J, Wong L (2002) A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Inf 13: 51–60
Mary-Huard T, Daudin JJ, Baccini M, Biggeri A, Bar-Hen A (2007) Biases induced by pooling samples in microarray experiments. Bioinformatics 23(13): i313–i318
Nocairi H, Qannari EM, Vigneau E, Bertrand D (2005) Discrimination on latent components with respect to patterns. Application to multicollinear data. Comput Stat Data Anal 48(1): 139–147
Peng X, Wood CL, Blalock EM, Chen KC, Landfield PW, Stromberg AJ (2003) Statistical implications of pooling rna samples for microarray experiments. BMC Bioinform 4: 26. doi:10.1186/1471-2105-4-26
Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32(Suppl): 496–501
R Development Core Team (2008) R: A language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org
Russel, S, Norvig, P (eds) (2009) Artificial intellligence: a modern approach. Prentice Hall, Upper Saddle River
Sadiq ST, Agranoff D (2008) Pooling serum samples may lead to loss of potential biomarkers in SELDI-ToF MS proteomic profiling. Proteome Sci 6: 16
Searfoss GH, Jordan WH, Calligaro DO, Galbreath EJ, Schirtzinger LM, Berridge BR, Gao H, Higgins MA, May PC, Ryan TP (2003) Adipsin, a biomarker of gastrointestinal toxicity mediated by a functional gamma-secretase inhibitor. J Biol Chem 278(46): 46107–46116
Simon R, Radmacher MD, Dobbin K (2002) Design of studies using dna microarrays. Genet Epidemiol 23(1):21–36. doi:10.1002/gepi.202
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2): 203–209
Storey R, Tibshirani J (2003) Statistical significance for genomewide studies. Proc Natal Acad Sci 100: 9440–9445
Telaar A, Nürnberg G, Repsilber D (2010) Finding biomarker signatures in pooled sample designs: a simulation framework for methodological comparisons. Adv Bioinform 2010: 8
Veer L (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(31): 530–536
Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. New York. ISBN 0-387-95457-0 http://www.stats.ox.ac.uk/pub/MASS4
Zhang W, Carriquiry A, Nettleton D, Dekkers JC (2007) Pooling mRNA in microarray experiments and its effect on power. Bioinformatics 23(10): 1217–1224
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Telaar, A., Repsilber, D. & Nürnberg, G. Biomarker discovery: classification using pooled samples. Comput Stat 28, 67–106 (2013). https://doi.org/10.1007/s00180-011-0302-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-011-0302-0