. 2024 Mar;18(1):858-881.

doi: 10.1214/23-aoas1817. Epub 2024 Jan 31.

A SIMPLE AND FLEXIBLE TEST OF SAMPLE EXCHANGEABILITY WITH APPLICATIONS TO STATISTICAL GENOMICS

Alan J Aw¹, Jeffrey P Spence², Yun S Song³

Affiliations

¹ Department of Statistics, University of California, Berkeley.
² Department of Genetics, School of Medicine, Stanford University.
³ Department of Statistics and Computer Science Division, University of California, Berkeley.

PMID: 38784669
PMCID: PMC11115382
DOI: 10.1214/23-aoas1817

A SIMPLE AND FLEXIBLE TEST OF SAMPLE EXCHANGEABILITY WITH APPLICATIONS TO STATISTICAL GENOMICS

Alan J Aw et al. Ann Appl Stat. 2024 Mar.

. 2024 Mar;18(1):858-881.

doi: 10.1214/23-aoas1817. Epub 2024 Jan 31.

Authors

Alan J Aw¹, Jeffrey P Spence², Yun S Song³

Affiliations

¹ Department of Statistics, University of California, Berkeley.
² Department of Genetics, School of Medicine, Stanford University.
³ Department of Statistics and Computer Science Division, University of California, Berkeley.

PMID: 38784669
PMCID: PMC11115382
DOI: 10.1214/23-aoas1817

Abstract

In scientific studies involving analyses of multivariate data, basic but important questions often arise for the researcher: Is the sample exchangeable, meaning that the joint distribution of the sample is invariant to the ordering of the units? Are the features independent of one another, or perhaps the features can be grouped so that the groups are mutually independent? In statistical genomics, these considerations are fundamental to downstream tasks such as demographic inference and the construction of polygenic risk scores. We propose a non-parametric approach, which we call the V test, to address these two questions, namely, a test of sample exchangeability given dependency structure of features, and a test of feature independence given sample exchangeability. Our test is conceptually simple, yet fast and flexible. It controls the Type I error across realistic scenarios, and handles data of arbitrary dimensions by leveraging large-sample asymptotics. Through extensive simulations and a comparison against unsupervised tests of stratification based on random matrix theory, we find that our test compares favorably in various scenarios of interest. We apply the test to data from the 1000 Genomes Project, demonstrating how it can be employed to assess exchangeability of the genetic sample, or find optimal linkage disequilibrium (LD) splits for downstream analysis. For exchangeability assessment, we find that removing rare variants can substantially increase the $p$ -value of the test statistic. For optimal LD splitting, the V test reports different optimal splits than previous approaches not relying on hypothesis testing. Software for our methods is available in R (CRAN: flintyR) and Python (PyPI: flintyPy).

Keywords: LD splitting; exchangeability; feature independence; non-parametric test; population stratification.

PubMed Disclaimer

Figures

**FIG 1.**
*Heat map of allele dosages (*0, 1 or 2) across 34 *approximately independent SNP markers from Chromosome 22 for a sample of* $N = 205$ *African individuals, who are either Yoruba in Ibadan, Nigeria* $(N_{Y R I} = 108)$ *or Luhya in Wubye, Kenya* $(N_{L W K} = 97)$ . Population-specific allele frequencies of each marker are depicted in the bar plot below. The user must decide, on the basis of differences in observed allele frequencies, whether the African sample should be treated as a single panmictic population.

**FIG 2.**
Overview of our method for detecting sample non-exchangeability or dependence between features.

**FIG 3.**
**Top Row** shows AUROCs of the V test and of the TW test for pairings of a null model and a non-exchangeable model, with solid diamond points reporting the mean AUROCs for the particular test. ***Bottom Row*** *shows ROCs generated from pairing a null model and a non-exchangeable model, both of which generate samples containing* $N = 50$ *observations and* $P$ *features*. A. *AUROC points are split into different distances between populations (Scenario 2, Table 1).* B. *AUROC points are split into different choices of sampling unevenness (Scenario 5, Table 1).* C. $P = 100$ *features generated. For the non-exchangeable model, individuals are drawn from* $K = 2$ *populations such that* 5 *individuals are drawn from Population 1 and the remaining* 45 *are drawn from Population 2; population closeness set to* $ε = 0.2$ . D. 25 *individuals are drawn each from* $K = 2$ *populations, with* $P = 1000$ *features only* 20% *of which truly discern between the two source populations; population closeness set to* $ε = 0.2$ . E. 25 *individuals are drawn each from* $K = 2$ *populations, with* $P = 100$ *features only* 20% *of which truly discern between the two source populations; population closeness set to* $ε = 0.2$ .

**FIG 4.**
*Probability-probability plots of the permutation-based distribution*, $F_{p e r m}$ , *against the large* $P$ *approximation*. A. $N = 10$ . B. $N = 100$ . C. $N = 1000$ .

**FIG 5.**
*Exchangeabiity test (Hypothesis H1)* $p$ -*values for the Utah population (CEU), Kinh population (KHV) and Yoruba population (YRI) across progressively stringent allele frequency threshold choices*, $r$ . *Raw* $p$ -*values are log-transformed for better visualization*.

**FIG 6.**
*Rotated heatmap of pairwise LD* $r^{2}$ *values within a* 2000 *b.p. region of Chromosome 22, with* $r^{2}$ *values less than* 0.05 removed for better visualization. Superimposed on the heatmap are split points lying in the region, as identified by various LD splitting methods, including ldetect (split points for entire African sample) and snp_ldsplit (optimal split points and suboptimal split points), as described in Subsection 6.2. The split points are given by ldetect : (21419799,22878110,2317414023717987,24488861,25664408), *optimal snp_ldsplit* : (22579801,23849683) *and suboptimal snp_ldsplit* : (22579801,23849683,25286983).

**FIG 7.**
*An exchangeable but heterogeneous sample. We set* $P = 2$ *in the model described in Appendix B, and draw* $N = 40$ *vectors* $x_{n} \in ℝ^{2}$ . *Points are shaped by the number of coordinates that lie above or below* 0.

See this image and copyright information in PMC

References

1. ANGELOPOULOS AN and BATES S. (2023). Conformal prediction: A gentle introduction. Foundations and Trends® in Machine Learning 16 494–591.
1. AW A. et al. (2023). Supplement to “A simple and flexible test of sample exchangeability with applications to statistical genomics”. Annals of Applied Statistics. - PMC - PubMed
1. BAI Z. and SILVERSTEIN JW (2010). Spectral Analysis of Large Dimensional Random Matrices, 2 ed. Springer Series in Statistics. Springer.
1. BALASUBRAMANIAN V, HO S-S and VOVK V. (2014). Conformal prediction for reliable machine learning: theory, adaptations and applications. Morgan Kaufmann.
1. BARTELS R. (1982). The rank version of von Neumann’s ratio test for randomness. Journal of the American Statistical Association 77 40–46.

Grants and funding

R35 GM134922/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A SIMPLE AND FLEXIBLE TEST OF SAMPLE EXCHANGEABILITY WITH APPLICATIONS TO STATISTICAL GENOMICS

Affiliations

A SIMPLE AND FLEXIBLE TEST OF SAMPLE EXCHANGEABILITY WITH APPLICATIONS TO STATISTICAL GENOMICS

Authors

Affiliations

Abstract

Figures

Similar articles

References

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous

Abstract

Figures

Similar articles

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous