Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar;18(1):858-881.
doi: 10.1214/23-aoas1817. Epub 2024 Jan 31.

A SIMPLE AND FLEXIBLE TEST OF SAMPLE EXCHANGEABILITY WITH APPLICATIONS TO STATISTICAL GENOMICS

Affiliations

A SIMPLE AND FLEXIBLE TEST OF SAMPLE EXCHANGEABILITY WITH APPLICATIONS TO STATISTICAL GENOMICS

Alan J Aw et al. Ann Appl Stat. 2024 Mar.

Abstract

In scientific studies involving analyses of multivariate data, basic but important questions often arise for the researcher: Is the sample exchangeable, meaning that the joint distribution of the sample is invariant to the ordering of the units? Are the features independent of one another, or perhaps the features can be grouped so that the groups are mutually independent? In statistical genomics, these considerations are fundamental to downstream tasks such as demographic inference and the construction of polygenic risk scores. We propose a non-parametric approach, which we call the V test, to address these two questions, namely, a test of sample exchangeability given dependency structure of features, and a test of feature independence given sample exchangeability. Our test is conceptually simple, yet fast and flexible. It controls the Type I error across realistic scenarios, and handles data of arbitrary dimensions by leveraging large-sample asymptotics. Through extensive simulations and a comparison against unsupervised tests of stratification based on random matrix theory, we find that our test compares favorably in various scenarios of interest. We apply the test to data from the 1000 Genomes Project, demonstrating how it can be employed to assess exchangeability of the genetic sample, or find optimal linkage disequilibrium (LD) splits for downstream analysis. For exchangeability assessment, we find that removing rare variants can substantially increase the p-value of the test statistic. For optimal LD splitting, the V test reports different optimal splits than previous approaches not relying on hypothesis testing. Software for our methods is available in R (CRAN: flintyR) and Python (PyPI: flintyPy).

Keywords: LD splitting; exchangeability; feature independence; non-parametric test; population stratification.

PubMed Disclaimer

Figures

FIG 1.
FIG 1.
Heat map of allele dosages (0, 1 or 2) across 34 approximately independent SNP markers from Chromosome 22 for a sample of N=205 African individuals, who are either Yoruba in Ibadan, Nigeria (NYRI=108) or Luhya in Wubye, Kenya (NLWK=97). Population-specific allele frequencies of each marker are depicted in the bar plot below. The user must decide, on the basis of differences in observed allele frequencies, whether the African sample should be treated as a single panmictic population.
FIG 2.
FIG 2.
Overview of our method for detecting sample non-exchangeability or dependence between features.
FIG 3.
FIG 3.
Top Row shows AUROCs of the V test and of the TW test for pairings of a null model and a non-exchangeable model, with solid diamond points reporting the mean AUROCs for the particular test. Bottom Row shows ROCs generated from pairing a null model and a non-exchangeable model, both of which generate samples containing N=50 observations and P features. A. AUROC points are split into different distances between populations (Scenario 2, Table 1). B. AUROC points are split into different choices of sampling unevenness (Scenario 5, Table 1). C. P=100 features generated. For the non-exchangeable model, individuals are drawn from K=2 populations such that 5 individuals are drawn from Population 1 and the remaining 45 are drawn from Population 2; population closeness set to ε=0.2. D. 25 individuals are drawn each from K=2 populations, with P=1000 features only 20% of which truly discern between the two source populations; population closeness set to ε=0.2. E. 25 individuals are drawn each from K=2 populations, with P=100 features only 20% of which truly discern between the two source populations; population closeness set to ε=0.2.
FIG 4.
FIG 4.
Probability-probability plots of the permutation-based distribution, Fperm, against the large P approximation. A. N=10. B. N=100. C. N=1000.
FIG 5.
FIG 5.
Exchangeabiity test (Hypothesis H1) p-values for the Utah population (CEU), Kinh population (KHV) and Yoruba population (YRI) across progressively stringent allele frequency threshold choices, r. Raw p-values are log-transformed for better visualization.
FIG 6.
FIG 6.
Rotated heatmap of pairwise LD r2 values within a 2000 b.p. region of Chromosome 22, with r2 values less than 0.05 removed for better visualization. Superimposed on the heatmap are split points lying in the region, as identified by various LD splitting methods, including ldetect (split points for entire African sample) and snp_ldsplit (optimal split points and suboptimal split points), as described in Subsection 6.2. The split points are given by ldetect : (21419799,22878110,2317414023717987,24488861,25664408), optimal snp_ldsplit : (22579801,23849683) and suboptimal snp_ldsplit : (22579801,23849683,25286983).
FIG 7.
FIG 7.
An exchangeable but heterogeneous sample. We set P=2 in the model described in Appendix B, and draw N=40 vectors xn2. Points are shaped by the number of coordinates that lie above or below 0.

Similar articles

References

    1. ANGELOPOULOS AN and BATES S. (2023). Conformal prediction: A gentle introduction. Foundations and Trends® in Machine Learning 16 494–591.
    1. AW A. et al. (2023). Supplement to “A simple and flexible test of sample exchangeability with applications to statistical genomics”. Annals of Applied Statistics. - PMC - PubMed
    1. BAI Z. and SILVERSTEIN JW (2010). Spectral Analysis of Large Dimensional Random Matrices, 2 ed. Springer Series in Statistics. Springer.
    1. BALASUBRAMANIAN V, HO S-S and VOVK V. (2014). Conformal prediction for reliable machine learning: theory, adaptations and applications. Morgan Kaufmann.
    1. BARTELS R. (1982). The rank version of von Neumann’s ratio test for randomness. Journal of the American Statistical Association 77 40–46.

LinkOut - more resources