Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Apr 11:14:1125884.
doi: 10.3389/fimmu.2023.1125884. eCollection 2023.

Adaptive immune receptor genotyping using the corecount program

Affiliations

Adaptive immune receptor genotyping using the corecount program

Sanjana Narang et al. Front Immunol. .

Abstract

We present a new Rep-Seq analysis tool called corecount, for analyzing genotypic variation in immunoglobulin (IG) and T cell receptor (TCR) genes. corecount is highly efficient at identifying V alleles, including those that are infrequently used in expressed repertoires and those that contain 3' end variation that are otherwise refractory to reliable identification during germline inference from expressed libraries. Furthermore, corecount facilitates accurate D and J gene genotyping. The output is highly reproducible and facilitates the comparison of genotypes from multiple individuals, such as those from clinical cohorts. Here, we applied corecount to the genotypic analysis of IgM libraries from 16 individuals. To demonstrate the accuracy of corecount, we Sanger sequenced all the heavy chain IG alleles (65 IGHV, 27 IGHD and 7 IGHJ) from one individual from whom we also produced two independent IgM Rep-seq datasets. Genomic analysis revealed that 5 known IGHV and 2 IGHJ sequences are truncated in current reference databases. This dataset of genomically validated alleles and IgM libraries from the same individual provides a useful resource for benchmarking other bioinformatic programs that involve V, D and J assignments and germline inference, and may facilitate the development of AIRR-Seq analysis tools that can take benefit from the availability of more comprehensive reference databases.

Keywords: IGH; VDJ germline genes; genotyping; immune repertoires; inference.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
IgM library preparation and inferred genotype. (A) Peripheral blood samples separated into polymorphonuclear and mononuclear cell fractions, with DNA and RNA extracted from each fraction respectively. IgM libraries were prepared from the RNA fraction, sequenced using the Illumina MiSeq system, and analyzed for the presence of novel alleles using the IgDiscover program. Genotyping was subsequently performed using the corecount module. DNA isolated from the polymorphonuclear fraction was used as template for genomic PCR of individual V, D and J alleles and subsequent Sanger validation. (B) Inferred haplotype analysis of case D19. IgDiscover plotallele output using IGHJ6*02 and IGHJ6*03 as haplotyping anchors. Genes associated with common structural alterations are marked with the red brackets. Dendrograms of (C), Sanger validated IGHV alleles, with non-expressed alleles denoted in red, as revealed by their lack of expression in two independent IgM libraries, (D), IGHD alleles, and (E), IGHJ alleles from case D19.
Figure 2
Figure 2
Genomic validation. (A) DNA extracted from polymorphonuclear cells from case D19 was used as template for individual PCR amplification of all IGHV, IGHD and IGHJ genes. Primers were located upstream of the leader coding sequence, and downstream of the RSS sequence in L2 V exon. Primers targeting the D genes were located upstream and downstream of the 5’ and 3’ D gene RSS segments respectively. Primers targeting the J gene were located upstream of the RSS segment and downstream of the J gene splice site segment. (B) Sequence discrepancies of known reference alleles identified through Sanger validation. Full length extension of truncated alleles IGHV3-66*02, IGHV4-38-2*01, IGHV2-70*05, IGHV5-10-1*03, IGHV5-51*03, IGHJ6*02 and IGHJ6*03 was enabled by the identification of the positions of framework 1, RSS heptamer sequence or splice sites within the genomic sequence encompassing the truncated allelic sequences.
Figure 3
Figure 3
VDJ associated IGHV truncation. (A) Schematic of VDJ recombination. Genomically distinct V, D and J alleles are recombined to produce a unique VDJ ‘exon’. The recombination process in most cases causes loss or replacement of several 3’ located nucleotides of the V gene, 5’ nucleotides of J genes and both 5’ and 3’ nucleotides of D genes, thereby enabling the generation of high levels of CDR3 coding diversity. The sections of the V, D and J genes subject to high rates of recombination associated change are denoted with a shaded blue color in the figure. The gene segments least affected by the recombination process are termed here the V, D and J core sequences. (B) Effect of recombination based 3’ nucleotide alteration on unmutated allele counts in D19 using NGS analysis of the IgM library. The proportion of counts for V alleles containing the entire V sequence are shown in blue. (C) The proportion of counts of a series of single nucleotide 3’ deletions are shown compared to the counts of a 6 nt truncation at the 3’ end of the V sequence for IGHV1 alleles in case D19. In all cases a plateau of counts is achieved at the -4 or -5 nt point. (D) Proportion of unmutated allele sequence of the gene IGHV1-46 containing nucleotide differences close to the IGHV 3’ end increases rapidly when the allelic search string is decreased by 5 nucleotides at the3 ’ end due to the presence of the single G/T snp that distinguishes IGHV1-46*01 from IGHV1-46*03 in case D19.
Figure 4
Figure 4
Schematic of corecount process. (i) The starting point for corecount analysis is the availability of a validated comprehensive database that contains end corrected allelic sequences. (ii) Database processing by corecount. The program processes allelic variants on a per gene basis, truncating the database sequences at the appropriate regions affected by VDJ recombination associated variation, trimming a default or user defined number of nucleotides from each sequence in the recombination associated regions, to leave a ‘core’ sequence specific for each allele. In the case of V genes where one or more allelic variant that distinguishes this allele from another is present within the last three nucleotides of the 3’ end of the gene, corecount truncates the sequence of all alleles of this gene only as far as that variant nucleotide. In each case, within a single gene all corecount processed alleles will be truncated by the same amount, facilitating a direct comparison of the counts for each expressed allele. The corecount database truncation occurs at the 3’ end of V sequences, 5’ end of J sequences, and both 5’ and 3’ end of D sequences. For D gene analysis, corecount additionally includes an alternative procedure that involves identifying the longest common substring (LCS) of allele specific sequences within the D gene sequences that are used as alternative core sequence search strings. (iii) corecount analysis of MiAIRR formatted NGS library. The corecount program identifies sequence matches to the database sequences within the library and passes the raw sequence counts to the germline filter step. (iv) Filtering of raw corecount output to produce the final expressed genotype. This filter step that can be user defined, includes an allelic ratio that compares counts of all alleles of each individual gene, a minimum count requirement, an expected frequency based on gene and allele frequency from multiple independent libraries, and two CDR3 diversity filters that detect and remove false positive expanded clones based on either biased CDR3 length or sequence diversity amongst the set of CDR3s associated with each germline sequence. The output of the program is a genotype of the appropriate gene type (V, D or J).
Figure 5
Figure 5
Illustration of IGHD LCS analysis procedure using IGHD5-24*01. The corecount LCS procedure identifies allele specific substrings from each sequence present in the supplied IGHD allelic database. These LCS substrings are a minimum percentage of the full length IGHD allele, as defined by the user, in this case 60%. The program searches the MiAIRR formatted library table and produces a set of counts for each of these substrings – with each VDJ sequence analyzed providing a maximum count of 1 of this set. An example is shown for allele IGHD5-24*01 in case D19. In this procedure the program utilizes sequences where the remaining segment of IGHD5-24*01 in the VDJ recombinant is central to the D core, but also enables the use of VDJ recombinations that skew towards the 5’ or 3’ part of the IGHD allele. The position of the V allele and the 5’ and 3’ RSS heptamers are shown in the top row. A total of 45 unique substrings of IGH5-24*01 are shown to be utilized, each resulting in different counts, the sum of which is the corecount total for this IGHD5-24*01 allelic sequence. This approach enables a series of ‘cores’ to be utilized for D gene analysis rather than restricting it to a single central ‘core’ of that germline.
Figure 6
Figure 6
(A) Analysis of the effect of library size on genotype output. The number of alleles found by corecount analysis in the complete IgM library (100%) was compared to the numbers found when using libraries that had been reduced to 25%, 10%, 5%, and 1%, each sampled ten times. The Numbers of V, D and J alleles are shown for case D19 and (B), for case D46. (C), corecount identification of allelic end variants. Heterozygous IGHV3-15*07 containing alleles in the D19 IgM library were modified computationally to change the final nucleotide of full length IGHV3-15*07 sequences, an allele that ends in an ‘A’, to alleles that ended on C, G or T nucleotides. The corecount analysis was performed using a database supplemented with the modified IGHV3-15*07 test variants, IGHV3-15*01_C, IGHV3-15*07_G and IGHV3-15*07_T. In each case corecount correctly produced an IGHV genotype containing either the appropriate IGHV3-15*07 sequence (top result) or the modified, IGHV3-15*01_C, IGHV3-15*07_G or IGHV3-15*07_T. (D), Allelic output, based on numbers of alleles identified, from corecount IGHV genotype analysis (purple) was compared to IgDiscover germline inference output (green) for 16 IgM libraries.

References

    1. Schatz DG, Swanson PC. V(D)J recombination: mechanisms of initiation. Annu Rev Genet (2011) 45:167–202. doi: 10.1146/annurev-genet-110410-132552 - DOI - PubMed
    1. Pennell M, Rodriguez OL, Watson CT, Greiff V. The evolutionary and functional significance of germline immunoglobulin gene variation. Trends Immunol (2022) 44(1):7–21. doi: 10.1016/j.it.2022.11.001 - DOI - PubMed
    1. Watson CT, Steinberg KM, Huddleston J, Warren RL, Malig M, Schein J, et al. . Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation. Am J Hum Genet (2013) 92(4):530–46. doi: 10.1016/j.ajhg.2013.03.004 - DOI - PMC - PubMed
    1. Ying T, Prabakaran P, Du L, Shi W, Feng Y, Wang Y, et al. . Junctional and allele-specific residues are critical for MERS-CoV neutralization by an exceptionally potent germline-like antibody. Nat Commun (2015) 6:8223. doi: 10.1038/ncomms9223 - DOI - PMC - PubMed
    1. Feeney AJ, Atkinson MJ, Cowan MJ, Escuro G, Lugo G. A defective vkappa A2 allele in navajos which may play a role in increased susceptibility to haemophilus influenzae type b disease. J Clin Invest (1996) 97(10):2277–82. doi: 10.1172/JCI118669 - DOI - PMC - PubMed

Publication types

Substances

Grants and funding

Funding for this work was provided by a Distinguished Professor grant from the Swedish Research Council (agreement number 2017-00968) and an ERC Advanced grant (agreement number 78816) to GBKH.

LinkOut - more resources