×

An effective haplotype assembly algorithm based on hypergraph partitioning. (English) Zbl 1412.92193

Summary: The haplotype assembly problem has been proven to be complex. Heuristic algorithms are the main methods that are used to solve the problem. These algorithms perform well when the SNP fragments are error-free, but they are less accurate when the error rate increases. The complex relationships caused by fragment errors present a major barrier to assembling accurate haplotypes. Therefore, modeling the complex relationships is the key to solve the problem. In this study, we model the haplotype assembly problem using hypergraph partitioning formulations and propose a novel method termed HGHap (hypergraph-based haplotype assembly method). HGHap approaches the haplotype assembly problem in two phases. In the first phase, a hypergraph is constructed in which each vertex corresponds to a fragment and vertices are multiply connected to form hyperedges. In the second phase, a hypergraph partitioning algorithm is employed to obtain two groups of fragments to construct the haplotypes. The hyperedges capture higher-order relationships among fragments that facilitate the subsequent partitioning. Our results demonstrate that the method performs better than other methods in most cases, especially in cases with a high error rate.

MSC:

92D10 Genetics and epigenetics
05C90 Applications of graph theory
Full Text: DOI

References:

[1] Althaus, I. W.; Chou, J. J.; Gonzales, A. J.; Deibel, M. R.; Chou, K. C.; Kezdy, F. J.; Romero, D. L.; Aristoff, P. A.; Tarpley, W. G.; Reusser, F., Steady-state kinetic-studies with the nonnucleoside HIV-1 reverse-transcriptase inhibitor U-87201E, J. Biol. Chem., 268, 6119-6124 (1993)
[2] Althaus, I. W.; Chou, J. J.; Gonzales, A. J.; Deibel, M. R.; Chou, K. C.; Kezdy, F. J.; Romero, D. L.; Palmer, J. R.; Thomas, R. C.; Aristoff, P. A.; Tarpley, W. G.; Reusser, F., Kinetic-studies with the nonnucleoside HIV-1 reverse-transcriptase inhibitor-U-88204E, Biochemistry, 32, 6548-6554 (1993)
[3] Althaus, I. W.; Gonzales, A. J.; Chou, J. J.; Romero, D. L.; Deibel, M. R.; Chou, K. C.; Kezdy, F. J.; Resnick, L.; Busso, M. E.; So, A. G.; Downey, K. M.; Thomas, R. C.; Aristoff, P. A.; Tarpley, W. G.; Reusser, F., The quinoline U-78036 is a potent inhibitor of HIV-1 reverse-transcriptase, J. Biol. Chem., 268, 14875-14880 (1993)
[4] Andraos, J., Kinetic plasticity and the determination of product ratios for kinetic schemes leading to multiple products without rate laws—new methods based on directed graphs, Can. J. Chem.-Revue Canadienne De Chimie, 86, 342-357 (2008)
[5] Bansal, V.; Bafna, V., HapCUT: an efficient and accurate algorithm for the haplotype assembly problem, Bioinformatics, 24, I153-I159 (2008)
[6] Borgelt, C., Frequent item set mining. Wiley Interdisciplinary Reviews, Data Min. Knowl. Discovery, 2, 437-456 (2012)
[7] Cambazoglu, B. B.; Aykanat, C., Hypergraph-partitioning-based remapping models for image-space-parallel direct volume rendering of unstructured grids, IEEE Trans. Parallel Distrib. Syst., 18, 3-16 (2007)
[8] Chen, W.; Feng, P. M.; Lin, H.; Chou, K. C., iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition, Nucleic Acids Res., 41 (2013)
[9] Chen, W.; Lin, H.; Feng, P. M.; Ding, C.; Zuo, Y. C.; Chou, K. C., iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties, PLoS One, 7 (2012)
[10] Chen, Z.; Fu, B.; Schweller, R.; Yang, B.; Zhao, Z.; Zhu, B., Linear time probabilistic algorithms for the singular haplotype reconstruction problem from SNP fragments, J. Comput. Biol., 15, 535-546 (2008)
[11] Chou, K. C., Graphic rules in steady and non-steady state enzyme-kinetics, J. Biol. Chem., 264, 12074-12079 (1989)
[12] Chou, K. C., Applications of graph-theory to enzyme-kinetics and protein folding kinetics—steady and non-steady-state systems, Biophys. Chem., 35, 1-24 (1990)
[13] Chou, K. C., Graphic rule for drug metabolism systems, Curr. Drug Metab., 11, 369-378 (2010)
[14] Chou, K. C., Some remarks on protein attribute prediction and pseudo amino acid composition, J. Theor. Biol., 273, 236-247 (2011) · Zbl 1405.92212
[15] Chou, K. C.; Forsen, S., Graphical rules for enzyme-catalyzed rate laws, Biochem. J., 187, 829-835 (1980)
[16] Chou, K. C.; Kézdy, F. J.; Reusser, F., Kinetics of processive nucleic acid polymerases and nucleases, Anal. Biochem., 221, 217-230 (1994)
[17] Chou, K. C.; Lin, W. Z.; Xiao, X., Wenxiang: a web-server for drawing wenxiang diagrams, Nat. Sci., 3 (2011)
[18] Chou, K. C.; Shen, H. B., Recent progress in protein subcellular location prediction, Anal. Biochem., 370, 1-16 (2007)
[19] Chou, K. C.; Shen, H. B., Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides, Biochem. Biophys. Res. Commun., 357, 633-640 (2007)
[20] Chou, K. C.; Shen, H. B., Recent advances in developing web-servers for predicting protein attributes, Nat. Sci., 1 (2009)
[21] Consortium, T. I.H., A haplotype map of the human genome, Nature, 437, 1299-1320 (2005)
[22] Duitama, J., Huebsch, T., McEwen, G., Suk, E.K., Hoehe, M.R., 2010. ReFHap: a reliable and fast algorithm for single individual haplotyping. In: Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology, ACM, Niagara Falls, New York , pp. 160-169.; Duitama, J., Huebsch, T., McEwen, G., Suk, E.K., Hoehe, M.R., 2010. ReFHap: a reliable and fast algorithm for single individual haplotyping. In: Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology, ACM, Niagara Falls, New York , pp. 160-169.
[23] Ertoz, L., Steinbach, M., Kumar, V., 2003. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the Third SIAM International Conference Data Min, pp. 47.; Ertoz, L., Steinbach, M., Kumar, V., 2003. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the Third SIAM International Conference Data Min, pp. 47.
[24] Fan, Y. N.; Xiao, X.; Min, J. L.; Chou, K. C., iNR-Drug: predicting the interaction of drugs with nuclear receptors in cellular networking, Int. J. Mol. Sci., 15, 4915-4937 (2014)
[25] Fiduccia, C.M., Mattheyses, R.M., 1982. A linear-time heuristic for improving network partitions. In: ACM IEEE Nineteenth Design Automation Conference Proceedings, 174-181.; Fiduccia, C.M., Mattheyses, R.M., 1982. A linear-time heuristic for improving network partitions. In: ACM IEEE Nineteenth Design Automation Conference Proceedings, 174-181.
[26] Genovese, L. M.; Geraci, F.; Pellegrini, M., SpeedHap: an accurate heuristic for the single individual SNP haplotyping problem with many gaps, high reading error rate and low coverage, In: IEEE/ACM Transactions on Computational Biology and Bioinformatics, 5, 492-502 (2008)
[27] Geraci, F., A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem, Bioinformatics, 26, 2217-2225 (2010)
[28] Guo, S. H.; Deng, E. Z.; Xu, L. Q.; Ding, H.; Lin, H.; Chen, W.; Chou, K. C., iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition, Bioinformatics (2014), (btu083)
[29] Halperin, E.; Eskin, E., Haplotype reconstruction from genotype data using imperfect phylogeny, Bioinformatics, 20, 1842-1849 (2004)
[30] Han, J.; Pei, J.; Yin, Y., Mining frequent patterns without candidate generation, ACM SIGMOD Record, vol. 29, 1-12 (2000), ACM
[31] He, D.; Choi, A.; Pipatsrisawat, K.; Darwiche, A.; Eskin, E., Optimal algorithms for haplotype assembly from whole-genome sequence data, Bioinformatics, 26, 183-190 (2010)
[32] Hu, T.; Liu, C.; Tang, Y.; Sun, J.; Xiong, H.; Sung, S., High-dimensional clustering: a clique-based hypergraph partitioning framework, Knowledge Inf. Syst., 1-28 (2013)
[33] Karypis, G.; Aggarwal, R.; Kumar, V.; Shekhar, S., Multilevel hypergraph partitioning: applications in VLSI domain, IEEE Trans. Very Large Scale Integr. VLSI Syst., 7, 69-79 (1999)
[34] Kim, S. J.; Ha, J. W.; Zhang, B. T., Constructing higher-order miRNA-mRNA interaction networks in prostate cancer via hypergraph-based learning, BMC Syst. Biol., 7, 47 (2013)
[35] Klamt, S.; Haus, U. U.; Theis, F., Hypergraphs and cellular networks, PLoS Comput. Biol., 5, e1000385 (2009)
[36] Korn, F.; Muthukrishnan, S., Influence sets based on reverse nearest neighbor queries, SIGMOD Record, 29, 201-212 (2000), ACM
[37] Koyutürk, M.; Aykanat, C., Iterative-improvement-based declustering heuristics for multi-disk databases, Inf. Syst., 30, 47-70 (2005)
[38] Kurochkina, N.; Choekyi, T., Helix-helix interfaces and ligand binding, J. Theor. Biol., 283, 92-102 (2011)
[39] Lancia, G.; Bafna, V.; Istrail, S.; Lippert, R.; Schwartz, R., SNPs problems, complexity, and algorithms, (Heide, F., Algorithms—ESA 2001, vol. 2161 (2001), Springer: Springer Berlin Heidelberg), 182-193 · Zbl 1016.92023
[40] Levy, S.; Sutton, G.; Ng, P. C.; Feuk, L.; Halpern, A. L.; Walenz, B. P.; Axelrod, N.; Huang, J.; Kirkness, E. F.; Denisov, G.; Lin, Y.; MacDonald, J. R.; Pang, A. W.C.; Shago, M.; Stockwell, T. B.; Tsiamouri, A.; Bafna, V.; Bansal, V.; Kravitz, S. A.; Busam, D. A.; Beeson, K. Y.; McIntosh, T. C.; Remington, K. A.; Abril, J. F.; Gill, J.; Borman, J.; Rogers, Y. H.; Frazier, M. E.; Scherer, S. W.; Strausberg, R. L.; Venter, J. C., The diploid genome sequence of an individual human, PLoS Biol, 5, 254 (2007)
[41] Lin, S. X.; Lapointe, J., J. Biomed. Sci. Eng., Theoretical and experimental biology in one—a symposium in honour of Professor Kuo-Chen Chou’s 50th anniversary and Professor Richard Giegé’s 40th anniversary of their scientific careers, 6, 435-442 (2013)
[42] Lippert, R.; Schwartz, R.; Lancia, G.; Istrail, S., Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem, Brief. Bioinform., 3, 23-31 (2002)
[43] Liu, B.; Zhang, D.; Xu, R.; Xu, J.; Wang, X.; Chen, Q.; Dong, Q.; Chou, K. C., Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection, Bioinformatics, 30, 472-479 (2014)
[44] Mei, S., Predicting plant protein subcellular multi-localization by Chou’s PseAAC formulation based multi-label homolog knowledge transfer learning, J. Theor. Biol., 310, 80-87 (2012) · Zbl 1337.92065
[45] Min, J.L., Xiao, X., Chou, K.C., 2013. iEzy-Drug: a web server for identifying the interaction between enzymes and drugs in cellular networking. Biomed. Res. Int..; Min, J.L., Xiao, X., Chou, K.C., 2013. iEzy-Drug: a web server for identifying the interaction between enzymes and drugs in cellular networking. Biomed. Res. Int..
[46] Panconesi, A.; Sozio, M., (Jonassen, I.; Kim, J., Fast Hare: A Fast Heuristic for Single Individual SNP Haplotype Reconstruction Algorithms in Bioinformatics, vol. 3240 (2004), Springer: Springer Berlin/Heidelberg), 266-277
[47] Qiu, W. R.; Xiao, X.; Chou, K. C., iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components, Int. J. Mol. Sci., 15, 1746-1766 (2014)
[48] Seref, O.; Brooks, J. P.; Fong, S. S., Decomposition of flux distributions into metabolic pathways, IEEE/ACM Trans. Comput. Biol. Bioinf., 10, 984-993 (2013)
[49] Stephens, J. C.; Schneider, J. A.; Tanguay, D. A.; Choi, J.; Acharya, T.; Stanley, S. E.; Jiang, R.; Messer, C. J.; Chew, A.; Han, J. H., Haplotype variation and linkage disequilibrium in 313 human genes, Science, 293, 489-493 (2001)
[50] Stephens, M.; Smith, N. J.; Donnelly, P., A new statistical method for haplotype reconstruction from population data, Am. J. Hum. Genet., 68, 978-989 (2001)
[51] Tian, Z.; Hwang, T.; Kuang, R., A hypergraph-based learning algorithm for classifying gene expression and arrayCGH data with prior knowledge, Bioinformatics, 25, 2831-2838 (2009)
[52] Venter, J. C.; Adams, M. D.; Myers, E. W.; Li, P. W.; Mural, R. J.; Sutton, G. G.; Smith, H. O.; Yandell, M.; Evans, C. A.; Holt, R. A., The sequence of the human genome, Science, 291, 1304-1351 (2001)
[53] Wang, L.; Xu, Y., Haplotype inference by maximum parsimony, Bioinformatics, 19, 1773-1780 (2003)
[54] Wang, R. S.; Wu, L. Y.; Li, Z. P.; Zhang, X. S., Haplotype reconstruction from SNP fragments by minimum error correction, Bioinformatics, 21, 2456-2462 (2005)
[55] Wang, Y.; Feng, E.; Wang, R., A clustering algorithm based on two distance functions for MEC model, Comput. Biol. Chem., 31, 148-150 (2007) · Zbl 1124.92019
[56] Wjst, M., Target SNP selection in complex disease association studies, BMC Bioinf., 5, 92 (2004)
[57] Xiao, X.; Min, J. L.; Wang, P.; Chou, K. C., iCDI-PseFpt: identify the channel-drug interaction in cellular networking with PseAAC and molecular fingerprints, J. Theor. Biol., 337, 71-79 (2013) · Zbl 1411.92115
[58] Xu, Y.; Ding, J.; Wu, L. Y.; Chou, K. C., iSNO-PseAAC: predict cysteine \(S\)-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition, PLoS One, 8 (2013)
[59] Xu, Y.; Shao, X. J.; Wu, L. Y.; Deng, N. Y.; Chou, K. C., iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins, PeerJ, 1, e171 (2013), (e171)
[60] Zhao, Y. Y.; Wu, L. Y.; Zhang, J. H.; Wang, R. S.; Zhang, X. S., Haplotype assembly from aligned weighted SNP fragments, Comput. Biol. Chem., 29, 281-287 (2005) · Zbl 1102.92031
[61] Zhou, G. P., The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein-protein interaction mechanism, J. Theor. Biol., 284, 142-148 (2011) · Zbl 1397.92245
[62] Zhou, G. P., The structural determinations of the leucine zipper coiled-coil domains of the cGMP-dependent protein kinase I alpha and its interaction with the myosin binding subunit of the myosin light chains phosphase, Protein Pept. Lett., 18, 966-978 (2011)
[63] Zhou, G. P.; Deng, M. H., An extension of chou graphic rules for deriving enzyme kinetic-equations to systems involving parallel reaction pathways, Biochem. J., 222, 169-176 (1984)
[64] Zhou, G. P.; Huang, R. B., The pH-triggered conversion of the PrPc to PrPsc, Curr. Top. Med. Chem., 13, 1152-1163 (2013)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.