×

A novel hybrid CNN and BiGRU-attention based deep learning model for protein function prediction. (English) Zbl 1530.92066


MSC:

92C40 Biochemistry, molecular biology
92D20 Protein sequences, DNA sequences
68T07 Artificial neural networks and deep learning
Full Text: DOI

References:

[1] Asgari, E. and Mofrad, M.R.K. (2015). ProtVec: a continuous distributed representation of biological sequences for proteomics and genomics. PLoS One 10: e0141287. doi:10.1371/journal.pone.0141287. · doi:10.1371/journal.pone.0141287
[2] Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al.. (2000). Gene ontology: tool for the unification of biology. The gene ontology consortium. Nature genetics 25: 25-29. doi:10.1038/75556. · doi:10.1038/75556
[3] Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
[4] Barrell, D., Dimmer, E., Huntley, R.P., Binns, D., O’Donovan, C., and Apweiler, R. (2009). The Goa database in 2009-an integrated gene ontology annotation resource. Nucleic Acids Res. 37: D396-D403. doi:10.1093/nar/gkn803. · doi:10.1093/nar/gkn803
[5] Cai, Y., Wang, J., and Deng, L. (2020). SDN2GO: an integrated deep learning model for protein function prediction. Front. Bioeng. Biotechnol. 8: 391, doi:10.3389/fbioe.2020.00391. · doi:10.3389/fbioe.2020.00391
[6] Cao, R., Freitas, C., Chan, L., Sun, M., Jiang, H., and Chen, Z. (2017). ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network. Molecules 22: 1732. doi:10.3390/molecules22101732. · doi:10.3390/molecules22101732
[7] Chen, H., Sun, M., Tu, C., Lin, Y., and Liu, Z. (2016). Neural sentiment classification with user and product attention. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 1650-1659.
[8] Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing. Association for Computational Linguistics.
[9] Choi, K., Lee, Y., Kim, C., Yoon, M. (2021). An effective GCN-based hierarchical multilabel classification for protein function prediction. arXiv:2112.02810.
[10] Clark, W.T. and Radivojac, P. (2011a). Analysis of protein function and its prediction from amino acid sequence. Proteins Struct. Funct. Bioinf. 79: 2086-2096. doi:10.1002/prot.23029. · doi:10.1002/prot.23029
[11] Clark, W.T. and Radivojac, P. (2011b). Analysis of protein function and its prediction from amino acid sequence. Proteins Struct. Funct. Bioinf. 79: 2086-2096. doi:10.1002/prot.23029. · doi:10.1002/prot.23029
[12] Consortium, U. (2015). Uniprot: a hub for protein information. Nucleic Acids Res. 43: D204-D212. doi:10.1093/nar/gku989. · doi:10.1093/nar/gku989
[13] Dutta, P. and Saha, S. (2017). Fusion of expression values and protein interaction information using multi-objective optimization for improving gene clustering. Comput. Biol. Med. 89: 31-43. doi:10.1016/j.compbiomed.2017.07.015. · doi:10.1016/j.compbiomed.2017.07.015
[14] Dutta, P. and Saha, S. (2020). Amalgamation of protein sequence, structure and textual information for improving protein-protein interaction identification. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 6396-6407.
[15] Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M., et al.. (2021). ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 14: 1.
[16] Elsayed, N., Maida, A.S., and Bayoumi, M. (2019). Deep gated recurrent and convolutional network hybrid model for univariate time series classification. Int. J. Adv. Comput. Sci. Appl. 10: 654-664. doi:10.14569/ijacsa.2019.0100582. · doi:10.14569/ijacsa.2019.0100582
[17] Forslund, K. and Sonnhammer, E.L. (2008). Predicting protein function from domain content. Bioinformatics 24: 1681-1687. doi:10.1093/bioinformatics/btn447. · doi:10.1093/bioinformatics/btn447
[18] Giri, S.J., Dutta, P.Student Member, Halan, P., and Saha, S. (2020). MultiPredGO: deep multi-modal protein function prediction by amalgamating protein structure, sequence, and interaction information. IEEE J. Biomed. Health Inform 25: 1832-1838.
[19] Heinzinger, M., Elnaggar, A., Wang, Y., Dallago, C., Nechaev, D., Matthes, F., and Rost, B. (2019). Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 20: 723. doi:10.1186/s12859-019-3220-8. · doi:10.1186/s12859-019-3220-8
[20] Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty, L., Duquenne, L., et al.. (2009). Interpro: the integrative protein signature database. Nucleic Acids Res. 37: D211-D215. doi:10.1093/nar/gkn785. · doi:10.1093/nar/gkn785
[21] Jiang, Y., Oron, T.R., Clark, W.T., Bankapur, A.R., D’Andrea, D., Lepore, R., Funk, C.S., Kahanda, I., Verspoor, K.M., Ben-Hur, A., et al.. (2016). An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 17: 184. doi:10.1186/s13059-016-1037-6. · doi:10.1186/s13059-016-1037-6
[22] Jinbao, T., Weiwei, K., Qiaoxin, T., and Zhaoqian, W. (2021). Text classification method based on LSTM-attention and CNN hybrid model. Comput. Eng. Appl. 57: 154-162.
[23] Kabir, A. and Shehu, A. (2022). Transformer neural networks attending to both sequence and structure for protein prediction tasks, arXiv:2206.11057.
[24] Kabir, A. and Shehu, A. (2022). GOProFormer: a multi-modal transformer method for gene ontology protein function prediction.
[25] Kingma, D.P. and Ba, J. (2014). Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[26] Kuang, S., Li, J., Branco, A., Luo, W.-H., and Xiong, D. (2018). Attention focusing for neural machine translation by bridging source and target embeddings. In: Proceedings of the 56th annual meeting of the association for computational linguistics, Vol. 1, Long Papers, pp. 1767-1776.
[27] Kulmanov, M. and Hoehndorf, R. (2020). Deepgoplus: improved protein function prediction from sequence. Bioinformatics 36: 422-429. doi:10.1093/bioinformatics/btz595. · doi:10.1093/bioinformatics/btz595
[28] Kulmanov, M., Khan, M.A., and Hoehndorf, R. (2018). DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34: 660-668. doi:10.1093/bioinformatics/btx624. · doi:10.1093/bioinformatics/btx624
[29] Le, N.Q.K., Yapp, E.K.Y., and Yeh, H.Y. (2019). ET-GRU: using multi-layer gated recurrent units to identify electron transport proteins. BMC Bioinf. 20: 377. doi:10.1186/s12859-019-2972-5. · doi:10.1186/s12859-019-2972-5
[30] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., and Jackel, L.D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Comput. 1: 541-551. doi:10.1162/neco.1989.1.4.541. · doi:10.1162/neco.1989.1.4.541
[31] Li, Y., Wang, X., and Xu, P. (2018). Chinese text classification model based on deep learning. Future Internet 10: 113, doi:10.3390/fi10110113. · doi:10.3390/fi10110113
[32] Li, J., Wang, L., Zhang, X., Liu, B., and Wang, Y. (2020). Gonet: a deep network to annotate proteins via recurrent convolution networks. In: 2020 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp. 29-34.
[33] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., et al.. (2022). Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv 379: 1123-1130, doi:10.1101/2022.07.20.500902. · doi:10.1101/2022.07.20.500902
[34] Marquet, C., Heinzinger, M., Olenyi, T., Dallago, C., Erckert, K., Bernhofer, M., Nechaev, D., and Rost, B. (2022). Embeddings from protein language models predict conservation and variant effects. Hum. Genet. 141: 1629-1647.
[35] Nambiar, A., Heflin, M., Liu, S., Maslov, S., Hopkins, M., and Ritz, A. (2020). Transforming the language of life: transformer neural networks for protein prediction tasks. In: Proceedings of the international conference on bioinformatics, computational biology, and health informatics (BCB). ACM, pp. 1-8.
[36] Pearson, W.R. (2013). An introduction to sequence similarity (“homology”) searching. Curr. Protoc. Bioinformatics 42: 3-1. doi:10.1002/0471250953.bi0301s42. · doi:10.1002/0471250953.bi0301s42
[37] Piovesan, D., Giollo, M., Leonardi, E., Ferrari, C., and Tosatto, S.C. (2015). INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res. 43: W134-W140. doi:10.1093/nar/gkv523. · doi:10.1093/nar/gkv523
[38] Ranjan, A., Fahad, M.S., Fernandez-Baca, D., Deepak, A., and Tripathi, S. (2019). Deep robust framework for protein function prediction using variable-length protein sequences. IEEE ACM Trans. Comput. Biol. Bioinf. 17: 1648-1659, doi:10.1109/tcbb.2019.2911609. · doi:10.1109/tcbb.2019.2911609
[39] Ranjan, A., Fernandez-Baca, D., Tripathi, S., and Deepak, A. (2021a). An ensemble Tf-Idf based approach to protein function prediction via sequence segmentation. IEEE ACM Trans. Comput. Biol. Bioinf. 19: 2685-2696. doi:10.1109/TCBB.2021.3093060. · doi:10.1109/TCBB.2021.3093060
[40] Ranjan, A., Tiwari, A., and Deepak, A. (2021b). A sub-sequence based approach to protein function prediction via multi-attention based multi-aspect network. IEEE ACM Trans. Comput. Biol. Bioinf. 20: 94-105. doi:10.1109/TCBB.2021.3130923. · doi:10.1109/TCBB.2021.3130923
[41] Ranjan, A., Fahad, M.S., Fernandez-Baca, D., Tripathi, S., and Deepak, A. (2022). MCWS-transformers: towards an efficient modeling of protein sequences via multi context-window based scaled self-attention. IEEE ACM Trans. Comput. Biol. Bioinf. 20: 1188-1199, doi:10.1109/TCBB.2022.3173789. · doi:10.1109/TCBB.2022.3173789
[42] Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C.L., Ma, J., et al.. (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U. S. A. 118: e2016239118. doi:10.1073/pnas.2016239118. · doi:10.1073/pnas.2016239118
[43] Roy, A., Yang, J., and Zhang, Y. (2012). COFACTOR: an accurate comparative algorithm for structure-based protein function annotation. Nucleic Acids Res. 40: W471-W477, doi:10.1093/nar/gks372. · doi:10.1093/nar/gks372
[44] Sharan, R., Ulitsky, I., and Shamir, R. (2007). Network-based prediction of protein function. Mol. Syst. Biol. 3: 88-100, doi:10.1038/msb4100129. · doi:10.1038/msb4100129
[45] Stark, H., Dallago, C., Heinzinger, M., and Rost, B. (2021). Light attention predicts protein location from the language of life. Bioinform. Adv. 1: vbab035. doi:10.1093/bioadv/vbab035. · doi:10.1093/bioadv/vbab035
[46] Strodthoff, N., Wagner, P., Wenzel, M., and Samek, W. (2020). UDSMProt: universal deep sequence models for protein classification. Bioinformatics 36: 2401-2409. doi:10.1093/bioinformatics/btaa003. · doi:10.1093/bioinformatics/btaa003
[47] Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., HuertaCepas, J., Simonovic, M., Roth, A., Santos, A., Tsafou, K.P., et al.. (2015). String v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43: D447-D452. doi:10.1093/nar/gku1003. · doi:10.1093/nar/gku1003
[48] Wang, H., Yan, L., Huang, H., and Ding, C. (2016). From protein sequence to protein function via multi-label linear discriminant analysis. IEEE ACM Trans. Comput. Biol. Bioinf. 14: 503-513. doi:10.1109/tcbb.2016.2591529. · doi:10.1109/tcbb.2016.2591529
[49] Wang, H., Yan, L., Huang, H., and Ding, C. (2017). From protein sequence to protein function via multi-label linear discriminant analysis. IEEE ACM Trans. Comput. Biol. Bioinf. 14: 503-513, doi:10.1109/tcbb.2016.2591529. · doi:10.1109/tcbb.2016.2591529
[50] Yang, J., Yan, R., Roy, A., Xu, D., Poisson, J., and Zhang, Y. (2015). The itasser suite: protein structure and function prediction. Nat. Methods 12: 7. doi:10.1038/nmeth.3213. · doi:10.1038/nmeth.3213
[51] Yang, Z., Yang, D., Dyer, C., He, X., Smola, A.J., and Hovy, E.H. (2016). Hierarchical attention networks for document classification. In: Proc. HLT-NAACL, pp. 1480-1489.
[52] Yang, L., Wei, P., Zhong, C., Li, X., and Tang, Y. Y. (2020). Protein structure prediction based on BN-GRU method. Int. J. Wavelets Multiresolut. Inf. Process. 18: 2050045, doi:10.1142/s0219691320500459. · Zbl 1538.92034 · doi:10.1142/s0219691320500459
[53] You, R., Yao, S., Xiong, Y., Huang, X., Sun, F., Mamitsuka, H., and Zhu, S. (2019). Netgo: improving large-scale protein function prediction with massive network information. Nucleic Acids Res. 47: W379-W387. doi:10.1093/nar/gkz388. · doi:10.1093/nar/gkz388
[54] Zhang, Y., Yuan, H., Wang, J., and Zhang, X. (2017). Using a CNN-LSTM model for sentiment Intensity prediction [C]. In: Proceedings of the 8th workshop on computational approaches to subjectivity, sentiment and social media analysis. Association for Computational Linguistics, pp. 200-204.
[55] Zhang, C., Zheng, W., Freddolino, P.L., and Zhang, Y. (2018). Metago: predicting gene ontology of non-homologous proteins through low-resolution protein structure prediction and protein-protein network mapping. J. Mol. Biol. 430: 2256-2265. doi:10.1016/j.jmb.2018.03.004. · doi:10.1016/j.jmb.2018.03.004
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.