account of in silico identification tools of secreted effector proteins in bacteria and future challenges | Briefings in Bioinformatics

Abstract

Bacterial pathogens secrete numerous effector proteins via six secretion systems, type I to type VI secretion systems, to adapt to new environments or to promote virulence by bacterium–host interactions. Many computational approaches have been used in the identification of effector proteins before the subsequent experimental verification because they tolerate laborious biological procedures and are genome scale, automated and highly efficient. Prevalent examples include machine learning methods and statistical techniques. In this article, we summarize the computational progress toward predicting secreted effector proteins in bacteria, with an opening of an introduction of features that are used to discriminate effectors from non-effectors. The mechanism, contribution and deficiency of previous developed detection tools are presented, which are further benchmarked based on a curated testing data set. According to the results of benchmarking, potential improvements of the prediction performance are discussed, which include (1) more informative features for discriminating the effectors from non-effectors; (2) the construction of comprehensive training data set of the machine learning algorithms; (3) the advancement of reliable prediction methods and (4) a better interpretation of the mechanisms behind the molecular processes. The future of in silico identification of bacterial secreted effectors includes both opportunities and challenges.

in silico identification, machine learning algorithms, characteristic features, benchmarking, bacterial secreted effectors

Introduction

Bacteria have adopted diverse secretion systems to translocate numerous effector proteins into the extracellular environment to interact with and defeat the host cells [1].

The investigation of secretion systems has attracted substantial attention to the phylogenetic distribution, gene content, organization and evolution. However, in addition to the identification of secretion systems, there remains a large number of challenging questions to answer to understand the pathogenesis systems of bacterial pathogens. Examples are the studies of the mechanisms of recognition and targeting effectors by secretion systems. Specifically, the identification of secreted effector proteins, the arch-criminal of bacterial virulence, and the interpretation of their pathogenesis mechanisms are the critical pioneering steps toward understanding of molecular bacterium-host interactions.

In this section, we briefly introduce the background of secretion systems and effectors and their identification. The ‘Features for identifying secreted effectors’ section discusses the features of effector proteins that can be used for their identification, and in silico prediction methods of bacterial effectors are elaborated in the ‘In silico identification methods’ section. Available online resources are collected as tables in the ‘Available online resources’ section.

Secretion systems and effectors

Gram-positive bacteria produce a single cytoplasmic membrane, followed by a thick cell wall layer, while gram-negative bacteria produce a double-membrane system with the cytoplasmic membrane as both an inner and outer membrane.

We define ‘secretion’ as the extracellular release of proteins via secretion systems within this article, although secreted proteins can also be cell-surface localized or be part of cell-surface appendages, as reported in [2]. The secreted proteins of gram-negative bacteria that are released extracellularly are also referred to as bacterial ‘effectors’.

Several secretion systems are responsible for the translocation of proteins across the bacterial cytoplasmic membrane, such as the Sec (general secretion pathway), SRP (signal-recognition particle), Tat (twin-arginine translocation), FEA (flagella export apparatus) and holin (hole forming) pathways [2, 3]. In addition, six additional types of secretion systems in gram-negative bacteria are also known and act as protein translocation systems across the bacterial outer membrane [2, 4]. They follow the prevalent nomenclature from type I secretion systems to type VI secretion systems (T1SS–T6SS). There is controversy about these systems with respect to the mechanism of effector molecule recognition and targeting of the secretion systems, namely, the recognition of specific effectors from all other proteins not secreted via a certain secretion system. For these secretion systems multiple excellent reviews are recommended [1, 5–8].

To translocate/secrete protein extracellularly, gram-negative bacteria use principally two strategies. Cytoplasmic membrane protein translocation systems, i.e. the ubiquitous Sec, Tat and others, translocate proteins to the periplasm of gram-negative bacterial cells. Subsequently, the proteins are engaged in an additional secretion pathway to be secreted from the outer membrane. The additional secretion pathways involved in these two protein translocation steps are referred to as the Sec-dependent secretion systems. Furthermore, the proteins secreted via Sec-dependent secretion systems harbor an N-terminal signal that is first recognized by the Sec machinery and subsequently removed by the peptidase with the remainder released into the periplasm and translocated across the outer membrane with the assistance of accessory proteins. They are T2SS and T5SS [2, 9, 10]. Additionally, the Sec-dependent signal contains no specific sequence motif for secretion but a pattern of charged residues and a hydrophobic domain that enable the recognition of secreted substrates [11]. Alternatively, multiple secretion systems can directly secrete proteins across two membranes to the exterior the of bacterial cell simultaneously, bypassing the periplasm compartment. These one-step secretion systems are classified as Sec-independent T1SS, T3SS, T4SS and T6SS, owing to the independence of the Sec system and the absence of such signal [2, 10].

Specifically, T1SS contain only three proteins: an outer membrane protein, two cytoplasmic membrane proteins, an ATP-binding cassette (ABC) and a membrane fusion protein. The secretion signal is usually located at the C-terminal end of the proteins secreted via T1SS, which specifically recognizes the ABC protein. The initial interaction between the C-terminal secretion signal and the ABC protein triggers the sequential assembly of the secretion complex by generating interactions between the three component proteins [9]. The effectors secreted via T1SS are translocated into the extracellular milieu, while the three remaining Sec-independent secretion systems all have different types of contact with the membranes of the recipient cell directly and are dedicated to translocating effectors to influence the physiology of host cells [1], which has attracted the majority of our attention.

T3SS are composed of approximately 20 proteins, most of which are located in the inner membrane [8]. Furthermore, T3SS in gram-negative bacteria construct a ‘bridge’ connecting the pathogen and host cells [12, 13]. This needle-like structure of the secretion machinery spans the inner and outer bacterial membranes and supports the injection of protein effectors directly into the cytoplasm of the host cell [12, 14, 15]. Two N-terminal domains of T3SS effectors are reported to be essential for secretion, although their exact boundary is not completely agreed [8, 13, 16–19]. One proposal suggests that residues 1–25 contain a region specialized as a secretion signal that targets the effectors to the secretion apparatus, and residues 25–100 correspond to a chaperone binding domain. The translocation of many effectors depends on the chaperone-binding domains, where the chaperones are considered to stabilize and maintain the effectors in an unfolded state before secretion [16, 20, 21].

T4SS are specialized protein complexes that are used by many bacterial pathogens to deliver type IV effector proteins directly into the host cells. In contrast to the above two types of Sec-independent secretion systems, T4SS are unique among other gram-negative bacterial secretion systems owing to their ability to transfer effector proteins, DNA and nucleoprotein complexes [22–27]. Genetic studies have discovered mutants/subtypes of type IV secretion systems [27–29], which divide the T4SS in gram-negative bacteria into two main subtypes called IVA and IVB [1, 23, 24, 30]. The type IVA systems are composed of subunits homologous to those of the Agrobacterium tumefaciens VirB/VirD4 system [31]. The type IVB systems are assembled from subunits homologous to the Legionella pneumophila Dot/Icm system [32]. Moreover, the effector proteins secreted via T4SS harbor a C-terminal signal sequence, which is useful for their translocation and identification [29, 33–38].

T6SS are an emerging type of secretion systems that are widespread in the bacterial world [4]. The minimal size of the T6SS apparatus consists of 13 proteins [39, 40]. The main component is a ‘Hemolysin co-regulated protein’ (Hcp) inner tube, with a tip complex composed of ‘valine-glycine repeat protein G’ (VgrG) and ‘proline-alanine-alanine-arginine’ (PAAR) proteins, which pierces the membrane of the host cell in a spike-like fashion. The presence of specific VgrG and PAAR is crucial for the T6SS to deliver the effectors directly into target cell [41, 42]. The target of T6SS can either be eukaryotic cells or other rival bacteria [43]. Antieukaryotic and antibacterial effectors promote pathogen survival in the new environment, such as Pseudomonas aeruginosa, Vibrio cholerae, Burkholderia species, Serratia marcescens and Acinetobacter baumannii. Antibacterial T6SS effectors are paired with specific cognate immunity proteins encoded downstream of the effector gene, which are able to prevent the secreted effectors from self-intoxication/self-killing [44–54]. Clear signal sequences of the T6SS effectors are not commonly accepted. Recently, Salomon et al. [55] reported an N-terminal motif, the MIX (marker for type VI effectors), which helps to identify novel T6SS effectors.

The main focus of this article is the progress of in silico identification of bacterial effector proteins that are secreted via the Sec-independent pathways T3SS, T4SS and T6SS. These three sophisticated types of secretion systems inject virulent effectors into the recipient cell directly, which is particularly important in the virulence of gram-negative bacteria, and in selected cases, is similar to human disease, such as the human innate immune/inflammatory response and typhoid induced by infection with Salmonella [56, 57], respiratory tract infections caused by Bordetella [58], cat-scratch disease and trench fever caused by Bartonella [59] and sexually transmitted diseases caused by Chlamydia [60].

The identification of effector proteins

The discovery of secretion systems is supported by the comparative analysis of coding genes, assuming the observation of homologous components in one secretion system is a good indicator of the appearance of a corresponding secretion system [3]. For example, T346Hunter is a computational tool for the prediction of T3SS, T4SS and T6SS based on the comparative similarity of the three secretion systems [61]. Particularly, a hidden Markov model (HMM) is used to construct the protein profiles of the core components of these three secretion systems, and new sequences are predicted to harbor the secretion system based on the enriched similarity to the constructed profiles.

Compared with the high conservation of secretion apparatus across bacterial pathogens, effector proteins evolve quickly to facilitate their adaptation to different host environments and participate in diverse molecular processes, such as inhibiting the host inflammatory response, killing immune cells, interfering with cellular activities and facilitating bacterial entry [30, 62–65].

The identification of secreted effector proteins is partially supported by biological experiments, namely, functional screening with a reporter gene [19, 66–73]. For example, Salmonella uses the T3SS to inject virulence effector proteins into host cells [74]. On the other hand, adenylate cyclase activity is entirely dependent on host cell calmodulin, and translocation of the adenylate cyclase domain into host cells can be detected by assaying cyclic AMP (cAMP) levels because it produces cAMP. Thus, based on these considerations, Geddes et al. generated translational fusions between Salmonella chromosomal genes and a fragment of the calmodulin-dependent adenylate cyclase genes and monitored the secretion of fusion proteins, namely, they identified secreted effector proteins by measuring the levels of cAMP in the infected host cells [74]. AvrRpt2 is another good in vivo reporter for type III secretion in Pseudomonas syringae [62, 75]. The secreted effectors can be detected by a hypersensitive response in Arabidopsis thaliana caused by the functional domain of AvrRpt2, as AvrRpt2 is fused to the candidate proteins and secreted into host cells along with unknown effectors.

The experimental techniques exemplified by reporting effectors secreted by fusion proteins are accurate; however, they are quite time-consuming and laborious. Furthermore, this type of identification methods is limited by both a priori knowledge about biological mechanisms and the sophisticated construction of molecular experiments. Additionally, most effectors are scattered throughout the genome rather than clustered in a narrow genomic region [66]. In contrast, computational identification of effector proteins can tolerate the researchers’ lack of biological background, and the genome-scale and automated detection of effectors greatly accelerate the process with respect to time and efficiency.

In fact, as sequencing techniques have improved in the past decade, the number of complete bacterial genomes available for genome-scale analysis has increased. In an era with such a large amount of data, we should take advantage of bioinformatics tools to efficiently detect the potential targets. Instead of a pure ‘wet’ experiment, the combination of both ‘wet’ and ‘dry’ is becoming prevalent and effective. Bioinformatics methods successfully limit the number of candidates for subsequent experimental validation by selecting putative ones based on plausible predictions, providing a reasonable starting point for experimental investigation, and promoting both the efficiency and reliability of the detection of novel secreted effectors [11, 37, 66, 69–71, 76–78]. In addition, the applications of bioinformatics techniques to many other molecular domains have achieved good success, such as the classical problem of predicting the secondary and tertiary structures of proteins [79, 80], which have proved the feasibility and significance of bioinformatics methods. Moreover, the characteristic information of a considerable number of validated bacterial secreted effectors has been observed, such as the preference of particular amino acids or certain structural motifs. These types of features can be grasped as ‘rules’ by in silico approaches and can further assist the efficient, automated, genome-scale mining of novel effectors in the huge wave of genomic data.

Major efforts have been devoted to identifying secreted effectors in silico, including multiple genome-scale studies [69, 70, 76, 81]. Two principal strategies have been adopted, machine learning approaches, which predict new effectors based on the extracted features of known effectors [11, 77, 82–87], and statistical methods, which calculate the probability of a protein being an effector [78, 88, 89].

The progress of computational identification of effectors secreted by fungi and nematodes has been reviewed [90]. For the bacterial secreted effectors, Greenberg et al. have presented the computational identification of T3SS effectors in P.syringae [91]. Segal et al. have addressed the bioinformatics approaches for identification of T4SS effectors in L.pneumophila, including the homology-based, promoter-based and signal-based methods [37]. Mcdermott et al. have described four machine learning-based prediction methods for type III and IV effectors. Specifically, based on three prime elements of machine learning algorithms, as introduced above, Mcdermott et al. compared the components of the methods, features of considered effectors, training and testing data and the performance of prediction based on an independent set of validated effectors known before the development of these methods [92]. Recently, An et al. have discussed the current computational studies for predicting effector proteins secreted via T3SS, T4SS and T6SS. This article mainly evaluates the algorithms, feature selection and software utilities based on curated testing data sets, which are quite informative and useful for future development of prediction methods [93]. In detail, An et al. construct three positive data sets which are constituted by extracted T3SS, T4SS and T6SS effectors, respectively. However, the negative data sets are selected from the other two positive data sets when a certain group of secreted effectors is considered. Although it is important to specifically detect one certain type of effectors from the ones secreted via other secretion systems, the ability of identifying particular groups of secreted effectors from the genomic background should be appreciated more. Ignoring the ’noise’ of truly non-effectors may decrease the biological persuasion of prediction performance. Moreover, An et al.’s work concludes the feature selection approaches, but pays less attention to the informative features of secreted effectors used for their identification. In fact, plenty of features represented by the amino acid composition (AAC), sequential motifs, structural and physicochemical information have been used to discriminate the effectors from other proteins. For example, the evolutionary conservation of secreted effectors is encoded as position-specific scoring matrix (PSSM) profiles, which are used in couples of prediction methods to efficiently identify secreted effectors [11, 38, 87].

We have been motivated to investigate the computational progress of identifying bacterial effectors secreted via T3SS, T4SS and T6SS from a distinctive aspect. We first review the informative features of secreted effectors, which are prevalently used for their identification. We then show our immersed study about the mechanisms of prediction methods of secreted effectors, especially the machine learning approaches. Additionally, we also mention the basic but classical BLAST and HMMs, which are prevalently applied to identify sequential motifs. A comparison of reviewed methods is carried out based on our comprehensive testing data sets. After the anatomization of our benchmarking results, we propose potential improvements of the prediction performance.

Features for identifying secreted effectors

The purpose of bacterial effector prediction methods is to discriminate effectors from non-effectors. The common features of known effectors are always used to identify new members according to the prediction ‘rules’. Hence, before addressing the topic of bioinformatics progress in predicting bacterial effectors, we first enumerate the features of effector proteins that can be/have been used for their identification.

Aligning candidates with known effectors for homology search would be the most straightforward way to identify effectors, where in selected cases the homology can be found by PSI-BLAST [66, 69, 71, 72, 76, 94–97]. This type of prediction method based on traditional sequence alignment generally gives reliable predictions, but low sequence similarity between secreted proteins induces poor prediction performance, as the effectors evolve quickly to facilitate their adaptation to different host environments [62, 64, 65]. PSI-BLAST is short for ‘Position-Specific Iterated BLAST’, which was proposed by Altschul et al. in 1997 to identify related sequences with weak similarity [98]. As an essential component of PSI-BLAST, PSSM profiles are a representative method of sequence motif/profile, which is well integrated in the domain of identifying bacterial effectors with great power. PSSM profiles are used to represent the evolutionary conservation profiles of the T3SS effectors [11], their 100 N-terminal residues [87] and the evolutionary features of T4SS effectors [38]. The transformation of PSSM, i.e. PSSM_AC, is also calculated to discriminate T4SS effectors and represents the correlation of evolutionary conservation of the 20 residues between two positions separated by a predefined distance along the sequence [38]. Compared with the evolutionary conservation within a certain type of effectors, their relationships to other genomes may help to identify new members, which are referred to as the phylogenetic profiles of effectors. Typically, phylogenetic profiles are lists of significant sequence similarity found between effector proteins in a series of organisms that provide information about the distribution of effector proteins over a range of different organisms with diverse evolutionary histories [11].

In addition to sequence similarity and conservation, detecting AAC biases (significant enrichment or depletion of certain amino acids) in known effectors is a straightforward and prevalent way to identify novel secreted proteins [11, 38, 77, 85]. For example, Asn, Glu and Lys show higher compositions in T4SS IVB effectors, and Ala, Glu and Ser occur more frequently in IVA effectors [38]. AAC biases can also be detected in the N-termini of T3SS effectors [19, 77, 83], the C-termini of T4SS effectors [99] and the N-termini of T6SS effectors [55], as they are assumed to be the most informative region for identifying secreted effectors. These particular regions of effectors are referred to as secretion signals and can help to identify the effectors that are to be secreted by the secretion apparatus [85]. Previous research shows that the first 100 amino acids in the N-terminal region of T3SS may harbor the signal sequences and the chaperone-binding sequences that are required to guide the secretion of T3SS effector proteins [8, 13, 16–19]. Specifically, an in silico stepwise deletion analysis of N-terminal amino acids suggests the 6th–10th N-terminal amino acids as the important region for identifying T3SS effectors [83]. In P.syringae, the AAC biases and patterns in the N-termini of T3SS effectors have been identified as a characteristic of effector proteins and have been used for their identification [62, 76, 91, 100, 101]. Arnold et al. showed that the selection 0–15 residues gives good discriminative power between T3SS effectors and non-effectors, and the most significant enrichment in the N-terminal of T3SS effectors is Ser, while Thr and Pro are significantly enriched in the effectors of animal pathogens, and Leu is depleted in both animal and plant effector proteins [85]. This supports the results indicating that the 50 N-terminal amino acids of P.syringae T3SS effectors show significant biases of high Ser and low Asp [62, 76, 83]. The C-terminal ‘E-block’ modules are used as criteria to identify T4SS effectors [96], where the E-block is a glutamate-rich region located within the 25 C-terminal amino acids of the secreted effectors and is required for efficient translocation of effectors [102]. Salomon et al. [55] reported an N-terminal motif for identifying novel T6SS effectors, the MIX, which was identified by using comparative proteomics. Typically, MIX is genetically linked to the T6SS core components but is not an essential structural element, as MIX-containing proteins are not required for the antibacterial activity. Meanwhile, MIX proteins contain cytotoxic effector domains, many of which are confirmed T6SS effectors.

In addition to the AAC biases, the characteristic features of the specific sequence fragments in the signal region can be used to identify corresponding effectors. Sequence similarities to the 50–100 N-terminal residues of known effectors are used to identify new T3SS effectors [19]. Analysis of the collection of T3SS effector proteins in P.syringae revealed an export-associated pattern of equivalent solvent-exposed amino acids in the 5 N-terminal positions and amphipathicity and richness in polar amino acids in the 50 N-terminal positions [76]. A hydrophobic residue near the C-terminus is reported to be critical for translocating T4SS effectors [36], whereas positively and negatively charged residues in the C-terminal signal sequence also play an important role in the translocation of T4SS effectors [33, 69, 96]. Meanwhile, two studies supported the conclusion that there is a preference for short polar amino acids located in the 20 C-terminal residues of T4SS effectors [69, 103]. C-terminal ‘basicity’ and ‘hydrophilicity’ are also used as criteria to identify T4SS effectors [96]. In addition to positively charged residues, other physico-chemical properties are considered to discriminate effectors, such as acidic residues and polar amino acids [76] and the molecular weight and pI estimated from the amino acid sequences [104]. Moreover, as most of the known secreted proteins are injected into the cytoplasm of host cells, hydrophilic residues would have higher compositions in effectors than in non-effectors. Hence, the global hydrophilicity of protein sequences is used to identify novel T4SS effectors [38, 71, 96].

Structural motifs of the effectors, including coils, helices and strands, may discriminate effectors from non-effectors [77, 83, 96]. Solvent accessibility represents the property of a side chain to be exposed to or buried in the solvent and is commonly used to identify effectors [76, 77, 83, 87]. For example, in addition to the significant contribution of the 6th–10th N-terminal amino acids, the secondary structure and solvent accessibility are reported to make important contributions to the identification of type III secretion signals [83]. With enriched Ser compared with other amino acids, T3SS effectors prefer to stay in unfolded coils and to be exposed to the solvent rather than be buried. The joint profiles analysis shows that the ‘Ser-coil-exposed’ preference is most frequently observed at most positions of T3SS effectors [83, 104].

Furthermore, many T3SS effectors share a commonality of interacting with cognate chaperones before secretion. Hence, the characteristic features of the chaperones can be used to identify cognate effectors that are encoded in the vicinity of chaperone genes [105]. The biophysical features of chaperones, such as the small molecular weight and acidic pI [106], provide markers for potential cognate effector loci in bacterial genomes [105]. Similarly, the antibacterial T6SS effectors are paired with specific cognate immunity proteins encoded in the vicinity of effector genes. The selected immunity proteins also exhibit low pI values and contain several highly conserved residues, making them markers for identifying new downstream cognate T6SS effectors [65]. The occurrence of conserved domains is another criterion to detect effectors, such as the eukaryotic domain [37, 69, 71, 72, 96, 107], which implies a potential imitation of eukaryotic host cell functionality of such domains. The occurrence of the prokaryotic domain, nuclear localization signal (NLS), mitochondrial localization signal (MLS) and prenylation domain are considered in [96] to identify novel T4SS effectors.

The characteristics of transcription regulatory sequences, the promoters, can also be used to identify the downstream open reading frames (ORFs) that encode putative effectors [62, 67, 69, 76, 77, 81, 96, 100, 108]. A match with the commonality of promoters may indicate an encoded effector downstream. But this approach is limited to known and detectable motifs and is specific to the conserved effector families. Moreover, G + C content analysis of genes encoding effectors helps to identify new effectors [11, 66, 69, 76, 104]. Studies have shown that effectors commonly have a relatively low G + C content [11, 62, 66, 69, 96, 109], supporting the hypothesis that a large number of effectors originate via horizontal gene transfer.

In silico identification methods

In general, the basis of prediction methods for bacterial effectors relies on the similarity between putative predictions and the known effectors. In other words, as the sequential, structural, physical or chemical collections of features of the known effectors are given/obtained, novel effectors are identified among the plausible predictions according to acceptable similarities to known effectors.

Computational prediction methods of bacterial effectors use multiple strategies to execute the prediction, such as machine learning algorithms and statistical approaches.

As one of the most common strategies, machine learning algorithms are well integrated into the exploration of bacterial effector proteins in silico when information about both effectors and non-effectors is available. Machine learning algorithms are a class of computational methods for binary ‘classification’ issues, where in the context of predicting effector proteins, the purpose of the algorithms is to discriminate effectors from non-effectors [11, 38, 77, 82–87]. Generally, the known effectors and non-effectors are fed into the machine learning algorithm as input, and the algorithm ‘learns’ or ‘is trained’ to discriminate effectors from non-effectors. The expectation is that the trained algorithms can identify new effector proteins based on the provided information.

The second major strategy to predict bacterial effectors is based on probability distributions. Statistical methods, such as Markov models, make novel predictions about effector proteins according to their probabilities [78, 88, 89], which are elaborated in the ‘Markov model’ section.

Specifically, a training data set for machine learning algorithms is constructed with both ‘positive’ and ‘negative’ examples initially. The positive examples are already known effectors, which are experimentally validated in most cases, or putative effector proteins. The negative examples are proteins that are identified or supposed to be non-effectors. Each example of the two groups is represented as a set of particular information called ‘features’, including the sequence patterns, physical characteristics and other features of the proteins. As the second step, a computational model is built using the constructed training data set. Both the positive and negative examples are input into the machine learning algorithm to enable the algorithm learn the difference between the given effectors and non-effectors. This process is also referred to as the ‘training/learning’ phase of machine learning algorithms. Finally, the algorithms are trained as predictive models that are able to make novel predictions about the effectors based on the input of new bacterial genomes or unknown sequences. The well-predicted results with a plausible evaluation performance are always considered via further experimental validation. The features of the validated results may be used to iteratively refine the predictive model. This process is referred to as ‘testing’ the algorithms. Instead of testing on independent data sets, the testing phase of machine learning algorithms can also be executed via ‘k-fold’ cross-validation, where the entire training data set is separated into k parts. Each part is used to test the model, which is trained by the remaining k-1 parts. This kind of testing is used to assess the predictive model, rather than to predict effectors from distinct organisms.

Figure 1 shows the flowchart of the bioinformatics identification of bacterial effectors based on machine learning algorithms. Knowledge of the previously known effectors and non-effectors is used in the prediction methods as ‘features’, which are used to train the model to make it able to discriminate effectors from non-effectors. The ‘trained’ or ‘learned’ model can make further novel predictions about new unclassified sequences, according to the knowledge it learned. Putative prediction candidates of effectors are considered for further experimental validation and the features of validated effectors help to refine the predictive model.

Figure 1

Flowchart of in silico prediction of bacterial secreted effectors based on machine learning algorithms. Particularly, during the training phase, diverse features of known effectors and non-effectors are extracted and combined into feature vectors $x = x_{1}, x_{2}, \dots, x_{N}$ where N denotes the number of considered features. Feature vectors are sent into the machine learning algorithms to construct a predictive model that is able to discriminate the effectors from non-effectors. New sequences are classified as either putative effectors or non-effectors by the constructed model according to the similarity between their features and that of known ones. The fidelity of putative effectors can be testified by experimental validation, and the validated ones may refine the predictive model further. This process is referred to as the independent testing, accompanying with cross-validation, which two are collectively known as the testing phase.

Open in new tab Download slide

The prediction methods of bacterial effector proteins secreted via certain types of secretion systems cannot be exhaustively elaborated owing to limited space in this article. Hence, the mechanisms of the commonly adopted prediction techniques are exemplified using selected examples, including support vector machine (SVM) (Figure 2A) [11, 38, 77], naive Bayes (NB) (Figure 2B) [85], artificial neural network (ANN) (Figure 2C) [86] and random forest (RF) (Figure 2D) [87], which have achieved different levels of success. The foremost differences between these methods are the mechanisms used to extract statistical commonalities in the data and their principles of processing data and making novel predictions.

Figure 2

In silico prediction methods of bacterial secreted effectors at work. (A–D) Four rife machine learning algorithms for identifying secreted effectors in bacteria, which are famous as binary classification tools; (E) statistic approach based on Markov model, which is prevalently used for searching homologous domains in bacterial effectors.

Open in new tab Download slide

Support vector machine

Diverse characteristic features of effectors and non-effectors are the input of SVMs, from which SVMs learn the differences between these two groups. Specifically, features depicting each example are numerically encoded as a feature vector by SVMs, which are represented as a sequence of numbers that can be thought of as a point in N-dimensional space, where N is the number of features considered. Given two opposite groups of effectors and non-effectors, SVMs separate the corresponding points in this N-dimensional space by constructing a hyperplane between them. Unclassified examples are predicted to be in either group according to which side of the hyperplane they fall on [11, 38, 77, 82–84, 110]. A brief description of how SVMs work is shown in Figure 2A.

SIEVE

The detection approach of T3SS effectors, SVM-based identification and evaluation of virulence effectors (SIEVE), is trained with multiple features of the known effectors. The feature vectors of SIEVE represent points in 711-dimensional space that are contributed by consideration of the G + C content, AAC, the 30 N-terminal residues of both effectors and non-effectors and their evolutionary relationships to other genomes [11]. The predictive power is tested by making novel predictions based on distantly related organisms.

In SIEVE, the construction of the training data set considers the removal of effector homologs in one particular organism, which are identified by BLAST. However, the evolutionary conservation of each effector protein across organism is considered as a feature vectors for the SIEVE, such as the PSSM and phylogenetic profiles.

SIEVE is trained on P.syringae effector proteins and evaluated by predicting Salmonella enterica serovar typhimurium effectors, which are evolutionarily far from the learning data set. A reverse experiment is conducted by swapping the training and testing data set of effectors from the two organisms. The prediction of novel effectors in a third human pathogen, Chlamydia trachomatis, is also conducted, and high-ranked predictions are also experimentally investigated.

In addition, the identifications of SIEVE indicate a set of conserved sequence biases within a majority of the effectors from both organisms, suggesting a putative secretion signal in the 30 N-terminal residues for the identification of novel T3SS effectors.

SSE-ACC

SSE-ACC, a SVM model, was developed to identify T3SS effectors based on the AAC, secondary structure (SSE) and solvent accessibility (ACC) of known effectors and non-effectors in P.syringae, precisely within the 100 N-terminal residues rather than the full length [77].

The 100-dimensional feature vectors are constructed according to the AAC in different secondary structure elements and solvent accessibility states. The first 60 dimensions describe the frequency of each of the 20 amino acids in each of the three types of protein secondary structure elements, i.e. strand, helix and coil (20*3). The last 40 dimensions represent the frequency of each of the 20 amino acids in each of the two types of protein solvent accessibility states, i.e. buried or exposed (20*2).

SSE-ACC is tested both by 5-fold cross-validation and on independent rhizobial genomes. Combined with a promoter search using a HMM, which will be shown in detail in the ‘Markov model’ section, multiple putative effectors are predicted and confirmed by wet-bench experiments [77].

Additionally, the structural motifs, which are presented as input features, are predicted by PSIPRED [79], and the solvent accessibility states are predicted by ACCpro within the SCRATCH server [111].

T4EffPred

As in our previous work, T4EffPred is a SVM model for predicting bacterial effectors secreted via T4SS based on the AAC and evolutionary conservation of effectors and non-effectors represented by PSSM profiles [38].

PSSM profiles are created by using PSI-BLAST to search the NCBI’s non-redundant (NR) protein database for multiple alignment against the query sequence. In detail, PSI-BLAST first constructs a multiple alignment from the original BLAST output data sought by the query sequence, where all the database sequence segments aligned to the query with an E-value below a predefined threshold (closely related) are chosen. Next, PSI-BLAST processes this multiple alignment into a PSSM, which holds the conservation sequence pattern among the alignment. Finally, PSI-BLAST feeds the PSSM into a new round of BLAST search against the database, rather than the query sequence. Newly identified sequence segments with low E-values are aligned and used to refine the PSSM iteratively until no additional related sequences are detected [38, 98].

The critical difference between the PSI-BLAST and pairwise alignment in BLAST is that the score for aligning a letter with the pattern position is given by the PSSM, rather than with reference to an amino acid substitution matrix. As PSSM greatly increases the sensitivity to weak but biologically relevant sequence relationships, PSSM is commonly used for molecular motif/pattern discovery, such as protein secondary structure [79], transporter targets [112] and bacterial secreted T3SS effectors [11, 82, 87].

In particular, for a given query sequence of length n, the corresponding PSSM profiles contain $n * 20$ elements, and the (i, j)th entry represents the score of the amino acid in position i of the query sequence mutated to amino acid type j during the evolution process. In practice, the authors transfer the PSSM profiles into the PSSM composition profile by summing the corresponding amino acid rows in the PSSM, which contributes 400 dimensions to the feature vectors for the SVM. In addition, the auto covariance transformation of PSSM, PSSM_AC is also calculated to discriminate T4SS effectors by the authors, which represents the correlation of evolutionary conservation of the residues with an interval of 10 along the sequence.

In contrast to the previous SVM-based models, four types of feature vectors are used by T4EffPred, constructed based on AAC (20D), amino acid pairs (400D), PSSM composition (400D) and PSSM_AC (200-D). Tenfold cross-validation suggests the PSSM-based feature vectors are more helpful for discrimination of effectors from non-effectors based on the training data set consisting of proteins from multiple pathogens. These four individual feature vectors and the ensemble are used to enlarge the putative T4SS effectors in an independent organism.

Naive Bayes

NB algorithms are probabilistic classifiers. When applied to a binary classification problem, the algorithms are trained using a positive and negative set of examples, with each example represented by a vector of features. NB algorithms classify unknown examples according to their probability calculated by applying Bayes’ theorem with an assumption of conditional independence between features [113]. A brief introduction to constructing NB classifier is shown in Figure 2B.

The method ‘EffectiveT3’ uses the NB algorithm as a classifier to model the AAC and secondary structures of the N-terminal residues of the T3SS effectors [85].

This machine learning approach takes 100 experimentally verified effector proteins from the animal pathogens Chlamydia, Salmonella, Yersinia and Escherichia and the plant pathogen Pseudomonas as the positive training data. The negative data set comes from randomly selected non-effectors.

The authors first analyze the known effectors in the training data set, and extract a specific AAC in the N-termini of the effectors as the general signal of T3SS-mediated transport. Then, they build the EffectiveT3 for this N-terminal signal using NB algorithms to separate the positive and negative data. After 10-fold cross-validation, EffectiveT3 is generalized to predict novel T3SS effectors from 739 genomes, which are with or without T3SS.

In contrast to other methods, the 20 amino acids are mapped into two reduced alphabets according to their biophysical properties and hydrophobic/hydrophilic characteristics. In other words, the 20 amino acids are separated into several groups hierarchically. The frequencies of each group of amino acids and the frequencies of 20 amino acids (AAC) make up the feature vector of EffectiveT3.

The features identified by EffectiveT3 are argued to be taxonomically universal owing to the inclusion of T3SS effectors from multiple pathogens in the training data set, where there is no organism-specific information required.

Additionally, the homologues in the training data set are clustered by a global alignment, and the N-terminal conserved region is surveyed through multiple alignments. The structural motifs, which are presented as input features, are predicted by PSIPRED [79].

Artificial neural network

ANNs are computer-based algorithms inspired by the structure and behavior of connected neurons in the human brain. Similar to other machine learning algorithms, ANNs can be trained to recognize and classify complex patterns by analyzing the rules of data within a training data set and can make novel classifications based on the learned knowledge [114]. The basic idea behind ANNs is shown in Figure 2C.

‘T3SS_prediction’ is a ANN-based model that is used to identify potential T3SS effectors based on the N-terminal amino acid sequence using a sliding-window procedure [86].

In particular, after the removal of redundant and short sequences (<100 aa), 575 putative T3SS effectors and 685 non-effectors from P.syringae and other species make up the training data set of T3SS_prediction. Based on the 30 N-terminal amino acids, a window with a width of 25 slides along the sequence fragment toward the C-terminus with an increment of one amino acid. The generated overlapping subsequences are the input of the ANN classifier, within which each amino acid is encoded as a string of length 20. The output neuron of ANN classifies each subsequence by estimating their corresponding probability of being a part of T3SS effector signals, according to a predefined threshold.

Seven hidden neurons are constructed in the final ANN model based on a series of systematic optimizations to make independent predictions of potential T3SS effectors for 918 bacterial genomes.

Moreover, based on the sequence patterns of the N-terminal secretion signals, similarity-based comparison with the known effectors, chaperone homologues and AAC, an SVM is trained as an alternative classifier.

Random forests

The RFs classifier is constructed using multiple decision trees, each of which individually makes a classification for the given data set. The final prediction of RFs corresponds to the classification with the most votes over all the decision trees in the forest [115]. The mechanism of RFs is introduced briefly in Figure 2D.

A prediction model of T3SS effectors, ‘T3SPs’, is developed based on the algorithm of RFs [87].

In particular, PSSM profiles are calculated to show position-specific conservation differences between the 283 T3SS effectors and 313 non-T3SS effectors across multiple species, which are generated by using PSI-BLAST to search the Swiss-Prot for multiple alignment against the query sequence. Each residue at one position is characterized by 20 values corresponding to the log-likelihood of mutation of 20 standard amino acids. The significance of the conservation differences between effectors and non-effectors at the same position is assessed statistically, and 52 relatively distinct positions for discrimination of T3SS effectors are retained, which represent the original input data for the RFs.

Based on the consideration of the AAC, secondary structure, solvent accessibility and six physicochemical properties of the 100 N-terminal amino acids of the collected data, 62 features are selected to construct the feature vectors in the RFs, according to their individual contribution to the T3SS prediction.

The selection of relatively optimal positions and features for identifying novel T3SS effectors is based on the argument that some residues with little conservation have poor contributions to the effector identification. Therefore, redundant position and feature information is removed to improve the predictive power of T3SPs.

Additionally, the redundancy in the initial data set is removed by clustering the sequence fragments of the 100 N-terminal amino acids; the homology is removed through multiple alignments. The structural motifs and the relative solvent accessibility, which are included in the feature vectors, are predicted by SABLE [116], and the six physicochemical properties are provided by Protparam in the ExPASy server [117].

Markov model

The features of bacterial effectors can be extracted statistically, where a sequence is predicted to be an effector protein according to its probability of begin one, based on the application of Markov models [77, 78, 81, 88, 89, 100].

A Markov model is a stochastic model used to describe a randomly changing process where what happens next depends merely on the current state of the system. In general, ‘Markov chains’ and ‘HMMs’ are commonly used, where the sequential state can be either observable or not.

As the simplest model, a Markov chain contains only one state sequence describing the process with sequential dependence between adjacent states. In the context of predicting bacterial effectors, the sequential dependence assumption of Markov models can be generalized as a probabilistic dependence between adjacent amino acids [88]. Hence, the probability of generating a sequence

S = A_{1}, A_{2} \dots A_{n}

with length of n can be described as a Markov chain and calculated as the product of conditional probabilities of each amino acid pair, as follows:

P (S) = P (A_{0}) P (A_{1} | A_{0}) \dots P (A_{i + 1} | A_{i}) \dots P (A_{n} | A_{n - 1})

(1)

where A_i denotes the ith amino acid, and

P (A_{i + 1} | A_{i})

describes the probability of

A_{i + 1}

conditioned on the amino acid at sequentially preceding position A_i.

‘T3_MM’, a Markov-chain-based prediction method of T3SS effectors, is constructed [88] based on the observation of T3SS effector-specific AAC conditional dependence.

Precisely, the probability of each amino acid conditional on the amino acid in the preceding position within the 100 N-terminal amino acids in the T3SS effectors is different from that of the non-T3SS effectors and the theoretical random probability distribution of each amino acid. In other words, given an amino acid, the sequentially following amino acid shows a significantly biased composition in T3SS effectors compared with that of non-T3SS effectors and the theoretical random distribution of AAC.

In particular, protein sequences are modeled as Markov chains with each state described by the amino acids in the protein sequence. Given a training data set composed of 154 T3SS effectors and 308 non-T3SS effectors, the AAC conditional probabilistic modeling results of each group of motifs are calculated. T3_MM is constructed based on the distribution of the conditional probability difference between the two groups to predict novel T3SS effectors, which corresponds to the likelihood ratio of the sequence being a T3SS effector or a non-effector.

In addition to 5-fold cross-validation, two independent testing data sets are included to evaluate the performance of T3_MM for predicting T3SS effectors.

Compared with the single-state sequence in Markov chains, HMMs are composed of a hidden-state sequence and an observed-symbol sequence, where the state sequence is a Markov chain that cannot be observed directly but is inferred probabilistically from the observed symbol sequence [118].

HMMs are commonly applied as sequence homology search tools in molecular domains based on probabilistic inference methods [119–121]. HMMs of a protein domain are typically built based on multiple sequence alignment of homologous domains, with the states representing the match, insertion and deletion of amino acids (the symbols) in the sequences, as shown in Figure 2E [120, 121].

In the context of predicting bacterial effectors, HMMs are used to model a particular region related to the effector proteins, such as the promoter patterns in the upstream region that encodes T3SS effectors [81, 77, 100], the full-length T3SS effectors [84] and phosphorylated ‘EPIYA’ motif sequences in both T3SS and T4SS effectors [89]. With additional consideration of the duration of states, a hidden semi-Markov model (HSMM) is constructed to describe the C-terminal patterns of T4SS effectors and to make novel predictions [78]. The constructed HMMs and HSMMs are further applied to search for homologous motifs in larger data sets to expand the inventory of bacterial effectors.

Other prediction strategies

There are exceptions of prediction methods that do not adopt machine learning algorithms or statistical probabilities, such as ‘searching algorithm for type IV secretion system effectors (S4TE)’, which predicts T4SS effectors using all positive examples to make novel predictions [96]. S4TE considers 13 features depicting known T4SS effectors as criteria for detection and returns putative predictions with a satisfactory score that is summed over the individual absence or presence of each feature. The 13 features are the promoter pattern of known T4SS effectors, sequence homology, the occurrence of eukaryotic and prokaryotic domains, NLS, MLS, prenylation domain and structural motifs, the basicity, charge, hydrophilicity and E-block module in the C-terminus, as well as global hydrophilicity of the considered protein.

The exceptions also include the classic prediction methods based on the use of comparative genomics for predicting effectors. They thrive during the early stage of this domain and are promoted by the increasing number of available genomic sequences from pathogens but are limited by the poor conservation between effectors.

As a classic bioinformatics tool to search homologs, the ‘basic local alignment search tool (BLAST)’ [122] has been extensively used to search homologs based on sequence similarity [19, 44, 66, 76, 81].

For example, validated and predicted T3SS effectors in different organisms are used to search for novel effectors in Escherichia coli based on the BLAST-determined sequence similarity [66]. Putative homologous hits are subjected to a second round of PSI-BLAST searches against the NCBI’s NR database to identify more distantly related homologs.

New putative ORFs encoding T3SS effectors are identified based on the BLAST-determined similarity to the 100 N-terminal residues of known effectors in Salmonella typhimurium [19]. The alignment of newly identified known effectors suggests that a consensus motif of WEK(I/M)XXFF is located in the 50 N-terminal residues, which functions as a conserved region for the translocation of effectors in T3SS.

Based on the HMM-identified promoters, downstream ORFs encoding T3SS effectors are searched by BLAST to identify putative T3SS effectors that are homologous to known ones [81].

Based on the identification of promoters above, the 50 N-terminal residues and homology to known effectors are also used to identify novel T3SS effectors [76]. Two motifs of ORFs are searched along the genome of P. syringae for novel T3SS effectors, which are defined according to the likelihood of certain amino acids appearing within the N-terminal region. The putative ORF hits returned by BLAST searches are subsequently eliminated by heuristic rules, such as a starting residue of Met and a minimal length of 150 aa.

Similarly, a phylogenetically disperse superfamily of homologous T6SS effectors is identified using physical information derived from experimentally validated bacterial T6SS effectors and effector-immunity pairs [44].

Performance evaluation

The fidelity of predicted bacterial secreted effectors can be testified by experimental confirmations, where the exact number of reliable predictions is reflected during precise processes [66, 69, 76]. However, experimental approaches are similar with respect to the limited application for faster and more efficient automation. Similar to the identification of secreted effectors, in silico assessment techniques are required.

As the in silico prediction of secreted effector proteins in bacteria is always referred to as a binary classification task of discriminating effectors from non-effectors, the prevalent statistical measures of classification are adopted to assess the performance of each prediction method, namely, the ‘sensitivity’, ‘specificity’, ‘Matthews correlation coefficient’ (MCC), ‘receiver operating characteristic (ROC) curve’ and the ‘area under the curve’ (AUC). Sensitivity (TP/(TP + FN)) and specificity (TN/(TN + FP)) assess the ability to correctly identify effectors and to correctly reject non-effectors, respectively. MCC combines sensitivity and specificity and is generally regarded as a comprehensive and balanced measure that can be used even if the positive and negative samples are of different sizes [123]. The coefficient values range from −1 to +1, with +1 representing prefect prediction, 0 indicating random guessing and −1 representing total disagreement with the prediction.

M C C = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(2)

The ROC curve is produced by plotting the sensitivity (y-axis) of the prediction method versus the ‘fall-out’ [x-axis, equal to (1 specificity)]. A perfect prediction with neither FN nor FP corresponds to point (0,1) in the ROC space, and a completely random guess yields a point along the diagonal line from (0,0) to (1,1). Points below the diagonal line indicate results that are worse than random guessing. Correspondingly, the AUC is 1 when all the samples are classified correctly and is 0.5 when the classification is random.

The diverse values mentioned above are commonly applied to assess the performance of a prediction method, as the numbers of effectors and non-effectors in the testing data set are known. Based on this requirement, the results can be classified as correct predictions [true positives (TPs) and true negatives (TNs)] or incorrect predictions [false positives (FPs) and false negatives (FNs)]. For example, given 275 known L.pneumophila T4SS effectors, the fidelity of S4TE is measured as the number of correct predictions, which is reflected by the above parameters [96]. During the reverse experiments in SIEVE, which swap the training and testing data sets of the effectors from P.syringae and S.typhimurium, the respective sensitivity and specificity values are calculated [11]. Particularly, at a sensitivity of 90%, the specificity is 88% when SIEVE is trained using the P.syringae effectors and then used to predict S.typhimurium effectors and is 87% in the opposite case. The corresponding ROC values are 0.95 and 0.96 [11]. The performance of different classification algorithms for predicting certain types of effectors are always compared with each other [85–88]. The evaluation values are also adopted to assess the contribution of individual features to discriminate effectors from non-effectors. For example, PSSM is the most effective single feature for representing T4SS effectors as it has the highest sensitivity (89.4% for identifying IVB effectors and 73.3% for identifying IVA effectors), MCC values (0.874 and 0.782, respectively) and AUC (0.970 for identifying IVB effectors) [38]. The optimal MCC values are obtained when the AACs are combined with the PSSM profiles (0.878 and 0.784 for identifying IVB and IVA effectors, respectively) [38].

Available online resources

With the rapid development of high-throughput sequencing technologies, an increasing number of bacterial genomes has been fully sequenced and made available for the analysis of bacterial secretion systems and effectors. There is an unprecedented requirement for bioinformatics tools/resources that are able to identify secreted effectors conveniently and reliably from these genomic data.

Representative prediction methods of bacterial effector proteins secreted via certain types of secretion systems are introduced above. In this section, we present more specific tools developed for this domain, which are mostly available online (Table 1), as well as databases built for the secreted effectors (Table 2). We hope the tables illustrate our respect for the computational efforts that have been made, although we cannot exhaustively enumerate all of them within this article.

Table 1

Computational prediction methods of secreted effector proteins in bacteria

Method	Secretion system	Prediction algorithm	Feature	Training data	Size	Description	Accessibility
SIEVE [11]	T3SS	SVM	AAC, SEQ, GC, CONS and PHYL	S. typhimurium and P. syringae	65e unfixed	SIEVE learns the difference between T3SS effectors and non- effectors and detects novel effectors based on SVM, with the information of G+C content, AAC, the 30 N-terminal residues of both effectors and non-effectors, and their evolutionary relationships to other genomes.	http://www.sysbep.org/sieve/
EffectiveT3 [85]	T3SS	NB	AAC, SEQ GC, PHYL	5 species	100e 200ne	EffectiveT3 uses the NB classifier to model the AAC and secondary structures of the N-terminal residues of the T3SS effectors and non-effectors, and makes novel predictions.	http://www.effectors.org/method/effectivet3
T3SS_prediction [86]	T3SS	ANN SVM	N-AAC	P. syringae and others	575e 685ne	T3SS_prediction is composed of ANNs that identify potential T3SS effectors based on the N-terminal amino acid sequences using a sliding-window procedure.	http://gecco.org.chemie.uni-frankfurt.de/T3SS_prediction/T3SS_prediction.html
T3SEdb [124]	T3SS	NB	PP	28 species	100e 100ne	T3SEdb detects novel T3SS effectors based on the hydrophobicity, polarity and β-turn profiles extracted from 100 N-terminal residues of known effectors.	http://effectors.bic.nus.edu.sg/T3SEdb/predict.php
SSE-ACC [77]	T3SS	SVM	AAC, SS and ACC	P. syringae	108e 3424ne	An SVM-based model developed to identify T3SS effectors based on the AAC, SS and ACC of known effectors from the 100 N-terminal residues.	On request
BPBAac [82]	T3SS	SVM	AAC	unclear	154e 308ne	An SVM-based method for identifying T3SS effectors with the information of the AAC features within the 100 N-terminal residues, which are extracted using a Bi-Profile Bayesian (BPB) model*.	http://biocomputer.bio.cuhk.edu.hk/T3DB/BPBAac.php
T3SPs [87]	T3SS	RF	AAC, SS ACC, PP	16 species	283e 313ne	Based on the AAC, SS, ACC and six physico-chemical properties of the 100 N-terminal amino acids, T3SPs uses RFs to predict T3SS effectors.	Unavailable
T3_MM [88]	T3SS	Markov model	Conditional dependence of AAC	Unclear	154e 308ne	T3_MM is constructed to predict novel T3SS effectors based on the distribution of the conditional probability difference between the effectors and non-effectors, which corresponds to the likelihood ratio of the sequence being either motif.	http://biocomputer.bio.cuhk.edu.hk/T3DB/T3_MM.php
T3SEpre [83]	T3SS	SVM	AAC, SS and ACC	Unclear	189e 385ne	T3SEpre is an SVM-based model for predicting T3SS effectors, with position-specific AAC, SS and ACC extracted from known effectors and non-effectors.	http://biocomputer.bio.cuhk.edu.hk/T3DB/T3SEpre.php
BEAN [84]	T3SS	SVM	AAC PSSM	Unclear	154e 1540ne	BEAN uses a PSSM-based k-spaced amino acid pair composition method ex-tracted within N-terminal sequences to compute the feature vectors of known effectors and non-effectors, which are separated by SVM.	Unavailable
BEAN2.0 [97]	T3SS	BLAST SVM	S,D,AAC PSSM	Unclear	243e 486ne	BEAN2.0 consists of three predictors, the first two predictors detect novel T3SS effectors or non-effectors based on similarity in sequence and domains to known effectors, and the third predictor adopts a similar strategy to BEAN but considers longer sequences.	http://systbio.cau.edu.cn/bean/
pEffect [125]	T3SS	BLAST SVM	S, PSSM	43 species	115e 3460ne	pEffect uses two predictors to make predictions of novel T3SS effectors, where the first PSI-BLAST-based predictor is used to identify sequential similarity to known effectors, and the second SVM-based predictor is used when no accepted similarity is available.	http://services.bromberglab.org/peffect/
S4TE [96]	T4SS	BLAST	S, P, D SS and PP	unclear	unclear	S4TE predicts novel T4SS effectors according to 13 features depicting known ones by computing a score summed over the individual absence or presence of features.	http://sate.cirad.fr/
T4EffPred [38]	T4SS	SVM	AAC PSSM	unclear	340e 1132ne	T4EffPred uses SVM to predict T4SS effectors based on the AAC, dipeptide composition, PSSM composition profiles and their auto covariance transformation.	http://bioinfo.tmmu.edu.cn/T4EffPred/
T4SEpre [99]	T4SS	SVM	C-AAC,C-SS and C-ACC	10 species	347e 694ne	T4SEpre is an SVM-based model for predicting T4SS effectors, based on the information of the C-terminal AAC, SS and ACC, extracted from the 100 N-terminal residues of known effectors.	http://biocomputer.bio.cuhk.edu.hk/T4DB/T4SEpre.php

Method	Secretion system	Prediction algorithm	Feature	Training data	Size	Description	Accessibility
SIEVE [11]	T3SS	SVM	AAC, SEQ, GC, CONS and PHYL	S. typhimurium and P. syringae	65e unfixed	SIEVE learns the difference between T3SS effectors and non- effectors and detects novel effectors based on SVM, with the information of G+C content, AAC, the 30 N-terminal residues of both effectors and non-effectors, and their evolutionary relationships to other genomes.	http://www.sysbep.org/sieve/
EffectiveT3 [85]	T3SS	NB	AAC, SEQ GC, PHYL	5 species	100e 200ne	EffectiveT3 uses the NB classifier to model the AAC and secondary structures of the N-terminal residues of the T3SS effectors and non-effectors, and makes novel predictions.	http://www.effectors.org/method/effectivet3
T3SS_prediction [86]	T3SS	ANN SVM	N-AAC	P. syringae and others	575e 685ne	T3SS_prediction is composed of ANNs that identify potential T3SS effectors based on the N-terminal amino acid sequences using a sliding-window procedure.	http://gecco.org.chemie.uni-frankfurt.de/T3SS_prediction/T3SS_prediction.html
T3SEdb [124]	T3SS	NB	PP	28 species	100e 100ne	T3SEdb detects novel T3SS effectors based on the hydrophobicity, polarity and β-turn profiles extracted from 100 N-terminal residues of known effectors.	http://effectors.bic.nus.edu.sg/T3SEdb/predict.php
SSE-ACC [77]	T3SS	SVM	AAC, SS and ACC	P. syringae	108e 3424ne	An SVM-based model developed to identify T3SS effectors based on the AAC, SS and ACC of known effectors from the 100 N-terminal residues.	On request
BPBAac [82]	T3SS	SVM	AAC	unclear	154e 308ne	An SVM-based method for identifying T3SS effectors with the information of the AAC features within the 100 N-terminal residues, which are extracted using a Bi-Profile Bayesian (BPB) model*.	http://biocomputer.bio.cuhk.edu.hk/T3DB/BPBAac.php
T3SPs [87]	T3SS	RF	AAC, SS ACC, PP	16 species	283e 313ne	Based on the AAC, SS, ACC and six physico-chemical properties of the 100 N-terminal amino acids, T3SPs uses RFs to predict T3SS effectors.	Unavailable
T3_MM [88]	T3SS	Markov model	Conditional dependence of AAC	Unclear	154e 308ne	T3_MM is constructed to predict novel T3SS effectors based on the distribution of the conditional probability difference between the effectors and non-effectors, which corresponds to the likelihood ratio of the sequence being either motif.	http://biocomputer.bio.cuhk.edu.hk/T3DB/T3_MM.php
T3SEpre [83]	T3SS	SVM	AAC, SS and ACC	Unclear	189e 385ne	T3SEpre is an SVM-based model for predicting T3SS effectors, with position-specific AAC, SS and ACC extracted from known effectors and non-effectors.	http://biocomputer.bio.cuhk.edu.hk/T3DB/T3SEpre.php
BEAN [84]	T3SS	SVM	AAC PSSM	Unclear	154e 1540ne	BEAN uses a PSSM-based k-spaced amino acid pair composition method ex-tracted within N-terminal sequences to compute the feature vectors of known effectors and non-effectors, which are separated by SVM.	Unavailable
BEAN2.0 [97]	T3SS	BLAST SVM	S,D,AAC PSSM	Unclear	243e 486ne	BEAN2.0 consists of three predictors, the first two predictors detect novel T3SS effectors or non-effectors based on similarity in sequence and domains to known effectors, and the third predictor adopts a similar strategy to BEAN but considers longer sequences.	http://systbio.cau.edu.cn/bean/
pEffect [125]	T3SS	BLAST SVM	S, PSSM	43 species	115e 3460ne	pEffect uses two predictors to make predictions of novel T3SS effectors, where the first PSI-BLAST-based predictor is used to identify sequential similarity to known effectors, and the second SVM-based predictor is used when no accepted similarity is available.	http://services.bromberglab.org/peffect/
S4TE [96]	T4SS	BLAST	S, P, D SS and PP	unclear	unclear	S4TE predicts novel T4SS effectors according to 13 features depicting known ones by computing a score summed over the individual absence or presence of features.	http://sate.cirad.fr/
T4EffPred [38]	T4SS	SVM	AAC PSSM	unclear	340e 1132ne	T4EffPred uses SVM to predict T4SS effectors based on the AAC, dipeptide composition, PSSM composition profiles and their auto covariance transformation.	http://bioinfo.tmmu.edu.cn/T4EffPred/
T4SEpre [99]	T4SS	SVM	C-AAC,C-SS and C-ACC	10 species	347e 694ne	T4SEpre is an SVM-based model for predicting T4SS effectors, based on the information of the C-terminal AAC, SS and ACC, extracted from the 100 N-terminal residues of known effectors.	http://biocomputer.bio.cuhk.edu.hk/T4DB/T4SEpre.php

For each type of secreted effector, the prediction methods are listed in chronological order. The machine learning algorithms mentioned here are SVM, support vector machine; NB, naive Bayes; ANN, artificial neural networks; and RF, random forests. ‘Features’ describes the features of the effector sequences considered by the prediction programs: AAC, amino acid composition; SEQ, 30 N-terminal residues; GC, G + C content; PHYL, phylogenetic profile; CONS, evolutionary conservation; SS, secondary structure; ACC, solvent accessibility; PP, physico-chemical properties; S, sequential similarity; D, similar domains; and P, promoter pattern. ‘Training Data’ indicates where the known effectors and non-effectors originate from, and their respective numbers are shown in ‘Size’ (as indicated in publications, e: effector, ne: non-effector). *BPB and Single-Profile Bayesian (SPB) are two methods of feature extraction [82, 126], where BPB considers the features of both positive and negative effectors in the training data sets and SPB considers only the features of positive examples.

Table 1

Computational prediction methods of secreted effector proteins in bacteria

Method	Secretion system	Prediction algorithm	Feature	Training data	Size	Description	Accessibility
SIEVE [11]	T3SS	SVM	AAC, SEQ, GC, CONS and PHYL	S. typhimurium and P. syringae	65e unfixed	SIEVE learns the difference between T3SS effectors and non- effectors and detects novel effectors based on SVM, with the information of G+C content, AAC, the 30 N-terminal residues of both effectors and non-effectors, and their evolutionary relationships to other genomes.	http://www.sysbep.org/sieve/
EffectiveT3 [85]	T3SS	NB	AAC, SEQ GC, PHYL	5 species	100e 200ne	EffectiveT3 uses the NB classifier to model the AAC and secondary structures of the N-terminal residues of the T3SS effectors and non-effectors, and makes novel predictions.	http://www.effectors.org/method/effectivet3
T3SS_prediction [86]	T3SS	ANN SVM	N-AAC	P. syringae and others	575e 685ne	T3SS_prediction is composed of ANNs that identify potential T3SS effectors based on the N-terminal amino acid sequences using a sliding-window procedure.	http://gecco.org.chemie.uni-frankfurt.de/T3SS_prediction/T3SS_prediction.html
T3SEdb [124]	T3SS	NB	PP	28 species	100e 100ne	T3SEdb detects novel T3SS effectors based on the hydrophobicity, polarity and β-turn profiles extracted from 100 N-terminal residues of known effectors.	http://effectors.bic.nus.edu.sg/T3SEdb/predict.php
SSE-ACC [77]	T3SS	SVM	AAC, SS and ACC	P. syringae	108e 3424ne	An SVM-based model developed to identify T3SS effectors based on the AAC, SS and ACC of known effectors from the 100 N-terminal residues.	On request
BPBAac [82]	T3SS	SVM	AAC	unclear	154e 308ne	An SVM-based method for identifying T3SS effectors with the information of the AAC features within the 100 N-terminal residues, which are extracted using a Bi-Profile Bayesian (BPB) model*.	http://biocomputer.bio.cuhk.edu.hk/T3DB/BPBAac.php
T3SPs [87]	T3SS	RF	AAC, SS ACC, PP	16 species	283e 313ne	Based on the AAC, SS, ACC and six physico-chemical properties of the 100 N-terminal amino acids, T3SPs uses RFs to predict T3SS effectors.	Unavailable
T3_MM [88]	T3SS	Markov model	Conditional dependence of AAC	Unclear	154e 308ne	T3_MM is constructed to predict novel T3SS effectors based on the distribution of the conditional probability difference between the effectors and non-effectors, which corresponds to the likelihood ratio of the sequence being either motif.	http://biocomputer.bio.cuhk.edu.hk/T3DB/T3_MM.php
T3SEpre [83]	T3SS	SVM	AAC, SS and ACC	Unclear	189e 385ne	T3SEpre is an SVM-based model for predicting T3SS effectors, with position-specific AAC, SS and ACC extracted from known effectors and non-effectors.	http://biocomputer.bio.cuhk.edu.hk/T3DB/T3SEpre.php
BEAN [84]	T3SS	SVM	AAC PSSM	Unclear	154e 1540ne	BEAN uses a PSSM-based k-spaced amino acid pair composition method ex-tracted within N-terminal sequences to compute the feature vectors of known effectors and non-effectors, which are separated by SVM.	Unavailable
BEAN2.0 [97]	T3SS	BLAST SVM	S,D,AAC PSSM	Unclear	243e 486ne	BEAN2.0 consists of three predictors, the first two predictors detect novel T3SS effectors or non-effectors based on similarity in sequence and domains to known effectors, and the third predictor adopts a similar strategy to BEAN but considers longer sequences.	http://systbio.cau.edu.cn/bean/
pEffect [125]	T3SS	BLAST SVM	S, PSSM	43 species	115e 3460ne	pEffect uses two predictors to make predictions of novel T3SS effectors, where the first PSI-BLAST-based predictor is used to identify sequential similarity to known effectors, and the second SVM-based predictor is used when no accepted similarity is available.	http://services.bromberglab.org/peffect/
S4TE [96]	T4SS	BLAST	S, P, D SS and PP	unclear	unclear	S4TE predicts novel T4SS effectors according to 13 features depicting known ones by computing a score summed over the individual absence or presence of features.	http://sate.cirad.fr/
T4EffPred [38]	T4SS	SVM	AAC PSSM	unclear	340e 1132ne	T4EffPred uses SVM to predict T4SS effectors based on the AAC, dipeptide composition, PSSM composition profiles and their auto covariance transformation.	http://bioinfo.tmmu.edu.cn/T4EffPred/
T4SEpre [99]	T4SS	SVM	C-AAC,C-SS and C-ACC	10 species	347e 694ne	T4SEpre is an SVM-based model for predicting T4SS effectors, based on the information of the C-terminal AAC, SS and ACC, extracted from the 100 N-terminal residues of known effectors.	http://biocomputer.bio.cuhk.edu.hk/T4DB/T4SEpre.php

Method	Secretion system	Prediction algorithm	Feature	Training data	Size	Description	Accessibility
SIEVE [11]	T3SS	SVM	AAC, SEQ, GC, CONS and PHYL	S. typhimurium and P. syringae	65e unfixed	SIEVE learns the difference between T3SS effectors and non- effectors and detects novel effectors based on SVM, with the information of G+C content, AAC, the 30 N-terminal residues of both effectors and non-effectors, and their evolutionary relationships to other genomes.	http://www.sysbep.org/sieve/
EffectiveT3 [85]	T3SS	NB	AAC, SEQ GC, PHYL	5 species	100e 200ne	EffectiveT3 uses the NB classifier to model the AAC and secondary structures of the N-terminal residues of the T3SS effectors and non-effectors, and makes novel predictions.	http://www.effectors.org/method/effectivet3
T3SS_prediction [86]	T3SS	ANN SVM	N-AAC	P. syringae and others	575e 685ne	T3SS_prediction is composed of ANNs that identify potential T3SS effectors based on the N-terminal amino acid sequences using a sliding-window procedure.	http://gecco.org.chemie.uni-frankfurt.de/T3SS_prediction/T3SS_prediction.html
T3SEdb [124]	T3SS	NB	PP	28 species	100e 100ne	T3SEdb detects novel T3SS effectors based on the hydrophobicity, polarity and β-turn profiles extracted from 100 N-terminal residues of known effectors.	http://effectors.bic.nus.edu.sg/T3SEdb/predict.php
SSE-ACC [77]	T3SS	SVM	AAC, SS and ACC	P. syringae	108e 3424ne	An SVM-based model developed to identify T3SS effectors based on the AAC, SS and ACC of known effectors from the 100 N-terminal residues.	On request
BPBAac [82]	T3SS	SVM	AAC	unclear	154e 308ne	An SVM-based method for identifying T3SS effectors with the information of the AAC features within the 100 N-terminal residues, which are extracted using a Bi-Profile Bayesian (BPB) model*.	http://biocomputer.bio.cuhk.edu.hk/T3DB/BPBAac.php
T3SPs [87]	T3SS	RF	AAC, SS ACC, PP	16 species	283e 313ne	Based on the AAC, SS, ACC and six physico-chemical properties of the 100 N-terminal amino acids, T3SPs uses RFs to predict T3SS effectors.	Unavailable
T3_MM [88]	T3SS	Markov model	Conditional dependence of AAC	Unclear	154e 308ne	T3_MM is constructed to predict novel T3SS effectors based on the distribution of the conditional probability difference between the effectors and non-effectors, which corresponds to the likelihood ratio of the sequence being either motif.	http://biocomputer.bio.cuhk.edu.hk/T3DB/T3_MM.php
T3SEpre [83]	T3SS	SVM	AAC, SS and ACC	Unclear	189e 385ne	T3SEpre is an SVM-based model for predicting T3SS effectors, with position-specific AAC, SS and ACC extracted from known effectors and non-effectors.	http://biocomputer.bio.cuhk.edu.hk/T3DB/T3SEpre.php
BEAN [84]	T3SS	SVM	AAC PSSM	Unclear	154e 1540ne	BEAN uses a PSSM-based k-spaced amino acid pair composition method ex-tracted within N-terminal sequences to compute the feature vectors of known effectors and non-effectors, which are separated by SVM.	Unavailable
BEAN2.0 [97]	T3SS	BLAST SVM	S,D,AAC PSSM	Unclear	243e 486ne	BEAN2.0 consists of three predictors, the first two predictors detect novel T3SS effectors or non-effectors based on similarity in sequence and domains to known effectors, and the third predictor adopts a similar strategy to BEAN but considers longer sequences.	http://systbio.cau.edu.cn/bean/
pEffect [125]	T3SS	BLAST SVM	S, PSSM	43 species	115e 3460ne	pEffect uses two predictors to make predictions of novel T3SS effectors, where the first PSI-BLAST-based predictor is used to identify sequential similarity to known effectors, and the second SVM-based predictor is used when no accepted similarity is available.	http://services.bromberglab.org/peffect/
S4TE [96]	T4SS	BLAST	S, P, D SS and PP	unclear	unclear	S4TE predicts novel T4SS effectors according to 13 features depicting known ones by computing a score summed over the individual absence or presence of features.	http://sate.cirad.fr/
T4EffPred [38]	T4SS	SVM	AAC PSSM	unclear	340e 1132ne	T4EffPred uses SVM to predict T4SS effectors based on the AAC, dipeptide composition, PSSM composition profiles and their auto covariance transformation.	http://bioinfo.tmmu.edu.cn/T4EffPred/
T4SEpre [99]	T4SS	SVM	C-AAC,C-SS and C-ACC	10 species	347e 694ne	T4SEpre is an SVM-based model for predicting T4SS effectors, based on the information of the C-terminal AAC, SS and ACC, extracted from the 100 N-terminal residues of known effectors.	http://biocomputer.bio.cuhk.edu.hk/T4DB/T4SEpre.php

Table 2

Warehouse of predicted secreted effector proteins in bacteria

Method	Secretion System	Organism	Size	Description	Accessibility
T3SEdb [124]	T3SS	46 species	504 effectors 572 predictions 13 unknown	In addition to being a predictor of T3SS effectors, T3SEdb also collects the effectors, manually annotates them and enables the assessment of sequence diversity among them.	http://effectors.bic.nus.edu.sg/T3SEdb/index.php
T3DB [127]	T3SS	35 species	Unclear	T3SS-related Database (T3DB) annotates the T3SS-related information, including the apparatus, chaperones, effectors and regulators. It also integrates multiple programs predicting T3SS effectors to provide the user with online prediction.	http://biocomputer.bio.cuhk.edu.hk/T3DB/
BEAN2.0 [97]	T3SS	221 species	1215 effectors	In addition to being a predictor of T3SS effectors, BEAN2.0 also integrates a warehouse of effectors, two functional relationship networks constructed among them and multiple sequence analysis tools, such as the subcellular location predictors.	http://systbio.cau.edu.cn/bean/
SecReT4 [128]	T4SS	289 species	239 effectors 1645 predictions	SecReT4 is a resource that specifically stores both effectors and core components of T4SS, it also supports the accessibility of multiple functional analysis tools, such as the similarity search tools.	http://db-mml.sjtu.edu.cn/SecReT4/
SecReT6 [129]	T6SS	240 species	92 effectors 1248 predictions	Similar to SecReT4, SecReT6 is constructed as a specific resource for storing data on the T6SS, cognate effectors and immunity proteins. Multiple functional analysis tools are also supported, such as the detection of T6SS gene clusters.	http://db-mml.sjtu.edu.cn/SecReT6/index.php
Effective [107]	Multiple Systems	587 species	421 774 predictions	Effective is a warehouse of putative effectors predicted by two integrated components, which are based on either the detection of Sec and T3SS pathway signals or the identification of eukaryoticlike domains.	Unavailable
EffectiveDB [130]	Multiple systems	1677 species	Unclear	As the undated version of Effective, EffectiveDB enables the prediction of specific T3SS and T4SS effectors or more general secreted effectors based on the identification of secretion signals, binding domains of T3SS chaperones or eukaryoticlike domains. EffectiveDB also supports the prediction of T3SS, T4SS and T6SS.	http://effectors.org/

Method	Secretion System	Organism	Size	Description	Accessibility
T3SEdb [124]	T3SS	46 species	504 effectors 572 predictions 13 unknown	In addition to being a predictor of T3SS effectors, T3SEdb also collects the effectors, manually annotates them and enables the assessment of sequence diversity among them.	http://effectors.bic.nus.edu.sg/T3SEdb/index.php
T3DB [127]	T3SS	35 species	Unclear	T3SS-related Database (T3DB) annotates the T3SS-related information, including the apparatus, chaperones, effectors and regulators. It also integrates multiple programs predicting T3SS effectors to provide the user with online prediction.	http://biocomputer.bio.cuhk.edu.hk/T3DB/
BEAN2.0 [97]	T3SS	221 species	1215 effectors	In addition to being a predictor of T3SS effectors, BEAN2.0 also integrates a warehouse of effectors, two functional relationship networks constructed among them and multiple sequence analysis tools, such as the subcellular location predictors.	http://systbio.cau.edu.cn/bean/
SecReT4 [128]	T4SS	289 species	239 effectors 1645 predictions	SecReT4 is a resource that specifically stores both effectors and core components of T4SS, it also supports the accessibility of multiple functional analysis tools, such as the similarity search tools.	http://db-mml.sjtu.edu.cn/SecReT4/
SecReT6 [129]	T6SS	240 species	92 effectors 1248 predictions	Similar to SecReT4, SecReT6 is constructed as a specific resource for storing data on the T6SS, cognate effectors and immunity proteins. Multiple functional analysis tools are also supported, such as the detection of T6SS gene clusters.	http://db-mml.sjtu.edu.cn/SecReT6/index.php
Effective [107]	Multiple Systems	587 species	421 774 predictions	Effective is a warehouse of putative effectors predicted by two integrated components, which are based on either the detection of Sec and T3SS pathway signals or the identification of eukaryoticlike domains.	Unavailable
EffectiveDB [130]	Multiple systems	1677 species	Unclear	As the undated version of Effective, EffectiveDB enables the prediction of specific T3SS and T4SS effectors or more general secreted effectors based on the identification of secretion signals, binding domains of T3SS chaperones or eukaryoticlike domains. EffectiveDB also supports the prediction of T3SS, T4SS and T6SS.	http://effectors.org/

For each type of secreted effector, the corresponding warehouses are listed in chronological order. The number of ‘Organisms’ and corresponding ‘Size’ are taken from the publications, where ‘effectors’ indicate experimentally verified ones, ‘predictions’ refer to hypothetical ones and ‘unknown’ indicates those with incomplete information.

Table 2

Warehouse of predicted secreted effector proteins in bacteria

Method	Secretion System	Organism	Size	Description	Accessibility
T3SEdb [124]	T3SS	46 species	504 effectors 572 predictions 13 unknown	In addition to being a predictor of T3SS effectors, T3SEdb also collects the effectors, manually annotates them and enables the assessment of sequence diversity among them.	http://effectors.bic.nus.edu.sg/T3SEdb/index.php
T3DB [127]	T3SS	35 species	Unclear	T3SS-related Database (T3DB) annotates the T3SS-related information, including the apparatus, chaperones, effectors and regulators. It also integrates multiple programs predicting T3SS effectors to provide the user with online prediction.	http://biocomputer.bio.cuhk.edu.hk/T3DB/
BEAN2.0 [97]	T3SS	221 species	1215 effectors	In addition to being a predictor of T3SS effectors, BEAN2.0 also integrates a warehouse of effectors, two functional relationship networks constructed among them and multiple sequence analysis tools, such as the subcellular location predictors.	http://systbio.cau.edu.cn/bean/
SecReT4 [128]	T4SS	289 species	239 effectors 1645 predictions	SecReT4 is a resource that specifically stores both effectors and core components of T4SS, it also supports the accessibility of multiple functional analysis tools, such as the similarity search tools.	http://db-mml.sjtu.edu.cn/SecReT4/
SecReT6 [129]	T6SS	240 species	92 effectors 1248 predictions	Similar to SecReT4, SecReT6 is constructed as a specific resource for storing data on the T6SS, cognate effectors and immunity proteins. Multiple functional analysis tools are also supported, such as the detection of T6SS gene clusters.	http://db-mml.sjtu.edu.cn/SecReT6/index.php
Effective [107]	Multiple Systems	587 species	421 774 predictions	Effective is a warehouse of putative effectors predicted by two integrated components, which are based on either the detection of Sec and T3SS pathway signals or the identification of eukaryoticlike domains.	Unavailable
EffectiveDB [130]	Multiple systems	1677 species	Unclear	As the undated version of Effective, EffectiveDB enables the prediction of specific T3SS and T4SS effectors or more general secreted effectors based on the identification of secretion signals, binding domains of T3SS chaperones or eukaryoticlike domains. EffectiveDB also supports the prediction of T3SS, T4SS and T6SS.	http://effectors.org/

Method	Secretion System	Organism	Size	Description	Accessibility
T3SEdb [124]	T3SS	46 species	504 effectors 572 predictions 13 unknown	In addition to being a predictor of T3SS effectors, T3SEdb also collects the effectors, manually annotates them and enables the assessment of sequence diversity among them.	http://effectors.bic.nus.edu.sg/T3SEdb/index.php
T3DB [127]	T3SS	35 species	Unclear	T3SS-related Database (T3DB) annotates the T3SS-related information, including the apparatus, chaperones, effectors and regulators. It also integrates multiple programs predicting T3SS effectors to provide the user with online prediction.	http://biocomputer.bio.cuhk.edu.hk/T3DB/
BEAN2.0 [97]	T3SS	221 species	1215 effectors	In addition to being a predictor of T3SS effectors, BEAN2.0 also integrates a warehouse of effectors, two functional relationship networks constructed among them and multiple sequence analysis tools, such as the subcellular location predictors.	http://systbio.cau.edu.cn/bean/
SecReT4 [128]	T4SS	289 species	239 effectors 1645 predictions	SecReT4 is a resource that specifically stores both effectors and core components of T4SS, it also supports the accessibility of multiple functional analysis tools, such as the similarity search tools.	http://db-mml.sjtu.edu.cn/SecReT4/
SecReT6 [129]	T6SS	240 species	92 effectors 1248 predictions	Similar to SecReT4, SecReT6 is constructed as a specific resource for storing data on the T6SS, cognate effectors and immunity proteins. Multiple functional analysis tools are also supported, such as the detection of T6SS gene clusters.	http://db-mml.sjtu.edu.cn/SecReT6/index.php
Effective [107]	Multiple Systems	587 species	421 774 predictions	Effective is a warehouse of putative effectors predicted by two integrated components, which are based on either the detection of Sec and T3SS pathway signals or the identification of eukaryoticlike domains.	Unavailable
EffectiveDB [130]	Multiple systems	1677 species	Unclear	As the undated version of Effective, EffectiveDB enables the prediction of specific T3SS and T4SS effectors or more general secreted effectors based on the identification of secretion signals, binding domains of T3SS chaperones or eukaryoticlike domains. EffectiveDB also supports the prediction of T3SS, T4SS and T6SS.	http://effectors.org/

For each type of secreted effector, the corresponding warehouses are listed in chronological order. The number of ‘Organisms�� and corresponding ‘Size’ are taken from the publications, where ‘effectors’ indicate experimentally verified ones, ‘predictions’ refer to hypothetical ones and ‘unknown’ indicates those with incomplete information.

Benchmarking the prediction methods based on curated data sets

The machine learning approaches behave like a black-box that does not imitate the unknown biological mechanism but models the statistical characteristics of the effectors in their training data sets, suggesting they still have weaknesses in their algorithms. SVMs separate the effectors and non-effectors by a hyper-plane, which strongly depends on the dimension of features and require a large amount of computation. NB classifiers are constructed based on the hypothesis that the conditional independence between features are limited for an extensive prediction of closely connected secreted effectors. ANNs have plenty of parameters located in each layer of neuron, which obstruct the users to inspect the progress.

To assess the performance of prediction methods on predicting secreted effectors, we attempt to benchmark, whenever possible, the methods in Table 1, and consider each algorithm of one detection tool as separated methods. We skip the assessment of predicting T6SS effectors, as An et al.’s work has reported the limitation of available approaches to identify T6SS effectors and no more effective methods are available.

The methods on board are BPBAac [87], EffectiveT3-SEL [85], EffectiveT3-SEN [85], pEffect [125], SIEVE [11], T3_MM [88], T3SS_prediction-ANN [86] and T3SS_prediction-SVM [86] for predicting T3SS effectors, and T4EffPred [38], T4SEprebpbAac [99], T4SEpreJoint [99] and T4SEprepsAac [99] for predicting T4SS effectors. Typically, the updated predictive model of EffectiveT3 makes novel predictions based on two restriction values of the ‘sensitive’ and ‘selective’. The ‘selective’ is the default minimal score of 0.9999 from the Naive Bayesian Classifier for the class of secreted effectors, whereas the ‘sensitive’ corresponds to the threshold score of 0.95, suggesting us to consider their performance separately as the EffectiveT3-SEL and EffectiveT3-SEN [85]. T3SS_prediction is an ANN-based predictive model, and it is still able to make novel predictions based on SVM, suggesting us to consider their performance separately as the T3SS_prediction-ANN and T3SS_prediction-SVM [86]. T4SEpre constructs predictive models based on three combinations of features, the BPB position-specific AAC features, the SPB position-specific and sequence-based AAC features, and the joint features of the position-specific AAC, SS and ACC, corresponding to T4SEprebpbAac, T4SEprepsAac and T4SEpreJoint, respectively.

Construction of the testing data sets

We collect both positive and negative examples of secreted effectors to construct a comprehensive testing data set of the prediction methods.

The positives are known as T3SS and T4SS effectors, which are collected from both the databases built in Table 2 and, whenever possible, the positive training data sets used for developing the original methods. They are referred to as the data sets of T3P and T4P, respectively. We search through the NCBI (http://www.ncbi.nlm.nih.gov/protein/) and UniProt (http://www.uniprot.org) databases for the complete effector sequences, which are partially used by some prediction methods [82, 88, 99]. The negatives are known non-effectors or artificial sequences, which are collected from the negative training data sets of benchmarking methods. They are referred to as the data sets of T3N and T4N, respectively. We still consider the ability of detecting one certain type of effectors from the other types of secreted effectors by each method. Hence, for the prediction of T3SS effectors, the negative examples are collected from T3N, T4P and T6P, a data set composed of known T6SS effectors. While for the prediction of T4SS effectors, the negative examples are collected from T4N, T3P and T6P, as shown in Table 3.

Table 3

Positives and negatives in the testing data set of prediction methods

Methods	Positives	Negatives
Prediction methods of T3SS effectors
	T3P	T3N	T4P	T6P
BPBAac	230	1415	371	65
EffectiveT3-SEL	230	1414	371	65
EffectiveT3-SEN	230	1414	371	65
pEffect	230	1415	371	65
SIEVE	230	1415	371	65
T3_MM	230	1415	371	65
T3SS_prediction-ANN	230	1414	371	65
T3SS_prediction-SVM	230	1414	371	65
Prediction methods of T4SS effectors
	T4P	T4N	T3P	T6P
T4EffPred	371	1567	230	65
T4SEprebpbAac	371	1567	230	65
T4SEpreJoint	371	1567	230	65
T4SEprepsAac	371	1567	230	65

Methods	Positives	Negatives
Prediction methods of T3SS effectors
	T3P	T3N	T4P	T6P
BPBAac	230	1415	371	65
EffectiveT3-SEL	230	1414	371	65
EffectiveT3-SEN	230	1414	371	65
pEffect	230	1415	371	65
SIEVE	230	1415	371	65
T3_MM	230	1415	371	65
T3SS_prediction-ANN	230	1414	371	65
T3SS_prediction-SVM	230	1414	371	65
Prediction methods of T4SS effectors
	T4P	T4N	T3P	T6P
T4EffPred	371	1567	230	65
T4SEprebpbAac	371	1567	230	65
T4SEpreJoint	371	1567	230	65
T4SEprepsAac	371	1567	230	65

Table 3

Positives and negatives in the testing data set of prediction methods

Methods	Positives	Negatives
Prediction methods of T3SS effectors
	T3P	T3N	T4P	T6P
BPBAac	230	1415	371	65
EffectiveT3-SEL	230	1414	371	65
EffectiveT3-SEN	230	1414	371	65
pEffect	230	1415	371	65
SIEVE	230	1415	371	65
T3_MM	230	1415	371	65
T3SS_prediction-ANN	230	1414	371	65
T3SS_prediction-SVM	230	1414	371	65
Prediction methods of T4SS effectors
	T4P	T4N	T3P	T6P
T4EffPred	371	1567	230	65
T4SEprebpbAac	371	1567	230	65
T4SEpreJoint	371	1567	230	65
T4SEprepsAac	371	1567	230	65

Methods	Positives	Negatives
Prediction methods of T3SS effectors
	T3P	T3N	T4P	T6P
BPBAac	230	1415	371	65
EffectiveT3-SEL	230	1414	371	65
EffectiveT3-SEN	230	1414	371	65
pEffect	230	1415	371	65
SIEVE	230	1415	371	65
T3_MM	230	1415	371	65
T3SS_prediction-ANN	230	1414	371	65
T3SS_prediction-SVM	230	1414	371	65
Prediction methods of T4SS effectors
	T4P	T4N	T3P	T6P
T4EffPred	371	1567	230	65
T4SEprebpbAac	371	1567	230	65
T4SEpreJoint	371	1567	230	65
T4SEprepsAac	371	1567	230	65

CD-HIT is used to cluster and remove the redundant secreted effectors in each data set, with the sequence identify cutoff of 0.3 [131]. The removal of protein sequences with high sequence similarity helps to exclude the effects of dependency from homology within the data sets [11, 38, 85]. Remainders of the T3P, T4P, T6P, T3N and T4N are 230, 371, 65, 1415 and 1567 effector sequences, respectively, and the corresponding number of positive and negative examples of each method are shown in Table 3. These curated data sets are available from http://bioinfo.tmmu.edu.cn/BenchmarkPSE/.

Result

Table 4 and Figure 3 show the performance of each prediction method for predicting T3SS or T4SS effectors based on the curated data sets. The evaluation parameters of sensitivity, specificity and MCC are used to assess the performance, with the best values highlighted in bold.

Table 4

Performance of the prediction methods based on curated data sets

Methods	Sensitivity	Specificity	MCC
Prediction methods of T3SS effectors
BPBAac	0.6	0.991	0.708
EffectiveT3-SEL	0.652	0.901	0.472
EffectiveT3-SEN	0.73	0.837	0.426
pEffect	0.883	0.868	0.572
SIEVE	0.478	0.983	0.573
T3_MM	0.822	0.835	0.484
T3SS_prediction-ANN	0.77	0.903	0.559
T3SS_prediction-SVM	0.7	0.985	0.75
Prediction methods of T4SS effectors
T4EffPred	0.919	0.943	0.802
T4SEprebpbAac	0.911	0.975	0.874
T4SEpreJoint	0.21	0.995	0.392
T4SEprepsAac	0.892	0.99	0.905

Methods	Sensitivity	Specificity	MCC
Prediction methods of T3SS effectors
BPBAac	0.6	0.991	0.708
EffectiveT3-SEL	0.652	0.901	0.472
EffectiveT3-SEN	0.73	0.837	0.426
pEffect	0.883	0.868	0.572
SIEVE	0.478	0.983	0.573
T3_MM	0.822	0.835	0.484
T3SS_prediction-ANN	0.77	0.903	0.559
T3SS_prediction-SVM	0.7	0.985	0.75
Prediction methods of T4SS effectors
T4EffPred	0.919	0.943	0.802
T4SEprebpbAac	0.911	0.975	0.874
T4SEpreJoint	0.21	0.995	0.392
T4SEprepsAac	0.892	0.99	0.905

The corresponding highest values of each parameter are highlighted in bold.

Table 4

Performance of the prediction methods based on curated data sets

Methods	Sensitivity	Specificity	MCC
Prediction methods of T3SS effectors
BPBAac	0.6	0.991	0.708
EffectiveT3-SEL	0.652	0.901	0.472
EffectiveT3-SEN	0.73	0.837	0.426
pEffect	0.883	0.868	0.572
SIEVE	0.478	0.983	0.573
T3_MM	0.822	0.835	0.484
T3SS_prediction-ANN	0.77	0.903	0.559
T3SS_prediction-SVM	0.7	0.985	0.75
Prediction methods of T4SS effectors
T4EffPred	0.919	0.943	0.802
T4SEprebpbAac	0.911	0.975	0.874
T4SEpreJoint	0.21	0.995	0.392
T4SEprepsAac	0.892	0.99	0.905

Methods	Sensitivity	Specificity	MCC
Prediction methods of T3SS effectors
BPBAac	0.6	0.991	0.708
EffectiveT3-SEL	0.652	0.901	0.472
EffectiveT3-SEN	0.73	0.837	0.426
pEffect	0.883	0.868	0.572
SIEVE	0.478	0.983	0.573
T3_MM	0.822	0.835	0.484
T3SS_prediction-ANN	0.77	0.903	0.559
T3SS_prediction-SVM	0.7	0.985	0.75
Prediction methods of T4SS effectors
T4EffPred	0.919	0.943	0.802
T4SEprebpbAac	0.911	0.975	0.874
T4SEpreJoint	0.21	0.995	0.392
T4SEprepsAac	0.892	0.99	0.905

The corresponding highest values of each parameter are highlighted in bold.

Figure 3

Performance of the prediction methods based on the curated data sets.

Open in new tab Download slide

For the prediction of T3SS effectors, T3_SSprediction-SVM has achieved the highest MCC value, suggesting it as the best prediction method based on the curated data sets. BPBAac performed also well, as it has achieved a high MCC value and the highest specificity, namely, BPBAac correctly rejected the most number of non-effectors. SIEVE had the lowest sensitivity, suggesting it is the least sensitive to the T3SS effectors based on the curated data set. In contrast, pEffect has correctly identified the most number of effectors. Globally, EffectiveT3-SEL, EffectiveT3-SEN and T3_MM have achieved an ordinary performance.

For the prediction of T4SS effectors, T4SEprepsAac has achieved the highest MCC value, suggesting it as the best predictor of T4SS effectors based on the curated data sets. T4EffPred performed not bad, as it has correctly identified the most number of T4SS effectors. Globally, T4SEpreJoint performed the worst, as it has achieved the lowest MCC value and sensitivity.

Discussion

Computational identification methods for bacterial effector proteins have been developed during the past decades. The core is to capture the statistical ‘rules’ of known effectors and to make novel predictions based on an acceptable similarity to the extracted characteristics.

In this article, we are focusing on reporting the progress of in silico identification of bacterial effectors secreted via T3SS, T4SS and T6SS and the features of effectors that help to discriminate them from other proteins. Most approaches are effective for identifying certain effectors, but they are always limited to extensive predictions because the general features of the effectors are derived from particularly well-characterized bacterial species. Different methods focus on particular features of the effector sequences, including the sequential, structural and physicochemical characteristics, or the regulatory elements, secretion signals, cognate chaperones, solvent accessibility and conservation profiles of the effectors.

Detecting the AAC biases in known effectors is a straightforward and prevalent way to identify novel secreted proteins. However, significant enrichment or depletion of certain amino acids has been observed in effector sequences generally/globally [85, 88] and at positions specifically [82, 87]. Generally, global AAC biases sum the occurrences of amino acids divided by the length of the full sequence or N-terminal fragment, while position-specific AAC information focuses on the bias of certain amino acids at specific positions. Both types of bias show discriminative power to identify novel effectors.

The sequence-general and position-specific information are combined in selected studies [11, 38], where the position-specific preferences of amino acids are encoded into the PSSM profiles. PSSM profiles commonly represent evolutionary conservation within multiple sequences, which is helpful for discrimination of effectors [11, 38, 87]. In particular, the PSSM composition profiles and the PSSM_AC profiles representing the auto covariance transformation of PSSM are calculated by T4EffPred [38]. Their performance test shows that these two classes of PSSM-based features are more helpful in identifying novel T4SS effectors than using AAC, as PSSM profiles may hold the conserved relevance of function-related proteins, which may fade away through the evolution of protein sequences. This argument is clearly supported by the construction of PSSM profiles, which are obtained from multiple sequence alignment, rather than a single sequence carrying relatively less information. Moreover, sequence-derived characteristics of AAC and PSSM can generally be calculated across effectors from diverse organisms as an expedient. These features can be extracted from all the effectors, regardless of the sequence-general or position-specific manner. In contrast, prediction methods based on the typical signal sequence of certain secretion systems are limited by the generalizability of their conservation across species.

Feature combination is a frequently used approach for improving classification performance [38, 69, 83, 88, 96] by combining not only the sequence-general and position-specific information but also the divergent features depicting the effectors. This is consistent with the argument that combining both the secondary structure and relative solvent accessibility could improve the predictive performance of identifying novel T3SS effectors [77].

During the computational progress of identifying bacterial effectors, selected machine learning approaches have been applied and have shown great performance. The method is selected based on the specific case, although they all turn the problem of predicting novel secreted effectors into the issue of discriminating them from non-effectors. We have implemented a benchmarking of available prediction methods of T3SS and T4SS effectors based on our curated data sets. We observed T3SS_prediction-SVM is the best T3SS effector predictor, while T4SEprepsAac is the best T4SS effector predictor based on our curated data sets. T4EffPred performed worse than the expectation, as it makes novel predictions based on the PSSM profiles calculated by new models.

There is no prediction method that has achieved an overwhelming success in our benchmarking. They are suggested to be combined to classify the input sequence according to a voting scheme of the individual algorithms. As pioneers, Burstein et al. and Chen et al. have selected the NB, Bayesian networks, SVM and ANN to construct a voting classifier to discriminate T4SS effectors in the genomes of L.pneumophila and Coxiella burnetii, respectively [69, 70].

A sensible construction of training data sets, especially the negatives, will greatly improve the prediction performance, where the taxonomic, characteristic, numerical and functional biases should be avoided. For example, to avoid the taxonomic bias, the non-effectors from many pathogens should be collected [85]. To reduce the bias of characteristics of the N-terminal signal sequence in T3SS effectors and other signals, the non-effectors with Sec-dependent translocation signals, cytoplasmic proteins and proteins exported by unknown pathways are included in the negative training data set [86]. A sensible size of the training data set and proper ratio between positives and negatives correspond to a high statistical significance of the derived features. And without a doubt, the increasing credible effectors should make indescribable contributions to the prediction performance.

In the context of limited accuracy of specific prediction methods, subsequent filtering or further ranking of predictions for experimental validation is highly recommended. To achieve reliable and beneficial predictions, in addition to the availability of more comprehensive training data, more informative and discriminative features are desired. Most approaches are effective for identifying certain effectors, but they are always limited to extensive predictions because the general features of the effectors are derived from particularly well-characterized bacterial species. In the previous work, the AAC, signal information, sequential, physicochemical, evolutionary conservation, eukaryotic domains and akin attributes of the secreted effector proteins have been considered by multiple prediction methods. We propose that the information of PPI between secreted effectors and host cells, the evolutional relationship between effectors, the GO annotations, pathways and 3-D structural information of effectors should be on board in the coming future. The observation of new bio-markers and the advancement of metagenomics should be helpful as well.

As we mentioned above, we dream of a future where as many features as possible are considered to discriminate secreted effectors from the non-effectors. But we still desire more reliable and efficient methods of feature extraction/representation which encode the proteins in smart fashions. For example, SPB and BPB are two kinds of ways to extract features. Recursive feature elimination policy is also applied to select important features. Hence, we believe the advance of this kind of methods will also promote the procedures of identifying secreted effectors in bacteria. And the involvement of more promising algorithms, such as the deep learning and logistic regression algorithms, may strengthen the ability and efficiency of prediction methods of secreted effectors in bacteria.

Genome sequencing has resulted in an explosion of knowledge about bacterial secretion systems. We know that gram-negative bacteria use multiple secretion systems to translocate effectors extracellularly, especially the T3SS, T4SS and T6SS, via which the effectors are secreted directly into the host cells. A clearer understanding of the mechanisms behind the molecular processes, such as the effector targeting and secreting, is far beyond our current capacity. The more comprehensive observation and better interpretation will also promote a higher level of success in predicting secreted effectors in bacteria.

Furthermore, many of the in silico methods are integrated as software packages, such as the execution of SVM by running the LIBSVM [132], or Gist [133] and WEKA toolbox [134], which also supports the calculation of NB in EffectiveT3 [85]; executing ANN by running Matlab [135]; executing RFs by running the RF package in R; and executing HMMs by running HMMER [136]. The availability of computational tools provides easy accessibility and friendly experience to the users, resulting in the prevalent and well-characterized application of in silico methods in the domain of predicting bacterial effectors.

S4TE, one method that is not benchmarked, predicts T4SS effectors based on the summed score of the presence of features discriminating known effectors [96]. This type of prediction approaches is complementary to machine learning algorithms and statistical approaches. They are especially practical when the known samples are insufficient to construct both the positive and negative data sets of the training data for the machine learning algorithms. This type of prediction method may also save a large amount of mathematical computation in the statistical approaches.

However, none of these methods is exhaustive or generally applicable. Homology-based approaches can only identify effectors that are close members of known effector families, and these are mostly specific for and hence limited to certain well-known bacterial species. Meanwhile, this type of method may show poor performance on predicting novel effectors as they evolve/mutate quickly and therefore contain low sequence similarity. The success of the T346hunter software is dependent on the high conservation of the secretion apparatus [61]. Although the promoter search is an efficient method for identifying downstream genes encoding effectors, it is limited in identifying a true effector in P.syringae because either true effectors may not be preceded by known promoters or the ORF hits downstream of the promoter may have relatively poor values, which is either hardly detected or encodes unknown proteins [76, 77, 81]. Meanwhile, specific promoter information is available from a small group of bacterial pathogens. The prediction methods based on the signal sequence located in the effectors have demonstrated effective discriminative power. However, a protein containing a signal sequence does not necessarily represent a secreted effector strictly, as some proteins in bacteria without protein-translocation secretion systems may have signal sequences [83, 85, 86, 99]. Hence, the significance of discriminating effectors by signal-based prediction methods should not be entirely praised. Amino acid composition biases exist in either the entire effectors or particular regions observed in certain species, providing limited discrimination in other bacterial genomes [11]. The studies successfully identify a number of T6SS effectors based on the physical characteristics of known effectors [44] and an N-terminal sequence marker [55], but neither method could identify the T6SS effector TseL in V.cholerae [65].

On the other hand, although computational prediction methods identify bacterial effectors rapidly, efficiently and automatically, FP predictions are always a flaw in machine learning approaches, where the predictive model predicts a non-effector as a positive effector candidate. For example, the secretion signals in T3SS effectors are predicted to exist in gram-positive bacteria and yeasts [83]. Cytoplasmic proteins are predicted to have a T3SS export signal [86]. Hence, it should be remembered that the machine learning algorithms are trained on the sequence features derived from a small number of known effectors from limited species and cannot be generalized to completely accurate effector discovery in other bacterial species. And the consideration of structural motifs in [77, 83, 85] is based on the prediction of other methods, such as PSIPRED [79]. Although using a greater number of features improves the power to identify effectors, we cannot deny the possibility of FPs resulting from the additional prediction programs.

Furthermore, studies identifying T3SS effectors are quite abundant compared with those identifying T4SS effectors. And there is a substantial imbalance in studies predicting T6SS effectors. For T3SS, T4SS and T6SS effectors, the methods that have demonstrated good performance for identifying T3SS effectors lose competitiveness when identifying T4SS and T6SS effectors, which may be owing to the distinctive features between the three families of effectors and the small data set of validated T4SS effectors and even smaller pool of known T6SS effectors, which directly impact the accuracy of machine learning approaches that depend on the quantities of authentic negative and positive examples. Currently, the most effective way to identify new T4SS and T6SS effectors is to validate predicted candidates according to the common features of known effectors encoded by the same or closely related bacteria [69, 70]. However, for example, SecRet4 contains <300 experimentally validated effectors of the 1884 effectors listed in the database. This demonstrates the poor speed of biological techniques to identify effectors compared with that of the computational methods. Hence, in silico methods must improve their comprehensive performance, including the reliability, promptness, efficiency and generality, which is the responsibility of the community.

Wang et al. reported the C-terminal preference of AAC, SS and ACC in T4SS effectors, which are quite similar to that of the N-terminal region in T3SS effectors [82, 99]. If this commonality exists generally, should the prediction methods developed based on the N-terminal signal region of T3SS effectors show some power in the prediction of T4SS effectors? Xu et al. reported a HMM-based method to predict and evaluate putative T3SS and T4SS effectors with EPIYA motifs. Their results showed that the predicted T3SS effectors scattered in a broad range of biological species, but this motif was not widely distributed in T4SS secreted proteins [89], which means this characteristic motif may not successfully discriminate T4SS effectors from T3SS effectors [38]. Hence, it is unclear how to use this type of commonality to mine more T4SS effectors. In one study to identify T6SS effectors, a conserved domain that functions similarly to the T3SS chaperone proteins was used to identify the associated downstream effectors rather than relying on the diverse sequential, structural and functional features of effector sequences [65]. This conserved domain has several features that are highly similar to T3SS chaperones, such as the low molecular weight and pI values. These similarities are expected to help the future identification of T6SS effectors, with generalizability to methods for predicting T3SS effectors. Li et al. reported the immunity proteins accompanying T6SS effectors, which prevent the bacterial cell from self-intoxication owing to the secreted toxin effectors being more stable than the cognate effectors. This result may suggest a way to predict T6SS effectors by considering the structural information of immunity proteins [137].

Genome sequencing has resulted in an explosion of knowledge about bacterial secretion systems. We know that gram-negative bacteria use multiple secretion systems to translocate effectors extracellularly, especially the T3SS, T4SS and T6SS, via which the effectors are secreted directly into the host cells. A clearer understanding of the mechanisms behind the molecular processes, such as the effector targeting and secreting, is far beyond our current capacity. The more comprehensive observation and better interpretation of the molecular procedures will also promote a higher level of success in predicting secreted effectors in bacteria.

Inspired by the observations and discussions above, we wish to present the current progress toward the development of our new prediction method for T6SS effectors and a more general, powerful and large-scale method that is able to discriminate the three families of secreted effectors from a large amount of genetic data. We cannot help to dream of a ‘super-powerful’ prediction method for bacterial effectors that combines the advantages of the available methods, enabling it to predict effectors secreted by any secretion system in most cases. However, this method would lack sensitivity to recognize one particular type of effector that is secreted by a specific secretion system. Secreted effectors can be identified based on their similarity to eukaryotic domains. The phylogenetic profiles between effectors and host cells or within effectors may help us to interpret the specific evolutionary processes. These valuable questions remain open in the investigation to provide insights into the nature of host–bacteria interactions.

Key Points

Bacterial secreted effectors play vital roles in pathogen–host interactions. Computational approaches have accelerated the process of identifying secreted effector proteins in bacteria.
We first reviewed informative features of known effectors, which contribute to their identification, composed of the sequential, structural, genomic information and others of the effectors. During this part of illustration, we attempted to highlight the biological background of these informative features.
Include available resources and then carefully study the strengths and weaknesses of multiple types of machine learning algorithms and statistical methods on predicting secreted effectors. This is demonstrated by implementing a benchmark of available ones based on our curated data sets.
We propose a future where the fidelity of identifying secreted effectors in silico will be much more persuasive and beneficial. This may be owing to the construction of a more balanced number of known effectors without taxonomic, characteristic, numerical and functional biases; more informative and discriminative features and more efficient methods of feature extraction/representation are desired; the improved reliability of the bioinformatic prediction tools and a better interpretation of the mechanisms behind the molecular pathogen–host interactions.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 31571352).

Cong Zeng received her PhD in Computer Science in 2015 from University of Paris-Sud, France. She now works as a lecturer in the Bioinformatics Center at the Third Military Medical University (TMMU), China. Her research interests include machine learning and data mining.

Lingyun Zou received his PhD in 2008 from National University of Defense Technology, China. He is an Associate Professor and PI at the Bioinformatics Center of TMMU. His research interests are machine learning, pattern recognition, omics-data mining and complex diseases. He is currently in charge of two projects funded by the National Natural Science Foundation of China, ‘The study of feature mining, computational prediction and experimental validation of bacterial effectors secreted via type IV secretion systems (Grant No. 31301097)’ and ‘Feature mining and machine learning based study of functional classification, computational prediction and experimental validation of bacterial effectors secreted via type VI secretion systems (Grant No. 31571352)’.

References

Chang

Desveaux

Creason

AL.

The ABCs and 123s of bacterial secretion systems in plant pathogenesis

Annu Rev Phytopathol

2014

;

(

317

–

Month:	Total Views:
July 2017	45
August 2017	41
September 2017	9
October 2017	63
November 2017	13
December 2017	14
January 2018	13
February 2018	7
March 2018	14
April 2018	1
May 2018	7
June 2018	11
July 2018	6
August 2018	4
September 2018	8
October 2018	4
November 2018	3
December 2018	9
January 2019	19
February 2019	28
March 2019	25
April 2019	10
May 2019	40
June 2019	25
July 2019	25
August 2019	31
September 2019	23
October 2019	23
November 2019	20
December 2019	15
January 2020	11
February 2020	7
March 2020	8
April 2020	19
May 2020	3
June 2020	11
July 2020	10
August 2020	1
September 2020	18
October 2020	14

Article Contents

An account of in silico identification tools of secreted effector proteins in bacteria and future challenges

Abstract

Introduction

Secretion systems and effectors

The identification of effector proteins

Features for identifying secreted effectors

In silico identification methods

Support vector machine

SIEVE

SSE-ACC

T4EffPred

Naive Bayes

Artificial neural network

Random forests

Markov model

Other prediction strategies

Performance evaluation

Available online resources

Benchmarking the prediction methods based on curated data sets

Construction of the testing data sets

Result

Discussion

Funding

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only