Abstract

Bacterial pathogens secrete numerous effector proteins via six secretion systems, type I to type VI secretion systems, to adapt to new environments or to promote virulence by bacterium–host interactions. Many computational approaches have been used in the identification of effector proteins before the subsequent experimental verification because they tolerate laborious biological procedures and are genome scale, automated and highly efficient. Prevalent examples include machine learning methods and statistical techniques. In this article, we summarize the computational progress toward predicting secreted effector proteins in bacteria, with an opening of an introduction of features that are used to discriminate effectors from non-effectors. The mechanism, contribution and deficiency of previous developed detection tools are presented, which are further benchmarked based on a curated testing data set. According to the results of benchmarking, potential improvements of the prediction performance are discussed, which include (1) more informative features for discriminating the effectors from non-effectors; (2) the construction of comprehensive training data set of the machine learning algorithms; (3) the advancement of reliable prediction methods and (4) a better interpretation of the mechanisms behind the molecular processes. The future of in silico identification of bacterial secreted effectors includes both opportunities and challenges.

Introduction

Bacteria have adopted diverse secretion systems to translocate numerous effector proteins into the extracellular environment to interact with and defeat the host cells [1].

The investigation of secretion systems has attracted substantial attention to the phylogenetic distribution, gene content, organization and evolution. However, in addition to the identification of secretion systems, there remains a large number of challenging questions to answer to understand the pathogenesis systems of bacterial pathogens. Examples are the studies of the mechanisms of recognition and targeting effectors by secretion systems. Specifically, the identification of secreted effector proteins, the arch-criminal of bacterial virulence, and the interpretation of their pathogenesis mechanisms are the critical pioneering steps toward understanding of molecular bacterium-host interactions.

In this section, we briefly introduce the background of secretion systems and effectors and their identification. The ‘Features for identifying secreted effectors’ section discusses the features of effector proteins that can be used for their identification, and in silico prediction methods of bacterial effectors are elaborated in the ‘In silico identification methods’ section. Available online resources are collected as tables in the ‘Available online resources’ section.

Secretion systems and effectors

Gram-positive bacteria produce a single cytoplasmic membrane, followed by a thick cell wall layer, while gram-negative bacteria produce a double-membrane system with the cytoplasmic membrane as both an inner and outer membrane.

We define ‘secretion’ as the extracellular release of proteins via secretion systems within this article, although secreted proteins can also be cell-surface localized or be part of cell-surface appendages, as reported in [2]. The secreted proteins of gram-negative bacteria that are released extracellularly are also referred to as bacterial ‘effectors’.

Several secretion systems are responsible for the translocation of proteins across the bacterial cytoplasmic membrane, such as the Sec (general secretion pathway), SRP (signal-recognition particle), Tat (twin-arginine translocation), FEA (flagella export apparatus) and holin (hole forming) pathways [2, 3]. In addition, six additional types of secretion systems in gram-negative bacteria are also known and act as protein translocation systems across the bacterial outer membrane [2, 4]. They follow the prevalent nomenclature from type I secretion systems to type VI secretion systems (T1SS–T6SS). There is controversy about these systems with respect to the mechanism of effector molecule recognition and targeting of the secretion systems, namely, the recognition of specific effectors from all other proteins not secreted via a certain secretion system. For these secretion systems multiple excellent reviews are recommended [1, 5–8].

To translocate/secrete protein extracellularly, gram-negative bacteria use principally two strategies. Cytoplasmic membrane protein translocation systems, i.e. the ubiquitous Sec, Tat and others, translocate proteins to the periplasm of gram-negative bacterial cells. Subsequently, the proteins are engaged in an additional secretion pathway to be secreted from the outer membrane. The additional secretion pathways involved in these two protein translocation steps are referred to as the Sec-dependent secretion systems. Furthermore, the proteins secreted via Sec-dependent secretion systems harbor an N-terminal signal that is first recognized by the Sec machinery and subsequently removed by the peptidase with the remainder released into the periplasm and translocated across the outer membrane with the assistance of accessory proteins. They are T2SS and T5SS [2, 9, 10]. Additionally, the Sec-dependent signal contains no specific sequence motif for secretion but a pattern of charged residues and a hydrophobic domain that enable the recognition of secreted substrates [11]. Alternatively, multiple secretion systems can directly secrete proteins across two membranes to the exterior the of bacterial cell simultaneously, bypassing the periplasm compartment. These one-step secretion systems are classified as Sec-independent T1SS, T3SS, T4SS and T6SS, owing to the independence of the Sec system and the absence of such signal [2, 10].

Specifically, T1SS contain only three proteins: an outer membrane protein, two cytoplasmic membrane proteins, an ATP-binding cassette (ABC) and a membrane fusion protein. The secretion signal is usually located at the C-terminal end of the proteins secreted via T1SS, which specifically recognizes the ABC protein. The initial interaction between the C-terminal secretion signal and the ABC protein triggers the sequential assembly of the secretion complex by generating interactions between the three component proteins [9]. The effectors secreted via T1SS are translocated into the extracellular milieu, while the three remaining Sec-independent secretion systems all have different types of contact with the membranes of the recipient cell directly and are dedicated to translocating effectors to influence the physiology of host cells [1], which has attracted the majority of our attention.

T3SS are composed of approximately 20 proteins, most of which are located in the inner membrane [8]. Furthermore, T3SS in gram-negative bacteria construct a ‘bridge’ connecting the pathogen and host cells [12, 13]. This needle-like structure of the secretion machinery spans the inner and outer bacterial membranes and supports the injection of protein effectors directly into the cytoplasm of the host cell [12, 14, 15]. Two N-terminal domains of T3SS effectors are reported to be essential for secretion, although their exact boundary is not completely agreed [8, 13, 16–19]. One proposal suggests that residues 1–25 contain a region specialized as a secretion signal that targets the effectors to the secretion apparatus, and residues 25–100 correspond to a chaperone binding domain. The translocation of many effectors depends on the chaperone-binding domains, where the chaperones are considered to stabilize and maintain the effectors in an unfolded state before secretion [16, 20, 21].

T4SS are specialized protein complexes that are used by many bacterial pathogens to deliver type IV effector proteins directly into the host cells. In contrast to the above two types of Sec-independent secretion systems, T4SS are unique among other gram-negative bacterial secretion systems owing to their ability to transfer effector proteins, DNA and nucleoprotein complexes [22–27]. Genetic studies have discovered mutants/subtypes of type IV secretion systems [27–29], which divide the T4SS in gram-negative bacteria into two main subtypes called IVA and IVB [1, 23, 24, 30]. The type IVA systems are composed of subunits homologous to those of the Agrobacterium tumefaciens VirB/VirD4 system [31]. The type IVB systems are assembled from subunits homologous to the Legionella pneumophila Dot/Icm system [32]. Moreover, the effector proteins secreted via T4SS harbor a C-terminal signal sequence, which is useful for their translocation and identification [29, 33–38].

T6SS are an emerging type of secretion systems that are widespread in the bacterial world [4]. The minimal size of the T6SS apparatus consists of 13 proteins [39, 40]. The main component is a ‘Hemolysin co-regulated protein’ (Hcp) inner tube, with a tip complex composed of ‘valine-glycine repeat protein G’ (VgrG) and ‘proline-alanine-alanine-arginine’ (PAAR) proteins, which pierces the membrane of the host cell in a spike-like fashion. The presence of specific VgrG and PAAR is crucial for the T6SS to deliver the effectors directly into target cell [41, 42]. The target of T6SS can either be eukaryotic cells or other rival bacteria [43]. Antieukaryotic and antibacterial effectors promote pathogen survival in the new environment, such as Pseudomonas aeruginosa, Vibrio cholerae, Burkholderia species, Serratia marcescens and Acinetobacter baumannii. Antibacterial T6SS effectors are paired with specific cognate immunity proteins encoded downstream of the effector gene, which are able to prevent the secreted effectors from self-intoxication/self-killing [44–54]. Clear signal sequences of the T6SS effectors are not commonly accepted. Recently, Salomon et al. [55] reported an N-terminal motif, the MIX (marker for type VI effectors), which helps to identify novel T6SS effectors.

The main focus of this article is the progress of in silico identification of bacterial effector proteins that are secreted via the Sec-independent pathways T3SS, T4SS and T6SS. These three sophisticated types of secretion systems inject virulent effectors into the recipient cell directly, which is particularly important in the virulence of gram-negative bacteria, and in selected cases, is similar to human disease, such as the human innate immune/inflammatory response and typhoid induced by infection with Salmonella [56, 57], respiratory tract infections caused by Bordetella [58], cat-scratch disease and trench fever caused by Bartonella [59] and sexually transmitted diseases caused by Chlamydia [60].

The identification of effector proteins

The discovery of secretion systems is supported by the comparative analysis of coding genes, assuming the observation of homologous components in one secretion system is a good indicator of the appearance of a corresponding secretion system [3]. For example, T346Hunter is a computational tool for the prediction of T3SS, T4SS and T6SS based on the comparative similarity of the three secretion systems [61]. Particularly, a hidden Markov model (HMM) is used to construct the protein profiles of the core components of these three secretion systems, and new sequences are predicted to harbor the secretion system based on the enriched similarity to the constructed profiles.

Compared with the high conservation of secretion apparatus across bacterial pathogens, effector proteins evolve quickly to facilitate their adaptation to different host environments and participate in diverse molecular processes, such as inhibiting the host inflammatory response, killing immune cells, interfering with cellular activities and facilitating bacterial entry [30, 62–65].

The identification of secreted effector proteins is partially supported by biological experiments, namely, functional screening with a reporter gene [19, 66–73]. For example, Salmonella uses the T3SS to inject virulence effector proteins into host cells [74]. On the other hand, adenylate cyclase activity is entirely dependent on host cell calmodulin, and translocation of the adenylate cyclase domain into host cells can be detected by assaying cyclic AMP (cAMP) levels because it produces cAMP. Thus, based on these considerations, Geddes et al. generated translational fusions between Salmonella chromosomal genes and a fragment of the calmodulin-dependent adenylate cyclase genes and monitored the secretion of fusion proteins, namely, they identified secreted effector proteins by measuring the levels of cAMP in the infected host cells [74]. AvrRpt2 is another good in vivo reporter for type III secretion in Pseudomonas syringae [62, 75]. The secreted effectors can be detected by a hypersensitive response in Arabidopsis thaliana caused by the functional domain of AvrRpt2, as AvrRpt2 is fused to the candidate proteins and secreted into host cells along with unknown effectors.

The experimental techniques exemplified by reporting effectors secreted by fusion proteins are accurate; however, they are quite time-consuming and laborious. Furthermore, this type of identification methods is limited by both a priori knowledge about biological mechanisms and the sophisticated construction of molecular experiments. Additionally, most effectors are scattered throughout the genome rather than clustered in a narrow genomic region [66]. In contrast, computational identification of effector proteins can tolerate the researchers’ lack of biological background, and the genome-scale and automated detection of effectors greatly accelerate the process with respect to time and efficiency.

In fact, as sequencing techniques have improved in the past decade, the number of complete bacterial genomes available for genome-scale analysis has increased. In an era with such a large amount of data, we should take advantage of bioinformatics tools to efficiently detect the potential targets. Instead of a pure ‘wet’ experiment, the combination of both ‘wet’ and ‘dry’ is becoming prevalent and effective. Bioinformatics methods successfully limit the number of candidates for subsequent experimental validation by selecting putative ones based on plausible predictions, providing a reasonable starting point for experimental investigation, and promoting both the efficiency and reliability of the detection of novel secreted effectors [11, 37, 66, 69–71, 76–78]. In addition, the applications of bioinformatics techniques to many other molecular domains have achieved good success, such as the classical problem of predicting the secondary and tertiary structures of proteins [79, 80], which have proved the feasibility and significance of bioinformatics methods. Moreover, the characteristic information of a considerable number of validated bacterial secreted effectors has been observed, such as the preference of particular amino acids or certain structural motifs. These types of features can be grasped as ‘rules’ by in silico approaches and can further assist the efficient, automated, genome-scale mining of novel effectors in the huge wave of genomic data.

Major efforts have been devoted to identifying secreted effectors in silico, including multiple genome-scale studies [69, 70, 76, 81]. Two principal strategies have been adopted, machine learning approaches, which predict new effectors based on the extracted features of known effectors [11, 77, 82–87], and statistical methods, which calculate the probability of a protein being an effector [78, 88, 89].

The progress of computational identification of effectors secreted by fungi and nematodes has been reviewed [90]. For the bacterial secreted effectors, Greenberg et al. have presented the computational identification of T3SS effectors in P.syringae [91]. Segal et al. have addressed the bioinformatics approaches for identification of T4SS effectors in L.pneumophila, including the homology-based, promoter-based and signal-based methods [37]. Mcdermott et al. have described four machine learning-based prediction methods for type III and IV effectors. Specifically, based on three prime elements of machine learning algorithms, as introduced above, Mcdermott et al. compared the components of the methods, features of considered effectors, training and testing data and the performance of prediction based on an independent set of validated effectors known before the development of these methods [92]. Recently, An et al. have discussed the current computational studies for predicting effector proteins secreted via T3SS, T4SS and T6SS. This article mainly evaluates the algorithms, feature selection and software utilities based on curated testing data sets, which are quite informative and useful for future development of prediction methods [93]. In detail, An et al. construct three positive data sets which are constituted by extracted T3SS, T4SS and T6SS effectors, respectively. However, the negative data sets are selected from the other two positive data sets when a certain group of secreted effectors is considered. Although it is important to specifically detect one certain type of effectors from the ones secreted via other secretion systems, the ability of identifying particular groups of secreted effectors from the genomic background should be appreciated more. Ignoring the ’noise’ of truly non-effectors may decrease the biological persuasion of prediction performance. Moreover, An et al.’s work concludes the feature selection approaches, but pays less attention to the informative features of secreted effectors used for their identification. In fact, plenty of features represented by the amino acid composition (AAC), sequential motifs, structural and physicochemical information have been used to discriminate the effectors from other proteins. For example, the evolutionary conservation of secreted effectors is encoded as position-specific scoring matrix (PSSM) profiles, which are used in couples of prediction methods to efficiently identify secreted effectors [11, 38, 87].

We have been motivated to investigate the computational progress of identifying bacterial effectors secreted via T3SS, T4SS and T6SS from a distinctive aspect. We first review the informative features of secreted effectors, which are prevalently used for their identification. We then show our immersed study about the mechanisms of prediction methods of secreted effectors, especially the machine learning approaches. Additionally, we also mention the basic but classical BLAST and HMMs, which are prevalently applied to identify sequential motifs. A comparison of reviewed methods is carried out based on our comprehensive testing data sets. After the anatomization of our benchmarking results, we propose potential improvements of the prediction performance.

Features for identifying secreted effectors

The purpose of bacterial effector prediction methods is to discriminate effectors from non-effectors. The common features of known effectors are always used to identify new members according to the prediction ‘rules’. Hence, before addressing the topic of bioinformatics progress in predicting bacterial effectors, we first enumerate the features of effector proteins that can be/have been used for their identification.

Aligning candidates with known effectors for homology search would be the most straightforward way to identify effectors, where in selected cases the homology can be found by PSI-BLAST [66, 69, 71, 72, 76, 94–97]. This type of prediction method based on traditional sequence alignment generally gives reliable predictions, but low sequence similarity between secreted proteins induces poor prediction performance, as the effectors evolve quickly to facilitate their adaptation to different host environments [62, 64, 65]. PSI-BLAST is short for ‘Position-Specific Iterated BLAST’, which was proposed by Altschul et al. in 1997 to identify related sequences with weak similarity [98]. As an essential component of PSI-BLAST, PSSM profiles are a representative method of sequence motif/profile, which is well integrated in the domain of identifying bacterial effectors with great power. PSSM profiles are used to represent the evolutionary conservation profiles of the T3SS effectors [11], their 100 N-terminal residues [87] and the evolutionary features of T4SS effectors [38]. The transformation of PSSM, i.e. PSSM_AC, is also calculated to discriminate T4SS effectors and represents the correlation of evolutionary conservation of the 20 residues between two positions separated by a predefined distance along the sequence [38]. Compared with the evolutionary conservation within a certain type of effectors, their relationships to other genomes may help to identify new members, which are referred to as the phylogenetic profiles of effectors. Typically, phylogenetic profiles are lists of significant sequence similarity found between effector proteins in a series of organisms that provide information about the distribution of effector proteins over a range of different organisms with diverse evolutionary histories [11].

In addition to sequence similarity and conservation, detecting AAC biases (significant enrichment or depletion of certain amino acids) in known effectors is a straightforward and prevalent way to identify novel secreted proteins [11, 38, 77, 85]. For example, Asn, Glu and Lys show higher compositions in T4SS IVB effectors, and Ala, Glu and Ser occur more frequently in IVA effectors [38]. AAC biases can also be detected in the N-termini of T3SS effectors [19, 77, 83], the C-termini of T4SS effectors [99] and the N-termini of T6SS effectors [55], as they are assumed to be the most informative region for identifying secreted effectors. These particular regions of effectors are referred to as secretion signals and can help to identify the effectors that are to be secreted by the secretion apparatus [85]. Previous research shows that the first 100 amino acids in the N-terminal region of T3SS may harbor the signal sequences and the chaperone-binding sequences that are required to guide the secretion of T3SS effector proteins [8, 13, 16–19]. Specifically, an in silico stepwise deletion analysis of N-terminal amino acids suggests the 6th–10th N-terminal amino acids as the important region for identifying T3SS effectors [83]. In P.syringae, the AAC biases and patterns in the N-termini of T3SS effectors have been identified as a characteristic of effector proteins and have been used for their identification [62, 76, 91, 100, 101]. Arnold et al. showed that the selection 0–15 residues gives good discriminative power between T3SS effectors and non-effectors, and the most significant enrichment in the N-terminal of T3SS effectors is Ser, while Thr and Pro are significantly enriched in the effectors of animal pathogens, and Leu is depleted in both animal and plant effector proteins [85]. This supports the results indicating that the 50 N-terminal amino acids of P.syringae T3SS effectors show significant biases of high Ser and low Asp [62, 76, 83]. The C-terminal ‘E-block’ modules are used as criteria to identify T4SS effectors [96], where the E-block is a glutamate-rich region located within the 25 C-terminal amino acids of the secreted effectors and is required for efficient translocation of effectors [102]. Salomon et al. [55] reported an N-terminal motif for identifying novel T6SS effectors, the MIX, which was identified by using comparative proteomics. Typically, MIX is genetically linked to the T6SS core components but is not an essential structural element, as MIX-containing proteins are not required for the antibacterial activity. Meanwhile, MIX proteins contain cytotoxic effector domains, many of which are confirmed T6SS effectors.

In addition to the AAC biases, the characteristic features of the specific sequence fragments in the signal region can be used to identify corresponding effectors. Sequence similarities to the 50–100 N-terminal residues of known effectors are used to identify new T3SS effectors [19]. Analysis of the collection of T3SS effector proteins in P.syringae revealed an export-associated pattern of equivalent solvent-exposed amino acids in the 5 N-terminal positions and amphipathicity and richness in polar amino acids in the 50 N-terminal positions [76]. A hydrophobic residue near the C-terminus is reported to be critical for translocating T4SS effectors [36], whereas positively and negatively charged residues in the C-terminal signal sequence also play an important role in the translocation of T4SS effectors [33, 69, 96]. Meanwhile, two studies supported the conclusion that there is a preference for short polar amino acids located in the 20 C-terminal residues of T4SS effectors [69, 103]. C-terminal ‘basicity’ and ‘hydrophilicity’ are also used as criteria to identify T4SS effectors [96]. In addition to positively charged residues, other physico-chemical properties are considered to discriminate effectors, such as acidic residues and polar amino acids [76] and the molecular weight and pI estimated from the amino acid sequences [104]. Moreover, as most of the known secreted proteins are injected into the cytoplasm of host cells, hydrophilic residues would have higher compositions in effectors than in non-effectors. Hence, the global hydrophilicity of protein sequences is used to identify novel T4SS effectors [38, 71, 96].

Structural motifs of the effectors, including coils, helices and strands, may discriminate effectors from non-effectors [77, 83, 96]. Solvent accessibility represents the property of a side chain to be exposed to or buried in the solvent and is commonly used to identify effectors [76, 77, 83, 87]. For example, in addition to the significant contribution of the 6th–10th N-terminal amino acids, the secondary structure and solvent accessibility are reported to make important contributions to the identification of type III secretion signals [83]. With enriched Ser compared with other amino acids, T3SS effectors prefer to stay in unfolded coils and to be exposed to the solvent rather than be buried. The joint profiles analysis shows that the ‘Ser-coil-exposed’ preference is most frequently observed at most positions of T3SS effectors [83, 104].

Furthermore, many T3SS effectors share a commonality of interacting with cognate chaperones before secretion. Hence, the characteristic features of the chaperones can be used to identify cognate effectors that are encoded in the vicinity of chaperone genes [105]. The biophysical features of chaperones, such as the small molecular weight and acidic pI [106], provide markers for potential cognate effector loci in bacterial genomes [105]. Similarly, the antibacterial T6SS effectors are paired with specific cognate immunity proteins encoded in the vicinity of effector genes. The selected immunity proteins also exhibit low pI values and contain several highly conserved residues, making them markers for identifying new downstream cognate T6SS effectors [65]. The occurrence of conserved domains is another criterion to detect effectors, such as the eukaryotic domain [37, 69, 71, 72, 96, 107], which implies a potential imitation of eukaryotic host cell functionality of such domains. The occurrence of the prokaryotic domain, nuclear localization signal (NLS), mitochondrial localization signal (MLS) and prenylation domain are considered in [96] to identify novel T4SS effectors.

The characteristics of transcription regulatory sequences, the promoters, can also be used to identify the downstream open reading frames (ORFs) that encode putative effectors [62, 67, 69, 76, 77, 81, 96, 100, 108]. A match with the commonality of promoters may indicate an encoded effector downstream. But this approach is limited to known and detectable motifs and is specific to the conserved effector families. Moreover, G + C content analysis of genes encoding effectors helps to identify new effectors [11, 66, 69, 76, 104]. Studies have shown that effectors commonly have a relatively low G + C content [11, 62, 66, 69, 96, 109], supporting the hypothesis that a large number of effectors originate via horizontal gene transfer.

In silico identification methods

In general, the basis of prediction methods for bacterial effectors relies on the similarity between putative predictions and the known effectors. In other words, as the sequential, structural, physical or chemical collections of features of the known effectors are given/obtained, novel effectors are identified among the plausible predictions according to acceptable similarities to known effectors.

Computational prediction methods of bacterial effectors use multiple strategies to execute the prediction, such as machine learning algorithms and statistical approaches.

As one of the most common strategies, machine learning algorithms are well integrated into the exploration of bacterial effector proteins in silico when information about both effectors and non-effectors is available. Machine learning algorithms are a class of computational methods for binary ‘classification’ issues, where in the context of predicting effector proteins, the purpose of the algorithms is to discriminate effectors from non-effectors [11, 38, 77, 82–87]. Generally, the known effectors and non-effectors are fed into the machine learning algorithm as input, and the algorithm ‘learns’ or ‘is trained’ to discriminate effectors from non-effectors. The expectation is that the trained algorithms can identify new effector proteins based on the provided information.

The second major strategy to predict bacterial effectors is based on probability distributions. Statistical methods, such as Markov models, make novel predictions about effector proteins according to their probabilities [78, 88, 89], which are elaborated in the ‘Markov model’ section.

Specifically, a training data set for machine learning algorithms is constructed with both ‘positive’ and ‘negative’ examples initially. The positive examples are already known effectors, which are experimentally validated in most cases, or putative effector proteins. The negative examples are proteins that are identified or supposed to be non-effectors. Each example of the two groups is represented as a set of particular information called ‘features’, including the sequence patterns, physical characteristics and other features of the proteins. As the second step, a computational model is built using the constructed training data set. Both the positive and negative examples are input into the machine learning algorithm to enable the algorithm learn the difference between the given effectors and non-effectors. This process is also referred to as the ‘training/learning’ phase of machine learning algorithms. Finally, the algorithms are trained as predictive models that are able to make novel predictions about the effectors based on the input of new bacterial genomes or unknown sequences. The well-predicted results with a plausible evaluation performance are always considered via further experimental validation. The features of the validated results may be used to iteratively refine the predictive model. This process is referred to as ‘testing’ the algorithms. Instead of testing on independent data sets, the testing phase of machine learning algorithms can also be executed via ‘k-fold’ cross-validation, where the entire training data set is separated into k parts. Each part is used to test the model, which is trained by the remaining k-1 parts. This kind of testing is used to assess the predictive model, rather than to predict effectors from distinct organisms.

Figure 1 shows the flowchart of the bioinformatics identification of bacterial effectors based on machine learning algorithms. Knowledge of the previously known effectors and non-effectors is used in the prediction methods as ‘features’, which are used to train the model to make it able to discriminate effectors from non-effectors. The ‘trained’ or ‘learned’ model can make further novel predictions about new unclassified sequences, according to the knowledge it learned. Putative prediction candidates of effectors are considered for further experimental validation and the features of validated effectors help to refine the predictive model.

Flowchart of in silico prediction of bacterial secreted effectors based on machine learning algorithms. Particularly, during the training phase, diverse features of known effectors and non-effectors are extracted and combined into feature vectors x=x1,x2,…,xN where N denotes the number of considered features. Feature vectors are sent into the machine learning algorithms to construct a predictive model that is able to discriminate the effectors from non-effectors. New sequences are classified as either putative effectors or non-effectors by the constructed model according to the similarity between their features and that of known ones. The fidelity of putative effectors can be testified by experimental validation, and the validated ones may refine the predictive model further. This process is referred to as the independent testing, accompanying with cross-validation, which two are collectively known as the testing phase.
Figure 1

Flowchart of in silico prediction of bacterial secreted effectors based on machine learning algorithms. Particularly, during the training phase, diverse features of known effectors and non-effectors are extracted and combined into feature vectors x=x1,x2,,xN where N denotes the number of considered features. Feature vectors are sent into the machine learning algorithms to construct a predictive model that is able to discriminate the effectors from non-effectors. New sequences are classified as either putative effectors or non-effectors by the constructed model according to the similarity between their features and that of known ones. The fidelity of putative effectors can be testified by experimental validation, and the validated ones may refine the predictive model further. This process is referred to as the independent testing, accompanying with cross-validation, which two are collectively known as the testing phase.

The prediction methods of bacterial effector proteins secreted via certain types of secretion systems cannot be exhaustively elaborated owing to limited space in this article. Hence, the mechanisms of the commonly adopted prediction techniques are exemplified using selected examples, including support vector machine (SVM) (Figure 2A) [11, 38, 77], naive Bayes (NB) (Figure 2B) [85], artificial neural network (ANN) (Figure 2C) [86] and random forest (RF) (Figure 2D) [87], which have achieved different levels of success. The foremost differences between these methods are the mechanisms used to extract statistical commonalities in the data and their principles of processing data and making novel predictions.

In silico prediction methods of bacterial secreted effectors at work. (A–D) Four rife machine learning algorithms for identifying secreted effectors in bacteria, which are famous as binary classification tools; (E) statistic approach based on Markov model, which is prevalently used for searching homologous domains in bacterial effectors.
Figure 2

In silico prediction methods of bacterial secreted effectors at work. (AD) Four rife machine learning algorithms for identifying secreted effectors in bacteria, which are famous as binary classification tools; (E) statistic approach based on Markov model, which is prevalently used for searching homologous domains in bacterial effectors.

Support vector machine

Diverse characteristic features of effectors and non-effectors are the input of SVMs, from which SVMs learn the differences between these two groups. Specifically, features depicting each example are numerically encoded as a feature vector by SVMs, which are represented as a sequence of numbers that can be thought of as a point in N-dimensional space, where N is the number of features considered. Given two opposite groups of effectors and non-effectors, SVMs separate the corresponding points in this N-dimensional space by constructing a hyperplane between them. Unclassified examples are predicted to be in either group according to which side of the hyperplane they fall on [11, 38, 77, 82–84, 110]. A brief description of how SVMs work is shown in Figure 2A.

SIEVE

The detection approach of T3SS effectors, SVM-based identification and evaluation of virulence effectors (SIEVE), is trained with multiple features of the known effectors. The feature vectors of SIEVE represent points in 711-dimensional space that are contributed by consideration of the G + C content, AAC, the 30 N-terminal residues of both effectors and non-effectors and their evolutionary relationships to other genomes [11]. The predictive power is tested by making novel predictions based on distantly related organisms.

In SIEVE, the construction of the training data set considers the removal of effector homologs in one particular organism, which are identified by BLAST. However, the evolutionary conservation of each effector protein across organism is considered as a feature vectors for the SIEVE, such as the PSSM and phylogenetic profiles.

SIEVE is trained on P.syringae effector proteins and evaluated by predicting Salmonella enterica serovar typhimurium effectors, which are evolutionarily far from the learning data set. A reverse experiment is conducted by swapping the training and testing data set of effectors from the two organisms. The prediction of novel effectors in a third human pathogen, Chlamydia trachomatis, is also conducted, and high-ranked predictions are also experimentally investigated.

In addition, the identifications of SIEVE indicate a set of conserved sequence biases within a majority of the effectors from both organisms, suggesting a putative secretion signal in the 30 N-terminal residues for the identification of novel T3SS effectors.

SSE-ACC

SSE-ACC, a SVM model, was developed to identify T3SS effectors based on the AAC, secondary structure (SSE) and solvent accessibility (ACC) of known effectors and non-effectors in P.syringae, precisely within the 100 N-terminal residues rather than the full length [77].

The 100-dimensional feature vectors are constructed according to the AAC in different secondary structure elements and solvent accessibility states. The first 60 dimensions describe the frequency of each of the 20 amino acids in each of the three types of protein secondary structure elements, i.e. strand, helix and coil (20*3). The last 40 dimensions represent the frequency of each of the 20 amino acids in each of the two types of protein solvent accessibility states, i.e. buried or exposed (20*2).

SSE-ACC is tested both by 5-fold cross-validation and on independent rhizobial genomes. Combined with a promoter search using a HMM, which will be shown in detail in the ‘Markov model’ section, multiple putative effectors are predicted and confirmed by wet-bench experiments [77].

Additionally, the structural motifs, which are presented as input features, are predicted by PSIPRED [79], and the solvent accessibility states are predicted by ACCpro within the SCRATCH server [111].

T4EffPred

As in our previous work, T4EffPred is a SVM model for predicting bacterial effectors secreted via T4SS based on the AAC and evolutionary conservation of effectors and non-effectors represented by PSSM profiles [38].

PSSM profiles are created by using PSI-BLAST to search the NCBI’s non-redundant (NR) protein database for multiple alignment against the query sequence. In detail, PSI-BLAST first constructs a multiple alignment from the original BLAST output data sought by the query sequence, where all the database sequence segments aligned to the query with an E-value below a predefined threshold (closely related) are chosen. Next, PSI-BLAST processes this multiple alignment into a PSSM, which holds the conservation sequence pattern among the alignment. Finally, PSI-BLAST feeds the PSSM into a new round of BLAST search against the database, rather than the query sequence. Newly identified sequence segments with low E-values are aligned and used to refine the PSSM iteratively until no additional related sequences are detected [38, 98].

The critical difference between the PSI-BLAST and pairwise alignment in BLAST is that the score for aligning a letter with the pattern position is given by the PSSM, rather than with reference to an amino acid substitution matrix. As PSSM greatly increases the sensitivity to weak but biologically relevant sequence relationships, PSSM is commonly used for molecular motif/pattern discovery, such as protein secondary structure [79], transporter targets [112] and bacterial secreted T3SS effectors [11, 82, 87].

In particular, for a given query sequence of length n, the corresponding PSSM profiles contain n*20 elements, and the (i, j)th entry represents the score of the amino acid in position i of the query sequence mutated to amino acid type j during the evolution process. In practice, the authors transfer the PSSM profiles into the PSSM composition profile by summing the corresponding amino acid rows in the PSSM, which contributes 400 dimensions to the feature vectors for the SVM. In addition, the auto covariance transformation of PSSM, PSSM_AC is also calculated to discriminate T4SS effectors by the authors, which represents the correlation of evolutionary conservation of the residues with an interval of 10 along the sequence.

In contrast to the previous SVM-based models, four types of feature vectors are used by T4EffPred, constructed based on AAC (20D), amino acid pairs (400D), PSSM composition (400D) and PSSM_AC (200-D). Tenfold cross-validation suggests the PSSM-based feature vectors are more helpful for discrimination of effectors from non-effectors based on the training data set consisting of proteins from multiple pathogens. These four individual feature vectors and the ensemble are used to enlarge the putative T4SS effectors in an independent organism.

Naive Bayes

NB algorithms are probabilistic classifiers. When applied to a binary classification problem, the algorithms are trained using a positive and negative set of examples, with each example represented by a vector of features. NB algorithms classify unknown examples according to their probability calculated by applying Bayes’ theorem with an assumption of conditional independence between features [113]. A brief introduction to constructing NB classifier is shown in Figure 2B.

The method ‘EffectiveT3’ uses the NB algorithm as a classifier to model the AAC and secondary structures of the N-terminal residues of the T3SS effectors [85].

This machine learning approach takes 100 experimentally verified effector proteins from the animal pathogens Chlamydia, Salmonella, Yersinia and Escherichia and the plant pathogen Pseudomonas as the positive training data. The negative data set comes from randomly selected non-effectors.

The authors first analyze the known effectors in the training data set, and extract a specific AAC in the N-termini of the effectors as the general signal of T3SS-mediated transport. Then, they build the EffectiveT3 for this N-terminal signal using NB algorithms to separate the positive and negative data. After 10-fold cross-validation, EffectiveT3 is generalized to predict novel T3SS effectors from 739 genomes, which are with or without T3SS.

In contrast to other methods, the 20 amino acids are mapped into two reduced alphabets according to their biophysical properties and hydrophobic/hydrophilic characteristics. In other words, the 20 amino acids are separated into several groups hierarchically. The frequencies of each group of amino acids and the frequencies of 20 amino acids (AAC) make up the feature vector of EffectiveT3.

The features identified by EffectiveT3 are argued to be taxonomically universal owing to the inclusion of T3SS effectors from multiple pathogens in the training data set, where there is no organism-specific information required.

Additionally, the homologues in the training data set are clustered by a global alignment, and the N-terminal conserved region is surveyed through multiple alignments. The structural motifs, which are presented as input features, are predicted by PSIPRED [79].

Artificial neural network

ANNs are computer-based algorithms inspired by the structure and behavior of connected neurons in the human brain. Similar to other machine learning algorithms, ANNs can be trained to recognize and classify complex patterns by analyzing the rules of data within a training data set and can make novel classifications based on the learned knowledge [114]. The basic idea behind ANNs is shown in Figure 2C.

‘T3SS_prediction’ is a ANN-based model that is used to identify potential T3SS effectors based on the N-terminal amino acid sequence using a sliding-window procedure [86].

In particular, after the removal of redundant and short sequences (<100 aa), 575 putative T3SS effectors and 685 non-effectors from P.syringae and other species make up the training data set of T3SS_prediction. Based on the 30 N-terminal amino acids, a window with a width of 25 slides along the sequence fragment toward the C-terminus with an increment of one amino acid. The generated overlapping subsequences are the input of the ANN classifier, within which each amino acid is encoded as a string of length 20. The output neuron of ANN classifies each subsequence by estimating their corresponding probability of being a part of T3SS effector signals, according to a predefined threshold.

Seven hidden neurons are constructed in the final ANN model based on a series of systematic optimizations to make independent predictions of potential T3SS effectors for 918 bacterial genomes.

Moreover, based on the sequence patterns of the N-terminal secretion signals, similarity-based comparison with the known effectors, chaperone homologues and AAC, an SVM is trained as an alternative classifier.

Random forests

The RFs classifier is constructed using multiple decision trees, each of which individually makes a classification for the given data set. The final prediction of RFs corresponds to the classification with the most votes over all the decision trees in the forest [115]. The mechanism of RFs is introduced briefly in Figure 2D.

A prediction model of T3SS effectors, ‘T3SPs’, is developed based on the algorithm of RFs [87].

In particular, PSSM profiles are calculated to show position-specific conservation differences between the 283 T3SS effectors and 313 non-T3SS effectors across multiple species, which are generated by using PSI-BLAST to search the Swiss-Prot for multiple alignment against the query sequence. Each residue at one position is characterized by 20 values corresponding to the log-likelihood of mutation of 20 standard amino acids. The significance of the conservation differences between effectors and non-effectors at the same position is assessed statistically, and 52 relatively distinct positions for discrimination of T3SS effectors are retained, which represent the original input data for the RFs.

Based on the consideration of the AAC, secondary structure, solvent accessibility and six physicochemical properties of the 100 N-terminal amino acids of the collected data, 62 features are selected to construct the feature vectors in the RFs, according to their individual contribution to the T3SS prediction.

The selection of relatively optimal positions and features for identifying novel T3SS effectors is based on the argument that some residues with little conservation have poor contributions to the effector identification. Therefore, redundant position and feature information is removed to improve the predictive power of T3SPs.

Additionally, the redundancy in the initial data set is removed by clustering the sequence fragments of the 100 N-terminal amino acids; the homology is removed through multiple alignments. The structural motifs and the relative solvent accessibility, which are included in the feature vectors, are predicted by SABLE [116], and the six physicochemical properties are provided by Protparam in the ExPASy server [117].

Markov model

The features of bacterial effectors can be extracted statistically, where a sequence is predicted to be an effector protein according to its probability of begin one, based on the application of Markov models [77, 78, 81, 88, 89, 100].

A Markov model is a stochastic model used to describe a randomly changing process where what happens next depends merely on the current state of the system. In general, ‘Markov chains’ and ‘HMMs’ are commonly used, where the sequential state can be either observable or not.

As the simplest model, a Markov chain contains only one state sequence describing the process with sequential dependence between adjacent states. In the context of predicting bacterial effectors, the sequential dependence assumption of Markov models can be generalized as a probabilistic dependence between adjacent amino acids [88]. Hence, the probability of generating a sequence S=A1,A2An with length of n can be described as a Markov chain and calculated as the product of conditional probabilities of each amino acid pair, as follows:
(1)
where Ai denotes the ith amino acid, and P(Ai+1|Ai) describes the probability of Ai+1 conditioned on the amino acid at sequentially preceding position Ai.

‘T3_MM’, a Markov-chain-based prediction method of T3SS effectors, is constructed [88] based on the observation of T3SS effector-specific AAC conditional dependence.

Precisely, the probability of each amino acid conditional on the amino acid in the preceding position within the 100 N-terminal amino acids in the T3SS effectors is different from that of the non-T3SS effectors and the theoretical random probability distribution of each amino acid. In other words, given an amino acid, the sequentially following amino acid shows a significantly biased composition in T3SS effectors compared with that of non-T3SS effectors and the theoretical random distribution of AAC.

In particular, protein sequences are modeled as Markov chains with each state described by the amino acids in the protein sequence. Given a training data set composed of 154 T3SS effectors and 308 non-T3SS effectors, the AAC conditional probabilistic modeling results of each group of motifs are calculated. T3_MM is constructed based on the distribution of the conditional probability difference between the two groups to predict novel T3SS effectors, which corresponds to the likelihood ratio of the sequence being a T3SS effector or a non-effector.

In addition to 5-fold cross-validation, two independent testing data sets are included to evaluate the performance of T3_MM for predicting T3SS effectors.

Compared with the single-state sequence in Markov chains, HMMs are composed of a hidden-state sequence and an observed-symbol sequence, where the state sequence is a Markov chain that cannot be observed directly but is inferred probabilistically from the observed symbol sequence [118].

HMMs are commonly applied as sequence homology search tools in molecular domains based on probabilistic inference methods [119–121]. HMMs of a protein domain are typically built based on multiple sequence alignment of homologous domains, with the states representing the match, insertion and deletion of amino acids (the symbols) in the sequences, as shown in Figure 2E [120, 121].

In the context of predicting bacterial effectors, HMMs are used to model a particular region related to the effector proteins, such as the promoter patterns in the upstream region that encodes T3SS effectors [81, 77, 100], the full-length T3SS effectors [84] and phosphorylated ‘EPIYA’ motif sequences in both T3SS and T4SS effectors [89]. With additional consideration of the duration of states, a hidden semi-Markov model (HSMM) is constructed to describe the C-terminal patterns of T4SS effectors and to make novel predictions [78]. The constructed HMMs and HSMMs are further applied to search for homologous motifs in larger data sets to expand the inventory of bacterial effectors.

Other prediction strategies

There are exceptions of prediction methods that do not adopt machine learning algorithms or statistical probabilities, such as ‘searching algorithm for type IV secretion system effectors (S4TE)’, which predicts T4SS effectors using all positive examples to make novel predictions [96]. S4TE considers 13 features depicting known T4SS effectors as criteria for detection and returns putative predictions with a satisfactory score that is summed over the individual absence or presence of each feature. The 13 features are the promoter pattern of known T4SS effectors, sequence homology, the occurrence of eukaryotic and prokaryotic domains, NLS, MLS, prenylation domain and structural motifs, the basicity, charge, hydrophilicity and E-block module in the C-terminus, as well as global hydrophilicity of the considered protein.

The exceptions also include the classic prediction methods based on the use of comparative genomics for predicting effectors. They thrive during the early stage of this domain and are promoted by the increasing number of available genomic sequences from pathogens but are limited by the poor conservation between effectors.

As a classic bioinformatics tool to search homologs, the ‘basic local alignment search tool (BLAST)’ [122] has been extensively used to search homologs based on sequence similarity [19, 44, 66, 76, 81].

For example, validated and predicted T3SS effectors in different organisms are used to search for novel effectors in Escherichia coli based on the BLAST-determined sequence similarity [66]. Putative homologous hits are subjected to a second round of PSI-BLAST searches against the NCBI’s NR database to identify more distantly related homologs.

New putative ORFs encoding T3SS effectors are identified based on the BLAST-determined similarity to the 100 N-terminal residues of known effectors in Salmonella typhimurium [19]. The alignment of newly identified known effectors suggests that a consensus motif of WEK(I/M)XXFF is located in the 50 N-terminal residues, which functions as a conserved region for the translocation of effectors in T3SS.

Based on the HMM-identified promoters, downstream ORFs encoding T3SS effectors are searched by BLAST to identify putative T3SS effectors that are homologous to known ones [81].

Based on the identification of promoters above, the 50 N-terminal residues and homology to known effectors are also used to identify novel T3SS effectors [76]. Two motifs of ORFs are searched along the genome of P. syringae for novel T3SS effectors, which are defined according to the likelihood of certain amino acids appearing within the N-terminal region. The putative ORF hits returned by BLAST searches are subsequently eliminated by heuristic rules, such as a starting residue of Met and a minimal length of 150 aa.

Similarly, a phylogenetically disperse superfamily of homologous T6SS effectors is identified using physical information derived from experimentally validated bacterial T6SS effectors and effector-immunity pairs [44].

Performance evaluation

The fidelity of predicted bacterial secreted effectors can be testified by experimental confirmations, where the exact number of reliable predictions is reflected during precise processes [66, 69, 76]. However, experimental approaches are similar with respect to the limited application for faster and more efficient automation. Similar to the identification of secreted effectors, in silico assessment techniques are required.

As the in silico prediction of secreted effector proteins in bacteria is always referred to as a binary classification task of discriminating effectors from non-effectors, the prevalent statistical measures of classification are adopted to assess the performance of each prediction method, namely, the ‘sensitivity’, ‘specificity’, ‘Matthews correlation coefficient’ (MCC), ‘receiver operating characteristic (ROC) curve’ and the ‘area under the curve’ (AUC). Sensitivity (TP/(TP + FN)) and specificity (TN/(TN + FP)) assess the ability to correctly identify effectors and to correctly reject non-effectors, respectively. MCC combines sensitivity and specificity and is generally regarded as a comprehensive and balanced measure that can be used even if the positive and negative samples are of different sizes [123]. The coefficient values range from −1 to +1, with +1 representing prefect prediction, 0 indicating random guessing and −1 representing total disagreement with the prediction.
(2)

The ROC curve is produced by plotting the sensitivity (y-axis) of the prediction method versus the ‘fall-out’ [x-axis, equal to (1 specificity)]. A perfect prediction with neither FN nor FP corresponds to point (0,1) in the ROC space, and a completely random guess yields a point along the diagonal line from (0,0) to (1,1). Points below the diagonal line indicate results that are worse than random guessing. Correspondingly, the AUC is 1 when all the samples are classified correctly and is 0.5 when the classification is random.

The diverse values mentioned above are commonly applied to assess the performance of a prediction method, as the numbers of effectors and non-effectors in the testing data set are known. Based on this requirement, the results can be classified as correct predictions [true positives (TPs) and true negatives (TNs)] or incorrect predictions [false positives (FPs) and false negatives (FNs)]. For example, given 275 known L.pneumophila T4SS effectors, the fidelity of S4TE is measured as the number of correct predictions, which is reflected by the above parameters [96]. During the reverse experiments in SIEVE, which swap the training and testing data sets of the effectors from P.syringae and S.typhimurium, the respective sensitivity and specificity values are calculated [11]. Particularly, at a sensitivity of 90%, the specificity is 88% when SIEVE is trained using the P.syringae effectors and then used to predict S.typhimurium effectors and is 87% in the opposite case. The corresponding ROC values are 0.95 and 0.96 [11]. The performance of different classification algorithms for predicting certain types of effectors are always compared with each other [85–88]. The evaluation values are also adopted to assess the contribution of individual features to discriminate effectors from non-effectors. For example, PSSM is the most effective single feature for representing T4SS effectors as it has the highest sensitivity (89.4% for identifying IVB effectors and 73.3% for identifying IVA effectors), MCC values (0.874 and 0.782, respectively) and AUC (0.970 for identifying IVB effectors) [38]. The optimal MCC values are obtained when the AACs are combined with the PSSM profiles (0.878 and 0.784 for identifying IVB and IVA effectors, respectively) [38].

Available online resources

With the rapid development of high-throughput sequencing technologies, an increasing number of bacterial genomes has been fully sequenced and made available for the analysis of bacterial secretion systems and effectors. There is an unprecedented requirement for bioinformatics tools/resources that are able to identify secreted effectors conveniently and reliably from these genomic data.

Representative prediction methods of bacterial effector proteins secreted via certain types of secretion systems are introduced above. In this section, we present more specific tools developed for this domain, which are mostly available online (Table 1), as well as databases built for the secreted effectors (Table 2). We hope the tables illustrate our respect for the computational efforts that have been made, although we cannot exhaustively enumerate all of them within this article.

Table 1

Computational prediction methods of secreted effector proteins in bacteria

MethodSecretion systemPrediction algorithmFeatureTraining dataSizeDescriptionAccessibility
SIEVE [11]T3SSSVMAAC, SEQ, GC, CONS and PHYLS. typhimurium and P. syringae65e unfixedSIEVE learns the difference between T3SS effectors and non- effectors and detects novel effectors based on SVM, with the information of G+C content, AAC, the 30 N-terminal residues of both effectors and non-effectors, and their evolutionary relationships to other genomes.http://www.sysbep.org/sieve/
EffectiveT3 [85]T3SSNBAAC, SEQ GC, PHYL5 species100e 200neEffectiveT3 uses the NB classifier to model the AAC and secondary structures of the N-terminal residues of the T3SS effectors and non-effectors, and makes novel predictions.http://www.effectors.org/method/effectivet3
T3SS_prediction [86]T3SSANN SVMN-AACP. syringae and others575e 685neT3SS_prediction is composed of ANNs that identify potential T3SS effectors based on the N-terminal amino acid sequences using a sliding-window procedure.http://gecco.org.chemie.uni-frankfurt.de/T3SS_prediction/T3SS_prediction.html
T3SEdb [124]T3SSNBPP28 species100e 100neT3SEdb detects novel T3SS effectors based on the hydrophobicity, polarity and β-turn profiles extracted from 100 N-terminal residues of known effectors.http://effectors.bic.nus.edu.sg/T3SEdb/predict.php
SSE-ACC [77]T3SSSVMAAC, SS and ACCP. syringae108e 3424neAn SVM-based model developed to identify T3SS effectors based on the AAC, SS and ACC of known effectors from the 100 N-terminal residues.On request
BPBAac [82]T3SSSVMAACunclear154e 308neAn SVM-based method for identifying T3SS effectors with the information of the AAC features within the 100 N-terminal residues, which are extracted using a Bi-Profile Bayesian (BPB) model*.http://biocomputer.bio.cuhk.edu.hk/T3DB/BPBAac.php
T3SPs [87]T3SSRFAAC, SS ACC, PP16 species283e 313neBased on the AAC, SS, ACC and six physico-chemical properties of the 100 N-terminal amino acids, T3SPs uses RFs to predict T3SS effectors.Unavailable
T3_MM [88]T3SSMarkov modelConditional dependence of AACUnclear154e 308neT3_MM is constructed to predict novel T3SS effectors based on the distribution of the conditional probability difference between the effectors and non-effectors, which corresponds to the likelihood ratio of the sequence being either motif.http://biocomputer.bio.cuhk.edu.hk/T3DB/T3_MM.php
T3SEpre [83]T3SSSVMAAC, SS and ACCUnclear189e 385neT3SEpre is an SVM-based model for predicting T3SS effectors, with position-specific AAC, SS and ACC extracted from known effectors and non-effectors.http://biocomputer.bio.cuhk.edu.hk/T3DB/T3SEpre.php
BEAN [84]T3SSSVMAAC PSSMUnclear154e 1540neBEAN uses a PSSM-based k-spaced amino acid pair composition method ex-tracted within N-terminal sequences to compute the feature vectors of known effectors and non-effectors, which are separated by SVM.Unavailable
BEAN2.0 [97]T3SSBLAST SVMS,D,AAC PSSMUnclear243e 486neBEAN2.0 consists of three predictors, the first two predictors detect novel T3SS effectors or non-effectors based on similarity in sequence and domains to known effectors, and the third predictor adopts a similar strategy to BEAN but considers longer sequences.http://systbio.cau.edu.cn/bean/
pEffect [125]T3SSBLAST SVMS, PSSM43 species115e 3460nepEffect uses two predictors to make predictions of novel T3SS effectors, where the first PSI-BLAST-based predictor is used to identify sequential similarity to known effectors, and the second SVM-based predictor is used when no accepted similarity is available.http://services.bromberglab.org/peffect/
S4TE [96]T4SSBLASTS, P, D SS and PPunclearunclearS4TE predicts novel T4SS effectors according to 13 features depicting known ones by computing a score summed over the individual absence or presence of features.http://sate.cirad.fr/
T4EffPred [38]T4SSSVMAAC PSSMunclear340e 1132neT4EffPred uses SVM to predict T4SS effectors based on the AAC, dipeptide composition, PSSM composition profiles and their auto covariance transformation.http://bioinfo.tmmu.edu.cn/T4EffPred/
T4SEpre [99]T4SSSVMC-AAC,C-SS and C-ACC10 species347e 694neT4SEpre is an SVM-based model for predicting T4SS effectors, based on the information of the C-terminal AAC, SS and ACC, extracted from the 100 N-terminal residues of known effectors.http://biocomputer.bio.cuhk.edu.hk/T4DB/T4SEpre.php
MethodSecretion systemPrediction algorithmFeatureTraining dataSizeDescriptionAccessibility
SIEVE [11]T3SSSVMAAC, SEQ, GC, CONS and PHYLS. typhimurium and P. syringae65e unfixedSIEVE learns the difference between T3SS effectors and non- effectors and detects novel effectors based on SVM, with the information of G+C content, AAC, the 30 N-terminal residues of both effectors and non-effectors, and their evolutionary relationships to other genomes.http://www.sysbep.org/sieve/
EffectiveT3 [85]T3SSNBAAC, SEQ GC, PHYL5 species100e 200neEffectiveT3 uses the NB classifier to model the AAC and secondary structures of the N-terminal residues of the T3SS effectors and non-effectors, and makes novel predictions.http://www.effectors.org/method/effectivet3
T3SS_prediction [86]T3SSANN SVMN-AACP. syringae and others575e 685neT3SS_prediction is composed of ANNs that identify potential T3SS effectors based on the N-terminal amino acid sequences using a sliding-window procedure.http://gecco.org.chemie.uni-frankfurt.de/T3SS_prediction/T3SS_prediction.html
T3SEdb [124]T3SSNBPP28 species100e 100neT3SEdb detects novel T3SS effectors based on the hydrophobicity, polarity and β-turn profiles extracted from 100 N-terminal residues of known effectors.http://effectors.bic.nus.edu.sg/T3SEdb/predict.php
SSE-ACC [77]T3SSSVMAAC, SS and ACCP. syringae108e 3424neAn SVM-based model developed to identify T3SS effectors based on the AAC, SS and ACC of known effectors from the 100 N-terminal residues.On request
BPBAac [82]T3SSSVMAACunclear154e 308neAn SVM-based method for identifying T3SS effectors with the information of the AAC features within the 100 N-terminal residues, which are extracted using a Bi-Profile Bayesian (BPB) model*.http://biocomputer.bio.cuhk.edu.hk/T3DB/BPBAac.php
T3SPs [87]T3SSRFAAC, SS ACC, PP16 species283e 313neBased on the AAC, SS, ACC and six physico-chemical properties of the 100 N-terminal amino acids, T3SPs uses RFs to predict T3SS effectors.Unavailable
T3_MM [88]T3SSMarkov modelConditional dependence of AACUnclear154e 308neT3_MM is constructed to predict novel T3SS effectors based on the distribution of the conditional probability difference between the effectors and non-effectors, which corresponds to the likelihood ratio of the sequence being either motif.http://biocomputer.bio.cuhk.edu.hk/T3DB/T3_MM.php
T3SEpre [83]T3SSSVMAAC, SS and ACCUnclear189e 385neT3SEpre is an SVM-based model for predicting T3SS effectors, with position-specific AAC, SS and ACC extracted from known effectors and non-effectors.http://biocomputer.bio.cuhk.edu.hk/T3DB/T3SEpre.php
BEAN [84]T3SSSVMAAC PSSMUnclear154e 1540neBEAN uses a PSSM-based k-spaced amino acid pair composition method ex-tracted within N-terminal sequences to compute the feature vectors of known effectors and non-effectors, which are separated by SVM.Unavailable
BEAN2.0 [97]T3SSBLAST SVMS,D,AAC PSSMUnclear243e 486neBEAN2.0 consists of three predictors, the first two predictors detect novel T3SS effectors or non-effectors based on similarity in sequence and domains to known effectors, and the third predictor adopts a similar strategy to BEAN but considers longer sequences.http://systbio.cau.edu.cn/bean/
pEffect [125]T3SSBLAST SVMS, PSSM43 species115e 3460nepEffect uses two predictors to make predictions of novel T3SS effectors, where the first PSI-BLAST-based predictor is used to identify sequential similarity to known effectors, and the second SVM-based predictor is used when no accepted similarity is available.http://services.bromberglab.org/peffect/
S4TE [96]T4SSBLASTS, P, D SS and PPunclearunclearS4TE predicts novel T4SS effectors according to 13 features depicting known ones by computing a score summed over the individual absence or presence of features.http://sate.cirad.fr/
T4EffPred [38]T4SSSVMAAC PSSMunclear340e 1132neT4EffPred uses SVM to predict T4SS effectors based on the AAC, dipeptide composition, PSSM composition profiles and their auto covariance transformation.http://bioinfo.tmmu.edu.cn/T4EffPred/
T4SEpre [99]T4SSSVMC-AAC,C-SS and C-ACC10 species347e 694neT4SEpre is an SVM-based model for predicting T4SS effectors, based on the information of the C-terminal AAC, SS and ACC, extracted from the 100 N-terminal residues of known effectors.http://biocomputer.bio.cuhk.edu.hk/T4DB/T4SEpre.php

For each type of secreted effector, the prediction methods are listed in chronological order. The machine learning algorithms mentioned here are SVM, support vector machine; NB, naive Bayes; ANN, artificial neural networks; and RF, random forests. ‘Features’ describes the features of the effector sequences considered by the prediction programs: AAC, amino acid composition; SEQ, 30 N-terminal residues; GC, G + C content; PHYL, phylogenetic profile; CONS, evolutionary conservation; SS, secondary structure; ACC, solvent accessibility; PP, physico-chemical properties; S, sequential similarity; D, similar domains; and P, promoter pattern. ‘Training Data’ indicates where the known effectors and non-effectors originate from, and their respective numbers are shown in ‘Size’ (as indicated in publications, e: effector, ne: non-effector). *BPB and Single-Profile Bayesian (SPB) are two methods of feature extraction [82, 126], where BPB considers the features of both positive and negative effectors in the training data sets and SPB considers only the features of positive examples.

Table 1

Computational prediction methods of secreted effector proteins in bacteria

MethodSecretion systemPrediction algorithmFeatureTraining dataSizeDescriptionAccessibility
SIEVE [11]T3SSSVMAAC, SEQ, GC, CONS and PHYLS. typhimurium and P. syringae65e unfixedSIEVE learns the difference between T3SS effectors and non- effectors and detects novel effectors based on SVM, with the information of G+C content, AAC, the 30 N-terminal residues of both effectors and non-effectors, and their evolutionary relationships to other genomes.http://www.sysbep.org/sieve/
EffectiveT3 [85]T3SSNBAAC, SEQ GC, PHYL5 species100e 200neEffectiveT3 uses the NB classifier to model the AAC and secondary structures of the N-terminal residues of the T3SS effectors and non-effectors, and makes novel predictions.http://www.effectors.org/method/effectivet3
T3SS_prediction [86]T3SSANN SVMN-AACP. syringae and others575e 685neT3SS_prediction is composed of ANNs that identify potential T3SS effectors based on the N-terminal amino acid sequences using a sliding-window procedure.http://gecco.org.chemie.uni-frankfurt.de/T3SS_prediction/T3SS_prediction.html
T3SEdb [124]T3SSNBPP28 species100e 100neT3SEdb detects novel T3SS effectors based on the hydrophobicity, polarity and β-turn profiles extracted from 100 N-terminal residues of known effectors.http://effectors.bic.nus.edu.sg/T3SEdb/predict.php
SSE-ACC [77]T3SSSVMAAC, SS and ACCP. syringae108e 3424neAn SVM-based model developed to identify T3SS effectors based on the AAC, SS and ACC of known effectors from the 100 N-terminal residues.On request
BPBAac [82]T3SSSVMAACunclear154e 308neAn SVM-based method for identifying T3SS effectors with the information of the AAC features within the 100 N-terminal residues, which are extracted using a Bi-Profile Bayesian (BPB) model*.http://biocomputer.bio.cuhk.edu.hk/T3DB/BPBAac.php
T3SPs [87]T3SSRFAAC, SS ACC, PP16 species283e 313neBased on the AAC, SS, ACC and six physico-chemical properties of the 100 N-terminal amino acids, T3SPs uses RFs to predict T3SS effectors.Unavailable
T3_MM [88]T3SSMarkov modelConditional dependence of AACUnclear154e 308neT3_MM is constructed to predict novel T3SS effectors based on the distribution of the conditional probability difference between the effectors and non-effectors, which corresponds to the likelihood ratio of the sequence being either motif.http://biocomputer.bio.cuhk.edu.hk/T3DB/T3_MM.php
T3SEpre [83]T3SSSVMAAC, SS and ACCUnclear189e 385neT3SEpre is an SVM-based model for predicting T3SS effectors, with position-specific AAC, SS and ACC extracted from known effectors and non-effectors.http://biocomputer.bio.cuhk.edu.hk/T3DB/T3SEpre.php
BEAN [84]T3SSSVMAAC PSSMUnclear154e 1540neBEAN uses a PSSM-based k-spaced amino acid pair composition method ex-tracted within N-terminal sequences to compute the feature vectors of known effectors and non-effectors, which are separated by SVM.Unavailable
BEAN2.0 [97]T3SSBLAST SVMS,D,AAC PSSMUnclear243e 486neBEAN2.0 consists of three predictors, the first two predictors detect novel T3SS effectors or non-effectors based on similarity in sequence and domains to known effectors, and the third predictor adopts a similar strategy to BEAN but considers longer sequences.http://systbio.cau.edu.cn/bean/
pEffect [125]T3SSBLAST SVMS, PSSM43 species115e 3460nepEffect uses two predictors to make predictions of novel T3SS effectors, where the first PSI-BLAST-based predictor is used to identify sequential similarity to known effectors, and the second SVM-based predictor is used when no accepted similarity is available.http://services.bromberglab.org/peffect/
S4TE [96]T4SSBLASTS, P, D SS and PPunclearunclearS4TE predicts novel T4SS effectors according to 13 features depicting known ones by computing a score summed over the individual absence or presence of features.http://sate.cirad.fr/
T4EffPred [38]T4SSSVMAAC PSSMunclear340e 1132neT4EffPred uses SVM to predict T4SS effectors based on the AAC, dipeptide composition, PSSM composition profiles and their auto covariance transformation.http://bioinfo.tmmu.edu.cn/T4EffPred/
T4SEpre [99]T4SSSVMC-AAC,C-SS and C-ACC10 species347e 694neT4SEpre is an SVM-based model for predicting T4SS effectors, based on the information of the C-terminal AAC, SS and ACC, extracted from the 100 N-terminal residues of known effectors.http://biocomputer.bio.cuhk.edu.hk/T4DB/T4SEpre.php
MethodSecretion systemPrediction algorithmFeatureTraining dataSizeDescriptionAccessibility
SIEVE [11]T3SSSVMAAC, SEQ, GC, CONS and PHYLS. typhimurium and P. syringae65e unfixedSIEVE learns the difference between T3SS effectors and non- effectors and detects novel effectors based on SVM, with the information of G+C content, AAC, the 30 N-terminal residues of both effectors and non-effectors, and their evolutionary relationships to other genomes.http://www.sysbep.org/sieve/
EffectiveT3 [85]T3SSNBAAC, SEQ GC, PHYL5 species100e 200neEffectiveT3 uses the NB classifier to model the AAC and secondary structures of the N-terminal residues of the T3SS effectors and non-effectors, and makes novel predictions.http://www.effectors.org/method/effectivet3
T3SS_prediction [86]T3SSANN SVMN-AACP. syringae and others575e 685neT3SS_prediction is composed of ANNs that identify potential T3SS effectors based on the N-terminal amino acid sequences using a sliding-window procedure.http://gecco.org.chemie.uni-frankfurt.de/T3SS_prediction/T3SS_prediction.html
T3SEdb [124]T3SSNBPP28 species100e 100neT3SEdb detects novel T3SS effectors based on the hydrophobicity, polarity and β-turn profiles extracted from 100 N-terminal residues of known effectors.http://effectors.bic.nus.edu.sg/T3SEdb/predict.php
SSE-ACC [77]T3SSSVMAAC, SS and ACCP. syringae108e 3424neAn SVM-based model developed to identify T3SS effectors based on the AAC, SS and ACC of known effectors from the 100 N-terminal residues.On request
BPBAac [82]T3SSSVMAACunclear154e 308neAn SVM-based method for identifying T3SS effectors with the information of the AAC features within the 100 N-terminal residues, which are extracted using a Bi-Profile Bayesian (BPB) model*.http://biocomputer.bio.cuhk.edu.hk/T3DB/BPBAac.php
T3SPs [87]T3SSRFAAC, SS ACC, PP16 species283e 313neBased on the AAC, SS, ACC and six physico-chemical properties of the 100 N-terminal amino acids, T3SPs uses RFs to predict T3SS effectors.Unavailable
T3_MM [88]T3SSMarkov modelConditional dependence of AACUnclear154e 308neT3_MM is constructed to predict novel T3SS effectors based on the distribution of the conditional probability difference between the effectors and non-effectors, which corresponds to the likelihood ratio of the sequence being either motif.http://biocomputer.bio.cuhk.edu.hk/T3DB/T3_MM.php
T3SEpre [83]T3SSSVMAAC, SS and ACCUnclear189e 385neT3SEpre is an SVM-based model for predicting T3SS effectors, with position-specific AAC, SS and ACC extracted from known effectors and non-effectors.http://biocomputer.bio.cuhk.edu.hk/T3DB/T3SEpre.php
BEAN [84]T3SSSVMAAC PSSMUnclear154e 1540neBEAN uses a PSSM-based k-spaced amino acid pair composition method ex-tracted within N-terminal sequences to compute the feature vectors of known effectors and non-effectors, which are separated by SVM.Unavailable
BEAN2.0 [97]T3SSBLAST SVMS,D,AAC PSSMUnclear243e 486neBEAN2.0 consists of three predictors, the first two predictors detect novel T3SS effectors or non-effectors based on similarity in sequence and domains to known effectors, and the third predictor adopts a similar strategy to BEAN but considers longer sequences.http://systbio.cau.edu.cn/bean/
pEffect [125]T3SSBLAST SVMS, PSSM43 species115e 3460nepEffect uses two predictors to make predictions of novel T3SS effectors, where the first PSI-BLAST-based predictor is used to identify sequential similarity to known effectors, and the second SVM-based predictor is used when no accepted similarity is available.http://services.bromberglab.org/peffect/
S4TE [96]T4SSBLASTS, P, D SS and PPunclearunclearS4TE predicts novel T4SS effectors according to 13 features depicting known ones by computing a score summed over the individual absence or presence of features.http://sate.cirad.fr/
T4EffPred [38]T4SSSVMAAC PSSMunclear340e 1132neT4EffPred uses SVM to predict T4SS effectors based on the AAC, dipeptide composition, PSSM composition profiles and their auto covariance transformation.http://bioinfo.tmmu.edu.cn/T4EffPred/
T4SEpre [99]T4SSSVMC-AAC,C-SS and C-ACC10 species347e 694neT4SEpre is an SVM-based model for predicting T4SS effectors, based on the information of the C-terminal AAC, SS and ACC, extracted from the 100 N-terminal residues of known effectors.http://biocomputer.bio.cuhk.edu.hk/T4DB/T4SEpre.php

For each type of secreted effector, the prediction methods are listed in chronological order. The machine learning algorithms mentioned here are SVM, support vector machine; NB, naive Bayes; ANN, artificial neural networks; and RF, random forests. ‘Features’ describes the features of the effector sequences considered by the prediction programs: AAC, amino acid composition; SEQ, 30 N-terminal residues; GC, G + C content; PHYL, phylogenetic profile; CONS, evolutionary conservation; SS, secondary structure; ACC, solvent accessibility; PP, physico-chemical properties; S, sequential similarity; D, similar domains; and P, promoter pattern. ‘Training Data’ indicates where the known effectors and non-effectors originate from, and their respective numbers are shown in ‘Size’ (as indicated in publications, e: effector, ne: non-effector). *BPB and Single-Profile Bayesian (SPB) are two methods of feature extraction [82, 126], where BPB considers the features of both positive and negative effectors in the training data sets and SPB considers only the features of positive examples.

Table 2

Warehouse of predicted secreted effector proteins in bacteria

MethodSecretion SystemOrganismSizeDescriptionAccessibility
T3SEdb [124]T3SS46 species504 effectors 572 predictions 13 unknownIn addition to being a predictor of T3SS effectors, T3SEdb also collects the effectors, manually annotates them and enables the assessment of sequence diversity among them.http://effectors.bic.nus.edu.sg/T3SEdb/index.php
T3DB [127]T3SS35 speciesUnclearT3SS-related Database (T3DB) annotates the T3SS-related information, including the apparatus, chaperones, effectors and regulators. It also integrates multiple programs predicting T3SS effectors to provide the user with online prediction.http://biocomputer.bio.cuhk.edu.hk/T3DB/
BEAN2.0 [97]T3SS221 species1215 effectorsIn addition to being a predictor of T3SS effectors, BEAN2.0 also integrates a warehouse of effectors, two functional relationship networks constructed among them and multiple sequence analysis tools, such as the subcellular location predictors.http://systbio.cau.edu.cn/bean/
SecReT4 [128]T4SS289 species239 effectors 1645 predictionsSecReT4 is a resource that specifically stores both effectors and core components of T4SS, it also supports the accessibility of multiple functional analysis tools, such as the similarity search tools.http://db-mml.sjtu.edu.cn/SecReT4/
SecReT6 [129]T6SS240 species92 effectors 1248 predictionsSimilar to SecReT4, SecReT6 is constructed as a specific resource for storing data on the T6SS, cognate effectors and immunity proteins. Multiple functional analysis tools are also supported, such as the detection of T6SS gene clusters.http://db-mml.sjtu.edu.cn/SecReT6/index.php
Effective [107]Multiple Systems587 species421 774 predictionsEffective is a warehouse of putative effectors predicted by two integrated components, which are based on either the detection of Sec and T3SS pathway signals or the identification of eukaryoticlike domains.Unavailable
EffectiveDB [130]Multiple systems1677 speciesUnclearAs the undated version of Effective, EffectiveDB enables the prediction of specific T3SS and T4SS effectors or more general secreted effectors based on the identification of secretion signals, binding domains of T3SS chaperones or eukaryoticlike domains. EffectiveDB also supports the prediction of T3SS, T4SS and T6SS.http://effectors.org/
MethodSecretion SystemOrganismSizeDescriptionAccessibility
T3SEdb [124]T3SS46 species504 effectors 572 predictions 13 unknownIn addition to being a predictor of T3SS effectors, T3SEdb also collects the effectors, manually annotates them and enables the assessment of sequence diversity among them.http://effectors.bic.nus.edu.sg/T3SEdb/index.php
T3DB [127]T3SS35 speciesUnclearT3SS-related Database (T3DB) annotates the T3SS-related information, including the apparatus, chaperones, effectors and regulators. It also integrates multiple programs predicting T3SS effectors to provide the user with online prediction.http://biocomputer.bio.cuhk.edu.hk/T3DB/
BEAN2.0 [97]T3SS221 species1215 effectorsIn addition to being a predictor of T3SS effectors, BEAN2.0 also integrates a warehouse of effectors, two functional relationship networks constructed among them and multiple sequence analysis tools, such as the subcellular location predictors.http://systbio.cau.edu.cn/bean/
SecReT4 [128]T4SS289 species239 effectors 1645 predictionsSecReT4 is a resource that specifically stores both effectors and core components of T4SS, it also supports the accessibility of multiple functional analysis tools, such as the similarity search tools.http://db-mml.sjtu.edu.cn/SecReT4/
SecReT6 [129]T6SS240 species92 effectors 1248 predictionsSimilar to SecReT4, SecReT6 is constructed as a specific resource for storing data on the T6SS, cognate effectors and immunity proteins. Multiple functional analysis tools are also supported, such as the detection of T6SS gene clusters.http://db-mml.sjtu.edu.cn/SecReT6/index.php
Effective [107]Multiple Systems587 species421 774 predictionsEffective is a warehouse of putative effectors predicted by two integrated components, which are based on either the detection of Sec and T3SS pathway signals or the identification of eukaryoticlike domains.Unavailable
EffectiveDB [130]Multiple systems1677 speciesUnclearAs the undated version of Effective, EffectiveDB enables the prediction of specific T3SS and T4SS effectors or more general secreted effectors based on the identification of secretion signals, binding domains of T3SS chaperones or eukaryoticlike domains. EffectiveDB also supports the prediction of T3SS, T4SS and T6SS.http://effectors.org/

For each type of secreted effector, the corresponding warehouses are listed in chronological order. The number of ‘Organisms’ and corresponding ‘Size’ are taken from the publications, where ‘effectors’ indicate experimentally verified ones, ‘predictions’ refer to hypothetical ones and ‘unknown’ indicates those with incomplete information.

Table 2

Warehouse of predicted secreted effector proteins in bacteria

MethodSecretion SystemOrganismSizeDescriptionAccessibility
T3SEdb [124]T3SS46 species504 effectors 572 predictions 13 unknownIn addition to being a predictor of T3SS effectors, T3SEdb also collects the effectors, manually annotates them and enables the assessment of sequence diversity among them.http://effectors.bic.nus.edu.sg/T3SEdb/index.php
T3DB [127]T3SS35 speciesUnclearT3SS-related Database (T3DB) annotates the T3SS-related information, including the apparatus, chaperones, effectors and regulators. It also integrates multiple programs predicting T3SS effectors to provide the user with online prediction.http://biocomputer.bio.cuhk.edu.hk/T3DB/
BEAN2.0 [97]T3SS221 species1215 effectorsIn addition to being a predictor of T3SS effectors, BEAN2.0 also integrates a warehouse of effectors, two functional relationship networks constructed among them and multiple sequence analysis tools, such as the subcellular location predictors.http://systbio.cau.edu.cn/bean/
SecReT4 [128]T4SS289 species239 effectors 1645 predictionsSecReT4 is a resource that specifically stores both effectors and core components of T4SS, it also supports the accessibility of multiple functional analysis tools, such as the similarity search tools.http://db-mml.sjtu.edu.cn/SecReT4/
SecReT6 [129]T6SS240 species92 effectors 1248 predictionsSimilar to SecReT4, SecReT6 is constructed as a specific resource for storing data on the T6SS, cognate effectors and immunity proteins. Multiple functional analysis tools are also supported, such as the detection of T6SS gene clusters.http://db-mml.sjtu.edu.cn/SecReT6/index.php
Effective [107]Multiple Systems587 species421 774 predictionsEffective is a warehouse of putative effectors predicted by two integrated components, which are based on either the detection of Sec and T3SS pathway signals or the identification of eukaryoticlike domains.Unavailable
EffectiveDB [130]Multiple systems1677 speciesUnclearAs the undated version of Effective, EffectiveDB enables the prediction of specific T3SS and T4SS effectors or more general secreted effectors based on the identification of secretion signals, binding domains of T3SS chaperones or eukaryoticlike domains. EffectiveDB also supports the prediction of T3SS, T4SS and T6SS.http://effectors.org/
MethodSecretion SystemOrganismSizeDescriptionAccessibility
T3SEdb [124]T3SS46 species504 effectors 572 predictions 13 unknownIn addition to being a predictor of T3SS effectors, T3SEdb also collects the effectors, manually annotates them and enables the assessment of sequence diversity among them.http://effectors.bic.nus.edu.sg/T3SEdb/index.php
T3DB [127]T3SS35 speciesUnclearT3SS-related Database (T3DB) annotates the T3SS-related information, including the apparatus, chaperones, effectors and regulators. It also integrates multiple programs predicting T3SS effectors to provide the user with online prediction.http://biocomputer.bio.cuhk.edu.hk/T3DB/
BEAN2.0 [97]T3SS221 species1215 effectorsIn addition to being a predictor of T3SS effectors, BEAN2.0 also integrates a warehouse of effectors, two functional relationship networks constructed among them and multiple sequence analysis tools, such as the subcellular location predictors.http://systbio.cau.edu.cn/bean/
SecReT4 [128]T4SS289 species239 effectors 1645 predictionsSecReT4 is a resource that specifically stores both effectors and core components of T4SS, it also supports the accessibility of multiple functional analysis tools, such as the similarity search tools.http://db-mml.sjtu.edu.cn/SecReT4/
SecReT6 [129]T6SS240 species92 effectors 1248 predictionsSimilar to SecReT4, SecReT6 is constructed as a specific resource for storing data on the T6SS, cognate effectors and immunity proteins. Multiple functional analysis tools are also supported, such as the detection of T6SS gene clusters.http://db-mml.sjtu.edu.cn/SecReT6/index.php
Effective [107]Multiple Systems587 species421 774 predictionsEffective is a warehouse of putative effectors predicted by two integrated components, which are based on either the detection of Sec and T3SS pathway signals or the identification of eukaryoticlike domains.Unavailable
EffectiveDB [130]Multiple systems1677 speciesUnclearAs the undated version of Effective, EffectiveDB enables the prediction of specific T3SS and T4SS effectors or more general secreted effectors based on the identification of secretion signals, binding domains of T3SS chaperones or eukaryoticlike domains. EffectiveDB also supports the prediction of T3SS, T4SS and T6SS.http://effectors.org/

For each type of secreted effector, the corresponding warehouses are listed in chronological order. The number of ‘Organisms��� and corresponding ‘Size’ are taken from the publications, where ‘effectors’ indicate experimentally verified ones, ‘predictions’ refer to hypothetical ones and ‘unknown’ indicates those with incomplete information.

Benchmarking the prediction methods based on curated data sets

The machine learning approaches behave like a black-box that does not imitate the unknown biological mechanism but models the statistical characteristics of the effectors in their training data sets, suggesting they still have weaknesses in their algorithms. SVMs separate the effectors and non-effectors by a hyper-plane, which strongly depends on the dimension of features and require a large amount of computation. NB classifiers are constructed based on the hypothesis that the conditional independence between features are limited for an extensive prediction of closely connected secreted effectors. ANNs have plenty of parameters located in each layer of neuron, which obstruct the users to inspect the progress.

To assess the performance of prediction methods on predicting secreted effectors, we attempt to benchmark, whenever possible, the methods in Table 1, and consider each algorithm of one detection tool as separated methods. We skip the assessment of predicting T6SS effectors, as An et al.’s work has reported the limitation of available approaches to identify T6SS effectors and no more effective methods are available.

The methods on board are BPBAac [87], EffectiveT3-SEL [85], EffectiveT3-SEN [85], pEffect [125], SIEVE [11], T3_MM [88], T3SS_prediction-ANN [86] and T3SS_prediction-SVM [86] for predicting T3SS effectors, and T4EffPred [38], T4SEprebpbAac [99], T4SEpreJoint [99] and T4SEprepsAac [99] for predicting T4SS effectors. Typically, the updated predictive model of EffectiveT3 makes novel predictions based on two restriction values of the ‘sensitive’ and ‘selective’. The ‘selective’ is the default minimal score of 0.9999 from the Naive Bayesian Classifier for the class of secreted effectors, whereas the ‘sensitive’ corresponds to the threshold score of 0.95, suggesting us to consider their performance separately as the EffectiveT3-SEL and EffectiveT3-SEN [85]. T3SS_prediction is an ANN-based predictive model, and it is still able to make novel predictions based on SVM, suggesting us to consider their performance separately as the T3SS_prediction-ANN and T3SS_prediction-SVM [86]. T4SEpre constructs predictive models based on three combinations of features, the BPB position-specific AAC features, the SPB position-specific and sequence-based AAC features, and the joint features of the position-specific AAC, SS and ACC, corresponding to T4SEprebpbAac, T4SEprepsAac and T4SEpreJoint, respectively.

Construction of the testing data sets

We collect both positive and negative examples of secreted effectors to construct a comprehensive testing data set of the prediction methods.

The positives are known as T3SS and T4SS effectors, which are collected from both the databases built in Table 2 and, whenever possible, the positive training data sets used for developing the original methods. They are referred to as the data sets of T3P and T4P, respectively. We search through the NCBI (http://www.ncbi.nlm.nih.gov/protein/) and UniProt (http://www.uniprot.org) databases for the complete effector sequences, which are partially used by some prediction methods [82, 88, 99]. The negatives are known non-effectors or artificial sequences, which are collected from the negative training data sets of benchmarking methods. They are referred to as the data sets of T3N and T4N, respectively. We still consider the ability of detecting one certain type of effectors from the other types of secreted effectors by each method. Hence, for the prediction of T3SS effectors, the negative examples are collected from T3N, T4P and T6P, a data set composed of known T6SS effectors. While for the prediction of T4SS effectors, the negative examples are collected from T4N, T3P and T6P, as shown in Table 3.

Table 3

Positives and negatives in the testing data set of prediction methods

MethodsPositivesNegatives
Prediction methods of T3SS effectors
T3PT3NT4PT6P
BPBAac230141537165
EffectiveT3-SEL230141437165
EffectiveT3-SEN230141437165
pEffect230141537165
SIEVE230141537165
T3_MM230141537165
T3SS_prediction-ANN230141437165
T3SS_prediction-SVM230141437165
Prediction methods of T4SS effectors
T4PT4NT3PT6P
T4EffPred371156723065
T4SEprebpbAac371156723065
T4SEpreJoint371156723065
T4SEprepsAac371156723065
MethodsPositivesNegatives
Prediction methods of T3SS effectors
T3PT3NT4PT6P
BPBAac230141537165
EffectiveT3-SEL230141437165
EffectiveT3-SEN230141437165
pEffect230141537165
SIEVE230141537165
T3_MM230141537165
T3SS_prediction-ANN230141437165
T3SS_prediction-SVM230141437165
Prediction methods of T4SS effectors
T4PT4NT3PT6P
T4EffPred371156723065
T4SEprebpbAac371156723065
T4SEpreJoint371156723065
T4SEprepsAac371156723065
Table 3

Positives and negatives in the testing data set of prediction methods

MethodsPositivesNegatives
Prediction methods of T3SS effectors
T3PT3NT4PT6P
BPBAac230141537165
EffectiveT3-SEL230141437165
EffectiveT3-SEN230141437165
pEffect230141537165
SIEVE230141537165
T3_MM230141537165
T3SS_prediction-ANN230141437165
T3SS_prediction-SVM230141437165
Prediction methods of T4SS effectors
T4PT4NT3PT6P
T4EffPred371156723065
T4SEprebpbAac371156723065
T4SEpreJoint371156723065
T4SEprepsAac371156723065
MethodsPositivesNegatives
Prediction methods of T3SS effectors
T3PT3NT4PT6P
BPBAac230141537165
EffectiveT3-SEL230141437165
EffectiveT3-SEN230141437165
pEffect230141537165
SIEVE230141537165
T3_MM230141537165
T3SS_prediction-ANN230141437165
T3SS_prediction-SVM230141437165
Prediction methods of T4SS effectors
T4PT4NT3PT6P
T4EffPred371156723065
T4SEprebpbAac371156723065
T4SEpreJoint371156723065
T4SEprepsAac371156723065

CD-HIT is used to cluster and remove the redundant secreted effectors in each data set, with the sequence identify cutoff of 0.3 [131]. The removal of protein sequences with high sequence similarity helps to exclude the effects of dependency from homology within the data sets [11, 38, 85]. Remainders of the T3P, T4P, T6P, T3N and T4N are 230, 371, 65, 1415 and 1567 effector sequences, respectively, and the corresponding number of positive and negative examples of each method are shown in Table 3. These curated data sets are available from http://bioinfo.tmmu.edu.cn/BenchmarkPSE/.

Result

Table 4 and Figure 3 show the performance of each prediction method for predicting T3SS or T4SS effectors based on the curated data sets. The evaluation parameters of sensitivity, specificity and MCC are used to assess the performance, with the best values highlighted in bold.

Table 4

Performance of the prediction methods based on curated data sets

MethodsSensitivitySpecificityMCC
Prediction methods of T3SS effectors
BPBAac0.60.9910.708
EffectiveT3-SEL0.6520.9010.472
EffectiveT3-SEN0.730.8370.426
pEffect0.8830.8680.572
SIEVE0.4780.9830.573
T3_MM0.8220.8350.484
T3SS_prediction-ANN0.770.9030.559
T3SS_prediction-SVM0.70.9850.75
Prediction methods of T4SS effectors
T4EffPred0.9190.9430.802
T4SEprebpbAac0.9110.9750.874
T4SEpreJoint0.210.9950.392
T4SEprepsAac0.8920.990.905
MethodsSensitivitySpecificityMCC
Prediction methods of T3SS effectors
BPBAac0.60.9910.708
EffectiveT3-SEL0.6520.9010.472
EffectiveT3-SEN0.730.8370.426
pEffect0.8830.8680.572
SIEVE0.4780.9830.573
T3_MM0.8220.8350.484
T3SS_prediction-ANN0.770.9030.559
T3SS_prediction-SVM0.70.9850.75
Prediction methods of T4SS effectors
T4EffPred0.9190.9430.802
T4SEprebpbAac0.9110.9750.874
T4SEpreJoint0.210.9950.392
T4SEprepsAac0.8920.990.905

The corresponding highest values of each parameter are highlighted in bold.

Table 4

Performance of the prediction methods based on curated data sets

MethodsSensitivitySpecificityMCC
Prediction methods of T3SS effectors
BPBAac0.60.9910.708
EffectiveT3-SEL0.6520.9010.472
EffectiveT3-SEN0.730.8370.426
pEffect0.8830.8680.572
SIEVE0.4780.9830.573
T3_MM0.8220.8350.484
T3SS_prediction-ANN0.770.9030.559
T3SS_prediction-SVM0.70.9850.75
Prediction methods of T4SS effectors
T4EffPred0.9190.9430.802
T4SEprebpbAac0.9110.9750.874
T4SEpreJoint0.210.9950.392
T4SEprepsAac0.8920.990.905
MethodsSensitivitySpecificityMCC
Prediction methods of T3SS effectors
BPBAac0.60.9910.708
EffectiveT3-SEL0.6520.9010.472
EffectiveT3-SEN0.730.8370.426
pEffect0.8830.8680.572
SIEVE0.4780.9830.573
T3_MM0.8220.8350.484
T3SS_prediction-ANN0.770.9030.559
T3SS_prediction-SVM0.70.9850.75
Prediction methods of T4SS effectors
T4EffPred0.9190.9430.802
T4SEprebpbAac0.9110.9750.874
T4SEpreJoint0.210.9950.392
T4SEprepsAac0.8920.990.905

The corresponding highest values of each parameter are highlighted in bold.

Performance of the prediction methods based on the curated data sets.
Figure 3

Performance of the prediction methods based on the curated data sets.

For the prediction of T3SS effectors, T3_SSprediction-SVM has achieved the highest MCC value, suggesting it as the best prediction method based on the curated data sets. BPBAac performed also well, as it has achieved a high MCC value and the highest specificity, namely, BPBAac correctly rejected the most number of non-effectors. SIEVE had the lowest sensitivity, suggesting it is the least sensitive to the T3SS effectors based on the curated data set. In contrast, pEffect has correctly identified the most number of effectors. Globally, EffectiveT3-SEL, EffectiveT3-SEN and T3_MM have achieved an ordinary performance.

For the prediction of T4SS effectors, T4SEprepsAac has achieved the highest MCC value, suggesting it as the best predictor of T4SS effectors based on the curated data sets. T4EffPred performed not bad, as it has correctly identified the most number of T4SS effectors. Globally, T4SEpreJoint performed the worst, as it has achieved the lowest MCC value and sensitivity.

Discussion

Computational identification methods for bacterial effector proteins have been developed during the past decades. The core is to capture the statistical ‘rules’ of known effectors and to make novel predictions based on an acceptable similarity to the extracted characteristics.

In this article, we are focusing on reporting the progress of in silico identification of bacterial effectors secreted via T3SS, T4SS and T6SS and the features of effectors that help to discriminate them from other proteins. Most approaches are effective for identifying certain effectors, but they are always limited to extensive predictions because the general features of the effectors are derived from particularly well-characterized bacterial species. Different methods focus on particular features of the effector sequences, including the sequential, structural and physicochemical characteristics, or the regulatory elements, secretion signals, cognate chaperones, solvent accessibility and conservation profiles of the effectors.

Detecting the AAC biases in known effectors is a straightforward and prevalent way to identify novel secreted proteins. However, significant enrichment or depletion of certain amino acids has been observed in effector sequences generally/globally [85, 88] and at positions specifically [82, 87]. Generally, global AAC biases sum the occurrences of amino acids divided by the length of the full sequence or N-terminal fragment, while position-specific AAC information focuses on the bias of certain amino acids at specific positions. Both types of bias show discriminative power to identify novel effectors.

The sequence-general and position-specific information are combined in selected studies [11, 38], where the position-specific preferences of amino acids are encoded into the PSSM profiles. PSSM profiles commonly represent evolutionary conservation within multiple sequences, which is helpful for discrimination of effectors [11, 38, 87]. In particular, the PSSM composition profiles and the PSSM_AC profiles representing the auto covariance transformation of PSSM are calculated by T4EffPred [38]. Their performance test shows that these two classes of PSSM-based features are more helpful in identifying novel T4SS effectors than using AAC, as PSSM profiles may hold the conserved relevance of function-related proteins, which may fade away through the evolution of protein sequences. This argument is clearly supported by the construction of PSSM profiles, which are obtained from multiple sequence alignment, rather than a single sequence carrying relatively less information. Moreover, sequence-derived characteristics of AAC and PSSM can generally be calculated across effectors from diverse organisms as an expedient. These features can be extracted from all the effectors, regardless of the sequence-general or position-specific manner. In contrast, prediction methods based on the typical signal sequence of certain secretion systems are limited by the generalizability of their conservation across species.

Feature combination is a frequently used approach for improving classification performance [38, 69, 83, 88, 96] by combining not only the sequence-general and position-specific information but also the divergent features depicting the effectors. This is consistent with the argument that combining both the secondary structure and relative solvent accessibility could improve the predictive performance of identifying novel T3SS effectors [77].

During the computational progress of identifying bacterial effectors, selected machine learning approaches have been applied and have shown great performance. The method is selected based on the specific case, although they all turn the problem of predicting novel secreted effectors into the issue of discriminating them from non-effectors. We have implemented a benchmarking of available prediction methods of T3SS and T4SS effectors based on our curated data sets. We observed T3SS_prediction-SVM is the best T3SS effector predictor, while T4SEprepsAac is the best T4SS effector predictor based on our curated data sets. T4EffPred performed worse than the expectation, as it makes novel predictions based on the PSSM profiles calculated by new models.

There is no prediction method that has achieved an overwhelming success in our benchmarking. They are suggested to be combined to classify the input sequence according to a voting scheme of the individual algorithms. As pioneers, Burstein et al. and Chen et al. have selected the NB, Bayesian networks, SVM and ANN to construct a voting classifier to discriminate T4SS effectors in the genomes of L.pneumophila and Coxiella burnetii, respectively [69, 70].

A sensible construction of training data sets, especially the negatives, will greatly improve the prediction performance, where the taxonomic, characteristic, numerical and functional biases should be avoided. For example, to avoid the taxonomic bias, the non-effectors from many pathogens should be collected [85]. To reduce the bias of characteristics of the N-terminal signal sequence in T3SS effectors and other signals, the non-effectors with Sec-dependent translocation signals, cytoplasmic proteins and proteins exported by unknown pathways are included in the negative training data set [86]. A sensible size of the training data set and proper ratio between positives and negatives correspond to a high statistical significance of the derived features. And without a doubt, the increasing credible effectors should make indescribable contributions to the prediction performance.

In the context of limited accuracy of specific prediction methods, subsequent filtering or further ranking of predictions for experimental validation is highly recommended. To achieve reliable and beneficial predictions, in addition to the availability of more comprehensive training data, more informative and discriminative features are desired. Most approaches are effective for identifying certain effectors, but they are always limited to extensive predictions because the general features of the effectors are derived from particularly well-characterized bacterial species. In the previous work, the AAC, signal information, sequential, physicochemical, evolutionary conservation, eukaryotic domains and akin attributes of the secreted effector proteins have been considered by multiple prediction methods. We propose that the information of PPI between secreted effectors and host cells, the evolutional relationship between effectors, the GO annotations, pathways and 3-D structural information of effectors should be on board in the coming future. The observation of new bio-markers and the advancement of metagenomics should be helpful as well.

As we mentioned above, we dream of a future where as many features as possible are considered to discriminate secreted effectors from the non-effectors. But we still desire more reliable and efficient methods of feature extraction/representation which encode the proteins in smart fashions. For example, SPB and BPB are two kinds of ways to extract features. Recursive feature elimination policy is also applied to select important features. Hence, we believe the advance of this kind of methods will also promote the procedures of identifying secreted effectors in bacteria. And the involvement of more promising algorithms, such as the deep learning and logistic regression algorithms, may strengthen the ability and efficiency of prediction methods of secreted effectors in bacteria.

Genome sequencing has resulted in an explosion of knowledge about bacterial secretion systems. We know that gram-negative bacteria use multiple secretion systems to translocate effectors extracellularly, especially the T3SS, T4SS and T6SS, via which the effectors are secreted directly into the host cells. A clearer understanding of the mechanisms behind the molecular processes, such as the effector targeting and secreting, is far beyond our current capacity. The more comprehensive observation and better interpretation will also promote a higher level of success in predicting secreted effectors in bacteria.

Furthermore, many of the in silico methods are integrated as software packages, such as the execution of SVM by running the LIBSVM [132], or Gist [133] and WEKA toolbox [134], which also supports the calculation of NB in EffectiveT3 [85]; executing ANN by running Matlab [135]; executing RFs by running the RF package in R; and executing HMMs by running HMMER [136]. The availability of computational tools provides easy accessibility and friendly experience to the users, resulting in the prevalent and well-characterized application of in silico methods in the domain of predicting bacterial effectors.

S4TE, one method that is not benchmarked, predicts T4SS effectors based on the summed score of the presence of features discriminating known effectors [96]. This type of prediction approaches is complementary to machine learning algorithms and statistical approaches. They are especially practical when the known samples are insufficient to construct both the positive and negative data sets of the training data for the machine learning algorithms. This type of prediction method may also save a large amount of mathematical computation in the statistical approaches.

However, none of these methods is exhaustive or generally applicable. Homology-based approaches can only identify effectors that are close members of known effector families, and these are mostly specific for and hence limited to certain well-known bacterial species. Meanwhile, this type of method may show poor performance on predicting novel effectors as they evolve/mutate quickly and therefore contain low sequence similarity. The success of the T346hunter software is dependent on the high conservation of the secretion apparatus [61]. Although the promoter search is an efficient method for identifying downstream genes encoding effectors, it is limited in identifying a true effector in P.syringae because either true effectors may not be preceded by known promoters or the ORF hits downstream of the promoter may have relatively poor values, which is either hardly detected or encodes unknown proteins [76, 77, 81]. Meanwhile, specific promoter information is available from a small group of bacterial pathogens. The prediction methods based on the signal sequence located in the effectors have demonstrated effective discriminative power. However, a protein containing a signal sequence does not necessarily represent a secreted effector strictly, as some proteins in bacteria without protein-translocation secretion systems may have signal sequences [83, 85, 86, 99]. Hence, the significance of discriminating effectors by signal-based prediction methods should not be entirely praised. Amino acid composition biases exist in either the entire effectors or particular regions observed in certain species, providing limited discrimination in other bacterial genomes [11]. The studies successfully identify a number of T6SS effectors based on the physical characteristics of known effectors [44] and an N-terminal sequence marker [55], but neither method could identify the T6SS effector TseL in V.cholerae [65].

On the other hand, although computational prediction methods identify bacterial effectors rapidly, efficiently and automatically, FP predictions are always a flaw in machine learning approaches, where the predictive model predicts a non-effector as a positive effector candidate. For example, the secretion signals in T3SS effectors are predicted to exist in gram-positive bacteria and yeasts [83]. Cytoplasmic proteins are predicted to have a T3SS export signal [86]. Hence, it should be remembered that the machine learning algorithms are trained on the sequence features derived from a small number of known effectors from limited species and cannot be generalized to completely accurate effector discovery in other bacterial species. And the consideration of structural motifs in [77, 83, 85] is based on the prediction of other methods, such as PSIPRED [79]. Although using a greater number of features improves the power to identify effectors, we cannot deny the possibility of FPs resulting from the additional prediction programs.

Furthermore, studies identifying T3SS effectors are quite abundant compared with those identifying T4SS effectors. And there is a substantial imbalance in studies predicting T6SS effectors. For T3SS, T4SS and T6SS effectors, the methods that have demonstrated good performance for identifying T3SS effectors lose competitiveness when identifying T4SS and T6SS effectors, which may be owing to the distinctive features between the three families of effectors and the small data set of validated T4SS effectors and even smaller pool of known T6SS effectors, which directly impact the accuracy of machine learning approaches that depend on the quantities of authentic negative and positive examples. Currently, the most effective way to identify new T4SS and T6SS effectors is to validate predicted candidates according to the common features of known effectors encoded by the same or closely related bacteria [69, 70]. However, for example, SecRet4 contains <300 experimentally validated effectors of the 1884 effectors listed in the database. This demonstrates the poor speed of biological techniques to identify effectors compared with that of the computational methods. Hence, in silico methods must improve their comprehensive performance, including the reliability, promptness, efficiency and generality, which is the responsibility of the community.

Wang et al. reported the C-terminal preference of AAC, SS and ACC in T4SS effectors, which are quite similar to that of the N-terminal region in T3SS effectors [82, 99]. If this commonality exists generally, should the prediction methods developed based on the N-terminal signal region of T3SS effectors show some power in the prediction of T4SS effectors? Xu et al. reported a HMM-based method to predict and evaluate putative T3SS and T4SS effectors with EPIYA motifs. Their results showed that the predicted T3SS effectors scattered in a broad range of biological species, but this motif was not widely distributed in T4SS secreted proteins [89], which means this characteristic motif may not successfully discriminate T4SS effectors from T3SS effectors [38]. Hence, it is unclear how to use this type of commonality to mine more T4SS effectors. In one study to identify T6SS effectors, a conserved domain that functions similarly to the T3SS chaperone proteins was used to identify the associated downstream effectors rather than relying on the diverse sequential, structural and functional features of effector sequences [65]. This conserved domain has several features that are highly similar to T3SS chaperones, such as the low molecular weight and pI values. These similarities are expected to help the future identification of T6SS effectors, with generalizability to methods for predicting T3SS effectors. Li et al. reported the immunity proteins accompanying T6SS effectors, which prevent the bacterial cell from self-intoxication owing to the secreted toxin effectors being more stable than the cognate effectors. This result may suggest a way to predict T6SS effectors by considering the structural information of immunity proteins [137].

Genome sequencing has resulted in an explosion of knowledge about bacterial secretion systems. We know that gram-negative bacteria use multiple secretion systems to translocate effectors extracellularly, especially the T3SS, T4SS and T6SS, via which the effectors are secreted directly into the host cells. A clearer understanding of the mechanisms behind the molecular processes, such as the effector targeting and secreting, is far beyond our current capacity. The more comprehensive observation and better interpretation of the molecular procedures will also promote a higher level of success in predicting secreted effectors in bacteria.

Inspired by the observations and discussions above, we wish to present the current progress toward the development of our new prediction method for T6SS effectors and a more general, powerful and large-scale method that is able to discriminate the three families of secreted effectors from a large amount of genetic data. We cannot help to dream of a ‘super-powerful’ prediction method for bacterial effectors that combines the advantages of the available methods, enabling it to predict effectors secreted by any secretion system in most cases. However, this method would lack sensitivity to recognize one particular type of effector that is secreted by a specific secretion system. Secreted effectors can be identified based on their similarity to eukaryotic domains. The phylogenetic profiles between effectors and host cells or within effectors may help us to interpret the specific evolutionary processes. These valuable questions remain open in the investigation to provide insights into the nature of host–bacteria interactions.

Key Points

  • Bacterial secreted effectors play vital roles in pathogen–host interactions. Computational approaches have accelerated the process of identifying secreted effector proteins in bacteria.

  • We first reviewed informative features of known effectors, which contribute to their identification, composed of the sequential, structural, genomic information and others of the effectors. During this part of illustration, we attempted to highlight the biological background of these informative features.

  • Include available resources and then carefully study the strengths and weaknesses of multiple types of machine learning algorithms and statistical methods on predicting secreted effectors. This is demonstrated by implementing a benchmark of available ones based on our curated data sets.

  • We propose a future where the fidelity of identifying secreted effectors in silico will be much more persuasive and beneficial. This may be owing to the construction of a more balanced number of known effectors without taxonomic, characteristic, numerical and functional biases; more informative and discriminative features and more efficient methods of feature extraction/representation are desired; the improved reliability of the bioinformatic prediction tools and a better interpretation of the mechanisms behind the molecular pathogen–host interactions.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 31571352).

Cong Zeng received her PhD in Computer Science in 2015 from University of Paris-Sud, France. She now works as a lecturer in the Bioinformatics Center at the Third Military Medical University (TMMU), China. Her research interests include machine learning and data mining.

Lingyun Zou received his PhD in 2008 from National University of Defense Technology, China. He is an Associate Professor and PI at the Bioinformatics Center of TMMU. His research interests are machine learning, pattern recognition, omics-data mining and complex diseases. He is currently in charge of two projects funded by the National Natural Science Foundation of China, ‘The study of feature mining, computational prediction and experimental validation of bacterial effectors secreted via type IV secretion systems (Grant No. 31301097)’ and ‘Feature mining and machine learning based study of functional classification, computational prediction and experimental validation of bacterial effectors secreted via type VI secretion systems (Grant No. 31571352)’.

References

1

Chang
JH
,
Desveaux
D
,
Creason
AL.
The ABCs and 123s of bacterial secretion systems in plant pathogenesis
.
Annu Rev Phytopathol
2014
;
52
(
1
):
317
45
.

2

Desvaux
M
,
Hébraud
M
,
Talon
R
.
Secretion and subcellular localizations of bacterial proteins: a semantic awareness issue
.
Trends Microbiol
2009
;
17
(
4
):
139
45
.

3

Pallen
MJ
,
Chaudhuri
RR
,
Henderson
IR.
Genomic analysis of secretion systems
.
Curr Opin Microbiol
2003
;
6
(
5
):
519
27
.

4

Bingle
LE
,
Bailey
CM
,
Pallen
MJ.
Type VI secretion: a beginner’s guide
.
Curr Opin Microbiol
2008
;
11
(
1
):
3
8
.

5

Tseng
TT
,
Tyler
BM
,
Setubal
JC.
Protein secretion systems in bacterial-host associations, and their description in the gene ontology
.
BMC Microbiol
2009
;
9
(
1
):
S2
.

6

Thanassi
DG
,
Hultgren
SJ.
Multiple pathways allow protein secretion across the bacterial outer membrane
.
Curr Opin Cell Biol
2000
;
12
(
4
):
420
30
.

7

Henderson
IR
,
Navarro-Garcia
F
,
Desvaux
M
.
Type V protein secretion pathway: the autotransporter story
.
Microbiol Mol Biol Rev
2004
;
68
(
4
):
692
744
.

8

Hueck
CJ.
Type III protein secretion systems in bacterial pathogens of animals and plants
.
Microbiol Mol Biol Rev
1998
;
62
(
2
):
379
433
.

9

Delepelaire
P.
Type I secretion in gram-negative bacteria
.
Biochim Biophys Acta
2004
;
1694
(
1:3
):
149
61
.

10

Natale
P
,
Brüser
T
,
Driessen
AJM.
Sec- and Tat-mediated protein secretion across the bacterial cytoplasmic membrane: distinct translocases and mechanisms
.
Biochim Biophys Acta
2008
;
1778
(
9
):
1735
56
.

11

Samudrala
R
,
Heffron
F
,
McDermott
JE.
Accurate prediction of secreted substrates and identification of a conserved putative secretion signal for type III secretion systems
.
PLoS Pathog
2009
;
5
(
4
):
e1000375
.

12

Galán
JE
,
Collmer
A.
Type III secretion machines: bacterial devices for protein delivery into host cells
.
Science
1999
;
284
(
5418
):
1322
8
.

13

Galán
JE
,
Wolf-Watz
H.
Protein delivery into eukaryotic cells by type III secretion machines
.
Nature
2006
;
444
(
7119
):
567
73
.

14

Yip
CK
,
Kimbrough
TG
,
Felise
HB
.
Structural characterization of the molecular platform for type III secretion system assembly
.
Nature
2005
;
435
(
7042
):
702
7
.

15

Haraga
A
,
Ohlson
MB
,
Miller
SI.
Salmonellae interplay with host cells
.
Nat Rev Microbiol
2008
;
6
(
1
):
53
66
.

16

Ghosh
P.
Process of protein transport by the type III secretion system
.
Microbiol Mol Biol Rev
2004
;
68
(
4
):
771
95
.

17

Lloyd
SA
,
Sjöström
M
,
Andersson
S
.
Molecular characterization of type III secretion signals via analysis of synthetic N-terminal amino acid sequences
.
Mol Microbiol
2002
;
43
(
1
):
51
9
.

18

Lloyd
SA
,
Norman
M
,
Rosqvist
R
.
Yersinia YopE is targeted for type III secretion by N-terminal, not mRNA, signals
.
Mol Microbiol
2001
;
39
(
2
):
520
32
.

19

Miao
EA
,
Miller
SI.
A conserved amino acid sequence directing intracellular type III secretion by Salmonella typhimurium
.
Proc Natl Acad Sci USA
2000
;
97
(
13
):
7539
44
.

20

Stebbins
CE
,
Galán
JE.
Maintenance of an unfolded polypeptide by a cognate chaperone in bacterial type III secretion
.
Nature
2001
;
414
(
6859
):
77
81
.

21

Akeda
Y
,
Galán
JE.
Chaperone release and unfolding of substrates in type III secretion
.
Nature
2005
;
437
(
7060
):
911
5
.

22

Juhas
M
,
Crook
DW
,
Hood
DW.
Type IV secretion systems: tools of bacterial horizontal gene transfer and virulence
.
Cell Microbiol
2008
;
10
(
12
):
2377
86
.

23

Christie
PJ
,
Atmakuri
K
,
Krishnamoorthy
V
.
Biogenesis, architecture, and function of bacterial type IV secretion systems
.
Annu Rev Microbiol
2005
;
59
:
451
85
.

24

Christie
PJ
,
Vogel
JP.
Bacterial type IV secretion: conjugation systems adapted to deliver effector molecules to host cells
.
Trends Microbiol
2000
;
8
(
8
):
354
60
.

25

Cascales
E
,
Christie
PJ.
Definition of a bacterial type IV secretion pathway for a DNA substrate
.
Science
2004
;
304
(
5674
):
1170
3
.

26

Cascales
E
,
Christie
PJ.
The versatile bacterial type IV secretion systems
.
Nat Rev Microbiol
2003
;
1
(
2
):
137
49
.

27

Souza
RC
,
del Rosario Quispe Saji
G
,
Costa
MO
.
AtlasT4SS: a curated database for type IV secretion systems
.
BMC Microbiol
2012
;
12
:
172
.

28

Vogel
JP
,
Andrews
HL
,
Wong
SK
.
Conjugative transfer by the virulence system of Legionella pneumophila
.
Science
1998
;
279
(
5352
):
873
6
.

29

Backert
S
,
Meyer
TF.
Type IV secretion systems and their effectors in bacterial pathogenesis
.
Curr Opin Microbiol
2006
;
9
(
2
):
207
17
.

30

Llosa
M
,
Roy
C
,
Dehio
C.
Bacterial type IV secretion systems in human disease
.
Mol Microbiol
2009
;
73
(
2
):
141
51
.

31

Christie
PJ.
Type IV secretion: the agrobacterium VirB/D4 and related conjugation systems
.
Biochim Biophys Acta
2004
;
1694
(
1–3
):
219
34
.

32

Segal
G
,
Feldman
M
,
Zusman
T.
The Icm/Dot type-IV secretion systems of Legionella pneumophila and Coxiella burnetii
.
FEMS Microbiol Rev
2005
;
29
(
1
):
65
81
.

33

Vergunst
AC
,
van Lier
MCM
,
den Dulk-Ras
A
.
Positive charge is an important feature of the C-terminal transport signal of the VirB/D4-translocated proteins of Agrobacterium
.
Proc Natl Acad Sci USA
2005
;
102
(
3
):
832
7
.

34

Simone
M
,
McCullen
CA
,
Stahl
LE
.
The carboxy-terminus of VirE2 from Agrobacterium tumefaciens is required for its transport to host cells by the VirB-encoded type IV transport system
.
Mol Microbiol
2001
;
41
(
6
):
1283
93
.

35

Schulein
R
,
Guye
P
,
Rhomberg
TA
.
A bipartite signal mediates the transfer of type IV secretion substrates of Bartonella henselae into human cells
.
Proc Natl Acad Sci USA
2005
;
102
(
3
):
856
61
.

36

Nagai
H
,
Cambronne
ED
,
Kagan
JC
.
A C-terminal translocation signal required for Dot/Icm-dependent delivery of the Legionella RalF protein to host cells
.
Proc Natl Acad Sci USA
2005
;
102
(
3
):
826
31
.

37

Segal
G.
Identification of Legionella effectors using Bioinformatic approaches
.
Methods Mol Biol
2013
;
954
:
595
602
.

38

Zou
L
,
Nan
C
,
Hu
F.
Accurate prediction of bacterial type IV secreted effectors using amino acid composition and PSSM profiles
.
Bioinformatics
2013
;
29
(
24
):
3135
42
.

39

Coulthurst
SJ.
The Type VI secretion system-a widespread and versatile cell targeting system
.
Res Microbiol
2013
;
164
(
6
):
640
54
.

40

Records
AR.
The type VI secretion system: a multipurpose delivery system with a phage-like machinery
.
Mol Plant Microbe Interact
2011
;
24
(
7
):
751
7
.

41

Cianfanelli
FR
,
Diniz
JA
,
Guo
M
.
VgrG and PAAR proteins define distinct versions of a functional type VI secretion system
.
PLoS Pathog
2016
;
12
(
6
):
e1005735
June
.

42

Shneider
MM
,
Buth
SA
,
Ho
BT
.
PAAR-repeat proteins sharpen and diversify the type VI secretion system spike
.
Nature
2013
;
500
(
7462
):
350
3
.

43

Silverman
JM
,
Brunet
YR
,
Cascales
E
.
Structure and regulation of the type VI secretion system
.
Annu Rev Microbiol
2012
;
66
:
453
72
.

44

Russell
AB
,
Singh
P
,
Brittnacher
M
.
A widespread bacterial type VI secretion effector superfamily identified using a heuristic approach
.
Cell Host Microbe
2012
;
11
(
5
):
538
49
.

45

Russell
AB
,
Hood
RD
,
Bui
NK
.
Type VI secretion delivers bacteriolytic effectors to target cells
.
Nature
2011
;
475
(
7356
):
343
7
.

46

Carruthers
MD
,
Nicholson
PA
,
Tracy
EN
.
Acinetobacter baumannii utilizes a type VI secretion system for bacterial competition
.
PLoS One
2013
;
8
(
3
):
e59388
.

47

Hood
RD
,
Singh
P
,
Hsu
F
.
A type VI secretion system of Pseudomonas aeruginosa targets a toxin to bacteria
.
Cell Host Microbe
2010
;
7
(
1
):
25
37
.

48

MacIntyre
DL
,
Miyata
ST
,
Kitaoka
M
.
The Vibrio cholerae type VI secretion system displays antimicrobial properties
.
Proc Natl Acad Sci USA
2010
;
107
(
45
):
19520
4
.

49

Murdoch
SL
,
Trunk
K
,
English
G
.
The opportunistic pathogen Serratia marcescens utilizes type VI secretion to target bacterial competitors
.
J Bacteriol
2011
;
193
(
21
):
6057
69
.

50

Russell
AB
,
Peterson
SB
,
Mougous
JD.
Type VI secretion system effectors: poisons with a purpose
.
Nat Rev Microbiol
2014
;
12
(
2
):
137
48
.

51

Hachani
A
,
Wood
TE
,
Filloux
A.
Type VI secretion and anti-host effectors
.
Curr Opin Microbiol
2016
;
29
:
81
93
.

52

Alcoforado Diniz
J
,
Liu
YC
,
Coulthurst
SJ.
Molecular weaponry: diverse effectors delivered by the type VI secretion system
.
Cell Microbiol
2015
;
17
(
12
):
1742
51
.

53

Jiang
F
,
Waterfield
NR
,
Yang
J
.
A Pseudomonas aeruginosa type VI secretion phospholipase D Effector targets both prokaryotic and eukaryotic cells
.
Cell Host Microbe
2014
;
15
(
5
):
600
10
.

54

Ma
AT
,
McAuley
S
,
Pukatzki
S
.
Translocation of a Vibrio cholerae type VI secretion effector requires bacterial endocytosis by host cells
.
Cell Host Microbe
2009
;
5
(
3
):
234
43
.

55

Salomon
D
,
Kinch
LN
,
Trudgian
DC
.
Marker for type VI secretion system effectors
.
Proc Natl Acad Sci USA
June 2014
;
111
(
25
):
9271
6
.

56

Coburn
B
,
Grassl
GA
,
Finlay
BB.
Salmonella, the host and disease: a brief review
.
Immunol Cell Biol
2006
;
85
(
2
):
112
8
.

57

McClelland
M
,
Sanderson
KE
,
Clifton
SW
.
Comparison of genome degradation in Paratyphi A and Typhi, human-restricted serovars of Salmonella enterica that cause typhoid
.
Nat Genet
2004
;
36
(
12
):
1268
74
.

58

Harvill
ET
,
Preston
A
,
Cotter
PA
.
Multiple roles for Bordetella lipopolysaccharide molecules during respiratory tract infection
.
Infect Immun
2000
;
68
(
12
):
6720
8
.

59

Rolain
JM
,
Brouqui
P
,
Koehler
JE
.
Recommendations for treatment of human infections caused by Bartonella species
.
Antimicrob Agents Chemother
2004
;
48
(
6
):
1921
33
.

60

Paavonen
J
,
Eggert-Kruse
W.
Chlamydia trachomatis: impact on human reproduction
.
Hum Reproduc Update
1999
;
5
(
5
):
433
47
.

61

Martínez-García
PM
,
Ramos
C
,
Rodríguez-Palenzuela
P.
T346Hunter: a novel web-based tool for the prediction of type III, type IV and type VI secretion systems in bacterial genomes
.
PLoS One
2015
;
10
(
4
):
e0119317
April
.

62

Guttman
DS
,
Vinatzer
BA
,
Sarkar
SF
.
A functional screen for the type III (Hrp) secretome of the plant pathogen Pseudomonas syringae
.
Science
2002
;
295
(
5560
):
1722
6
.

63

Cornelis
GR.
The Yersinia Ysc-Yop ‘type III’ weaponry
.
Nat Rev Mol Cell Biol
2002
;
3
(
10
):
742
54
.

64

Stavrinides
J
,
McCann
HC
,
Guttman
DS.
Host-pathogen interplay and the evolution of bacterial effectors
.
Cell Microbiol
2008
;
10
(
2
):
285
92
.

65

Liang
X
,
Moore
R
,
Wilton
M
.
Identification of divergent type VI secretion effectors using a conserved chaperone domain
.
Proc Natl Acad Sci USA
2015
;
112
(
29
):
9106
11
.

66

Tobe
T
,
Beatson
SA
,
Taniguchi
H
.
An extensive repertoire of type III secretion effectors in Escherichia coli O157 and the role of lambdoid phages in their dissemination
.
Proc Natl Acad Sci USA
2006
;
103
(
40
):
14941
6
.

67

Schechter
LM
,
Vencato
M
,
Jordan
KL
.
Multiple approaches to a complete inventory of Pseudomonas syringae pv. tomato DC3000 type III secretion system effector proteins
.
Mol Plant Microbe Interact
2006
;
19
(
11
):
1180
92
.

68

Schechter
LM
,
Roberts
KA
,
Jamir
Y
.
Pseudomonas syringae type III secretion system targeting signals and novel effectors studied with a Cya translocation reporter
.
J Bacteriol
2004
;
186
(
2
):
543
55
.

69

Burstein
D
,
Zusman
T
,
Degtyar
E
.
Genome-scale identification of Legionella pneumophila effectors using a machine learning approach
.
PLoS Pathog
2009
;
5
(
7
):
e1000508
.

70

Chen
C
,
Banga
S
,
Mertens
K
.
Large-scale identification and translocation of type IV secretion substrates by Coxiella burnetii
.
Proc Natl Acad Sci USA
2010
;
107
(
50
):
21755
60
.

71

Lockwood
S
,
Voth
DE
,
Brayton
KA
.
Identification of Anaplasma marginale type IV secretion system effector proteins
.
PLoS One
2011
;
6
(
11
):
e27724
.

72

Marchesini
MI
,
Herrmann
CK
,
Salcedo
SP
.
In search of Brucella abortus type IV secretion substrates: screening and identification of four proteins translocated into host cells through VirB system
.
Cell Microbiol
2011
;
13
(
8
):
1261
74
.

73

Zhu
W
,
Banga
S
,
Tan
Y
.
Comprehensive identification of protein substrates of the Dot/Icm type IV transporter of Legionella pneumophila
.
PLoS One
2011
;
6
(
3
).

74

Geddes
K
,
Worley
M
,
Niemann
G
.
Identification of new secreted effectors in Salmonella enterica serovar typhimurium
.
Infect Immun
2005
;
73
(
10
):
6260
71
.

75

Chang
JH
,
Urbach
JM
,
Law
TF
.
A high-throughput, near-saturating screen for type III effector genes from Pseudomonas syringae
.
Proc Natl Acad Sci USA
2005
;
102
(
7
):
2549
54
.

76

Petnicki-Ocwieja
T
,
Schneider
DJ
,
Tam
VC
.
Genomewide identification of proteins secreted by the Hrp type III protein secretion system of Pseudomonas syringae pv. tomato DC3000
.
Proc Natl Acad Sci USA
2002
;
99
(
11
):
7652
7
.

77

Yang
Y
,
Zhao
J
,
Morgan
RL
.
Computational prediction of type III secreted proteins from gram-negative bacteria
.
BMC Bioinformatics
2010
;
11
(
1
):
1
10
.

78

Lifshitz
Z
,
Burstein
D
,
Peeri
M
.
Computational modeling and experimental validation of the Legionella and Coxiella virulence-related type-IVB secretion signal
.
Proc Natl Acad Sci USA
2013
;
110
(
8
):
E707
15
.

79

Jones
DT.
Protein secondary structure prediction based on position-specific scoring matrices
.
J Mol Biol
1999
;
292
(
2
):
195
202
.

80

Schwede
T
,
Kopp
J
,
Guex
N
.
SWISS-MODEL: an automated protein homology-modeling server
.
Nucleic Acids Res
2003
;
31
(
13
):
3381
5
.

81

Fouts
DE
,
Abramovitch
RB
,
Alfano
JR
.
Genomewide identification of Pseudomonas syringae pv. tomato DC3000 promoters controlled by the HrpL alternative sigma factor
.
Proc Natl Acad Sci USA
2002
;
99
(
4
):
2275
80
.

82

Wang
Y
,
Zhang
Q
,
Sun
M
.
High-accuracy prediction of bacterial type III secreted effectors based on position-specific amino acid composition profiles
.
Bioinformatics
2011
;
27
(
6
):
777
84
.

83

Wang
Y
,
Sun
M
,
Bao
H
.
Effective identification of bacterial type III secretion signals using joint element features
.
PLoS One
2013
;
8
(
4
):
e59754
.

84

Dong
X
,
Zhang
YJ
,
Zhang
Z.
Using weakly conserved motifs hidden in secretion signals to identify type-III effectors from bacterial pathogen genomes
.
PLoS One
2013
;
8
(
2
):
e56632
.

85

Arnold
R
,
Brandmaier
S
,
Kleine
F
.
Sequence-based prediction of type III secreted proteins
.
PLoS Pathog
2009
;
5
(
4
):
e1000376
.

86

Löwer
M
,
Schneider
G.
Prediction of type III secretion signals in genomes of Gram-negative bacteria
.
PLoS One
2009
;
4
(
6
):
e5917.

87

Yang
X
,
Guo
Y
,
Luo
J
.
Effective identification of Gram-negative bacterial type III secreted effectors using position-specific residue conservation profiles
.
PLoS One
2013
;
8
(
12
):
e84439
.

88

Wang
Y
,
Sun
M
,
Bao
H
.
T3_MM: a Markov model effectively classifies bacterial type III secretion signals
.
PLoS One
2013
;
8
(
3
):
e58173
.

89

Xu
S
,
Zhang
C
,
Miao
Y
.
Effector prediction in host-pathogen interaction based on a Markov model of a ubiquitous EPIYA motif
.
BMC Genomics
2010
;
11
(
3
):
S1
.

90

Sonah
H
,
Deshmukh
RK
,
Bélanger
RR.
Computational prediction of effector proteins in fungi: opportunities and challenges
.
Front Plant Sci
2016
;
7
:
126
.

91

Greenberg
JT
,
Vinatzer
BA.
Identifying type III effectors of plant pathogens and analyzing their interaction with plant cells
.
Curr Opin Microbiol
2003
;
6
(
1
):
20
8
.

92

McDermott
JE
,
Corrigan
A
,
Peterson
E
.
Computational prediction of type III and IV secreted effectors in Gram-negative bacteria
.
Infect Immun
2011
;
79
(
1
):
23
32
.

93

An
Y
,
Wang
J
,
Li
C
.
Comprehensive assessment and performance improvement of effector protein predictors for bacterial secretion systems III, IV and VI
.
Brief Bioinform
2016
; doi:10.1093/bib/bbw100.

94

Pallen
MJ
,
Beatson
SA
,
Bailey
CM.
Bioinformatics analysis of the locus for enterocyte effacement provides novel insights into type-III secretion
.
BMC Microbiol
2005
;
5
:
9.

95

Pallen
MJ
,
Beatson
SA
,
Bailey
CM.
Bioinformatics, genomics and evolution of non-flagellar type-III secretion systems: a Darwinian perpective
.
FEMS Microbiol Rev
2005
;
29
(
2
):
201
29
.

96

Meyer
DF
,
Noroy
C
,
Moumène
A
.
Searching algorithm for type IV secretion system effectors 1.0: a tool for predicting type IV effectors and exploring their genomic context
.
Nucleic Acids Res
2013
;
41
(
20
):
9218
29
.

97

Dong
X
,
Lu
X
,
Zhang
Z.
BEAN 2.0: an integrated web resource for the identification and functional analysis of type III secreted effectors
.
Database
2015
;
2015
:
bav064.

98

Altschul
SF
,
Madden
TL
,
Schäffer
AA
.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
.
Nucleic Acids Res
1997
;
25
(
17
):
3389
402
.

99

Wang
Y
,
Wei
X
,
Bao
H
.
Prediction of bacterial type IV secreted effectors by C-terminal features
.
BMC Genomics
2014
;
15
:
50
.

100

Vencato
M
,
Tian
F
,
Alfano
JR
.
Bioinformatics-enabled identification of the HrpL regulon and type III secretion system effector proteins of Pseudomonas syringae pv. phaseolicola 1448A
.
Mol Plant Microbe Interact
2006
;
19
(
11
):
1193
206
.

101

Vinatzer
BA
,
Jelenska
J
,
Greenberg
JT.
Bioinformatics correctly identifies many type III secretion substrates in the plant pathogen Pseudomonas syringae and the biocontrol isolate P. fluorescens SBW25
.
Mol Plant Microbe Interact
2005
;
18
(
8
):
877
88
.

102

Huang
L
,
Boyd
D
,
Amyot
WM
.
The E Block motif is associated with Legionella pneumophila translocated substrates
.
Cell Microbiol
2011
;
13
(
2
):
227
45
.

103

Kubori
T
,
Hyakutake
A
,
Nagai
H.
Legionella translocates an E3 ubiquitin ligase that has multiple U-boxes with distinct functions
.
Mol Microbiol
2008
;
67
(
6
):
1307
19
.

104

Sato
Y
,
Takaya
A
,
Yamamoto
T.
Meta-analytic approach to the accurate prediction of secreted virulence effectors in gram-negative bacteria
.
BMC Bioinformatics
2011
;
12
:
442.

105

Panina
EM
,
Mattoo
S
,
Griffith
N
.
A genome-wide screen identifies a Bordetella type III secretion effector and candidate effectors in other species
.
Mol Microbiol
2005
;
58
(
1
):
267
79
.

106

Page
AL
,
Parsot
C.
Chaperones of the type III secretion pathway: jacks of all trades
.
Mol Microbiol
2002
;
46
(
1
):
1
11
.

107

Jehl
MA
,
Arnold
R
,
Rattei
T.
Effective-a database of predicted secreted bacterial proteins
.
Nucleic Acids Res
2011
;
39 (Suppl 1)
:
D591
5
.

108

Collmer
A
,
Lindeberg
M
,
Petnicki-Ocwieja
T
.
Genomic mining type III secretion system effectors in Pseudomonas syringae yields new picks for all TTSS prospectors
.
Trends Microbiol
2002
;
10
(
10
):
462
9
.

109

Rohmer
L
,
Guttman
DS
,
Dangl
JL.
Diverse evolutionary mechanisms shape the type III effector virulence factor repertoire in the plant pathogen Pseudomonas syringae
.
Genetics
2004
;
167
(
3
):
1341
60
.

110

Noble
WS.
What is a support vector machine?
Nat Biotechnol
2006
;
24
(
12
):
1565
7
.

111

Cheng
J
,
Randall
AZ
,
Sweredoski
MJ
.
SCRATCH: a protein structure and structural feature prediction server
.
Nucleic Acids Res
2005
;
33
:
W72
6
.

112

Chen
SA
,
Ou
YY
,
Lee
TY
.
Prediction of transporter targets using efficient RBF networks with PSSM profiles and biochemical properties
.
Bioinformatics
2011
;
27
(
15
):
2062
7
.

113

John
GH
,
Langley
P.
Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, UAI’95. Morgan Kaufmann Publishers Inc, San Francisco, CA,
1995
, 338–345.

114

Basheer
IA
,
Hajmeer
M.
Artificial neural networks: fundamentals, computing, design, and application
.
J Microbiol Methods
2000
;
43
(
1
):
3
31
.

115

Breiman
L.
Random forests
.
Mach Learn
2001
;
45
(
1
):
5
32
.

116

Adamczak
R
,
Porollo
A
,
Meller
J.
Combining prediction of secondary structure and solvent accessibility in proteins
.
Proteins
2005
;
59
(
3
):
467
75
.

117

Gasteiger
E
,
Hoogland
C
,
Gattiker
A
. Protein identification and analysis tools on the ExPASy server. In:
Walker
J
(ed).
The Proteomics Protocols Handbook
.
Humana Press, Totowa, NJ, USA
,
2005
,
571
607
.

118

Eddy
SR.
Profile hidden Markov models
.
Bioinformatics
1998
;
14
(
9
):
755
63
.

119

Eddy
SR.
A new generation of homology search tools based on probabilistic inference
.
Genome Inform
2009
;
23
(
1
):
205
11
.

120

Johnson
S
,
Mitra
R
,
Schedl
T
.
Remote Protein Homology Detection Using Hidden Markov Models
.
St. Louis
:
Washington University
,
2006
.

121

Karplus
K
,
Barrett
C
,
Hughey
R.
Hidden Markov models for detecting remote protein homologies
.
Bioinformatics
1998
;
14
(
10
):
846
56
.

122

Altschul
SF
,
Gish
W
,
Miller
W
.
Basic local alignment search tool
.
J Mol Biol
1990
;
215
(
3
):
403
10
.

123

Matthews
BW.
Comparison of the predicted and observed secondary structure of T4 phage lysozyme
.
Biochimica Biophys Acta
1975
;
405
(
2
):
442
51
.

124

Tay
DMM
,
Govindarajan
KR
,
Khan
AM
.
T3SEdb: data warehousing of virulence effectors secreted by the bacterial type III secretion system
.
BMC Bioinformatics
2010
;
11
(
7
):
1
7
.

125

Goldberg
T
,
Rost
B
,
Bromberg
Y.
Computational prediction shines light on type III secretion origins
.
Scientific Reports
2016
;doi:10.1038/srep34516.

126

Shao
J
,
Xu
D
,
Tsai
SN
.
Computational identification of protein methylation sites through Bi-Profile Bayes feature extraction
.
PLoS One
2009
;
4
(
3
):
e4920
.

127

Wang
Y
,
Huang
H
,
Sun
M
.
T3DB: an integrated database for bacterial type III secretion system
.
BMC Bioinformatics
2012
;
13
:
66
.

128

Bi
D
,
Liu
L
,
Tai
C
.
SecReT4: a web-based bacterial type IV secretion system resource
.
Nucleic Acids Res
2013
;
41
(
D1
):
D660
5
.

129

Li
J
,
Yao
Y
,
Xu
HH
.
SecReT6: a web-based resource for type VI secretion systems found in bacteria
.
Environ Microbiol
2015
;
17
(
7
):
2196
202
.

130

Eichinger
V
,
Nussbaumer
T
,
Platzer
A
.
EffectiveDB-updates and novel features for a better annotation of bacterial secreted proteins and type III, IV, VI secretion systems
.
Nucleic Acids Res
2016
;
44
(
D1
):
D669
74
.

131

Huang
Y
,
Niu
B
,
Gao
Y
.
CD-HIT Suite: a web server for clustering and comparing biological sequences
.
Bioinformatics
2010
;
26
(
5
):
680
2
.

132

Chang
CC
,
Lin
CJ.
LIBSVM: a library for support vector machines
.
ACM Trans Intell Syst Technol
2011
;
2
(
3
):
27
.

133

Pavlidis
P
,
Wapinski
I
,
Noble
WS.
Support vector machine classification on the web
.
Bioinformatics
2004
;
20
(
4
):
586
7
.

134

Holmes
G
,
Donkin
A
,
Witten
IH.
WEKA: a machine learning workbench. In Proceedings of the 1994 Second Australian and New Zealand Conference on Intelligent Information Systems, 1994.
1994
, 357–361.

135

Demuth
H
,
Beale
M.
Neural Network Toolbox For Use with Matlab - User’S Guide Verion 3.0.
1993
.

136

Finn
RD
,
Clements
J
,
Eddy
SR.
HMMER web server: interactive sequence similarity searching
.
Nucleic Acids Res
2011
;
39 (Suppl 2)
:
W29
37
.

137

Li
M
,
Trong
IL
,
Carl
MA
.
Structural basis for type VI secretion effector recognition by a cognate immunity protein
.
PLoS Pathog
2012
;
8
(
4
):
e1002613
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)