- Split View
-
Views
-
Cite
Cite
Anuj R. Shah, Christopher S. Oehmen, Bobbie-Jo Webb-Robertson, SVM-HUSTLE—an iterative semi-supervised machine learning approach for pairwise protein remote homology detection, Bioinformatics, Volume 24, Issue 6, March 2008, Pages 783–790, https://doi.org/10.1093/bioinformatics/btn028
- Share Icon Share
Abstract
Motivation: As the amount of biological sequence data continues to grow exponentially we face the increasing challenge of assigning function to this enormous molecular ‘parts list’. The most popular approaches to this challenge make use of the simplifying assumption that similar functional molecules, or proteins, sometimes have similar composition, or sequence. However, these algorithms often fail to identify remote homologs (proteins with similar function but dissimilar sequence) which often are a significant fraction of the total homolog collection for a given sequence. We introduce a Support Vector Machine (SVM)-based tool to detect homology using semi-supervised iterative learning (SVM-HUSTLE) that identifies significantly more remote homologs than current state-of-the-art sequence or cluster-based methods. As opposed to building profiles or position specific scoring matrices, SVM-HUSTLE builds an SVM classifier for a query sequence by training on a collection of representative high-confidence training sets, recruits additional sequences and assigns a statistical measure of homology between a pair of sequences. SVM-HUSTLE combines principles of semi-supervised learning theory with statistical sampling to create many concurrent classifiers to iteratively detect and refine, on-the-fly, patterns indicating homology.
Results: When compared against existing methods for identifying protein homologs (BLAST, PSI-BLAST, COMPASS, PROF_SIM, RANKPROP and their variants) on two different benchmark datasets SVM-HUSTLE significantly outperforms each of the above methods using the most stringent ROC1 statistic with P-values less than 1e-20. SVM-HUSTLE also yields results comparable to HHSearch but at a substantially reduced computational cost since we do not require the construction of HMMs.
Availability: The software executable to run SVM-HUSTLE can be downloaded from http://www.sysbio.org/sysbio/networkbio/svm_hustle
Contact: anuj.shah@pnl.gov
1 INTRODUCTION
During the past 30 years, numerous efforts have been focused on computationally predicting the functions of novel proteins. A well-established practice is to compare a query sequence of unknown function with a well-characterized database of sequences. If two similar sequences are derived from a common ancestor, they often preserve a related function. Inferring this common ancestry between similar sequences, termed homology detection, provides a kind of boot-strapping method by which newly sequenced genomes can be characterized based on existing annotations.
One method of quantifying the degree of homology is finding an optimal alignment between two sequences. In practice, many proteins can be thought of as comprising a series of functional regions joined by natively disordered links (Dunker et al., 2002). Mutations in the functional regions would tend to degrade the behavior of a protein, whereas mutations in disordered regions may not impact the function at all. As a result, one would expect that homologous proteins can be more rapidly detected by aligning these smaller pieces or ‘local’ regions. Alignment methods such as the Smith–Waterman algorithm (Smith and Waterman, 1981), the fast approximation to Smith–Waterman (FASTA) (Pearson, 1985) and Basic Local Alignment Search Tool (BLAST) (Altschul et al., 1990) take advantage of local sequence similarity in proteins. However, these methods often fail to detect common ancestry when proteins share low residue similarity, i.e. when proteins are remote homologs, which occurs commonly (Rost, 1999).
The concept of protein families was introduced into the problem of remote homology detection as an approach to overcome the lack of sensitivity that arises with algorithms comparing sequences based on residue similarity. A protein family is a group of sequences that share a common evolutionary origin. A statistical/computational model can be built for each of these families or super-families. A protein of unknown function can then be compared to each of the protein families producing a measure of likelihood of the sequence originating from this family. These family-based methods help computational biologists identify nearly three times the number of homology relationships as identified by simple pairwise alignments (Park et al., 1998). PSI-BLAST (Altschul et al., 1997) and hidden Markov models (HMM) (Baldi et al., 1994) are two examples of such generative approaches. PSI-BLAST builds an alignment-based statistical model from a set of proteins by defining a position specific scoring matrix (PSSM). It then adopts a semi-supervised algorithm to use this query-specific PSSM model to search the space to identify additional homologous sequences iteratively from the database, refining the model at each step. This comparison of profiles of a family of proteins against a database still fails to detect a large number of distant/weak homologs. Methods such as PROF_SIM (Yona and Levitt, 2002), COMPASS (Sadreyev et al., 2003) and HHSearch (Soeding, 2005) take the approach of PSI-BLAST one step further. Instead of comparing a profile against individual sequences within a database, these methods compare two profiles in a pairwise setting, yielding significantly more homologs than PSI-BLAST, HMM or profile-based methods. However one major short-coming of these methods is the computationally expensive multiple alignments that needs to be generated in order to generate profiles for a set of sequences.
Discriminative approaches use a machine learning technique such as Artificial Neural Networks, Genetic Algorithms and Support vector machines (SVM) (Vapnik, 1995, 1998) to learn a set of classification rules based on given training data. Protein sequences are represented as fixed-length vectors in a machine learning framework to classify positive and negative examples. A number of family-based discriminative approaches have been published in literature and the only difference is in the representation of the protein vectors (Atalay and Cetin-Atalay, 2005; Ben-Hur and Brutlag, 2003; Busuttil et al., 2004; Kuang et al., 2005a; Jaakkola et al., 2000; Leslie et al., 2003; Liao and Noble, 2003; Lingner and Meinicke, 2006; Ogul and Mumcuoglu, 2006; Rangwala and Karypis, 2005; Shah et al., 2007; Webb-Robertson et al., 2005; Hou et al., 2003, 2004). It has successfully been demonstrated that these discriminative approaches outperform the current generative approaches on a limited benchmark dataset consisting of typically 54 families (Zaki et al., 2003). Despite improved sensitivity for the task of remote homology detection, there are caveats to this approach: since the protein families used in family-based descriminative models are pre-determined, only proteins that belong to a known and trained family can be classified. Most discriminative methods answer the question ‘Whether the given protein belongs to an existing family?’ Thus new protein families or homologous sequence pairs outside these defined families, i.e. families for which models are pre-built, cannot be discovered. However, these approaches illustrate the power of applying machine learning methods to the problem of homology detection. The goal of this work is to develop the first non-family-based SVM method for sensitive homology detection.
Most biologists and informaticists to date use BLAST and PSI-BLAST as their preferred method for homology detection. Both of these algorithms use string matching to identify sequences from a database which are similar to a query sequence. The input query sequence is searched against a large gene/protein database to determine closest matches and a ranked list of sequences is produced as output. One of the defining characteristics of such a problem is the large ratio of negative examples (non-homologous sequences) to positive examples (homologous sequences). Recent algorithms such as RANKPROP (Weston et al., 2004) and MOTIFPROP (Kuang et al., 2005b) produce a ranking of proteins that are homologous to the query protein by starting with an initial similarity network produced by a base algorithm (BLAST/PSI-BLAST) and identify additional relationships via network propagation. The initial network has edges between similar proteins labeled with E-values (or a function of E-values) resulting from the base algorithm while the network propagation is achieved by means of a diffusion operation. These methods have demonstrated increased sensitivity over BLAST and PSI-BLAST.
We present an iterative semi-supervised approach to the problem of pairwise remote homology detection. This approach is analogous to PSI-BLAST, which iteratively builds a sequence position specific scoring matrix on a set of homologs (positive examples). But SVM-HUSTLE accounts for both positive and negative examples, so it has a substantially improved performance compared to PSI-BLAST. SVM-HUSTLE achieves this by using concurrent SVMs to iteratively search for homologs against a database with pre-computed sequence similarity scores and provides the answer to the question ‘Which sequences are homologous to a given query?’ Adopting a semi-supervised approach, SVM-HUSTLE treats the database as a source of both labeled and unlabeled data, progressively growing the set of positive training examples. At each step, the database is reclassified. SVM-Hustle does not need family or super family classification knowledge for a given query sequence and in this respect is similar to BLAST and PSI-BLAST and different from existing state-of-the-art SVM homology detection methods. Classification models within SVM-HUSTLE are built on-the-fly for the given input query using unlabeled data to refine the model. This semi-supervised learning approach leads to a set of classifiers that have higher sensitivity when detecting distant homologs compared to PSI-BLAST. In addition, we use Bayesian statistics to build a statistical measure of confidence related to a pair of sequences that accounts for both the homologous and non-homologous models.
2 METHODS
2.1 Representation of protein sequences in a support vector machine
Vapnik developed the theory behind the SVM as the basis for statistical binary classification for a wide range of application areas. (Vapnik, 1995, 1998). SVMs are machine learning algorithms designed with the intention of ‘generalizing better’. The problem SVM algorithms address relates to the efficient learning of a classification rule from a set of exemplars. The goal is the classification of each candidate as being either ‘member’ or ‘not a member’ of a given class. In our case, we train the SVM to recognize patterns of sequence similarity which are indicative of homologous proteins. The SVM requires that input data be represented as a collection of fixed-length vectors.
2.2 Semi-supervised learning
In supervised learning, the SVM ‘learns’ classification rules from both positive and negative exemplars. In unsupervised learning, patterns are formed based on clustering techniques. Semi-supervised learning is a training strategy that combines information from labeled data (i.e. data that is known to fall into one category or the other) and the unlabeled data which often clusters around labeled data. Semi-supervised learning functions are based on so-called cluster assumption: classification rules remain unchanged in areas of dense exemplars. Using clustering to associate unlabeled data with the labeled training data often improves coverage of the vector space improving the accuracy of classification.
Figure 1 illustrates the potential difference between learning using a semi-supervised technique as opposed to a supervised technique. The supervised algorithm may develop a classification rule that could cause incorrect classification of incoming exemplars when labeled data sparsely covers the vector space. Adding unlabeled data refines the detail of the separating hyperplane placing the decision boundary in areas of low density generating a more accurate classifier.
Semi-supervised learning is generally applied when unlabeled data is abundant while labeled data is small e.g. web page classification. For a detailed literature survey of semi-supervised learning interested readers can refer to Zhu (2006). Weston et al. (2006), recently extended the RANKPROP algorithm employing a semi-supervised learning technique to the problem of remote homology detection. RANKPROP exploits the global structure of similarity relationships among proteins in a database by performing a diffusion operation on a similarity network with weighted edges.
The initial similarity network, which is given as input to the algorithm, is formed by connecting proteins that are homologous using an auxiliary approach such as PSI-BLAST and the edge labels are based on a transfer function applied to the probabilistic E-values from PSI-BLAST searches. Multiple strategies to incorporate knowledge from unlabeled examples in the RANKPROP algorithm are suggested by (Weston et al., 2006). In each case they achieve better rankings than a supervised approach.
2.3 SVM-HUSTLE: combining semi-supervised learning and SVM methodology
SVM-HUSTLE is a new algorithm that employs a semi-supervised SVM model to iteratively identify homologs to a query sequence from a database. The SVM-HUSTLE methodology can be described by six steps detailed in the flow chart of the algorithm in Figure 2. A novel component of this algorithm is the selection of the negative class (i.e. non-homologs) by sampling, to better cover the vector space. The number of non-homologs is much larger than the homologous class and as such we employ a random sampling algorithm to attain a representative negative class of proportionate size to the positive class. For example, in the benchmark dataset used for demonstration purposes there are 7329 protein sequences, for which we know the homologies. This dataset contains more than 53 million protein pairs. Of these pairs only 1%, (591 383), are homologous. To account for this imbalance in positive and negative examples, we iteratively repeat the entire sampling and classification step k number of times and use the average score for the subsequent iteration. SVM-HUSTLE uses semi-supervised learning to build multiple query-specific models of homology for the incoming query sequence. These models are then subjected to supervised learning using SVMs. The database is re-created using these query-specific models and re-classified using each distinct SVM. Finally, an averaging mechanism is used that helps select the best possible matches for a protein sequence at each step. This combination of semi-supervised learning and supervised learning gives SVM-HUSTLE much of its classification power.
Identify Seed Positive Training Set. An initial set of n homologous proteins are identified using an auxiliary method (BLAST in this case) against the database D consisting of 7329 proteins. This produces a basis set consisting of protein sequences that are similar to the original sequence with high statistical confidence (E-value <0.1). These n protein sequences form the initial positive training set (Pos) for SVM-HUSTLE. This is the only stage in the algorithm where sequence alignment scores are used.
Randomly sample k Negative Training Sets. Select k (number of concurrent SVM classifiers) random sets of proteins from D of size that is an integral multiple of n for k SVM constructions. These negative training examples are represented as an array Neg [k]. Proteins in the negative training sets must be (i) not in the high- or medium-confidence BLAST output, and (ii) only in one negative training set.
Train SVM Classifiers. Train a set of SVM [k] classifiers where each training set has the same positive training examples (Pos) and a different set of negative training examples as defined by Neg [k]. The train and test set protein sequences are now vectorized using the approach discussed in Section 2.1.
Vectorize the database. The database D is vectorized k times, once for each trained SVM. For each classifier, the proteins in D are converted to vector form using the positive train set and one of the negative train sets (Neg [k]). There will be k different vectorized databases—one for each of the classifiers. Each instance of the vectorized database is different since they all have different negative sequences.
Classify protein pairs. Classify each protein in D and use the SVM discriminant score as a measure of classification confidence, select the top (best average score) a*n homologs for the given query protein q. The parameter ‘a’ corresponds to the ratio of protein sequences selected as positive training examples per iteration.
Iterate. Repeat steps 2–5 until no new samples are classified as positive or for a maximum number of iterations (m).
Figure 3 represents a schematic of the internal operation of SVM-HUSTLE after kernel transformations. The labeled and unlabeled examples can be laid out in two dimensions as shown. Constructing a single hyperplane based on the initial set of positive examples and randomly selected negative examples may not lead to correct classification in all cases. However, constructing multiple hyperplanes (k) would lead to an aggregate classification plane; one that will yield more accurate and sensitive results. It is possible that a true homolog can be recruited as a negative training example (as indicated by an underlined triangle in Fig. 3B and C) in one instance of the classification runs, however, the chances are slim since the ratio of positive to negative examples is very small. Additionally, using the averaging approach will also account for such discrepancies.
As its final output, SVM-HUSTLE produces a ranked listing of protein sequences that are homologous to a given query sequence with statistical scores signifying the similarity. SVM-HUSTLE does not require prior knowledge of protein family memberships for classification, so it can accurately and sensitively detect homologs which do not fall into any known protein family.
3 RESULTS AND DISCUSSION
3.1 Performance comparison with supervised and semi-supervised homology detection methods
We compare SVM-HUSTLE with PSI-BLAST, MotifProp, RankProp and their variants (k-mer MotifProp, E-Motif MotifProp etc). Comparisons between the pairwise approach of SVM-HUSTLE and family-based classification methods such as SVM-pairwise and its descendents cannot be performed as these methods rely on a family-based classification that identifies homologies within a set of pre-defined families. The comparison is made on a set of 7329 sequences, which were extracted from version 1.59 of the SCOP database, purged by using the website http://astral.berkeley.edu so that no pair of sequences shares more than 95% identity. This benchmark dataset has been used in a number of previous studies (Kuang et al., 2005b; Weston et al., 2004, 2005). Following a similar procedure, the dataset was divided into 4246 training sequences and 3083 test sequences. SVM-HUSTLE identifies remote homologs on-the-fly and as such does not require a pre-defined training set. The SCOP 1.59 training set of 4246 sequences has no overlap in superfamily membership with the 3083 test sequences and as such inclusion in the database D does not provide more positive examples, only negative examples. Similar to PSI-BLAST, the training data is included in the database used to build the model, but homology metrics are only generated for the test data to allow comparison with the current state-of-the-art. No prior knowledge of family or superfamily is used in the training of SVM-HUSTLE. Initially all data is considered unlabeled and the initial positive training set is constructed using BLAST results.
The performance of RANKPROP, MOTIFPROP and their variants were assessed by their ROC scores which were directly obtained from the authors of the algorithms. PSI-BLAST was allowed to run for a maximum of six iterations, using a default E-value threshold of 0.005 and the BLOSUM62 scoring matrix. As suggested by (Schaffer et al., 2001) this results in the most optimal performance of PSI-BLAST. Too few iterations results in an incomplete position-specific scoring matrix. Too many iterations result in the recruitment of large numbers of false positives, skewing the scoring matrix in favor of false positives.
In order to compare the performance of SVM-HUSTLE against the above methods, SVM-HUSTLE was run on each of the 3083 test set protein sequences on a LINUX cluster using four nodes with two processors per node with different values of parameters defined in Section 2.3 namely: SVM-HUSTLE could not be run on three sequences out of the total 3083 sequences in the test set as BLAST failed to produce homolog hits with E-values better than 0.10. In its current form SVM-HUSTLE cannot be run on sequences with no BLAST hits, however an option would be to run an exhaustive sequence similarity algorithm such as Smith–Waterman or Needleman–Wunsch to extract a starting homologous set of protein sequences. For the current experiments, these three sequences were eliminated from all methods for comparison with SVM-HUSTLE. The network propagation methods overcome this issue by building a network despite the discovery of only a self-homolog. However, a single positive example of a sequence being homologous to itself is not adequate to train a SVM.
k—number of concurrent SVM classifiers,
a—ratio of protein sequences selected as positive training examples at each iteration and
m—maximum number of iterations.
The primary statistic used for the statistical comparison of homology detection methods is a Receiver Operating Characteristic (ROC) (Hanley and McNeil, 1982) curve. A ROC curve yields a measure of the rate of false positives versus the true positives i.e. it is a plot of the true positive rate against the false positive rate for the different possible cut-offs of a diagnostic test. The area under a ROC plot is 1 for an ideal classifier, indicating that one can infer all of the correct protein homologies (identified from the SCOP classification) without having to tolerate any false positives.
The ROCn score computes this score up to the n-th false positive (Gribskov and Robinson, 1996). Figure 5 outlines the pseudo code for an ROC score calculation. The ROC1 values were calculated for each of the test set sequences to gage the relative performance of homology detection methods in the context of a database search where the vast majority of sequences are not homologous. In this case, it is important to get most of the correct homologs before a small number (rather than a small percentage) of false positives are classified as homologs. Essentially, the ROC1 value equates to less than one error per query. Figure 4 shows the comparison of SVM-HUSTLE with the other approaches using the ROC1 curve. The graph plots the total number of queries for which a given method exceeds a ROC score threshold. As shown by Figure 4, our results suggest that SVM-HUSTLE outperforms each of the algorithms with P-values less than 1e-20 in a two-tailed signed rank test (Salzberg, 1997).
Table 1 reports the average ROCn scores across all test set queries for a given method. All results of SVM-HUSTLE are reported for k = 100, a = 1, m = 10 and a BLAST cut-off of 0.10 for initial homolog set. The rationale behind selection of a generous cut-off for the BLAST search is to allow more diverse sequences to be recruited in the initial positive train set.
Method . | SVM-HUSTLE . | MotifProp . | RankProp . | PSI-BLAST . |
---|---|---|---|---|
ROC 1 | 0.7445 | 0.6405 | 0.5927 | 0.5978 |
ROC 10 | 0.7818 | 0.6629 | 0.6674 | 0.6172 |
ROC 50 | 0.8122 | 0.6876 | 0.7249 | 0.6411 |
Method . | SVM-HUSTLE . | MotifProp . | RankProp . | PSI-BLAST . |
---|---|---|---|---|
ROC 1 | 0.7445 | 0.6405 | 0.5927 | 0.5978 |
ROC 10 | 0.7818 | 0.6629 | 0.6674 | 0.6172 |
ROC 50 | 0.8122 | 0.6876 | 0.7249 | 0.6411 |
Method . | SVM-HUSTLE . | MotifProp . | RankProp . | PSI-BLAST . |
---|---|---|---|---|
ROC 1 | 0.7445 | 0.6405 | 0.5927 | 0.5978 |
ROC 10 | 0.7818 | 0.6629 | 0.6674 | 0.6172 |
ROC 50 | 0.8122 | 0.6876 | 0.7249 | 0.6411 |
Method . | SVM-HUSTLE . | MotifProp . | RankProp . | PSI-BLAST . |
---|---|---|---|---|
ROC 1 | 0.7445 | 0.6405 | 0.5927 | 0.5978 |
ROC 10 | 0.7818 | 0.6629 | 0.6674 | 0.6172 |
ROC 50 | 0.8122 | 0.6876 | 0.7249 | 0.6411 |
3.2 Performance comparison with profile-profile methods
In this experiment, we compare SVM-HUSTLE against profile-profile based remote homology detection methods such as COMPASS (Sadreyev et al., 2003), PROF_SIM (Yona and Levitt, 2002) and HHSearch (Soeding, 2005). These programs are shown to be significantly more sensitive than PSI-BLAST and have been applied to identify novel homologies.
To compare the results sequences were extracted from the SCOP 1.63 database using the website http://astral.berkeley.edu so that no pair of sequences shares more than 20% identity (SCOP-20). This results in a total of 3691 sequences, the homologies for which are obtained from their SCOP classifications. This benchmark dataset, previously used in (Soeding, 2005), has only 41 505 true positive homolog pairs while there are 1.08 × 107 false positives. The results for the above profile-profile comparison algorithms are extracted from (Soeding, 2005) while SVM-HUSTLE was set up to run each of the 3691 sequences with varying parameters. We report the results of SVM-HUSTLE for k = 60, a = 1 and m = 10 for this experiment. To include sufficient number of homologs in the initial training set, the BLAST cut-off was increased to 1.0 since this dataset has hardly any sequence similarity. In order to recruit sequences in a semi-supervised technique, we augment the 3691 protein sequences with sequences from the previous 7329 benchmark dataset. This results in a total of 8272 unique sequences. However, the test statistics are calculated only on the original 3691 SCOP-20 dataset.
As previously report in (Soeding, 2005) at a 10% error rate (FP/(TP + FP) = 10%) BLAST detects only 908 homologous pairs which is 2.2% of the total number of homologs. PSI-BLAST performs slightly better and results in a correct identification of 17.7% while HMMER finds 18.7%. PROF_SIM and COMPASS find 24.9 and 34%, respectively. HHSearch algorithms perform much better than any of the existing methods identifying 40–50% of the homologs, based on usage of additional knowledge such as the correlation score and predicted secondary structure. SVM-HUSTLE achieves an identification rate of 48.5%, correctly identifying 20 153 of the 41 505 homologs. Modification of the configuration parameters may result in higher percentage of homologs being found, e.g. increasing the BLAST cut-off from 1 to 10 and increasing the number of concurrent SVM's results in an additional 5% increase in homolog detection. These results are comparable to HHSearch while they outperform both PROF_SIM and COMPASS yielding almost twice the number of homolog detection.
Our results indicate the importance of using semi-supervised learning in combination with a supervised approach. First, Using the SCOP 7329 dataset as a source of unlabeled data (in a semi-supervised setting) increases the homology detection capability of SVM-HUSTLE. Second, SVM-HUSTLE can be easily reproduced and does not require the complicated profile–profile alignments or the alignments of a large number of variable length sequences in order to generate HMM profiles.
3.3 Assessing statistical significance of a SVM-HUSTLE score
3.4 Computational costs and scalability
Our results clearly demonstrate the improved performance characteristics of SVM-HUSTLE when detecting remote homologs. This increased sensitivity in homology detection does however come with increase in computational cost with respect to PSI-BLAST. A query specific training set is generated with similar positive exemplars and randomly sampled multiple negative exemplars. Since a query specific model is built for each incoming query, the target database D has to be reconstructed (re-vectorized) using this model, for each concurrent SVM classifier. The reconstruction cost of the database is the biggest bottleneck for SVM-HUSTLE.
The database reconstruction cost is directly impacted by the parameter k; the number of concurrent SVM classifiers, which linearly increases the time-to-solution for a given query. Increasing the number of concurrent classifiers introduces greater variability in the negative set that is randomly selected. As a result, we can better represent the space of negative exemplars in the training space while a smaller value of k only covers part of the database. We experimented with multiple values of this parameter and report results for k = 60. Our evaluation indicates a statistically significant increase in accuracy values when moving from k = 4 to k = 100. The transition from k = 40 to k = 80 is significant and as a result we settle for the average case where k = 60. A formal estimation of the best value of k would be part of future work on SVM-HUSTLE.
Depending on configuration parameters for SVM-HUSTLE, the database would be reconstructed in the worst case (complete 10 iterations, 60 samples) for m (number of iterations) × k (concurrent SVM classifiers) times, 600 in our case. The best case scenario for SVM-HUSTLE would be when m = 1 which would amount to k reconstructions. If a computing facility is chosen such that there are multiple processors and the processor to classifier ratio is unity, the computation can be effectively expected to complete in time proportional to the size of the database. As the database grows in size (number of sequences), the database reconstruction cost can be prohibitive in nature and can slow down classification. Currently, SVM-Hustle requires specialized hardware to scale to large databases, such as nr. An additional challenge with large datasets is that in the case where a query has a large number of homologs, e.g. over 10 000, the size of the SVM training set would be prohibitive. However this is easily circumvented by setting a maximum limit on the initial positive train set as well as the BLAST cut-off for selection of training examples. We are currently developing a web service that would allow searches against various large sequence databases, public or user-defined. This would offer access to the specialized hardware required to run the algorithm on larger datasets. Additionally, an executable that can be used on datasets of less than 10 000 proteins is available at http://www.sysbio.org/sysbio/networkbio/svm_hustle.
The average time to solution for a single query in both experiments using SVM-HUSTLE is in seconds once the initial setup is completed. A large number of the queries from the test set (85%) complete well within the average time however queries that have a large number of training examples require more SVM optimization cycles and take around 1–2 min or more for convergence. Depending on the size of the database, users looking for a small set of homologs may be able to retrieve their results in a very short time.
As mentioned, the primary computational bottleneck is in creating the protein vectors on the fly using the Smith-Waterman algorithm to reconstruct the database. An alternative and efficient way of building the SVM classifier would be to use a protein representation that is independent of the training set sequences. We are currently investigating the potential of using fixed-length motif representation of protein sequences based on motif content (Ben-Hur and Brutlag, 2003; Hou et al., 2003). These motif-based representations would allow SVM-HUSTLE to produce rank-lists of proteins within seconds without requiring the reconstruction phase.
4 CONCLUSIONS
The accurate annotation of newly sequenced proteins precedes much of the meaningful work that can be done in understanding molecular systems. For example, the analysis of microarray data and co-regulation in general, is much more meaningful when the genes are observed in a functional context. This relies heavily on one's ability to accurately and sensitively identify homologous proteins which are of distant evolutionary origin and hence, have a low degree of sequence similarity.
Our approach to incorporate semi-supervised learning coupled with concurrent training of SVM classifiers, though computationally expensive compared to PSI-BLAST, improves the sensitivity of remote homology detection by a significant margin using stricter thresholds of the ROC measure. More importantly, SVM-HUSTLE does not rely on knowing the superfamily classifications of query sequences making it possible to find homologs of proteins which do not fit into any known family.
SVM-HUSTLE is a pairwise sequence homology detection algorithm that builds models from representative high-confidence training sequences on the fly in a semi-supervised fashion. A typical setting for the use of SVM-HUSTLE would be to search for homologous proteins from a database for a given query. SVM-HUSTLE yields significantly better results than network propagation algorithms such as the RANKPROP and MOTIFPROP on our benchmark datasets. The performance is also comparable to HHSearch which requires the computation of multiple alignments and/or structural models of the database.
ACKNOWLEDGEMENTS
This research was supported by the Data-intensive Computing for Complex Biological Systems project funded by the Office of Advanced Scientific Computing Research, and under the Laboratory Directed Research and Development Program at the Pacific Northwest National Laboratory, (PNNL). PNNL is a multi-program national laboratory operated by Battelle for the US DOE under contract DE-AC06-76RL01830. This research was performed in part using the Molecular Science Computing Facility (MSCF) in the William R. Wiley Environmental Molecular Science Laboratory (EMSL), a national scientific user facility sponsored by the US DOE, OBER and located at PNNL. In addition, the authors would like to thank the Dr Jason Weston, Dr Christina Leslie, Dr Rui Kuang, Dr Li Liao and Dr William Stafford Noble for the public access to their data and results.
Conflict of Interest: none declared.
REFERENCES
Author notes
Associate Editor: Burkhard Rost