Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2013

Open Access 01-04-2013 | Proceedings

Efficient protein structure search using indexing methods

Authors: Sungchul Kim, Lee Sael, Hwanjo Yu

Published in: BMC Medical Informatics and Decision Making | Special Issue 1/2013

Login to get access

Abstract

Understanding functions of proteins is one of the most important challenges in many studies of biological processes. The function of a protein can be predicted by analyzing the functions of structurally similar proteins, thus finding structurally similar proteins accurately and efficiently from a large set of proteins is crucial. A protein structure can be represented as a vector by 3D-Zernike Descriptor (3DZD) which compactly represents the surface shape of the protein tertiary structure. This simplified representation accelerates the searching process. However, computing the similarity of two protein structures is still computationally expensive, thus it is hard to efficiently process many simultaneous requests of structurally similar protein search. This paper proposes indexing techniques which substantially reduce the search time to find structurally similar proteins. In particular, we first exploit two indexing techniques, i.e., iDistance and iKernel, on the 3DZDs. After that, we extend the techniques to further improve the search speed for protein structures. The extended indexing techniques build and utilize an reduced index constructed from the first few attributes of 3DZDs of protein structures. To retrieve top-k similar structures, top-10 × k similar structures are first found using the reduced index, and top-k structures are selected among them. We also modify the indexing techniques to support θ-based nearest neighbor search, which returns data points less than θ to the query point. The results show that both iDistance and iKernel significantly enhance the searching speed. In top-k nearest neighbor search, the searching time is reduced 69.6%, 77%, 77.4% and 87.9%, respectively using iDistance, iKernel, the extended iDistance, and the extended iKernel. In θ-based nearest neighbor serach, the searching time is reduced 80%, 81%, 95.6% and 95.6% using iDistance, iKernel, the extended iDistance, and the extended iKernel, respectively.
Literature
1.
go back to reference Sael L, Li B, La D, Fang Y, Ramani K, Rustamov R, Kihara D: Fast protein tertiary structure retrieval based on global surface shape similarity. Proteins. 2008, 72: 1259-1273. 10.1002/prot.22030.CrossRefPubMed Sael L, Li B, La D, Fang Y, Ramani K, Rustamov R, Kihara D: Fast protein tertiary structure retrieval based on global surface shape similarity. Proteins. 2008, 72: 1259-1273. 10.1002/prot.22030.CrossRefPubMed
2.
go back to reference Deng K, Zhou X, Shen HT, Liu Q, Xu K, Lin X: A multi-resolution surface distance model for k-NN query processing. VLDB. 2008, 1101-1119. Deng K, Zhou X, Shen HT, Liu Q, Xu K, Lin X: A multi-resolution surface distance model for k-NN query processing. VLDB. 2008, 1101-1119.
3.
go back to reference Shen HT, Huang Z, Cao J, Zhou X: High-dimensional indexing with oriented cluster representation for multimedia databases. VLDB. 2009, ICME, 1628-1631. Shen HT, Huang Z, Cao J, Zhou X: High-dimensional indexing with oriented cluster representation for multimedia databases. VLDB. 2009, ICME, 1628-1631.
4.
go back to reference Kim S, Lee S, Yu H: Indexing methods for efficient protein 3D surface search. Proceedings of the ACM Sixth International Workshop on Data and Text Mining in Biomedical Informatics. 2012, New York: ACM, 41-48. 10.1145/2390068.2390078. Kim S, Lee S, Yu H: Indexing methods for efficient protein 3D surface search. Proceedings of the ACM Sixth International Workshop on Data and Text Mining in Biomedical Informatics. 2012, New York: ACM, 41-48. 10.1145/2390068.2390078.
5.
go back to reference Gibrat JF, Madej T, Bryant SH: Surprising similarities in structure comparison. Curr Opin Struct Biol. 1996, 6: 377-385. 10.1016/S0959-440X(96)80058-3.CrossRefPubMed Gibrat JF, Madej T, Bryant SH: Surprising similarities in structure comparison. Curr Opin Struct Biol. 1996, 6: 377-385. 10.1016/S0959-440X(96)80058-3.CrossRefPubMed
6.
go back to reference Shindyalov IN, Bourne PE: Protein structure alighment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998, 11: 739-747. 10.1093/protein/11.9.739.CrossRefPubMed Shindyalov IN, Bourne PE: Protein structure alighment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998, 11: 739-747. 10.1093/protein/11.9.739.CrossRefPubMed
7.
go back to reference Singh AP, Brutlag DL: Hierarchical protein structure superposition using both secondary structure and atomic representations. Intl Syst for Mol Biol (ISMB). 2008, 1013-1022. Singh AP, Brutlag DL: Hierarchical protein structure superposition using both secondary structure and atomic representations. Intl Syst for Mol Biol (ISMB). 2008, 1013-1022.
8.
go back to reference Holm L, Sander C: Protein structure comparison by alighment of distance matrices. J Mol Biol. 1993, 233: 123-138. 10.1006/jmbi.1993.1489.CrossRefPubMed Holm L, Sander C: Protein structure comparison by alighment of distance matrices. J Mol Biol. 1993, 233: 123-138. 10.1006/jmbi.1993.1489.CrossRefPubMed
9.
go back to reference Martin A: The ups and downs of protein topology: rapid comparison of protein structure. Protein Eng. 2000, 13: 829-837. 10.1093/protein/13.12.829.CrossRefPubMed Martin A: The ups and downs of protein topology: rapid comparison of protein structure. Protein Eng. 2000, 13: 829-837. 10.1093/protein/13.12.829.CrossRefPubMed
10.
go back to reference Mizuguchi K, Go N: Seeking significance in three-dimensional rotein structure comparisons. Curr Opin Struct Biol. 1995, 5: 377-382. 10.1016/0959-440X(95)80100-6.CrossRefPubMed Mizuguchi K, Go N: Seeking significance in three-dimensional rotein structure comparisons. Curr Opin Struct Biol. 1995, 5: 377-382. 10.1016/0959-440X(95)80100-6.CrossRefPubMed
11.
go back to reference Kolodny R, Petrey D, Honig B: Protein structure comparison: implications for the nature of 'fold space', and structure and function prediction. Curr Opin Struct Biol. 2006, 16: 393-398. 10.1016/j.sbi.2006.04.007.CrossRefPubMed Kolodny R, Petrey D, Honig B: Protein structure comparison: implications for the nature of 'fold space', and structure and function prediction. Curr Opin Struct Biol. 2006, 16: 393-398. 10.1016/j.sbi.2006.04.007.CrossRefPubMed
12.
go back to reference Kihara D, Skolnick J: The PDB is a covering set of amall protein structures. J Mol Biol. 2003, 334: 793-802. 10.1016/j.jmb.2003.10.027.CrossRefPubMed Kihara D, Skolnick J: The PDB is a covering set of amall protein structures. J Mol Biol. 2003, 334: 793-802. 10.1016/j.jmb.2003.10.027.CrossRefPubMed
13.
go back to reference Gerstein M, Levitt M: Using iterative dynamic programming to obtain accurate pairwise and multiple alignments of ptotein structures. Proc Int Conf Intell Syst Mol Biol. 1996, 4: 59-67.PubMed Gerstein M, Levitt M: Using iterative dynamic programming to obtain accurate pairwise and multiple alignments of ptotein structures. Proc Int Conf Intell Syst Mol Biol. 1996, 4: 59-67.PubMed
14.
go back to reference Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH-a hierarchic classification of protein domain structures. Structure. 1997, 5: 1093-1108. 10.1016/S0969-2126(97)00260-8.CrossRefPubMed Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH-a hierarchic classification of protein domain structures. Structure. 1997, 5: 1093-1108. 10.1016/S0969-2126(97)00260-8.CrossRefPubMed
15.
go back to reference Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: A structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995, 247: 536-540.PubMed Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: A structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995, 247: 536-540.PubMed
17.
go back to reference Madej T, Gibrat JF, Bryant SH: Threading a database of protein cores. Proteins. 1995, 23: 356-369. 10.1002/prot.340230309.CrossRefPubMed Madej T, Gibrat JF, Bryant SH: Threading a database of protein cores. Proteins. 1995, 23: 356-369. 10.1002/prot.340230309.CrossRefPubMed
18.
go back to reference Kinoshita K, Nakamura H: Identification of protein biochemical functions by similarity search using the molecular surface database eF-site. Protein Sci. 2003, 12: 1589-1595. 10.1110/ps.0368703.PubMedCentralCrossRefPubMed Kinoshita K, Nakamura H: Identification of protein biochemical functions by similarity search using the molecular surface database eF-site. Protein Sci. 2003, 12: 1589-1595. 10.1110/ps.0368703.PubMedCentralCrossRefPubMed
19.
go back to reference Aung Z, Fu W, lee Tan K: An efficient index-based protein structure database searching method. DASFAA. 2003 Aung Z, Fu W, lee Tan K: An efficient index-based protein structure database searching method. DASFAA. 2003
20.
go back to reference Conte LL, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res. 2002, 30: 264-267. 10.1093/nar/30.1.264.PubMedCentralCrossRefPubMed Conte LL, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res. 2002, 30: 264-267. 10.1093/nar/30.1.264.PubMedCentralCrossRefPubMed
21.
go back to reference Ciaccia P, Patella M, Zezula P: M-tree: an efficient access method for similarity search in metric spaces. Nucleic Acids Res. 1997, VLDB Ciaccia P, Patella M, Zezula P: M-tree: an efficient access method for similarity search in metric spaces. Nucleic Acids Res. 1997, VLDB
22.
go back to reference Keim D: Tutorial on high-dimensional index structures: Database support for next decades applications. Nucleic Acids Res. 2000, ICDE Keim D: Tutorial on high-dimensional index structures: Database support for next decades applications. Nucleic Acids Res. 2000, ICDE
23.
go back to reference Bruno N, Gravano L, Marian A: Evaluating top-k queries over web-accessible databases. Nucleic Acids Res. 2002, ICDE Bruno N, Gravano L, Marian A: Evaluating top-k queries over web-accessible databases. Nucleic Acids Res. 2002, ICDE
24.
go back to reference Venkatraman V, Chakravarthy PR, Kihara D: Application of 3D Zernike descriptors to shape-based ligand similarity searching. J Cheminform. 2009, 1: 19-10.1186/1758-2946-1-19.PubMedCentralCrossRefPubMed Venkatraman V, Chakravarthy PR, Kihara D: Application of 3D Zernike descriptors to shape-based ligand similarity searching. J Cheminform. 2009, 1: 19-10.1186/1758-2946-1-19.PubMedCentralCrossRefPubMed
25.
go back to reference Canterakis N: 3D Zernike Moments and Zernike Affine Invariants for 3D Image Analysis and Recognition. Image Analysis. 1999, 85-93. Canterakis N: 3D Zernike Moments and Zernike Affine Invariants for 3D Image Analysis and Recognition. Image Analysis. 1999, 85-93.
26.
go back to reference Novotni M, Klein R: 3D zernike descriptors for content based shape retrieval. SM. 2003 Novotni M, Klein R: 3D zernike descriptors for content based shape retrieval. SM. 2003
27.
go back to reference La D, Esquivel-Rodríguez J, Venkatraman V, Li B, Sael L, Ueng S, Ahrendt S, Kihara D: 3D-SURFER: software for high-throughput protein surface comparison and analysis. Bioinformatics. 2009, 25: 2843-2844. 10.1093/bioinformatics/btp542.PubMedCentralCrossRefPubMed La D, Esquivel-Rodríguez J, Venkatraman V, Li B, Sael L, Ueng S, Ahrendt S, Kihara D: 3D-SURFER: software for high-throughput protein surface comparison and analysis. Bioinformatics. 2009, 25: 2843-2844. 10.1093/bioinformatics/btp542.PubMedCentralCrossRefPubMed
28.
go back to reference Connolly ML: The molecular surface package. J Mol Graph. 1993, 11 (2): 139-141. 10.1016/0263-7855(93)87010-3.CrossRefPubMed Connolly ML: The molecular surface package. J Mol Graph. 1993, 11 (2): 139-141. 10.1016/0263-7855(93)87010-3.CrossRefPubMed
29.
go back to reference Jagadish HV, Ooi BC, Tan KL, Yu C, Zhang R: iDistance: An adaptive B-tree based indexing method for nearest neighbor search. Database Syst. 2005, 364-397. Jagadish HV, Ooi BC, Tan KL, Yu C, Zhang R: iDistance: An adaptive B-tree based indexing method for nearest neighbor search. Database Syst. 2005, 364-397.
30.
go back to reference Yu H, Ko I, Kim Y, Hwang S, Han WS: Exact indexing for support vector machines. Management of data. Yu H, Ko I, Kim Y, Hwang S, Han WS: Exact indexing for support vector machines. Management of data.
31.
go back to reference MacQueen JB: Some methods for classification and analysis of multivariate observations. Management of data Math statist and Prob. 1967 MacQueen JB: Some methods for classification and analysis of multivariate observations. Management of data Math statist and Prob. 1967
32.
go back to reference nad J Lozano JP, Larranaga P: An empirical comparison of four initializatoin methods for the k-means algorithm. Management of data. 1999 nad J Lozano JP, Larranaga P: An empirical comparison of four initializatoin methods for the k-means algorithm. Management of data. 1999
Metadata
Title
Efficient protein structure search using indexing methods
Authors
Sungchul Kim
Lee Sael
Hwanjo Yu
Publication date
01-04-2013
Publisher
BioMed Central
DOI
https://doi.org/10.1186/1472-6947-13-S1-S8

Other articles of this Special Issue 1/2013

BMC Medical Informatics and Decision Making 1/2013 Go to the issue