Investigating Protein Semantic Similarity Measurement and its Correlation with Sequence Similarity

Ikram, Najmul.

DSpace Home
→
Engineering and Technology
→
Thesis
→
View Item

dc.contributor.author	Ikram, Najmul.
dc.date.accessioned	2019-01-03T07:35:22Z
dc.date.accessioned	2020-04-11T15:34:38Z
dc.date.available	2020-04-11T15:34:38Z
dc.date.issued	2018
dc.identifier.uri	http://142.54.178.187:9060/xmlui/handle/123456789/4988
dc.description.abstract	Protein sequence similarity is commonly used to compare proteins, and to search for proteins similar to a query protein. With the growing use of biomedical ontologies, especially Gene Ontology (GO), semantic similarity between ontology terms, proteins and genes is getting attention of researchers. Protein semantic similarity measurement has many applications in bioinformatics, including protein function prediction and protein-protein interactions. Semantic similarity measures were proposed by Resnik, Jiang and Conrath, and Lin. Recent measures include Wang and AIC. The question whether the semantic similarity has a strong correlation with sequence similarity, has been addressed by some authors. It has been reported that such correlation exists, and it has been used for the evaluation of semantic similarity computation methods as well as for protein function prediction. We investigate the correlation between semantic similarity and sequence similarity using graphs, Pearson's correlation coe cient and example proteins. We nd that there is no strong correlation between the two similarity measures. Pearson's correlation coef- cient is not su cient to explain the nature of this relationship, if not accompanied by graph analysis. We nd that there are several pairs with low sequence similarity and high semantic similarity, but very few pairs with high sequence similarity and low semantic similarity. Interestingly, the correlation coe cient depends only on the number of common GO terms in proteins under comparison. We propose a novel method SemSim for semantic similarity measurement. It addresses the limitations of existing methods, and computes similarity in two steps. In the rst step, SimGIC like approach is used where contribution of common ancestors is divided by contribution of all ancestors. In the second step, we use two new factors: Speci city computed from ontology based information content, and Uniqueness computed from annotation based information content. The nal result, after applying these two factors, makes clear distinction between the generalized and specialized terms. We conducted experiments on protein pairs having evidence of high similarity, and the ones having evidence of low similarity. Experiments show that SemSim performs better than the previous measures in both cases. When semantic similarity is used for searching proteins from large databases, the speed issue becomes signi cant. To search for proteins similar to a query protein having m annotations, from the database of p proteins, p m n g comparisons would be required. Here n is the average annotations per protein, g is the complexity of GO term similarity computation algorithm, and it is assumed that each term of one protein is compared with each term of the other. We propose a method SimExact that is suitable for high speed searching of semantically similar proteins. Although SimExact works on common terms only, our experiments show that it gives correct results required for protein semantic searching. SimExact can be used as a pre processor, generating candidate list for the existing methods, which proceed for further computation. Such arrangement will gain high speed while retaining the accuracy of the given method. We provide online tool that generates a ranked list of the proteins similar to a query protein, with a response time of less than 8 seconds in our setup. We use SimExact to search for protein pairs having high disparity between semantic similarity and sequence similarity. SimExact makes such searches possible, which would be NP-hard otherwise.	en_US
dc.description.sponsorship	CUST University.	en_US
dc.language.iso	en_US	en_US
dc.publisher	CAPITAL UNIVERSITY OF SCIENCE AND TECHNOLOGY, ISLAMABAD	en_US
dc.subject	Technological Sciences	en_US
dc.title	Investigating Protein Semantic Similarity Measurement and its Correlation with Sequence Similarity	en_US
dc.type	Thesis	en_US