The Parallel Maximal Cliques Algorithm for Protein Sequence Clustering
Abstract
Problem statement: Protein sequence clustering is a method used to discover relations between proteins. This method groups the proteins based on their common features. It is a core process in protein sequence classification. Graph theory has been used in protein sequence clustering as a means of partitioning the data into groups, where each group constitutes a cluster. Mohseni-Zadeh introduced a maximal cliques algorithm for protein clustering. Approach: In this study we adapted the maximal cliques algorithm of Mohseni-Zadeh to find cliques in protein sequences and we then parallelized the algorithm to improve computation times and allowed large protein databases to be processed. We used the N-Gram Hirschberg approach proposed by Abdul Rashid to calculate the distance between protein sequences. The task farming parallel program model was used to parallelize the enhanced cliques algorithm. Results: Our parallel maximal cliques algorithm was implemented on the stealth cluster using the C programming language and a hybrid approach that includes both the Message Passing Interface (MPI) library and POSIX threads (PThread) to accelerate protein sequence clustering. Conclusion: Our results showed a good speedup over sequential algorithms for cliques in protein sequences.
DOI: https://doi.org/10.3844/ajassp.2009.1368.1372
Copyright: © 2009 Khalid Jaber, Nur’Aini Abdul Rashid and Rosni Abdullah. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- 3,308 Views
- 2,737 Downloads
- 3 Citations
Download
Keywords
- Maximal cliques algorithm
- parallel processing
- protein sequence clustering
- MPI
- pthread