Background: Knowing a protein's cellular localization helps elucidate
its function, its role in both healthy processes and in the onset of
diseases, and its potential use as a drug target. Experimental
characterization of protein localization is accurate but slow and
labor-intensive. The amino acid sequence of a protein usually provides
crucial indication to its cellular localization sites. On the other hand,
sequenced genomic data is experiencing an exponential increase in recent
years due to maturation of High-Throughput sequencing techniques. Thus, many
computational methods have been proposed to set up the link between a protein
sequence and its cellular location. These include McGeoch's method for signal
sequence recognition, discriminant analysis of the amino acid content of
outer membrane and periplasmic proteins, etc. Generally, these approaches try to detect the sequence patterns
associating with the set of proteins that locate at the same cellular
compartment. However, each of these methods can only deal with one protein
category, i.e. giving the probability of a sequence being a membrane protein,
or deciding whether it is a nucleus protein or not. Thus, for a new protein
sequence on which people have no pre-knowledge, the only way to decide its
localization site is to check all available methods to get a sense. However,
people still need to judge among these results to decide which predication is
more reliable, what is the cutoff probability for it to be safe to say a
protein is in a certain cellular localization site but not in other sites.
Thus, it is in a great need to develop a comprehensive system, integrating
protein sequence-derived data and prediction results from all the methods
described above. It has been showed that a variety of machine learning
methods can be used for this purpose. Build a machine learning model to predict the cellular
localization site of proteins! Abstract:
Predicting the Cellular Localization Sites of Proteins |
Data Set Characteristics: |
Multivariate |
Number of Instances: |
1484 |
Area: |
Life |
Attribute Characteristics: |
Real |
Number of Attributes: |
8 |
Date Donated |
1996-09-01 |
Associated Tasks: |
Classification |
Missing Values? |
No |
Number of Web Hits: |
187184 |
Relevant Papers:
Paul Horton & Kenta Nakai, "A Probablistic
Classification System for Predicting the Cellular Localization Sites of
Proteins", Intelligent Systems in Molecular Biology, 109-115. St. Louis,
USA 1996.
[Web Link]
Kenta Nakai & Minoru Kanehisa, "A Knowledge Base for Predicting
Protein Localization Sites in Eukaryotic Cells", Genomics 14:897-911,
1992.
[Web Link]
Rank | Team | Score | Count | Submitted Date |
---|