Yeast

Predict in Yeast Cellular Localization Sites of Proteins



Background:

Knowing a protein's cellular localization helps elucidate its function, its role in both healthy processes and in the onset of diseases, and its potential use as a drug target. Experimental characterization of protein localization is accurate but slow and labor-intensive. The amino acid sequence of a protein usually provides crucial indication to its cellular localization sites. On the other hand, sequenced genomic data is experiencing an exponential increase in recent years due to maturation of High-Throughput sequencing techniques. Thus, many computational methods have been proposed to set up the link between a protein sequence and its cellular location. These include McGeoch's method for signal sequence recognition, discriminant analysis of the amino acid content of outer membrane and periplasmic proteins, etc.

Generally, these approaches try to detect the sequence patterns associating with the set of proteins that locate at the same cellular compartment. However, each of these methods can only deal with one protein category, i.e. giving the probability of a sequence being a membrane protein, or deciding whether it is a nucleus protein or not. Thus, for a new protein sequence on which people have no pre-knowledge, the only way to decide its localization site is to check all available methods to get a sense. However, people still need to judge among these results to decide which predication is more reliable, what is the cutoff probability for it to be safe to say a protein is in a certain cellular localization site but not in other sites. Thus, it is in a great need to develop a comprehensive system, integrating protein sequence-derived data and prediction results from all the methods described above. It has been showed that a variety of machine learning methods can be used for this purpose.

 

Build a machine learning model to predict the cellular localization site of proteins!

 

Abstract: Predicting the Cellular Localization Sites of Proteins

 

Data Set Characteristics:  

Multivariate

Number of Instances:

1484

Area:

Life

Attribute Characteristics:

Real

Number of Attributes:

8

Date Donated

1996-09-01

Associated Tasks:

Classification

Missing Values?

No

Number of Web Hits:

187184

 

 

Relevant Papers:

Paul Horton & Kenta Nakai, "A Probablistic Classification System for Predicting the Cellular Localization Sites of Proteins", Intelligent Systems in Molecular Biology, 109-115. St. Louis, USA 1996. 
[Web Link] 

Kenta Nakai & Minoru Kanehisa, "A Knowledge Base for Predicting Protein Localization Sites in Eukaryotic Cells", Genomics 14:897-911, 1992. 
[Web Link]

 

Evaluation

Rules

Leaderboard

Rank Team Score Count Submitted Date

Data License


Citation Request:

Please refer to the Machine Learning Repository's citation policy