Institute for Computer Science

Machine Learning and Natural Language Processing Lab

PreviousNext

Student's Project

KDD Cup 2002 - Experiments in Yeast Gene Regulation Prediction

Stefan Mutter, 2002


The task of the KDD Cup 2002 is a classical classification problem: Does a specific gene influence a hidden system in Saccharomyces Cerevisiae, or doesn't it? The hidden system turned out to be the AHR (Aryl Hydrocarbon Response) Signaling Pathway. The data set is typically biological: There are only few positive training instances (l.t. 3%) and the data has much missing information.If existent, there are nominal, hierarchically ordered features for a gene. The hierarchy tree was represented using a strict inheritance network. Attributes were created as paths in the network from the root to the respective property of the gene under consideration. In addition, textual features in the form of MEDLINE abstracts are provided by the challenge organizers. To extract the information out of the abstracts a bag of words representation was used. A rearrangement of the training instances turned out to be fruitful. New training sets were built by deleting information. Therefore, a well-defined part of the negative training instances were removed from the training set. The results were validated using the entire set of instances. Deleting information resulted in obtaining better ROC Curves. Using Support Vector Machines represents a promising approach to this task, because Support Vector Machines can handle large data sets, have the ability to handle outliners and can handle correlations between attribute values.