First Award: Identify Informative Genes for Cancer Classification
Project Award Date: 05-01-2004
With the completion of human genome project and the advance of microarray technologies, it is now possible to explore the whole genome both systematically and comprehensively. Microarrays have been extensively used for screening gene expressions and for exploiting important clues to understanding the role of genes and the underlying gene regulatory networks. Use of microarrays is rapidly generating large amounts of data (typically terabytes) that create both opportunities and challenging problems. Conventional methods are increasingly unable to deal with the huge amount of data. For example, when applied to cancer classification, microarray data are overwhelming conventional machine learning algorithms because the number of samples is much less than the number of features (genes). A major challenge is the identification of informative genes for cancer classification from gene expression measurements. In fact, it has been demonstrated that only a small number of genes are relevant to a specific cancer classification problem. Identifying these relevant genes is important in numerous microarray-based applications such as drug discovery, early disease detection, and proper treatment guidance.
This project addresses the problems of identification of informative genes for cancer classification. The main objective of this work is to perform a preliminary investigation on a new margin and genetic-algorithm-based feature-selection algorithm. In addition, we will conduct a comparative and comprehensive study of several fundamental gene selection algorithms in microarray-based cancer classification problems to assess their performance on different data sets on the equal footing.
The intellectual merit of this project will include major progress in the informative gene identification problem that is a primary challenge in microarray data analysis, better understanding of feature selection algorithms in small sample problems, and potential solutions to choosing suitable gene selection algorithms for given problems. The new gene selection algorithm is expected to perform equally well on both training and test data in classification problems. When combined with support vector machines, the new algorithm will be able to predict the data that are unseen during training, even for small training samples.
The broader impacts resulting from project activities include a robust method to extract information from large datasets; the potential integration of the small number of identified genes into cancer diagnosis process; the applications to gene function discoveries; and the integration of research activities into a new bioinformatics course, Machine Learning with Life Science Applications. The class will be offered to graduate and senior undergraduate students in EECS and students in other department such as Biology who are interested in bioinformatics.
Primary Sponsor(s): NSF and KTEC