K-INBRE Cellular Pathogen Gene Identification via Graph Data Mining
Project Award Date: 06-27-2007
Genomics efforts continue to yield a myriad of new protein sequences. They offer unprecedented opportunities for knowledge-based sequence annotations that aim to automatically transfer experimentally gained biological knowledge from model organisms to newly sequenced genomes to expedite biological discovery. Applying rigorous data mining methods to large, sequentially diverse, and clinically-important protein families, like the immunologic proteins, can yield reliable, intuitively predictive models readily extensible to annotating novel sequences. This would enable rational experimental design that may lead to improved medicine against refractory pathogens. Specifically, for characterizing and annotating immunological proteins, we plan to devise, refine, and disseminate statistical geometric analysis methods. We will include rigorous protein structure representation using geometric graphs, identifying conserved substructure patterns in protein structures based on graph database mining, mapping structure patterns to sequence motifs, and annotating genes using the obtained sequence motif with advanced statistical learning methods such as support vector machine.
Our choice of foci herein is based on strong preliminary results in each of the above objectives, including (1) the development of Delaunay Tessellation and almost-Delaunay Tessellation for statistical geometric analysis of protein structures, (2) development of state-of-the art subgraph mining algorithm that retrieves recurring subgraphs in a group of graph represented 3D protein structures, (3) applying machine learning techniques, such as support vector machine, to building annotating model with high specificity and sensitivity.
Primary Sponsor(s): KUMCRI (flow-through from NIH)