Statistical Machine Learning Methods for Bioinformatics

Instructor: Dr. Jianlin Cheng

Location: EBW 105, Time: TuTh 2:00 pm - 3:15 pm, Office Hours: TuTh 1:00 - 2:00, Semester: Spring 2008

Syllabus

Lecture Notes

1. HMM Theory

2. HMM Application in Bioinformatics

3. Neural Network Theory

4. Neural Network Applications in Bioinformatics

5. Support Vector Machine Theory

6. Support Vector Machine Applications in Bioinformatics

7. Bayesian Network Theory and Applications in Bioinformatics (Introduction)

Reading Assignments

(1) Hidden Markov Model Theory and its Application in Bioinformatics (e.g. sequence and profile alignment)

    L. R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE. 1989.  (week 1: pages 257-266)

    A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler. Hidden Markov Models in Computational Biology (Applications to Protein Modeling). Journal of Molecular Biology. 1993. (week  2)

    P. Baldi, Y. Chauvin, T. Hunkapiller, and M. A. McClure. Hidden Markov Models of Biological Primary Sequence Information. PNAS. 1994.  (week 3)

    S. R. Eddy. Profile Hidden Markov Models. Bioinformatics. 1998. (week 3)

    K. Karplus, C. Barrett, and R. Hughey. Hidden Markov Models for Detecting Remote Protein Homologies. Bioinformatics. 1998. (week 4)

    R. C. Edgar and K. Sjolander. COACH: Profile-Profile Alignment of Protein Families Using Hidden Markov Models. Bioinformatics. 2004. (week 4)

    J. Soeding. Protein Homology Detection by HMM-HMM Comparison. Bioinformatics. 2005. (week 4)

    J. D. Thompson, F. Plewniak and O. Poch. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics. 1999 (Project).

(2) Neural Network Theory and its Application in Bioinformatics (e.g. protein secondary structure prediction)

    D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature. 1986. (week 5)

    Excellent Lectures about Neural Network Theory: Hinton's Class at University of Toronto: http://www.cs.toronto.edu/~hinton/csc321/lectures.html.

    B. Rost and C. Sander. Combining Evolutionary Information and Neural Networks to Predict Protein Secondary Structure. Proteins. 1994.  (week 6)

    D.T. Jones. Protein Secondary Structure Prediction Based on Position-Specific Scoring Matrices. 1999. (week 7)

    G. Pollastri, D. Przybylski, B. Rost, and P. Baldi. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins. 2002 (week 8)

(3) Support Vector Machine Theory and its Application in Bioinformatics (e.g. protein fold recognion)

    C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery. 1998. (weeks 9-10)

    W. S. Noble. What is support vector machine? Nature Biotechnology. 2006 (week 10)

    J. Cheng and P. Baldi. A Machine Learning Information Retrieval Approach to Protein Fold Recognition. Bioinformatics. 2006. (week 11)

    J. Cheng, A. Randall, and P. Baldi. Prediction of Protein Stability Changes for Single-Site Mutations Using Support Vector Machines . Proteins. 2006. (week 12)

    A. Smola and B. Scholkopf. A Tutorial on Support Vector Regression . 1998. (week 12)

 (4) EM algorithm, Gibbs Sampling, Bayesian Networks and their Application in Bioinformatics (e.g. inference of gene regulatory network)

     A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of Royal Statistics Society. 1977.  (week 13)

    S. Geman and D. Geman. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1984. (week 14)

    C.E. Lawrence and A. A. Reilly. An Expectation Maximization Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences. Proteins. 1990. (week 15)

    C.E. Lawrence, S.F. Altschul, M. S. Bogouski, J. S. Liu, A. F. Neuwald, and J. C. Wooten. Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignmet. Science. 1993. (week 15)

    J. S. Liu and C. E. Lawrence. Bayesian Inference on Biopolymer Models. Bioinformatics. 1999.  (week 16)

    T. L. Bailey and C. Elkan. Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization. Machine Learning. 1995.  (week 16)

    D. Heckerman. A Tutorial on Learning with Bayesian Networks. Microsoft Research. 1995. (week 17)

    K. Murphy. A Brief Introduction to Graphical Models and Bayesian Networks (online) (week 18)

    E. Segal, M. Shpira, A. Regev, D. Peer, D. Botstein, D. Koller, and N. Friedman. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nature Genetics. 2003. (week 19)

    N. Friedman. Inferring cellular networks using probabilistic graphical models. Science. 2004.(week 19)

    C. J. Needham, J. R. Bradford, A. J. Bulpitt and D. R. Westhead. A Primer on Learning in Bayesian Networks for Computational Biology. Plos Computational Biology. 2007.

Project

Select one project from the following options:

(1) Multiple sequence alignment using HMM

(2) Profile-profile alignment using HMM

(3) Secondary structure prediction using neural network

(4) Protein fold recognition using support vector machine

Presentation

Each group / person has 15 minutes to present the selected project (about 13 minutes for presentatioin and 2 minutes for questions). The following days are reserved for presentation: May 1 (Thu), May 6 (Tue), May 8 (Thu), May 13 (Tue), May 15 (optional). The final project report is due on May 15 (mid-night).

Bioinformatics Background

Introduction to Bioinformatics (taught by Jianlin Cheng, Fall, 2006).

Machine Learning Background

A Set of Statistical Machine Learning Tutorials (taught by Andrew Moore at CMU and Google).

Reference Books

1. Baldi and Brunak. Bioinformatics: the Machine Learning Approach (second edition). MIT press, 2001.  (textbook)

2. Durbin, Eddy, and Krogh. Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1999.  (textbook)

3. A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis (second edition). Chapman and Hall. 2003.

4.  Baxevanis and Ouellette (editors). Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins (Third edition). John Wiley & Sons, 2004

5. C. M. Bishop. Neural Network for Pattern Recognition. Oxford University Press, 1996.

6. John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.

7. F. V. Jensen. Bayesian Networks and Decision Graphs. Springer, 2001.