机器学习生物信息学方法Machine Learning Approaches to Bioinformatics.doc-得力文库

资源描述

《机器学习生物信息学方法Machine Learning Approaches to Bioinformatics.doc》由会员分享，可在线阅读，更多相关《机器学习生物信息学方法Machine Learning Approaches to Bioinformatics.doc（339页珍藏版）》请在得力文库 - 分享文档赚钱的网站上搜索。

1、machine learning approachesto bioinformaticsSCIENCE, ENGINEERING, AND BIOLOGY INFORMATICSSeries Editor: Jason T. L. Wang(New Jersey Institute of Technology, USA)Published:Vol. 1:Advanced Analysis of Gene Expression Microarray Data(Aidong Zhang)Vol. 2:Life Science Data Mining(Stephen T. C. Wong & Chu

2、ng-Sheng Li)Vol. 3:Analysis of Biological Data: A Soft Computing Approach(Sanghamitra Bandyopadhyay, Ujjwal Maulik & Jason T. L. Wang)Vol. 4:Machine Learning Approaches to Bioinformatics (Zheng Rong Yang)Forthcoming:Vol. 5:Biodata Mining and Visualization: Novel Approaches(Ilkka Havukkala)machine le

3、arning approaches to bioinformaticszheng rong yangUniversity of Exeter, UK World ScientificN E W J E R S E Y L O N D O N S I N G A P O R E B E I J I N G S H A N G H A I H O N G K O N G TA I P E I C H E N N A IPublished byWorld Scientific Publishing Co. Pte. Ltd.5 Toh Tuck Link, Singapore 596224USA o

4、ffice: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601UK office: 57 Shelton Street, Covent Garden, London WC2H 9HEBritish Library Cataloguing-in-Publication DataA catalogue record for this book is available from the British Library.Science, Engineering, and Biology Informatics Vol. 4MACHINE LE

5、ARNING APPROACHES TO BIOINFORMATICSCopyright 2010 by World Scientific Publishing Co. Pte. Ltd.All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval syst

6、em now known or to be invented, without written permission from the Publisher.For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the

7、publisher.ISBN-13 978-981-4287-30-2ISBN-10 981-4287-30-XPrinted in Singapore.PREFACEBioinformatics has been one of the most important multidisciplinary subjects in the last century. Initially, the major task of bioinformatics research was to handle large genomic data for knowledge extraction and for

8、 making predictions. More recently, the practices of bioinformatics have extended from genomics to proteomics, metabolomics, and most importantly systems biology. In addition to most traditional bioinformatics exercises which focus on large database management and sequence homology alignment for mol

9、ecular structure prediction and function annotation, modelling biological data using statistical/ machine learning has been an important trend. This part of the exercise has gained great attention because it can help carry out efficient, effective, and accurate knowledge extraction and prediction mo

10、del construction. However, the application of machine learning approaches in bioinformatics researches and practices has a series of challenges compared with other applications. The challenges include data size, data quality, and the imbalance between different data resources. These challenges are p

11、articularly obvious in systems biology research. For instance, genomics data size has a scale of around 25K, but proteomics data size can reach up to a scale of millions. Currently, it is hard to use modern computers to handle such large scale data in one machine learning model. Furthermore, due to

12、experimental variation, tissue corruption, and equipment resolution, most metabolite data suffer a problem of data quality. This casts a challenge in machine learning model construction in terms of data noise and missing data. In using next generation sequencing equipment such as Illumina, we are fa

13、ced with tega-byte of fragments of sequences. The challenge is how to assemblyvviMachine Learning Approaches to Bioinformaticsthese fragments accurately without any reference sequences. An urgent requirement in systems biology proposes to use different sources of data for analysing systems behaviour

14、. This then casts a challenge about how to efficiently incorporate these data with different resolutions, with different data format, with different data quality, and with different data dimensionalities in one machine learning model. This book therefore tries to discuss some of these challenges.Thi

15、s book is written based on my teaching and research notes in bioinformatics in the past ten years. I thank Prof Jason Wang and the publisher for inviting me to write this book. The book is written mainly for postgraduates and researchers at the start of their bioinformatics research and practice. Th

16、e pre-requisite to using this book is some basic linear algebra and statistics knowledge. The book can be used for both advanced undergraduate and postgraduate teaching reference. Readers are encouraged to be familiar with basic R programming before using this book as most case studies presented in

17、the book are implemented in R.The book is composed of three parts. The first part covers several unsupervised learning approaches which can be used in bioinformatics. For instance, multidimensional scaling is commonly used in bioinformatics for biological data visualisation. Various cluster analysis

18、 approaches as well as self-organising map have been used for biological pattern recognition. After data partitioning, molecules can then be clustered leading to prototype pattern discovery and new hypothesis generation.The second part mainly discusses supervised learning approaches. In many bioinfo

19、rmatics projects, a typical question is how to accurately predict unknowns based on experimental data. For instance, how can we identify the most important genes for most efficient and accurate disease diagnosis? Additionally, given a huge number of molecular sequence data in which most functions ar

20、e still unknown, how can we make prediction models based on limited information of known functions in sequence data? This part therefore introduces several commonly used supervised learning algorithms as well as their applications to bioinformatics.PREFACEviiThe third part of this book introduces th

21、e concepts relevant to computational systems biology which is now the most important research targets in bioinformatics. Computational systems biology research mainly focuses on large biological systems aiming to reveal the complex interplay between molecules and molecular entities. Gene network, sy

22、stems dynamics and pathway recognition have been of much interest in recent years. The third part then demonstrates how machine learning algorithms can be used for these issues.As mentioned above, this book is based on the revision of my teaching and research notes. It is therefore important to name

23、 several research collaborators. My key research collaborators include T Charlie Hodgman, Andrew Dalby, Murray Grant, Richard Titball, Nick Smirnoff, and Tom Richards. The students who have contributed to the improvement of my teaching of bioinformatics in University of Exeter are Rebecca Hamer, Jon

24、 Dry, Emily Berry, Dave Trudgun, Hanieh Yaghootkar and Susie Clark. I am very grateful to Susie Clark for proof-reading the book.Finally, I would like to thank my parents, wife and daughter for their great support. During the writing of this book, I regret not being able to spend more time with them

25、. I hope the publication of this book will make up for the sacrifice.Zheng Rong Yang29 November 2009Exeter, England, UKThis page intentionally left blankCONTENTSPrefacev1Introduction11.1Brief history of bioinformatics31.2Database application in bioinformatics61.3Web tools and services for sequence h

26、omology8Alignment1.3.1Web tools and services for protein functional9site identification1.3.2Web tools and services for other biological data101.4Pattern analysis101.5The contribution of information technology111.6Chapters122 Introduction to Unsupervised Learning153 Probability Density Estimation App

27、roaches243.1Histogram approach243.2Parametric approach253.3Non-parametric approach283.3.1K-nearest neighbour approach283.3.2Kernel approach29Summary364Dimension Reduction384.1General38ixxMachine Learning Approaches to Bioinformatics4.2Principal component analysis394.3An application of PCA424.4Multi-

28、dimensional scaling464.5Application of the Sammon algorithm to gene data48Summary505Cluster Analysis525.1Hierarchical clustering525.2K-means555.3Fuzzy C-means585.4Gaussian mixture models605.5Application of clustering algorithms to the Burkholderia64pseudomallei gene expression dataSummary676Self-org

29、anising Map696.1Vector quantization696.2SOM structure736.3SOM learning algorithm756.4Using SOM for classification796.5Bioinformatics applications of VQ and SOM816.5.1Sequence analysis816.5.2Gene expression data analysis836.5.3Metabolite data analysis866.6A case study of gene expression data analysis

30、866.7A case study of sequence data analysis88Summary907 Introduction to Supervised Learning927.1General concepts927.2General definition947.3Model evaluation967.4Data organisation1017.5Bayes rule for classification103Summary103Contentsxi8Linear/Quadratic Discriminant Analysis and K-nearest104Neighbou

31、r8.1Linear discriminant analysis1048.2Generalised discriminant analysis1098.3K-nearest neighbour1118.4KNN for gene data analysis118Summary1189Classification and Regression Trees, Random Forest120Algorithm9.1Introduction1209.2Basic principle for constructing a classification tree1219.3Classification

32、and regression tree1259.4CART for compound pathway involvement prediction1269.5The random forest algorithm1289.6RF for analyzing Burkholderia pseudomallei gene129expression profilesSummary13210Multi-layer Perceptron13310.1Introduction13310.2Learning theory13710.2.1Parameterization of a neural networ

33、k13710.2.2Learning rules13710.3Learning algorithms14510.3.1Regression14510.3.2Classification14610.3.3Procedure14710.4Applications to bioinformatics14810.4.1Bio-chemical data analysis14810.4.2Gene expression data analysis14910.4.3Protein structure data analysis14910.4.4Bio-marker identification15010.

34、5 A case study on Burkholderia pseudomallei150gene expression dataSummary153xiiMachine Learning Approaches to Bioinformatics11 Basis Function Approach and Vector Machines15411.1Introduction15411.2Radial-basis function neural network (RBFNN)15611.3Bio-basis function neural network16211.4Support vecto

35、r machine16811.5Relevance vector machine173Summary17612Hidden Markov Model17712.1Markov model17712.2Hidden Markov model17912.2.1General definition17912.2.2Handling HMM18312.2.3Evaluation18412.2.4Decoding18812.2.5Learning18912.3HMM for sequence classification191Summary19413Feature Selection19513.1Bui

36、lt-in strategy19513.1.1Lasso regression19613.1.2Ridge regression19913.1.3Partial least square regression (PLS) algorithm20013.2Exhaustive strategy20413.3Heuristic strategy orthogonal least square approach20413.4Criteria for feature selection20813.4.1Correlation measure20913.4.2Fisher ratio measure21

37、013.4.3Mutual information approach210Summary21214 Feature Extraction (Biological Data Coding)21314.1Molecular sequences21414.2Chemical compounds215Contentsxiii14.3General definition21614.4Sequence analysis21614.4.1Peptide feature extraction21614.4.2Whole sequence feature extraction222Summary22415 Se

38、quence/Structural Bioinformatics Foundation 225Peptide Classification15.1Nitration site prediction22515.2Plant promoter region prediction230Summary23716 Gene Network Causal Network and Bayesian238Networks16.1Gene regulatory network23816.2Causal networks, networks, graphs24116.3A brief review of the

39、probability24216.4Discrete Bayesian network24516.5Inference with discrete Bayesian network24616.6Learning discrete Bayesian network24716.7Bayesian networks for gene regulartory networks24716.8Bayesian networks for discovering peptide patterns24816.9Bayesian networks for analysing Burkholderia249pseu

40、domallei gene dataSummary25217 S-Systems25317.1Michealis-Menten change law25317.2S-system25617.3Simplification of an S-system25917.4Approaches for structure identification and parameter260estimation17.4.1Neural network approach26017.4.2Simulated annealing approach26117.4.3Evolutionary computation ap

41、proach262xivMachine Learning Approaches to Bioinformatics17.5Steady-state analysis of an S-system26217.6Sensitivity of an S-system267Summary26818 Future Directions26918.1Multi-source data27018.2Gene regulatory network construction27218.3Building models using incomplete data27418.4Biomarker detection

42、 from gene expression data275Summary278References279Index319Chapter 1IntroductionBioinformatics has been in action for at least three decades. However, there is still a general confusion as to the function of bioinformatics. Some biologists are still treating bioinformatics as tools. Some informatis

43、ts1 regard bioinformatics as a career of developing novel algorithms and systems. Because of this, there is a slight difference in definitions. In the literature, one fundamental concept is also missing: that information is a natural, inherent, and dynamic component in all biological systems.We first examine how bioinformatics is defined in various textbooks. In Attwood and Parry-Smiths book 1 bioinformatics is defined as “the application of computers in biology sciences and especially analysis of biological sequence data”. In Baxevanis and Ouellettes book 2 bioinformatics is “a field inte

展开阅读全文