(2.5.1)--学术英语阅读第五讲讲义.pdf-得力文库

资源描述

《(2.5.1)--学术英语阅读第五讲讲义.pdf》由会员分享，可在线阅读，更多相关《(2.5.1)--学术英语阅读第五讲讲义.pdf（9页珍藏版）》请在得力文库 - 分享文档赚钱的网站上搜索。

1、Velvet:Algorithms for de novo short read assemblyusing de Bruijn graphsDaniel R.Zerbino and Ewan Birney1EMBL-European Bioinformatics Institute,Wellcome Trust Genome Campus,Hinxton,Cambridge CB10 1SD,United KingdomWe have developed a new set of algorithms,collectively called“Velvet,”to manipulate de

2、Bruijn graphs for genomicsequence assembly.A de Bruijn graph is a compact representation based on short words(k-mers)that is ideal forhigh coverage,very short read(2550 bp)data sets.Applying Velvet to very short reads and paired-endsinformation only,one can produce contigs of significant length,up t

3、o 50-kb N50 length in simulations ofprokaryotic data and 3-kb N50 on simulated mammalian BACs.When applied to real Solexa data sets without readpairs,Velvet generated contigs of 8 kb in a prokaryote and 2 kb in a mammalian BAC,in close agreement with oursimulated results without read-pair informatio

4、n.Velvet represents a new approach to assembly that can leveragevery short reads in combination with read pairs to produce useful assemblies.Supplemental material is available online at www.genome.org.The code for Velvet is freely available,under theGNU Public License,at http:/www.ebi.ac.uk/zerbino/

5、velvet.Sequencing remains at the core of genomics.Applications in-clude determining the genome sequence of a new species,deter-mining the genome sequence of an individual within a popula-tion,sequencing RNA molecules from a particular sample,andusing DNA sequence as a readout assay in molecular biol

6、ogytechniques.Determining the complete genome sequence of aspecies remains an important application of sequencing,and de-spite the success in determining the human(International Hu-man Genome Sequencing Consortium 2001;Venter et al.2001),mouse(Waterston et al.2002),and numerous other genomes,this is

7、 a tiny sample of the millions of species in the biosphere.Recently,new sequencing technologies have emerged(Metzker 2005),for example,pyrosequencing(454 Sequencing)(Margulies et al.2005)and sequencing by synthesis(Solexa)(Bentley 2006),both commercially available.Compared to tradi-tional Sanger met

8、hods,these technologies produce shorter reads,currently 200 bp for pyrosequencing and 35 bp for Solexa.Untilrecently,very short read information was only used in the con-text of a known reference assembly,either for sequencing indi-viduals of the same species as the reference,or readout assaysfor ex

9、ample,STAGE(Kim et al.2005)and ChIPSeq(Johnson et al.2007).A critical stage in genome sequencing is the assembly ofshotgun reads,or piecing together fragments randomly extractedfrom the sample,to form a set of contiguous sequences(contigs)representing the DNA in the sample.Algorithms are available f

10、orwhole-genome shotgun(WGS)fragment assembly,including:Atlas(Havlak et al.2004),ARACHNE(Batzoglou et al.2002),Celera(Myers et al.2000),PCAP(Huang et al.2003),phrap(P.Green,http:/www.phrap.org),or Phusion(Mullikin and Ning2003).All these programs rely on the overlap-layout-consensusapproach(Batzoglou

11、 2005),representing each read as a node andeach detected overlap as an arc between the appropriate nodes.These methods have proved their use through numerous de novogenome assemblies.Very short reads are not well suited to this traditional ap-proach.Because of their length,they must be produced in l

12、argequantities and at greater coverage depths than traditional Sangersequencing projects.The sheer number of reads makes the over-lap graph,with one node per read,extremely large and lengthy tocompute.With long reads,repeats in the data are disambiguatedby careful metrics over long overlaps that dis

13、tinguish repeatmatches from real overlaps,using,for example,high-quality basedisagreements.With short reads,and correspondingly shortoverlaps to judge from,many reads in repeats will have only asingle or no base differences.This leads to many more ambiguousconnections in the assembly.The EULER assem

14、bler(Pevzner et al.2001)adopted a fun-damentally different approach using de Bruijn graphs.In thisrepresentation of data,elements are not organized around reads,but around words of k nucleotides,or k-mers.Reads are mappedas paths through the graph,going from one word to the next ina determined order

15、.Several teams(Shah et al.2004;Bokhari andSauer 2005;Myers 2005;Jiang et al.2007)have since expandedon the use of de Bruijn graphs for sequence assembly.The fun-damental data structure in the de Bruijn graph is based on k-mers,not reads,thus high redundancy is naturally handled bythe graph without a

16、ffecting the number of nodes.In addition,each repeat is present only once in the graph with explicit linksto the different start and end points.Depending on availableinformation,it can be either resolvable or not,but it is readilyidentifiable.Mis-assembly errors are therefore more easilyavoided than

17、 with overlap graphs.Finally,searches for overlapsare simplified,as overlapping reads are mapped onto the samearcs and can easily be followed simultaneously.Despite the attractiveness of the de Bruijn graph data struc-ture for short read assemblies,it has not been used extensively incurrent producti

18、on-based assembly methods.Chaisson et al.(2004)and Sundquist et al.(2007)suggested ways of using thesegraphs specifically for short read assembly(100200 bp),but notfor very short reads(2550 bp).More recently,programs such asSSAKE(Warren et al.2007),SHARCGS(Dohm et al.2007),andVCAKE(Jeck et al.2007)i

19、mplicitly use this framework,but at alocal level.With the advent of highly cost effective very shortreads,de Bruijn graph-based methods will grow in utility.How-ever,it is necessary to develop efficient and robust methods tomanage experimental errors and repeats.1Corresponding author.E-mail birneyeb

20、i.ac.uk;fax 44-1223-494-468.Article published online before print.Article and publication date are at http:/www.genome.org/cgi/doi/10.1101/gr.074492.107Resource18:821829 2008 by Cold Spring Harbor Laboratory Press;ISSN 1088-9051/08;www.genome.orgGenome Research821www.genome.orgIn this study,we prese

21、nt a set of algorithms,collectivelynamed“Velvet,”that manipulates these de Bruijn graphs effi-ciently to both eliminate errors and resolve repeats.These twotasks are done separately:first,the error correction algorithmmerges sequences that belong together,then the repeat solverseparates paths sharin

22、g local overlaps.We have assessed Velveton both simulated and real data.Using only very short pairedsimulated reads,Velvet is capable of assembling bacterial ge-nomes,with N50 contig lengths of up to 50 kb,and simulationson 5-Mb regions of large mammalian genomes,with contigs of3 kb.ResultsThe de Br

23、uijn graphStructureIn the de Bruijn graph,each node N represents a series of over-lapping k-mers(cf.Fig.1 for a small example).Adjacent k-mersoverlap by k?1 nucleotides.The marginal information con-tained by a k-mer is its last nucleotide.The sequence of thosefinal nucleotides is called the sequence

24、 of the node,or s(N).Each node N is attached to a twin node N,which representsthe reverse series of reverse complement k-mers.This ensures thatoverlaps between reads from opposite strands are taken into ac-count.Note that the sequences attached to a node and its twin donot need to be reverse complem

25、ents of each other.The union of a node N and its twin Nis called a“block.”From now on,any change to a node is implicitly applied sym-metrically to its twin.A block therefore has two distinguishablesides,in analogy to the“k-mer edges”described in Pevzner et al.s2001 paper.Nodes can be connected by a

26、directed“arc.”In that case,thelast k-mer of an arcs origin node overlaps with the first of itsdestination node.Because of the symmetry of the blocks,if an arcgoes from node A to B,a symmetric arc goes from Bto A.Anymodification of one arc is implicitly applied symmetrically to itspaired arc.On these

27、 nodes and arcs,reads are mapped as“paths”tra-versing the graph.Extracting the nucleotide sequence from apath is straightforward given the initial k-mer of the first nodeand the sequences of all the nodes in the path.ConstructionThe reads are first hashed according to a predefined k-mer length.This

28、variable k is limited on the upper side by the length of thereads being hashed,to allow for a small amount of overlap,usu-ally k=21 for 25-bp reads.Smaller k-mers increase the connec-tivity of the graph by simultaneously increasing the chance ofobserving an overlap between two reads and the number o

29、f am-biguous repeats in the graph.There is therefore a balance be-tween sensitivity and specificity determined by k(cf.Methods).For each k-mer observed in the set of reads,the hash tablerecords the ID of the first read encountered containing that k-merand the position of its occurrence within that r

30、ead.Each k-mer isrecorded simultaneously to its reverse complement.To ensurethat each k-mer cannot be its own reverse complement,k must beodd.This first scan allows us to rewrite each read as a set oforiginal k-mers combined with overlaps with previously hashedreads.We call this new representation o

31、f the reads sequence the“roadmap.”A second database is created with the opposite information.It records,for each read,which of its original k-mers are over-lapped by subsequent reads.The ordered set of original k-mers ofthat read is cut each time an overlap with another read begins orends.For each u

32、ninterrupted sequence of original k-mers,a nodeis created.Finally,reads are traced through the graph using the road-maps.Knowing the correspondence between original k-mers andthe newly created nodes,Velvet proceeds from one node to thenext,creating a new directed arc or incrementing an existing onea

33、s appropriate at each step.SimplificationAfter constructing the graph,it is generally possible to simplify itwithout any loss of information.Blocks are interrupted each timea read starts or ends.This leads to the formation of“chains”ofblocks,or linear connected subgraphs.This fragmentation of thegra

34、ph costs memory space and lengthens calculation times.These chains can be easily simplified.Whenever a node Ahas only one outgoing arc that points to another node B that hasonly one ingoing arc,the two nodes(and their twins)are merged.Iteratively,chains of blocks are collapsed into single blocks.The

35、 simplification of two nodes into one is analogous to theconventional concatenation of two character strings,and also tosome string graph based methods(Myers 2005).This straightfor-ward transformation involves transferring arc,read,and se-quence information as appropriate.Error removalErrors are cor

36、rected after graph creation to allow for simulta-neous operations over the whole set of reads.In our framework,errors can be due to both the sequencing process or to the bio-logical sample,for example,polymorphisms.Distinguishingpolymorphisms from errors is a post-assembly task.A naive ap-proach to

37、error removal would be to use the difference betweenFigure 1.Schematic representation of our implementation of the deBruijn graph.Each node,represented by a single rectangle,represents aseries of overlapping k-mers(in this case,k=5),listed directly above orbelow.(Red)The last nucleotide of each k-me

38、r.The sequence of thosefinal nucleotides,copied in large letters in the rectangle,is the sequenceof the node.The twin node,directly attached to the node,either below orabove,represents the reverse series of reverse complement k-mers.Arcsare represented as arrows between nodes.The last k-mer of an ar

39、cs originoverlaps with the first of its destination.Each arc has a symmetric arc.Note that the two nodes on the left could be merged into one withoutloss of information,because they form a chain.Zerbino and Birney822Genome Researchwww.genome.orgthe expected coverage of genuine sequences and that of

40、randomerrors.Therefore removing all the low coverage nodes(and theircorresponding arcs)would remove the errors.However,this relieson the differences being due to genuine errors and not to bio-logical variants present at a reasonable frequency in the sample,and errors being randomly distributed in th

41、e reads.Instead,Velvet focuses on topological features.Erroneousdata create three types of structures:“tips”due to errors at theedges of reads,“bulges”due to internal read errors or to nearbytips connecting,and erroneous connections due to cloning errorsor to distant merging tips.The three features

42、are removed con-secutively.Removing tipsA“tip”is a chain of nodes that is disconnected on one end.Removing these tips is a straightforward task.Discarding thisinformation results in only local effects,as no connectivity isdisrupted.Nonetheless,some restraint must be applied to theprocess to avoid er

43、oding genuine sequences that are merely in-terrupted by a coverage gap.To deal with this issue,we definetwo criteria:length and minority count.A tip will only be removed if it is shorter than 2k.Thisarbitrary cutoff length was chosen because it is greater than thelength in k-mers of an individual ve

44、ry short read.Erroneous con-structs involving entire short reads are presumably extremelyrare.In the case of long reads,this cutoff is set to be the maxi-mum length tip that can be created by two nearby mistakes.A tiplonger than 2k therefore represents either genuine sequence oran accumulation of er

45、rors that is hard to distinguish from novelsequence.In the latter case,clipping the read tips more strin-gently might be necessary.We define“minority count”as the property that,at thenode where the tip connects to the rest of the graph,the arcleading to that tip has a multiplicity inferior to at lea

46、st one of theother arcs radiating out of that junction node.In other words,starting from that node,going through the tip is an alternative toa more common path.This ensures that,at the local level,tips are removed inincreasing order of multiplicity.Velvet progressively uncoverschains of high coverag

47、e nodes that are not destroyed by virtue ofthe previous criteria,thus preserving the graph from completeerosion.Velvet iteratively removes tips from the graph under thesetwo criteria.When there are no more tips to remove,the graph issimplified once again.Removing bubbles with the Tour Bus algorithmW

48、e consider two paths redundant if they start and end at thesame nodes(forming a“bubble”)and contain similar sequences.Such bubbles can be created by errors or biological variants,suchas SNPs or cloning artifacts prior to sequencing.Erroneousbubbles are removed by an algorithm called“Tour Bus.”Thecri

49、teria for deciding whether two paths justify simplification canbe complex,taking into account error models of the sequence or(for the case of mixed haplotype samples)other features of thesequence and graph,such as coverage.Currently,we applysimple sequence identity and length thresholds.Detection of

50、 redundant paths is done through a Dijkstra-likebreadth-first search.The algorithm starts from an arbitrary nodeand progresses along the graph,visiting nodes in order of increas-ing distance from the origin.In this application,the distancebetween two consecutive nodes A and B is the length of s(B)di

展开阅读全文