[Biopython-dev] Fwd: ACE files at Biopython-dev

Gordon Robertson agrobertson at telus.net
Wed Jul 1 00:27:21 UTC 2009


I flagged the current ACE discussion to the Consed author, David  
Gordon, and am forwarding his response.

G

Begin forwarded message:

> From: David Gordon <dgordon at u.washington.edu>
> Date: June 30, 2009 9:38:10 AM PDT
> To: Gordon Robertson <agrobertson at telus.net>
> Cc: Yaron Butterfield <ybutterf at bcgsc.ca>, David Gordon <dgordon at u.washington.edu 
> >
> Subject: Re: ACE files at Biopython-dev
>
> Hi, Gordon,
>
> Could you add a comment from me to this thread, please?  Here is the
> text to add:
>
>
> I am the author of Consed and briefly read this thread.
>
> I have one suggestion on this tool: that it create phd ball files
> instead of phd files, particularly if the number of reads is more than
> a few hundred.  phd files are a leftover from the days of sequencing
> when there were only a few thousand reads at most.  The linux
> operating system and software cannot handle millions of phd files in
> the same directory, so Consed now typically uses a small number of phd
> balls.  Here is an example of a phd ball file that contains 2 reads
> (typically a phd ball file will contain up to a million--more than
> that becomes difficult to copy).  Notice that there is a comment at
> the beginning starting with "#" at the beginning of the line.  Also
> notice that the BEGIN_SEQUENCE line is slightly different due to the
> "1" at the end--this is the version, which corresponds to the
> extension on the end of a phd file name such as
> HWI-EAS94_4_1_1_537_446.phd.1
>
> Notice also that peak positions (which normally form a 3rd column
> after the quality) are now optional, which helps keep the file size
> down.  For reads that you want to see the traces, you will need to
> have peak positions.  A 454 example follows:
>
>
>
> # solexa file ../solexa_dir/solexa_reads.fastq (beginning)
>
> BEGIN_SEQUENCE HWI-EAS94_4_1_1_537_446 1
> BEGIN_COMMENT
> TIME: Wed Dec 24 11:21:50 2008
> CHEM: solexa
> END_COMMENT
> BEGIN_DNA
> g 30
> c 30
> c 30
> a 30
> a 30
> t 30
> c 30
> a 30
> g 30
> g 30
> t 30
> t 30
> t 30
> c 30
> t 30
> c 30
> t 30
> g 30
> c 30
> a 30
> a 28
> g 23
> c 30
> c 30
> c 30
> c 30
> t 30
> t 30
> t 28
> a 22
> g 8
> c 22
> a 7
> g 15
> c 15
> t 15
> g 10
> a 10
> g 11
> c 15
> END_DNA
> END_SEQUENCE
>
> BEGIN_SEQUENCE HWI-EAS94_4_1_1_602_99 1
> BEGIN_COMMENT
> TIME: Wed Dec 24 11:21:50 2008
> CHEM: solexa
> END_COMMENT
> BEGIN_DNA
> g 30
> c 30
> c 30
> a 30
> t 30
> g 30
> g 30
> c 30
> a 30
> c 30
> a 30
> t 30
> a 30
> t 30
> a 30
> t 30
> g 30
> a 30
> a 30
> g 30
> g 30
> t 30
> c 30
> a 30
> g 30
> a 30
> g 16
> g 30
> a 28
> c 22
> a 22
> a 22
> c 14
> t 15
> t 15
> g 5
> c 10
> t 15
> g 10
> t 5
> END_DNA
> END_SEQUENCE
>
>
> phd ball files for 454 reads (in which traces are displayed) have more
> information.  Here is an example:
>
> BEGIN_SEQUENCE EBE03TV04IHLTF.77-243 1
>
> BEGIN_COMMENT
>
> CHROMAT_FILE: sff:reads.sff:EBE03TV04IHLTF
> QUALITY_LEVELS: 99
> TIME: Thu Jul 27 12:33:48 2000
> TRACE_ARRAY_MIN_INDEX: 0
> TRACE_ARRAY_MAX_INDEX: 4723
> CHEM: 454
>
> END_COMMENT
>
> BEGIN_DNA
> g 37 91
> g 37 110
> g 37 129
> g 37 148
> a 37 167
> t 37 186
> g 37 205
> a 37 224
> a 37 243
> a 37 262
> g 37 281
> g 37 300
> g 37 319
> .
> .
> .
> a 26 4385
> t 26 4404
> c 26 4423
> t 30 4442
> c 33 4461
> g 33 4480
> g 33 4499
> t 33 4518
> g 33 4537
> g 36 4556
> t 36 4575
> a 33 4594
> g 33 4613
> g 33 4632
> t 36 4651
> g 26 4670
> a 22 4689
> END_DNA
>
> END_SEQUENCE
>
> (more BEGIN_SEQUENCE/END_SEQUENCE blocks to follow)
>
>
> The line:
> CHROMAT_FILE: sff:reads.sff:EBE03TV04IHLTF
> indicates both the sff file that the read came from as well as the
> read name.
>
> When creating ace files, BS lines are now optional.  BS lines really
> only make sense when the assembly is phrap
>
>
>
> David Gordon
>
>
>
> On Tue, 30 Jun 2009, Gordon Robertson wrote:
>
>> David
>>
>> I thought I should flag with you that code for ACE files are being  
>> discussed now in BioPython.
>>
>> G
>>
>> Begin forwarded message:
>>
>>> From: biopython-dev-request at lists.open-bio.org
>>> Date: June 30, 2009 1:39:04 AM PDT
>>> To: biopython-dev at lists.open-bio.org
>>> Subject: Biopython-dev Digest, Vol 77, Issue 30
>>> Reply-To: biopython-dev at lists.open-bio.org
>>> Send Biopython-dev mailing list submissions to
>>> 	biopython-dev at lists.open-bio.org
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>> 	http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>> or, via email, send a message with subject or body 'help' to
>>> 	biopython-dev-request at lists.open-bio.org
>>> You can reach the person managing the list at
>>> 	biopython-dev-owner at lists.open-bio.org
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of Biopython-dev digest..."
>>> Today's Topics:
>>>
>>> 1. GSoC Weekly Update 6: PhyloXML for Biopython (Eric Talevich)
>>> 2. Re: GSoC Weekly Update 6: PhyloXML for Biopython (Peter)
>>> 3. Re: [Biopython] Bio.Sequencing.Ace (Peter Cock)
>>> 4. Re: Bio.Sequencing (Peter)
>>> 5. Re: [Biopython] Bio.Sequencing.Ace (Jose Blanca)
>>> 6. Re: GSoC Weekly Update 6: PhyloXML for Biopython
>>>    (Bartek Wilczynski)
>>> ----------------------------------------------------------------------
>>> <snip>
>>> ------------------------------
>>> Message: 3
>>> Date: Tue, 30 Jun 2009 09:01:28 +0100
>>> From: Peter Cock <p.j.a.cock at googlemail.com>
>>> Subject: Re: [Biopython-dev] [Biopython] Bio.Sequencing.Ace
>>> To: Jose Blanca <jblanca at btc.upv.es>
>>> Cc: biopython-dev at lists.open-bio.org
>>> Message-ID:
>>> 	<320fb6e00906300101r3e3faa37l6a47295bd5e12538 at mail.gmail.com>
>>> Content-Type: text/plain; charset=ISO-8859-1
>>> On Mon, Jun 29, 2009 at 4:16 PM, Jose Blanca<jblanca at btc.upv.es>  
>>> wrote:
>>>>> Are you using Bio.Sequencing.Ace in your code, or did you write  
>>>>> a whole
>>>>> new parser instead?
>>>> I wrote one, because I wanted to be able to get one particular  
>>>> contig or just
>>>> the contig or the read names. But I don't think that is a  
>>>> problem. I gues
>>>> that the biopyhon parser could be easily adapted to that.
>>> I see. This touches on the indexing discussion - the same idea on
>>> this thread would probably work on Ace files too:
>>> http://lists.open-bio.org/pipermail/biopython/2009-June/005275.html
>>>>> Now that I have been using Ace files in my own work, I've been  
>>>>> meaning
>>>>> to look over your stuff. In some ways, a contig class can be  
>>>>> seen as a
>>>>> generalisation of a multiple sequence alignment class. Certainly  
>>>>> this is
>>>>> something we should improve in Biopython (as you might gather from
>>>>> some of the enhancement bugs on bugzilla, I have lots of ideas  
>>>>> for the
>>>>> current alignment class), and I'm sure you have some great ideas  
>>>>> too.
>>>> I think that here is the main deviation from Biopython. The  
>>>> contig class is
>>>> similar to an alignment class, in fact my contig classes shoud be  
>>>> compatible
>>>> with your new alignment proporsal api.
>>> That's good. I agree that a specialised contig class that works like
>>> the traditional multiple sequence alignment class would be nice.
>>> It would then make sense to have Bio.AlignIO handle contigs as
>>> well as traditional multiple sequence alignments.
>>>> alignment.
>>>> seq1 +++++++++>
>>>> seq2 +++++++++>
>>>> seq3 +++++++++>
>>>> contig
>>>> seq1 ++++>
>>>> seq2 ? ?+++++>
>>>> seq3 ? ? ? ?++++++>
>>>> Basically every read has a different coordinate system in the  
>>>> contig case.
>>>> What I've done is to create a class named LocatableSequence that  
>>>> is a
>>>> container for sequence objects. It works like:
>>>>>>> seq1 = 'ATCG'
>>>>>>> locseq1 = locate_sequence(seq1, location=10)
>>>>>>> locseq1[10] == A
>>>> In that way the contig is a list of LocatableSequences and the  
>>>> coordinate
>>>> system transformations are done by the LocatableSequences, not by  
>>>> the contig.
>>>> The LocatableSequences also allow for masks.
>>>> The LocatableSequence works with any sequence like objects, strs,  
>>>> Seq,
>>>> SeqRecord, lists, etc.
>>>> There's also a Location class that represents a fragment of a  
>>>> sequence. My
>>>> Location class is more limited than the one in the Biopython  
>>>> SeqFeature. In
>>>> my case the start and end should be integers. I use this class to  
>>>> represent
>>>> the region not masked in the sequence and the Location of the  
>>>> sequence inside
>>>> the LocatableSequence.
>>>> Take a look at Contig.py and at LocatableSequence.py, these are  
>>>> the most
>>>> relevant classes for this.
>>>> Best regards,
>>> I'll have to make some time for looking at your code.
>>> What I was thinking of was a contig class as an alignment subclass,
>>> holding a list of SeqRecord objects and offsets. The consensus might
>>> just be one element of this list - but could be handled specially.  
>>> This
>>> sounds simpler than having to introduce a whole new object system,
>>> related to but different to SeqFeature objects. However, I don't yet
>>> have a sample implementation to demonstrate this.
>>> One important thing I think we should do BEFORE adding any contig
>>> class to Biopython, is get it working with at least one other  
>>> contig file
>>> format in addition to Ace. I don't want to end up with a class which
>>> is too specialised for how ace contigs work.
>>> Peter
>>> ------------------------------
>>> Message: 4
>>> Date: Tue, 30 Jun 2009 09:18:44 +0100
>>> From: Peter <biopython at maubp.freeserve.co.uk>
>>> Subject: Re: [Biopython-dev] Bio.Sequencing
>>> To: Cymon Cox <cy at cymon.org>
>>> Cc: BioPython-Dev Mailing List <biopython-dev at lists.open-bio.org>
>>> Message-ID:
>>> 	<320fb6e00906300118l78ca2a98kc25278e24ad433a1 at mail.gmail.com>
>>> Content-Type: text/plain; charset=ISO-8859-1
>>> On Mon, Jun 29, 2009 at 4:47 PM, Cymon Cox<cy at cymon.org> wrote:
>>>> Hi Peter,
>>>> 2009/6/29 Peter
>>>>> Hi Cymon,
>>>>> I've checked in some of your patch on Bug 2865 already,
>>>>> recording the per-letter-annotation which I was planning to
>>>>> do but hadn't got round to yet - thank you:
>>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2865
>>>>> This means with the latest code you can now use Biopython
>>>>> to convert a PHD output file into a FASTQ file (or a QUAL
>>>>> file) which could be handy for doing meta assemblies.
>>>> Yeah, that's nice. Conversely, the reason I wrote the Phd writer  
>>>> is that I
>>>> want to 'fake' some Phd files from FASTA and QUAL files - should/ 
>>>> might be
>>>> possible by using the default headers and equally spaced peak  
>>>> locations. The
>>>> use-case is to fool Consed into displaying the trace (which it  
>>>> 'fakes') from
>>>> a 454 Mira assembly ACE file output, but which it will only do if  
>>>> the Phd
>>>> files are available. So I'm hoping to write the Phd files from  
>>>> the original
>>>> FASTA/QUAL input files. Not sure if this is going to work, or if  
>>>> its a
>>>> sensible thing to be trying...
>>> That sounds reasonable - as long as you know you are faking it ;)
>>>>> I did relatively recently update SeqIO for the Ace format to
>>>>> record the qualities - but there is an issue here. Only the
>>>>> nucleotides get given quality scores, but not the insertions
>>>>> (gaps, shown as "*" in the Ace file consensus sequence).
>>>>> Currently the Bio.SeqIO parser gives the gapped sequence.
>>>>> This means to record the quality scores, we need to give
>>>>> some null value to the gap characters (and I used None).
>>>>> What I am wondering about is making the Bio.SeqIO Ace
>>>>> parser just return the ungapped sequence (and the
>>>>> associated PHRED quality scores). This means we could
>>>>> then convert Ace files into FASTQ or QUAL files, and also
>>>>> a simple Ace to FASTA conversion would give something
>>>>> useful for downstream analysis (the ungapped consensus).
>>>>> The gaps *are* important if you want to see how the
>>>>> consensus was built up - in which case it makes sense to
>>>>> think about each Ace contig as a kind of multiple sequence
>>>>> alignment. See this earlier discussion with David Winter:
>>>>> http://lists.open-bio.org/pipermail/biopython/2009-April/005125.html
>>>>> http://lists.open-bio.org/pipermail/biopython/2009-April/005128.html
>>>>> Any thoughts?
>>>> I think it's probably unwise to return an ungapped sequence/qual  
>>>> by default
>>>> if the contig in the ACE assembly is gapped. It would be nice if  
>>>> the parser
>>>> had a switch ungapped=True, but thats not going to work with the  
>>>> SeqIO
>>>> interface.
>>> We can certainly add a ungapped optional argument to the parser
>>> in Bio.SeqIO.AceIO - that would be a small improvement, meaning
>>> the functionality would be there if you needed it (all be it a bit  
>>> hidden).
>>> Several of the Bio.SeqIO parsers already have optional arguments.
>>> I have sometimes wondered about letting the SeqIO functions take
>>> a **kwargs argument, and passing these arbitrary options to the
>>> underlying parser. This would allow for example passing wrap options
>>> to the FASTA writer, or skiping the features when parsing GenBank
>>> and EBML. On the other hand, it gets very complicated, and detracts
>>> from the current simplicity of Bio.SeqIO (which I like).
>>>> Second best option would be to have an easy way of getting the
>>>> ungapped SeqRecord from the gapped SeqRecord - a function
>>>> somewhere in Bio.Sequencing?
>>> I've already suggested some kind of "ungapped" method for Seq
>>> objects, and yes, having this at the SeqRecord level too would
>>> solve this particular use case. Removing the per-letter-annotations
>>> associated with the gaps would be straight forward. I'm not sure
>>> what we would want to do with any features in the SeqRecord
>>> (perhaps a corner case), but most likely any SeqFeature covering
>>> a region containing a gap would be lost.
>>>> Anyway, I assume (havent checked) that currently if all the
>>>> contigs are free of gaps then the SeqIO.AceIO will parse
>>>> them into an Ungapped alphabet which can then be written
>>>> to FASTA/QUAL etc. I think this is the right way to go, if
>>>> the contigs have gaps the user needs to decide how to deal
>>>> with them explicitly.
>>> Yes, if the Ace contig has no gaps, it will have a nice integer
>>> PHRED quality for each base, and could be saved as FASTQ
>>> or QUAL (or FASTA).
>>> The thing about "gaps" in contigs is that the consensus is
>>> really the ungapped sequence. I'd have to check but I think
>>> Newbler and CAP3 will output both FASTA and ACE files,
>>> and in the FASTA files there are no insertions/gaps in the
>>> contig sequences.
>>> What I am thinking is Bio.SeqIO could return the ungapped
>>> consensus sequences as SeqRecord objects (which can then
>>> be saved as FASTA, FASTQ, QUAL) while Bio.AlignIO
>>> could return contig-alignment objects (with the gaps, like
>>> David's cookbook but in the long run with a contig class).
>>> This has some merit, but breaks my current convention that
>>> parsing an alignment file with SeqIO works by giving each
>>> gapped sequence in each alignment in turn.
>>> Peter
>>> ------------------------------
>>> Message: 5
>>> Date: Tue, 30 Jun 2009 10:31:06 +0200
>>> From: Jose Blanca <jblanca at btc.upv.es>
>>> Subject: Re: [Biopython-dev] [Biopython] Bio.Sequencing.Ace
>>> To: biopython-dev at lists.open-bio.org
>>> Message-ID: <200906301031.06273.jblanca at btc.upv.es>
>>> Content-Type: text/plain;  charset="iso-8859-1"
>>>> What I was thinking of was a contig class as an alignment subclass,
>>>> holding a list of SeqRecord objects and offsets. The consensus  
>>>> might
>>>> just be one element of this list - but could be handled  
>>>> specially. This
>>>> sounds simpler than having to introduce a whole new object system,
>>>> related to but different to SeqFeature objects. However, I don't  
>>>> yet
>>>> have a sample implementation to demonstrate this.
>>> I thought about that implementation and I created some code. The  
>>> problem I
>>> found with that approach is that the contig class code got too  
>>> messy.  Take
>>> into account that besides the offset you also need the masks and  
>>> that some
>>> sequences could be reversed. That's why I decided to split the  
>>> part that
>>> calculates the offset and the mask into a separate class.
>>>> One important thing I think we should do BEFORE adding any contig
>>>> class to Biopython, is get it working with at least one other  
>>>> contig file
>>>> format in addition to Ace. I don't want to end up with a class  
>>>> which
>>>> is too specialised for how ace contigs work.
>>>> Peter
>>> Well, In fact my contig class is modeled after the caf file  
>>> format. The ace
>>> parsing was just an afterthought, my primary interest was the caf  
>>> format.
>>> -- 
>>> Jose M. Blanca Postigo
>>> Instituto Universitario de Conservacion y
>>> Mejora de la Agrodiversidad Valenciana (COMAV)
>>> Universidad Politecnica de Valencia (UPV)
>>> Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
>>> 46022 Valencia (SPAIN)
>>> Tlf.:+34-96-3877000 (ext 88473)
>>> ------------------------------
>>> <snip>
>>> _______________________________________________
>>> Biopython-dev mailing list
>>> Biopython-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>> End of Biopython-dev Digest, Vol 77, Issue 30
>>> *********************************************
>>
>> --
>> Gordon Robertson
>> Canada's Michael Smith Genome Sciences Centre
>> Vancouver BC Canada
>>
>>
>>
>

--
Gordon Robertson
Canada's Michael Smith Genome Sciences Centre
Vancouver BC Canada





More information about the Biopython-dev mailing list