[Biopython-dev] Iterating over Ace contig files

Tue Jun 17 13:06:34 UTC 2008

Hi Peter,

makes totally sense to me. Feel free to do the changes as you see it fit

Frank

Peter wrote:
> Hello Frank,
>
> I wanted to get your opinion on iterating over the Ace file contig by
> contig, and what is lost in the WA, CT, RT and WR tags at the end of
> the file by doing this.  As large sequencing runs become more common,
> iterating over the file in a single pass WITHOUT keeping everything in
> memory does seem to be desirable.
>
> Similar past discussions:
> http://portal.open-bio.org/pipermail/biopython/2004-February/001825.html
> http://portal.open-bio.org/pipermail/biopython/2005-May/002661.html
>
> Would you object to me rewording your module's header-comment not to
> say that the Ace Iterator is NOT deprecated, but rather that it has
> certain drawbacks.
>
> [The context for this is my recent thread on the Biopython dev mailing
> list about integrating your Bio.Sequencing.Ace parser into Bio.SeqIO
> and/or Bio.AlignIO - I've included a little context below.]
>
> Thanks,
>
> Peter
>
> --
>
> Peter wrote:
>   
>>> So integrating the "ace" format into Bio.SeqIO representing the
>>> consensus sequence of each contig as a SeqRecord would be useful.
>>> Initially I would try and represent the aligned reads as SeqFeature
>>> objects (much like when reading a genome from a GenBank file you get
>>> CDS features with their amino acid translation).
>>>
>>> Note that for memory reasons, I would be inclined to scan over the Ace
>>> file in one pass (using the existing Iterator in the
>>> Bio.Sequencing.Ace parser) returning SeqRecords as we go.  As Frank
>>> points out in the code comments, this means we can't easily include
>>> the WA, CT, RT and WR tags found in the Ace file footer.  Do you use
>>> this information Jose?
>>>       
>
> Jose replied,
>   
>> I haven't used the iterator because of the deprecation warning of the code. I
>> tried with about 40000 alignments and it worked in a computer with 8 GB of ram.
>> I there are more sequences, and there will be with the 454 sequencer, we will
>> have trouble reading all at once. I vote for the iterator approach. I have not
>> used the information of this tag, but I don't know also what they mean. I've
>> been looking for documentation about this format, but I've found none, do you
>> have any good ace documentation?
>>     
>
>