[Biopython-dev] Bio.Sequencing

Cymon Cox cy at cymon.org
Tue Jun 30 09:02:04 UTC 2009


2009/6/30 Peter <biopython at maubp.freeserve.co.uk>

> On Mon, Jun 29, 2009 at 4:47 PM, Cymon Cox<cy at cymon.org> wrote:
> > Hi Peter,
> >
> > 2009/6/29 Peter
> >>
> >> Hi Cymon,
> >>
>
[...]

>
> Several of the Bio.SeqIO parsers already have optional arguments.
> I have sometimes wondered about letting the SeqIO functions take
> a **kwargs argument, and passing these arbitrary options to the
> underlying parser. This would allow for example passing wrap options
> to the FASTA writer, or skiping the features when parsing GenBank
> and EBML. On the other hand, it gets very complicated, and detracts
> from the current simplicity of Bio.SeqIO (which I like).


Its a bit of a slippery-slope - but it would have been nice to have a
"useDefaults" switch in the PhdWriter.


> > Anyway, I assume (havent checked) that currently if all the
> > contigs are free of gaps then the SeqIO.AceIO will parse
> > them into an Ungapped alphabet which can then be written
> > to FASTA/QUAL etc. I think this is the right way to go, if
> > the contigs have gaps the user needs to decide how to deal
> > with them explicitly.
>
> Yes, if the Ace contig has no gaps, it will have a nice integer
> PHRED quality for each base, and could be saved as FASTQ
> or QUAL (or FASTA).
>
> The thing about "gaps" in contigs is that the consensus is
> really the ungapped sequence.


Yes, but... there is still some ambiguity over the consensus sequence which
is lost in the ungapped sequence. OK, so this isnt such a bid deal with the
massive coverages achieved by 454 tech but I can imagine cases of hybrid
Sanger/454 where this might be an issue (might be scraping the bottom of the
barrel a bit here...).

I'd have to check but I think
> Newbler and CAP3 will output both FASTA and ACE files,
> and in the FASTA files there are no insertions/gaps in the
> contig sequences.


For comparison, Mira outputs ACE, plus X.gapped.fasta, and X.ungapped.fasta


> What I am thinking is Bio.SeqIO could return the ungapped
> consensus sequences as SeqRecord objects (which can then
> be saved as FASTA, FASTQ, QUAL) while Bio.AlignIO
> could return contig-alignment objects (with the gaps, like
> David's cookbook but in the long run with a contig class).


Yeah, I like this. Although, I'm not sure how intuitive it is that SeqIO
would necessarily return the ungapped rather than gapped sequences - but it
kinda makes sense...

Cheers, C.
--



More information about the Biopython-dev mailing list