[Biopython-dev] Bio.Sequencing

Tue Jun 30 08:18:44 UTC 2009

On Mon, Jun 29, 2009 at 4:47 PM, Cymon Cox<cy at cymon.org> wrote:
> Hi Peter,
>
> 2009/6/29 Peter
>>
>> Hi Cymon,
>>
>> I've checked in some of your patch on Bug 2865 already,
>> recording the per-letter-annotation which I was planning to
>> do but hadn't got round to yet - thank you:
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2865
>>
>> This means with the latest code you can now use Biopython
>> to convert a PHD output file into a FASTQ file (or a QUAL
>> file) which could be handy for doing meta assemblies.
>
> Yeah, that's nice. Conversely, the reason I wrote the Phd writer is that I
> want to 'fake' some Phd files from FASTA and QUAL files - should/might be
> possible by using the default headers and equally spaced peak locations. The
> use-case is to fool Consed into displaying the trace (which it 'fakes') from
> a 454 Mira assembly ACE file output, but which it will only do if the Phd
> files are available. So I'm hoping to write the Phd files from the original
> FASTA/QUAL input files. Not sure if this is going to work, or if its a
> sensible thing to be trying...

That sounds reasonable - as long as you know you are faking it ;)

>> I did relatively recently update SeqIO for the Ace format to
>> record the qualities - but there is an issue here. Only the
>> nucleotides get given quality scores, but not the insertions
>> (gaps, shown as "*" in the Ace file consensus sequence).
>> Currently the Bio.SeqIO parser gives the gapped sequence.
>> This means to record the quality scores, we need to give
>> some null value to the gap characters (and I used None).
>>
>> What I am wondering about is making the Bio.SeqIO Ace
>> parser just return the ungapped sequence (and the
>> associated PHRED quality scores). This means we could
>> then convert Ace files into FASTQ or QUAL files, and also
>> a simple Ace to FASTA conversion would give something
>> useful for downstream analysis (the ungapped consensus).
>>
>> The gaps *are* important if you want to see how the
>> consensus was built up - in which case it makes sense to
>> think about each Ace contig as a kind of multiple sequence
>> alignment. See this earlier discussion with David Winter:
>> http://lists.open-bio.org/pipermail/biopython/2009-April/005125.html
>> http://lists.open-bio.org/pipermail/biopython/2009-April/005128.html
>>
>> Any thoughts?
>
> I think it's probably unwise to return an ungapped sequence/qual by default
> if the contig in the ACE assembly is gapped. It would be nice if the parser
> had a switch ungapped=True, but thats not going to work with the SeqIO
> interface.

We can certainly add a ungapped optional argument to the parser
in Bio.SeqIO.AceIO - that would be a small improvement, meaning
the functionality would be there if you needed it (all be it a bit hidden).

Several of the Bio.SeqIO parsers already have optional arguments.
I have sometimes wondered about letting the SeqIO functions take
a **kwargs argument, and passing these arbitrary options to the
underlying parser. This would allow for example passing wrap options
to the FASTA writer, or skiping the features when parsing GenBank
and EBML. On the other hand, it gets very complicated, and detracts
from the current simplicity of Bio.SeqIO (which I like).

> Second best option would be to have an easy way of getting the
> ungapped SeqRecord from the gapped SeqRecord - a function
> somewhere in Bio.Sequencing?

I've already suggested some kind of "ungapped" method for Seq
objects, and yes, having this at the SeqRecord level too would
solve this particular use case. Removing the per-letter-annotations
associated with the gaps would be straight forward. I'm not sure
what we would want to do with any features in the SeqRecord
(perhaps a corner case), but most likely any SeqFeature covering
a region containing a gap would be lost.

> Anyway, I assume (havent checked) that currently if all the
> contigs are free of gaps then the SeqIO.AceIO will parse
> them into an Ungapped alphabet which can then be written
> to FASTA/QUAL etc. I think this is the right way to go, if
> the contigs have gaps the user needs to decide how to deal
> with them explicitly.

Yes, if the Ace contig has no gaps, it will have a nice integer
PHRED quality for each base, and could be saved as FASTQ
or QUAL (or FASTA).

The thing about "gaps" in contigs is that the consensus is
really the ungapped sequence. I'd have to check but I think
Newbler and CAP3 will output both FASTA and ACE files,
and in the FASTA files there are no insertions/gaps in the
contig sequences.

What I am thinking is Bio.SeqIO could return the ungapped
consensus sequences as SeqRecord objects (which can then
be saved as FASTA, FASTQ, QUAL) while Bio.AlignIO
could return contig-alignment objects (with the gaps, like
David's cookbook but in the long run with a contig class).
This has some merit, but breaks my current convention that
parsing an alignment file with SeqIO works by giving each
gapped sequence in each alignment in turn.

Peter