[Biopython-dev] [BioPython] annotations in an Alignment object

Mon Nov 10 11:42:31 UTC 2008

On Mon, Nov 10, 2008 at 12:28 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Mon, Nov 10, 2008 at 11:04 AM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
>> Is there any way to store some annotations in an Alignment object??
>> For example: the alignment tool used, its parameters, its version, the
>> date, and the nature of the sequence aligned.
>
> Not officially, no.  This is on my mental list of things to do with
> the alignment object (after Biopython 1.49 is done).  I've CC'd the
> dev-mailing list which is probably a better place to discuss the
> details.
>
> If you look at Bio/AlignIO/StockholmIO.py or the
> Bio/AlignIO/FastaIO.py code you'll see I've recorded this kind of
> information in a private dictionary, i.e. alignment._annotations.
> This makes the data available if anyone really needs it, but signals
> that this is not part of the public API and is likely to change.
>
> As part of an alignment annotation enhancement, we should try and
> establish some agreed standards for naming annotation entries (and
> also counting systems).

ok... I will use the private dictionary for my own implementation.
Unfortunately I don't have any useful suggestion for this..

>> I am asking this because I would like to write a module to create
>> ldhat input files from an alignment program.
>> A ldhat file (http://www.stats.ox.ac.uk/~mcvean/LDhat/instructions.html)
>> is very similar to a fasta file; the only difference is that in its
>> first line, it contains three numbers, one of which can't always be
>> inferred by the data.
>
> Why go to the trouble of making a new Bio.AlignIO module?  For this
> example from the LDhat manual, it looks like a FASTA file with an
> extra header:

Yeah.. of course :)
Let's say I am simply playing with biopython's code, to better understand it.
Since I am going to use this function many times, I will have to write
a module for it any way.
The first number in the ldhat file is the number of sequences, the
second is their length, and the third should be usually one in an
alignment object, I suppose.

>
> 4 10 1
>>SampleA
> TCCGC??RTT
>>SampleB
> TACGC??GTA
>>SampleC
> TC?-CTTGTA
>>SampleD
> TCC-CTTGTT
>
> Rather than writing support for a whole new file format, wouldn't it
> be easier to do something like this:
>
> alignment = ...
> number_a = 4
> number_b = 10
> number_c = 1
>
> handle = open("example.txt","w")
> handle.write("%i %i %i\n" % (number_a, number_b, number_c))
> handle.write(alignment.format("fasta"))
> handle.close()
>
> Peter
>

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it