[Biopython-dev] SeqRecord and Alignment inconsistencies

Peter biopython-dev at maubp.freeserve.co.uk
Fri Feb 18 13:55:08 EST 2005


Matt Dimmic wrote:
> In the storm of fixes leading up to the next BioPython release, I
> have a basic design question. There seems to be some inconsistency
> when it comes to the use of the name, id, and description fields of
> the FASTA format in the Alignment class.

You are probably right - and not just the Alignment class, but I would
suspect its too late to change this for the new release, and that any
last minute change like this would cause lots of backwards compatibility
problems for existing code using BioPython.

Maybe for a "big version jump" like a hypothetical BioPython 2.0 this
would make more sense.

> The FASTA format itself is ambiguous, but in general I expect the
> title of the sequence to be a variant of one of these:
> 
> id|description
 > id description
 > id
 >

Your first example with a "|" is not a good idea - the "important" bit
is up to the first space.  This first word is often subdivided with the
"|" (pipe or bar) character.

For some examples, read the "FASTA Defline Format" section of this:

ftp://ftp.ncbi.nih.gov/blast/documents/formatdb.txt

> The important thing is that the description is optional, and the 
> SeqRecord() object can be instantiated with id and/or description
> (and also a 'name', which makes things even more confusing!).

For yet more terminology, The FASTA record parser uses the terms "title"
and "sequence", where the first word of the title is important (which I
personally call the name or id) and the rest is optional.

i.e. title = id + optional description?

On the other hand, I'm sure I have seen the term "description" applied
to the whole ">" line of a FASTA file...

Peter



More information about the Biopython-dev mailing list