[Biopython-dev] Start/end co-ordinates in SeqRecord objects?

Mon May 21 16:37:21 UTC 2012

Hello all,

This is something I talked to Bow a little about during our last
weekly meeting for his GSoC meeting, but it is broader than
just SearchIO...

When describing  BLAST results, or FASTA alignments, or
indeed many other local alignments you typically have a
(gapped) query sequence and match sequence fragment,
and the co-ordinates describing which part of the full query
and matched sequence this is. i.e. You are told the start
and end of the subsequence (and perhaps strand).

The same essentially applies to some multiple alignment
formats in AlignIO as well, including Stockholm/PFAM
(where this is encoded into the record name as identifier
slash start-end), FASTA output (which will be handled via
SearchIO in future) and MAF.

http://biopython.org/wiki/Multiple_Alignment_Format

Indeed thinking about how best to handle this was the
main reason I haven't merged Andrew's MAF branch yet.

(There are subtleties, for instance how is the strand given
in the file, do you get start+end explicitly or must the end
be inferred from the start and the sequence, etc).

Currently recording these in the SeqRecord's annotation
dictionary 'works', but does not exploit the structure. In
particular, if the SeqRecord is sliced to get a fragment
of the alignment, this co-ordinate information is lost. It
would be nice if this preserved the start/end/strand and
updated it accordingly.

One idea for doing this is to introduce a new location
property to the SeqRecord (defaulting to None), which
would be a FeatureLocation object normally used for
SeqFeature objects.

If an operation couldn't preserve or update the location,
it would become None. Note that slicing we will generally
need to know the gap characters of the sequence (in
order to recalculate the sub-sequence's start/end), which
for the parsers may mean some minor updates to ensure
the default alphabet specifies the '-' gap character.

On the order hand, perhaps this 'location' property idea
is overly complicated?

Maybe all we need is a common convention about which
keys to use in the annotation dictionary, and how to store
the information (e.g. Python counting, start < end, and
strand as +1 or -1 if present)?

Thoughts or feedback please? Would a worked example
help with my explanation?

Thanks,

Peter