[Biopython] SeqIO feature.location.start and end for genes spanning origin

Thu May 8 23:20:22 UTC 2014

On Thu, May 8, 2014 at 6:47 PM, Richard Llewellyn wrote:

>> > What numbers are you hoping to get out of this location?

>

> Great question.  I can see that having 0,end is useful as a flag for
> origin

> spanning.  However, it is also the least informative, as neither 0 or the

> end are actual locations of the gene starting/ending.

>

Your view "least informative" is subjective. They are end points of

the "fake" exons used to describe the feature, and more importantly

represent the min/max of the region spanned (very useful for drawing

features or considering intersections etc).

Ok.  But I do pause on the statement that these 'represent the min/max of
the region spanned,' as from my perspective I don't think of a gene as
spanning the entire chromosome.  Regardless, if I check for this special
case, no big deal for me.  I realize you have many other constraints I
haven't needed to worry about.

> My code would have

> expected the start and end to be the sequence locations (so start >>
> end),

> and it would have marked this as a special case of origin spanning.  But
> it

> does require special handling.  I currently use negative numbers for the

> start in this situation, though this has its own problems.

> You mean the biological start and end?

Well, I didn't expect the biological start and end, but I did expect that
the start and end would represent the start and end of the genes.

You didn't answer my

question

I sensed quicksand ;-)

but I am guessing you wanted start 879 (adjusted for

Python counting?) and end 490883 given this location string:

complement(join(490883..490885,1..879))

 I would have been fine with start 490883, and end 879.  But I'm not
pushing for such.  I do think most users would naively assume that the set
of start and end of a gene feature would contain the start and end,
regardless of order or strand.  At least I did.

> That would be possible for any stranded feature, although it

is less well defined for strand-less features (which in GenBank

and EMBL are by default forward strand).

Anyway, using NC_005213 as the example:

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gbk

Sample code:

  >>> from Bio import SeqIO

  >>> record = SeqIO.read("NC_005213.gbk", "genbank")

  >>> feature = record.features[4] # the CDS record spanning origin

  >>> print(feature.location)

  join{[0:879](-), [490882:490885](-)}

  >>> print(feature.location.start)

  0

  >>> print(feature.location.end)

  490885

Perhaps there is a case for "biological" start and end properties

(calculated from the current data structure on demand)? i.e.

something like this:

  >>> feature.location.parts[0].end if

feature.location.parts[0].strand == -1 else

feature.location.parts[0].start

  ExactPosition(879)

  >>> feature.location.parts[-1].start if

feature.location.parts[-1].strand == -1 else

feature.location.parts[-1].end

  ExactPosition(490882)

Note as a convenience, even basic non-compound locations have

a parts property - returning a list of one entry, themselves. So that

code should work in general :)

Maybe so.  I use left and right for the biopython start and end locations
myself, to distinguish between biological starts and ends, which I
calculate with something like your logic, unless around origin.

I appreciate that len(feature) is correct, even for around origin.

The potential enhancement would be to define these are new

properties, feature.location.bio_start and feature.location.bio_end

or similar naming? Would that be useful? What would name them?

I do rather like that, especially for cases around origin.  bio_start/end
or transcription_start/end -- nah too long.  I like bio_.  If nothing else,
would serve as a reminder that start and end are not necessarily the
transcribed start and end.

> > In the code I'm currently debugging I treat DNA molecules as partial
> order

> graphs in order to deal with overlapping or mutually-exclusive genes [or

> gene predictions ;-)  ].  The graphs are cyclical for circular molecules.

> The graphs are responsible for generating distances between genes, so each

> node in the graph (a gene) contains the entire molecule length if
> circular.

> The definition of 'distance' here is also an issue -- typically I want the

> shortest intergenic distance.

Have you found any mixed-strand trans-spliced genes in your

dataset yet? They are really special cases which cause pain!

Ugh no!

Thanks again,

Richard

On Thu, May 8, 2014 at 4:14 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> Do you mind sending this again but including the list?  Thanks!
>
> Peter
>
> On Thu, May 8, 2014 at 8:55 PM, Richard Llewellyn <llewelr at gmail.com>
> wrote:
> >> On Thu, May 8, 2014 at 6:47 PM, Richard Llewellyn wrote:
> >>
> >> >> > What numbers are you hoping to get out of this location?
> >>
> >> >
> >>
> >> > Great question.  I can see that having 0,end is useful as a flag for
> >> > origin
> >>
> >> > spanning.  However, it is also the least informative, as neither 0 or
> >> > the
> >>
> >> > end are actual locations of the gene starting/ending.
> >
> >
> >>
> >>
> >>
> >> Your view "least informative" is subjective. They are end points of
> >>
> >> the "fake" exons used to describe the feature, and more importantly
> >>
> >> represent the min/max of the region spanned (very useful for drawing
> >>
> >> features or considering intersections etc).
> >
> >
> > Ok.  But I do pause on the statement that these 'represent the min/max of
> > the region spanned,' as from my perspective I don't think of a gene as
> > spanning the entire chromosome.  Regardless, if I check for this special
> > case, no big deal for me.  I realize you have many other constraints I
> > haven't needed to worry about.
> >
> >
> >>
> >> > My code would have
> >>
> >> > expected the start and end to be the sequence locations (so start >>
> >> > end),
> >>
> >> > and it would have marked this as a special case of origin spanning.
>  But
> >> > it
> >>
> >> > does require special handling.  I currently use negative numbers for
> the
> >>
> >> > start in this situation, though this has its own problems.
> >
> >
> >>
> >> You mean the biological start and end?
> >
> >
> > Well, I didn't expect the biological start and end, but I did expect that
> > the start and end would represent the start and end of the genes.
> >
> >
> >>
> >> You didn't answer my
> >>
> >> question
> >
> >
> > I sensed quicksand ;-)
> >
> >
> >>
> >> but I am guessing you wanted start 879 (adjusted for
> >>
> >> Python counting?) and end 490883 given this location string:
> >>
> >> complement(join(490883..490885,1..879))
> >
> >
> >  I would have been fine with start 490883, and end 879.  But I'm not
> pushing
> > for such.  I do think most users would naively assume that the set of
> start
> > and end of a gene feature would contain the start and end, regardless of
> > order or strand.  At least I did.
> >
> >>
> >> That would be possible for any stranded feature, although it
> >>
> >> is less well defined for strand-less features (which in GenBank
> >>
> >> and EMBL are by default forward strand).
> >>
> >> Anyway, using NC_005213 as the example:
> >>
> >>
> >>
> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gbk
> >>
> >> Sample code:
> >>
> >>   >>> from Bio import SeqIO
> >>
> >>   >>> record = SeqIO.read("NC_005213.gbk", "genbank")
> >>
> >>   >>> feature = record.features[4] # the CDS record spanning origin
> >>
> >>   >>> print(feature.location)
> >>
> >>   join{[0:879](-), [490882:490885](-)}
> >>
> >>   >>> print(feature.location.start)
> >>
> >>   0
> >>
> >>   >>> print(feature.location.end)
> >>
> >>   490885
> >>
> >> Perhaps there is a case for "biological" start and end properties
> >>
> >> (calculated from the current data structure on demand)? i.e.
> >>
> >> something like this:
> >>
> >>   >>> feature.location.parts[0].end if
> >>
> >> feature.location.parts[0].strand == -1 else
> >>
> >> feature.location.parts[0].start
> >>
> >>   ExactPosition(879)
> >>
> >>   >>> feature.location.parts[-1].start if
> >>
> >> feature.location.parts[-1].strand == -1 else
> >>
> >> feature.location.parts[-1].end
> >>
> >>   ExactPosition(490882)
> >>
> >> Note as a convenience, even basic non-compound locations have
> >>
> >> a parts property - returning a list of one entry, themselves. So that
> >>
> >> code should work in general :)
> >
> >
> > Maybe so.  I use left and right for the biopython start and end locations
> > myself, to distinguish between biological starts and ends, which I
> calculate
> > with something like your logic, unless around origin.
> >
> > I appreciate that len(feature) is correct, even for around origin.
> >
> >> The potential enhancement would be to define these are new
> >>
> >> properties, feature.location.bio_start and feature.location.bio_end
> >>
> >> or similar naming? Would that be useful? What would name them?
> >
> >
> > I do rather like that, especially for cases around origin.
>  bio_start/end or
> > transcription_start/end -- nah too long.  I like bio_.  If nothing else,
> > would serve as a reminder that start and end are not necessarily the
> > transcribed start and end.
> >
> >
> >>
> >> > In the code I'm currently debugging I treat DNA molecules as partial
> >> > order
> >>
> >> > graphs in order to deal with overlapping or mutually-exclusive genes
> [or
> >>
> >> > gene predictions ;-)  ].  The graphs are cyclical for circular
> >> > molecules.
> >>
> >> > The graphs are responsible for generating distances between genes, so
> >> > each
> >>
> >> > node in the graph (a gene) contains the entire molecule length if
> >> > circular.
> >>
> >> > The definition of 'distance' here is also an issue -- typically I want
> >> > the
> >>
> >> > shortest intergenic distance.
> >>
> >> Have you found any mixed-strand trans-spliced genes in your
> >
> > dataset yet? They are really special cases which cause pain!
> >
> > Ugh no!
> >
> > Thanks again,
> >
> > Richard
> >
> >
> >
> > On Thu, May 8, 2014 at 12:25 PM, Peter Cock <p.j.a.cock at googlemail.com>
> > wrote:
> >>
> >> On Thu, May 8, 2014 at 6:47 PM, Richard Llewellyn wrote:
> >> >> > What numbers are you hoping to get out of this location?
> >> >
> >> > Great question.  I can see that having 0,end is useful as a flag for
> >> > origin
> >> > spanning.  However, it is also the least informative, as neither 0 or
> >> > the
> >> > end are actual locations of the gene starting/ending.
> >>
> >> Your view "least informative" is subjective. They are end points of
> >> the "fake" exons used to describe the feature, and more importantly
> >> represent the min/max of the region spanned (very useful for drawing
> >> features or considering intersections etc).
> >>
> >> > My code would have
> >> > expected the start and end to be the sequence locations (so start >>
> >> > end),
> >> > and it would have marked this as a special case of origin spanning.
>  But
> >> > it
> >> > does require special handling.  I currently use negative numbers for
> the
> >> > start in this situation, though this has its own problems.
> >>
> >> You mean the biological start and end? You didn't answer my
> >> question but I am guessing you wanted start 879 (adjusted for
> >> Python counting?) and end 490883 given this location string:
> >>
> >> complement(join(490883..490885,1..879))
> >>
> >> That would be possible for any stranded feature, although it
> >> is less well defined for strand-less features (which in GenBank
> >> and EMBL are by default forward strand).
> >>
> >> Anyway, using NC_005213 as the example:
> >>
> >>
> >>
> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gbk
> >>
> >> Sample code:
> >>
> >>   >>> from Bio import SeqIO
> >>   >>> record = SeqIO.read("NC_005213.gbk", "genbank")
> >>   >>> feature = record.features[4] # the CDS record spanning origin
> >>   >>> print(feature.location)
> >>   join{[0:879](-), [490882:490885](-)}
> >>   >>> print(feature.location.start)
> >>   0
> >>   >>> print(feature.location.end)
> >>   490885
> >>
> >> Perhaps there is a case for "biological" start and end properties
> >> (calculated from the current data structure on demand)? i.e.
> >> something like this:
> >>
> >>   >>> feature.location.parts[0].end if
> >> feature.location.parts[0].strand == -1 else
> >> feature.location.parts[0].start
> >>   ExactPosition(879)
> >>
> >>   >>> feature.location.parts[-1].start if
> >> feature.location.parts[-1].strand == -1 else
> >> feature.location.parts[-1].end
> >>   ExactPosition(490882)
> >>
> >> Note as a convenience, even basic non-compound locations have
> >> a parts property - returning a list of one entry, themselves. So that
> >> code should work in general :)
> >>
> >> The potential enhancement would be to define these are new
> >> properties, feature.location.bio_start and feature.location.bio_end
> >> or similar naming? Would that be useful? What would name them?
> >>
> >> > In the code I'm currently debugging I treat DNA molecules as partial
> >> > order
> >> > graphs in order to deal with overlapping or mutually-exclusive genes
> [or
> >> > gene predictions ;-)  ].  The graphs are cyclical for circular
> >> > molecules.
> >> > The graphs are responsible for generating distances between genes, so
> >> > each
> >> > node in the graph (a gene) contains the entire molecule length if
> >> > circular.
> >> > The definition of 'distance' here is also an issue -- typically I want
> >> > the
> >> > shortest intergenic distance.
> >>
> >> Have you found any mixed-strand trans-spliced genes in your
> >> dataset yet? They are really special cases which cause pain!
> >>
> >> Peter
> >
> >
>