[Biopython] SeqIO feature.location.start and end for genes spanning origin
Richard Llewellyn
llewelr at gmail.com
Thu May 8 23:20:22 UTC 2014
On Thu, May 8, 2014 at 6:47 PM, Richard Llewellyn wrote:
>> > What numbers are you hoping to get out of this location?
>
> Great question. I can see that having 0,end is useful as a flag for
> origin
> spanning. However, it is also the least informative, as neither 0 or the
> end are actual locations of the gene starting/ending.
>
Your view "least informative" is subjective. They are end points of
the "fake" exons used to describe the feature, and more importantly
represent the min/max of the region spanned (very useful for drawing
features or considering intersections etc).
Ok. But I do pause on the statement that these 'represent the min/max of
the region spanned,' as from my perspective I don't think of a gene as
spanning the entire chromosome. Regardless, if I check for this special
case, no big deal for me. I realize you have many other constraints I
haven't needed to worry about.
> My code would have
> expected the start and end to be the sequence locations (so start >>
> end),
> and it would have marked this as a special case of origin spanning. But
> it
> does require special handling. I currently use negative numbers for the
> start in this situation, though this has its own problems.
> You mean the biological start and end?
Well, I didn't expect the biological start and end, but I did expect that
the start and end would represent the start and end of the genes.
You didn't answer my
question
I sensed quicksand ;-)
but I am guessing you wanted start 879 (adjusted for
Python counting?) and end 490883 given this location string:
complement(join(490883..490885,1..879))
I would have been fine with start 490883, and end 879. But I'm not
pushing for such. I do think most users would naively assume that the set
of start and end of a gene feature would contain the start and end,
regardless of order or strand. At least I did.
> That would be possible for any stranded feature, although it
is less well defined for strand-less features (which in GenBank
and EMBL are by default forward strand).
Anyway, using NC_005213 as the example:
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gbk
Sample code:
>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005213.gbk", "genbank")
>>> feature = record.features[4] # the CDS record spanning origin
>>> print(feature.location)
join{[0:879](-), [490882:490885](-)}
>>> print(feature.location.start)
0
>>> print(feature.location.end)
490885
Perhaps there is a case for "biological" start and end properties
(calculated from the current data structure on demand)? i.e.
something like this:
>>> feature.location.parts[0].end if
feature.location.parts[0].strand == -1 else
feature.location.parts[0].start
ExactPosition(879)
>>> feature.location.parts[-1].start if
feature.location.parts[-1].strand == -1 else
feature.location.parts[-1].end
ExactPosition(490882)
Note as a convenience, even basic non-compound locations have
a parts property - returning a list of one entry, themselves. So that
code should work in general :)
Maybe so. I use left and right for the biopython start and end locations
myself, to distinguish between biological starts and ends, which I
calculate with something like your logic, unless around origin.
I appreciate that len(feature) is correct, even for around origin.
The potential enhancement would be to define these are new
properties, feature.location.bio_start and feature.location.bio_end
or similar naming? Would that be useful? What would name them?
I do rather like that, especially for cases around origin. bio_start/end
or transcription_start/end -- nah too long. I like bio_. If nothing else,
would serve as a reminder that start and end are not necessarily the
transcribed start and end.
> > In the code I'm currently debugging I treat DNA molecules as partial
> order
> graphs in order to deal with overlapping or mutually-exclusive genes [or
> gene predictions ;-) ]. The graphs are cyclical for circular molecules.
> The graphs are responsible for generating distances between genes, so each
> node in the graph (a gene) contains the entire molecule length if
> circular.
> The definition of 'distance' here is also an issue -- typically I want the
> shortest intergenic distance.
Have you found any mixed-strand trans-spliced genes in your
dataset yet? They are really special cases which cause pain!
Ugh no!
Thanks again,
Richard
On Thu, May 8, 2014 at 4:14 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
> Do you mind sending this again but including the list? Thanks!
>
> Peter
>
> On Thu, May 8, 2014 at 8:55 PM, Richard Llewellyn <llewelr at gmail.com>
> wrote:
> >> On Thu, May 8, 2014 at 6:47 PM, Richard Llewellyn wrote:
> >>
> >> >> > What numbers are you hoping to get out of this location?
> >>
> >> >
> >>
> >> > Great question. I can see that having 0,end is useful as a flag for
> >> > origin
> >>
> >> > spanning. However, it is also the least informative, as neither 0 or
> >> > the
> >>
> >> > end are actual locations of the gene starting/ending.
> >
> >
> >>
> >>
> >>
> >> Your view "least informative" is subjective. They are end points of
> >>
> >> the "fake" exons used to describe the feature, and more importantly
> >>
> >> represent the min/max of the region spanned (very useful for drawing
> >>
> >> features or considering intersections etc).
> >
> >
> > Ok. But I do pause on the statement that these 'represent the min/max of
> > the region spanned,' as from my perspective I don't think of a gene as
> > spanning the entire chromosome. Regardless, if I check for this special
> > case, no big deal for me. I realize you have many other constraints I
> > haven't needed to worry about.
> >
> >
> >>
> >> > My code would have
> >>
> >> > expected the start and end to be the sequence locations (so start >>
> >> > end),
> >>
> >> > and it would have marked this as a special case of origin spanning.
> But
> >> > it
> >>
> >> > does require special handling. I currently use negative numbers for
> the
> >>
> >> > start in this situation, though this has its own problems.
> >
> >
> >>
> >> You mean the biological start and end?
> >
> >
> > Well, I didn't expect the biological start and end, but I did expect that
> > the start and end would represent the start and end of the genes.
> >
> >
> >>
> >> You didn't answer my
> >>
> >> question
> >
> >
> > I sensed quicksand ;-)
> >
> >
> >>
> >> but I am guessing you wanted start 879 (adjusted for
> >>
> >> Python counting?) and end 490883 given this location string:
> >>
> >> complement(join(490883..490885,1..879))
> >
> >
> > I would have been fine with start 490883, and end 879. But I'm not
> pushing
> > for such. I do think most users would naively assume that the set of
> start
> > and end of a gene feature would contain the start and end, regardless of
> > order or strand. At least I did.
> >
> >>
> >> That would be possible for any stranded feature, although it
> >>
> >> is less well defined for strand-less features (which in GenBank
> >>
> >> and EMBL are by default forward strand).
> >>
> >> Anyway, using NC_005213 as the example:
> >>
> >>
> >>
> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gbk
> >>
> >> Sample code:
> >>
> >> >>> from Bio import SeqIO
> >>
> >> >>> record = SeqIO.read("NC_005213.gbk", "genbank")
> >>
> >> >>> feature = record.features[4] # the CDS record spanning origin
> >>
> >> >>> print(feature.location)
> >>
> >> join{[0:879](-), [490882:490885](-)}
> >>
> >> >>> print(feature.location.start)
> >>
> >> 0
> >>
> >> >>> print(feature.location.end)
> >>
> >> 490885
> >>
> >> Perhaps there is a case for "biological" start and end properties
> >>
> >> (calculated from the current data structure on demand)? i.e.
> >>
> >> something like this:
> >>
> >> >>> feature.location.parts[0].end if
> >>
> >> feature.location.parts[0].strand == -1 else
> >>
> >> feature.location.parts[0].start
> >>
> >> ExactPosition(879)
> >>
> >> >>> feature.location.parts[-1].start if
> >>
> >> feature.location.parts[-1].strand == -1 else
> >>
> >> feature.location.parts[-1].end
> >>
> >> ExactPosition(490882)
> >>
> >> Note as a convenience, even basic non-compound locations have
> >>
> >> a parts property - returning a list of one entry, themselves. So that
> >>
> >> code should work in general :)
> >
> >
> > Maybe so. I use left and right for the biopython start and end locations
> > myself, to distinguish between biological starts and ends, which I
> calculate
> > with something like your logic, unless around origin.
> >
> > I appreciate that len(feature) is correct, even for around origin.
> >
> >> The potential enhancement would be to define these are new
> >>
> >> properties, feature.location.bio_start and feature.location.bio_end
> >>
> >> or similar naming? Would that be useful? What would name them?
> >
> >
> > I do rather like that, especially for cases around origin.
> bio_start/end or
> > transcription_start/end -- nah too long. I like bio_. If nothing else,
> > would serve as a reminder that start and end are not necessarily the
> > transcribed start and end.
> >
> >
> >>
> >> > In the code I'm currently debugging I treat DNA molecules as partial
> >> > order
> >>
> >> > graphs in order to deal with overlapping or mutually-exclusive genes
> [or
> >>
> >> > gene predictions ;-) ]. The graphs are cyclical for circular
> >> > molecules.
> >>
> >> > The graphs are responsible for generating distances between genes, so
> >> > each
> >>
> >> > node in the graph (a gene) contains the entire molecule length if
> >> > circular.
> >>
> >> > The definition of 'distance' here is also an issue -- typically I want
> >> > the
> >>
> >> > shortest intergenic distance.
> >>
> >> Have you found any mixed-strand trans-spliced genes in your
> >
> > dataset yet? They are really special cases which cause pain!
> >
> > Ugh no!
> >
> > Thanks again,
> >
> > Richard
> >
> >
> >
> > On Thu, May 8, 2014 at 12:25 PM, Peter Cock <p.j.a.cock at googlemail.com>
> > wrote:
> >>
> >> On Thu, May 8, 2014 at 6:47 PM, Richard Llewellyn wrote:
> >> >> > What numbers are you hoping to get out of this location?
> >> >
> >> > Great question. I can see that having 0,end is useful as a flag for
> >> > origin
> >> > spanning. However, it is also the least informative, as neither 0 or
> >> > the
> >> > end are actual locations of the gene starting/ending.
> >>
> >> Your view "least informative" is subjective. They are end points of
> >> the "fake" exons used to describe the feature, and more importantly
> >> represent the min/max of the region spanned (very useful for drawing
> >> features or considering intersections etc).
> >>
> >> > My code would have
> >> > expected the start and end to be the sequence locations (so start >>
> >> > end),
> >> > and it would have marked this as a special case of origin spanning.
> But
> >> > it
> >> > does require special handling. I currently use negative numbers for
> the
> >> > start in this situation, though this has its own problems.
> >>
> >> You mean the biological start and end? You didn't answer my
> >> question but I am guessing you wanted start 879 (adjusted for
> >> Python counting?) and end 490883 given this location string:
> >>
> >> complement(join(490883..490885,1..879))
> >>
> >> That would be possible for any stranded feature, although it
> >> is less well defined for strand-less features (which in GenBank
> >> and EMBL are by default forward strand).
> >>
> >> Anyway, using NC_005213 as the example:
> >>
> >>
> >>
> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gbk
> >>
> >> Sample code:
> >>
> >> >>> from Bio import SeqIO
> >> >>> record = SeqIO.read("NC_005213.gbk", "genbank")
> >> >>> feature = record.features[4] # the CDS record spanning origin
> >> >>> print(feature.location)
> >> join{[0:879](-), [490882:490885](-)}
> >> >>> print(feature.location.start)
> >> 0
> >> >>> print(feature.location.end)
> >> 490885
> >>
> >> Perhaps there is a case for "biological" start and end properties
> >> (calculated from the current data structure on demand)? i.e.
> >> something like this:
> >>
> >> >>> feature.location.parts[0].end if
> >> feature.location.parts[0].strand == -1 else
> >> feature.location.parts[0].start
> >> ExactPosition(879)
> >>
> >> >>> feature.location.parts[-1].start if
> >> feature.location.parts[-1].strand == -1 else
> >> feature.location.parts[-1].end
> >> ExactPosition(490882)
> >>
> >> Note as a convenience, even basic non-compound locations have
> >> a parts property - returning a list of one entry, themselves. So that
> >> code should work in general :)
> >>
> >> The potential enhancement would be to define these are new
> >> properties, feature.location.bio_start and feature.location.bio_end
> >> or similar naming? Would that be useful? What would name them?
> >>
> >> > In the code I'm currently debugging I treat DNA molecules as partial
> >> > order
> >> > graphs in order to deal with overlapping or mutually-exclusive genes
> [or
> >> > gene predictions ;-) ]. The graphs are cyclical for circular
> >> > molecules.
> >> > The graphs are responsible for generating distances between genes, so
> >> > each
> >> > node in the graph (a gene) contains the entire molecule length if
> >> > circular.
> >> > The definition of 'distance' here is also an issue -- typically I want
> >> > the
> >> > shortest intergenic distance.
> >>
> >> Have you found any mixed-strand trans-spliced genes in your
> >> dataset yet? They are really special cases which cause pain!
> >>
> >> Peter
> >
> >
>
More information about the Biopython
mailing list