[Biopython-dev] [Bug 2507] New: Adding getitem to SeqRecord for element access and slicing

Mon Jun 2 13:26:28 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2507

           Summary: Adding __getitem__ to SeqRecord for element access and
                    slicing
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
OtherBugsDependingO 1944
             nThis:

With a Seq object, you can access individual letters and create sub-sequences
using slicing.  You can even use a stride to reverse the sequence, or select
every third letter.

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
>>> print my_seq
GATCGATGGGCCTATATAGGATCGAAAATCGC
>>> my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())
>>> my_seq[5:10]
Seq('ATGGG', IUPACUnambiguousDNA())
>>> my_seq[::-1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())
>>> my_seq[5]
'A'

Currently, these operations cannot be done with a SeqRecord object.  This
enhancement bug is to allow element access and splicing (perhaps even with a
stride) on SeqRecord objects, where the annotations are taken into
consideration, and preserved as far as reasonably possible.

Looking at the different SeqRecord properties, this is what I think should
happen for creating a sub-sequence:

.id, .name, .description (three strings) - preserve?

Blindly preserving these may not always be meaningful.  For example, if the
description was "Complete plasmid" then it doesn't really apply to a
sub-sequence.  Perhaps we should preserve only the id and name, and set the
description to "sub-sequence"?

.annotations (dictionary) - either preserve or lose?

Some annotation entries will still be valid for a sub-sequence (e.g. "source"
or references).  Others will not (e.g. anything describing its coordinates
within a larger parent sequence).  There is no reliable way to decide on a case
by case basis.

.dbxrefs (list of strings) - preserve?

Any database cross-references would arguably still apply to a sub-sequence or
even a reversed sequence.

.features (list of SeqFeatures) - select only those features still in the new
sub-sequence, and adjust their locations for the new coordinates.  Supporting
strides other than +1 would be complicated!  For simplicity, I would say any
feature only partially within the sub-sequence should be discarded.

In summary, one clearly defined set of actions on creating a sub-sequence could
be to preserve all the annotation data except the SeqFeatures which would be
handled sensibly.

[If we later support "per-letter-annotation" in either a Seq or SeqRecord
subclass, then this too should be spliced]

Adding a __getitem__ method to the SeqRecord as outlined above should be
compatible with the suggestion that the SeqRecord subclasses the Seq object
(see bug 2351).

A related point, when accessing single letters, e.g. record[0], should a single
letter string be returned (which lacks any annotation) as currently happens
with the Seq object?

P.S. I'm marking this new enhancement bug as blocking bug 1944.  Once SeqRecord
objects support splicing, this would make annotation preserving slicing of
alignment objects much more straightforward.

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Biopython-dev] [Bug 2507] New: Adding __getitem__ to SeqRecord for element access and slicing

[Biopython-dev] [Bug 2507] New: Adding getitem to SeqRecord for element access and slicing