[Biopython-dev] [Bug 2578] New: The GenBank SeqRecord parser does not record module type or if circular

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Wed Sep 3 16:46:30 UTC 2008


http://bugzilla.open-bio.org/show_bug.cgi?id=2578

           Summary: The GenBank SeqRecord parser does not record module type
                    or if circular
           Product: Biopython
           Version: 1.47
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk


Filing this bug after discussion on the mailing list, where the issue was
raised by Chris Lasher:
http://lists.open-bio.org/pipermail/biopython/2008-September/004474.html
http://lists.open-bio.org/pipermail/biopython/2008-September/004475.html
http://lists.open-bio.org/pipermail/biopython/2008-September/004476.html

The LOCUS line at the start of a GenBank record can record the molecule type
(DNA, RNA, mRNA, protein etc) and also if the sequence is linear or circular,
e.g.

LOCUS       NC_002678            7036071 bp    DNA     circular BCT 22-JUL-2008

Currently Bio.SeqIO (and Bio.GenBank.FeatureParser if called directly) do not
record these two bits of information in the SeqRecord.

Bio.SeqIO uses the Bio.GenBank.FeatureParser, which gets passed this
information from the Scanner via the residue_type event.  This is a combined
lump of data containing both the sequence type (DNA, RNA etc) and if it is
linear or circular.  It is currently only used to determine the Seq alphabet,
and has never been recorded.  So in addition to not recording if the LOCUS line
said the sequence was circular, if the LOCUS line contained cDNA, mRNA, ...
this fine detail is also currently lost in the SeqRecord representation.  On
the other hand, the Bio.GenBank.RecordParser stores all this as the record's
residue_type property (a single combined field, presumably reflecting the
layout of early GenBank files).

It would be a logical improvement to record the sequence data (molecule type
and if circular) in the SeqRecord's annotations dictionary - perhaps as two
fields but we'd need to check if that would be straight forward for EMBL files
too.  Alternatively, if Biopython included a native CircularSeq object, we
could use that explicitly when the sequence is declared as circular.  This
might be considered a little surprising though.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list