[Biopython-dev] [Bug 2544] Bio.GenBank and SeqFeature improvements

Wed Jul 16 20:35:13 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2544

------- Comment #2 from mmokrejs at ribosome.natur.cuni.cz  2008-07-16 16:35 EST -------
(In reply to comment #1)

> 
> I'm guessing the missing line here was something like:
> seq_record = SeqIO.read(handle, "genbank")

Yes, I forgot to paste one line, sorry.

> > I do not see how I could access the value 'DNA' from the LOCUS line:
> > LOCUS       EF452680                 260 bp    DNA     linear   SYN 11-JUN-2008
> 
> Currently the sequence type (DNA, RNA, Protein) is used internally by the
> GenBank parser to determine the alphabet.  It is not currently recorded in the
> SeqRecord object's annotation but could be.  How about something like this?:
> 
> seq_record.annotations["seq_type"]

I am not much familiar with the official naming of the fields in LOCUS line
by Genbank but hope you are. Yes, it would be fine for me. I hope all other
values from LOCUS line can be accessed similarly as well.

> > Could seq_record.features have a repr() function to give me something useful
> > instead of this?
> > 
> > >>> print seq_record.features
> > [<Bio.SeqFeature.SeqFeature instance at 0x837bc2c>, <Bio.SeqFeature.SeqFeature
> > instance at 0x837b9cc>, <Bio.SeqFeature.SeqFeature instance at 0x837bd4c>]
> 
> Yes we could add that, but you wouldn't want to do that on a typical genome
> with thousands of features.  Adding a repr method for the Reference object is
> also something I had wondered about doing.

I think it could be there even for large records. It not up to the programmer
to use repr() or not, and while testing/learning it would be really useful. Or
at least internally the routine could check for number of features and in case
there would be thousands it could print some first and then stop with a clear
message how to force for full listing.

> > I don't see documented anywhere in the biopython docs access the features,
> > pasting something like the following into docs would give a user clue where to
> > look for for values:
> > 
> > >>> print seq_record.features[0].qualifiers
> > {'db_xref': ['taxon:32630'], 'mol_type': ['other DNA'], 'organism': ['synthetic
> > construct'], 'chromosome': ['Ib'], 'PCR_primers': ['fwd_seq:
> > aggcctctgcttgccgtttgtttcg, rev_seq: cgccggcggcacacgctcaactaattac']}
> > >>> print seq_record.features[1].qualifiers
> > {'gene': ['NOS']}
> > >>> print seq_record.features[2].qualifiers
> > {'product': ['nitric oxide synthase'], 'codon_start': ['2'], 'EC_number':
> > ['1.14.13.39'], 'transl_table': ['11'], 'note': ['derived from Toxoplasma
> > gondii'], 'db_xref': ['GI:166831529'], 'translation':
> > ['RPLLAVCFVSDFYSLSLLHFASVPFHESDGCVGRSHWLPGKHANYVKPAGARKRPEVGCRSSCLLRSVCCDILSPVRTRGN'],
> > 'gene': ['NOS'], 'protein_id': ['ABP65329.2']}
> 
> There is a minimal bit of text in what is currently Chapter 10 of the tutorial
> on the SeqFeature object. I agree, this is an area that needs improvement.

Yes I read that before but it is too short, even after reading 2.4.2, 4.2.1,
9.2 and http://biopython.org/wiki/SeqIO.

> 
> Perhaps a full example of parsing a simple GenBank file in the SeqIO chapter
> would help?

Definitely, you should pick some exceptional record having different fields,
I think the one I have shown is quite OK.

> 
> > >>> print seq_record.features[3].qualifiers
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in <module>
> > IndexError: list index out of range
> 
> You must have only three features (indexed 0, 1 and 2) which explains the 
> index error.

I knew, it was intentional. ;-)

> 
> > I wonder if I could access the above dicts as seq_record.features['source']
> > or seq_record.features['CDS']. Where is the 'source', 'gene', 'CDS' gone?
> 
> As the .type attribute, try this:
> 
> for feature in seq_record.features:
>    print feature.type

>>> for feature in seq_record.features:
...    print feature.type
... 
source
gene
CDS
>>>

> 
> You can't access the features by type (e.g. seq_record.features['CDS']) 
> because there is generally more than one feature of each type.

Yes, but how about seq_record.features['CDS'][index]? Could that be provided?

> P.S. Most of your comments are not on Bio.SeqIO itself, but actually about the
> underlying Bio.GenBank parser or the SeqFeature object.  I have therefore
> changed the title.

Thanks!

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.