[Bioperl-l] Bio::SeqIO, genbank -> fasta, protein only?

Chris Fields cjfields at uiuc.edu
Sun Oct 15 00:44:04 UTC 2006


...
> Huh. That's what I assumed when I stumbled into the -alphabet parameter.
> So I thought this would read the protein sequences out of my genbank file
> and write a fasta file for me:

You have to think about it this way: the GenBank record you are using is for
the nucleotide sequence only, and all other information in that record
describes the sequence.  Similarly, if you used a 'GenPept' sequence, the
focus would be the protein sequence.  Both normally contain annotations
which describe the sequence globally, such as references, organism info,
etc.  Both also may contain features (or SeqFeatures), which describe a
feature bound to a particular location on the sequence.  However, features
are not an absolute requirement for a sequence; they're sort of 'window
dressing', albeit almost always essential for describing the main sequence.

I would do exactly as Brian suggests.  See the Feature/Annotation HOWTO for
ideas on how to screen out the particular features you want and either grab
the 'translation' tag data or get the sequence object from the feature and
translate it directly.  You should get the same result either way though
getting the tag may be faster.

...

> It didn't. Would it be a Good Thing if it did what I was expecting? (Like
> I said I rolled my own, but I'm always looking for ways to enhance BioPerl
> that other people might find useful... Someday I will contribute something
> useful, by golly. -grin-)
> 
> (Background: I'm doing protein BLASTs from genbank files. To make formatdb
> happy I have to have fasta files full of the protein sequences.)
> 
> j

You could, theoretically, write up a method to only retrieve features which
correspond to coding regions only (CDS).  You may want to optionally screen
out pseudogenes but that's up to you.

Chris





More information about the Bioperl-l mailing list