[Bioperl-l] parse cds and peptide from genbank file with identifier

Wed May 28 11:53:17 EDT 2003

Yep, what name would you like them to have?

In the scripts bioperl/scripts/seq/extrac_feature_seq.PLS you can see an
example of you to set the sequence names.  Here is basic code to do so:
  ...
  my $cdsct = 1;
  foreach my $f ( @cds ) {
    my $featureseq = $f->spliced_seq();
    my ($id) = $f->has_tag('gene') ?  $f->each_tag_value('gene') :
               $f->has_tag('product') ? $f->each_tag_value('product') :
  	       'CDS'.$cdsct++;

   $featureseq->display_id($id);
   ...
 }

  You might also add other fields to check depending on how your features
are annotated.  More sophisticated logic could be written as well and is
left as an excercise for the reader....

-jason

On Sun, 25 May 2003 Guoqi4 at aol.com wrote:

> Dear group,
> I am interested in parsing cds and peptide from genbank file. I run the
> following script:
>
> #!/usr/bin/perl -w
> # Contributed by Jason Stajich <^S HYPERLINK "mailto:jason at bioperl.org"
> ^Tjason at bioperl.org^U>
>
> # simple extract the CDS features from a genbank file and
> # write out the CDS and Peptide sequences
>
> use strict;
> use Bio::SeqIO;
> my $filename = shift || die("pass in a genbank filename on the cmd line");
> my $in = new Bio::SeqIO(-file => $filename, -format => 'genbank');
> my $out = new Bio::SeqIO(-file => ">$filename.cds");
> my $outpep = new Bio::SeqIO(-file => ">$filename.pep");
>
> while( my $seq = $in->next_seq ) {
>    my @cds = grep { $_->primary_tag eq 'CDS' } $seq->get_SeqFeatures();
>    foreach my $feature ( @cds ) {
>      my $featureseq = $feature->spliced_seq;
>      #print 'featureseq:'.$featureseq;
>      my $featureseqid=$feature->display_name;
>          #print my $featureseqid;
>     # $out->write_seq($featureseq);
>      $out->write_seq($featureseqid);
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^--- this is wrong, go back to the previous
                                         code, you can't write out an id only
> $outpep->write_seq($featureseq->translate);
>    }
> }
>
> There are two problems found. The first lines are all the same in   for One
> is the fasta format
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu