[Bioperl-l] Re: help extracting CDS

Jason Stajich jason@cgt.mc.duke.edu
Wed, 18 Dec 2002 11:08:57 -0500 (EST)


oh well if you wanted to get the translation from the genbank record
you just do this - I wasn't assuming you had a well annotated record from
your mail, I thought you only had CDS annotated.

Of course given what you've done, I might have just generated the list of
protein accession numbers and downloaded those from genbank rather than
trying to extract this out of the sequence record, but whatever works.

This WILL work on bioperl 1.0.2

my $in = new Bio::SeqIO(-file => 'myfile.gb', -format => 'genbank');
my $out = new Bio::SeqIO(-file => '>myfile.pep');

while( my $seq = $in->next_seq ) {

    foreach my $f ( grep { $_->primary_tag eq 'CDS' }
       $seq->top_SeqFeatures ) {
    my ($gname);
    if( $f->has_tag('product') ) {
        ($gname) = $f->each_tag_value('product');
    } elsif( $f->has_tag('name') ) {
        ($gname) = $f->each_tag_value('name');
    }

    my ($ref) = $f->has_tag('protein_id') && $f->each_tag_value('protein_id');
    my ($gi)  = $f->has_tag('db_xref') && $f->each_tag_value('db_xref');

    my ($translation) = $f->has_tag('translation') &&
                          $f->each_tag_value('translation');

    unless( $gi && $ref && $gname && $translation ) {
      print STDERR "not fully annotated CDS ($gi,$ref,$gname), skipping...\n";
      next;
    }
    my $outseq = Bio::PrimarySeq->new(-seq => $translation,
                                      -display_id =>
            sprintf("gi|%s|gb|%s|%s",$gi,$gname,$ref));
    $out->write_seq($outseq);
  }
}

On Wed, 18 Dec 2002, Pedro Antonio Reche wrote:

> Hi, the ways of programing with bioperl remains misterious to me.
> Despite the always excellent help of Jason Stajich I could not find
> simple way to get the CDS from a genbank record using bioperl. At least
> for now.
> Anyway, in case someone was interested in my original post, this code in
> straight perl does just that.
> Regards.
> #!/usr/sbin/perl -w
> use strict;
> $/ = "\n     CDS";
> <>;	# to skeep header
> while ( <> ) {
>     my ($gname) = /product="([^"]+)"/;#sometimes /product= is replace by
> /name=
>     $gname      =~ s/\s+//g;
>     my ($ref)   = /protein_id="([\w.]+)"/;
>     my ($gid)   = /db_xref="(GI:\w+)"/;
>     my ($seq)   = /translation="([A-Z\s]+)"/;
>     $seq        =~ s/\s+//g;
>
>     print ">$gid|$gname|$ref\n$seq\n";
> }
> >
> > Hi, I need to extract the CDS from a genbank genome record, saving them
> > into file in  fasta format, and I wonder if someone can let me know how
> > to do this using bioperl.
> > Tanks in advance for any positive consideration.
> >
> > pedro
> >
> > *******************************************************************
> > PEDRO A. RECHE , pHD            TL: 617 632 3824
> > Dana-Farber Cancer Institute,   FX: 617 632 4569
> > Harvard Medical School,         EM: reche@research.dfci.harvard.edu
> > 44 Binney Street, D1510A,       EM: reche@mifoundation.org
> > Boston, MA 02115                URL: http://www.reche.org
> > *******************************************************************
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
>
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu