[Bioperl-l] help converting seq objects.

Pedro Antonio Reche reche at research.dfci.harvard.edu
Mon Aug 16 18:30:44 EDT 2004


Hi all,
I have combined two scripts kindly provided by Jason Stajich to 1)  
Retrieve a list of seq records from genbank saving them in GenBank  
format; and 2)  get the translation feature from the file previously  
saved. The program -shown below- is works nicely, but I will nicer if I  
could parse the sequence object obtained from the GenBank database  
without printing  it. Does anyone knows how to do this? Thanks in  
advance for any help.
#!/usr/sbin/perl -w
# How to retrieve GenBank entries over the Web
# by Jason Stajich

use Bio::DB::GenBank;
use Bio::SeqIO;
use Bio::Seq;

$in = shift @ARGV;

###### acc numbes from file an array ##########
open (F, "$in") || die;
while(<F>){
         next unless /(NM_\d+)\s/;
         $acc = $1;
         chomp($acc);
         push @AC, $acc;
}
close(F);
$list = join " ", @AC;
print "$list\n";

##### get genbank records into a file  
#######################################################
my $gb = new Bio::DB::GenBank;
my $seqio = $gb->get_Stream_by_acc(\@AC);
my $seqout = new Bio::SeqIO(-file => ">$in.gb", -format => 'GenBank');
#my $seqout = new Bio::SeqIO(-file => ">$ARGV[0].gb", -format =>  
'GenBank');
while( defined ($seq = $seqio->next_seq )) {
         $seqout->write_seq($seq);
}
###########get translation from file just created  
############################################
my $tmp  = "$in.gb";
my $temp = "$in.tfa";

my $in = new Bio::SeqIO(-file => "<$tmp", -format => 'genbank');
my $out = new Bio::SeqIO(-file => ">$temp", -format => 'fasta');
while ($seq = $in->next_seq ) {
         foreach my $f ( grep { $_->primary_tag eq 'CDS' }  
$seq->top_SeqFeatures ) {
                 my ($gname);
                 if ( $f->has_tag('gene') ) {
                         ($gname) = $f->each_tag_value('gene');
                 } elsif ( $f->has_tag('product') ) {
                         ($gname) = $f->each_tag_value('product');
                 }
                 ($gname) =~ s/\s+/_/g;
                 my ($ref) = $f->has_tag('protein_id') &&  
$f->each_tag_value('protein_id');
                 my ($gi)  = $f->has_tag('db_xref') &&  
$f->each_tag_value('db_xref');
                 my ($translation) = $f->has_tag('translation') &&  
$f->each_tag_value('translation');
                 unless( $gi && $ref && $gname && $translation ) {
                         print STDERR "not fully annotated CDS  
($gi,$ref,$gname), skipping...\n";
                         next;
                 }
                 my $tfa = Bio::PrimarySeq->new (-seq => $translation,  
-display_id =>
                                 sprintf("%s|%s|%s",$gi,$ref,$gname));
                         $out->write_seq($tfa);
     }
}
======================================================================== 
=====================
Dr. Pedro A Reche, Instructor of Medicine
Dana-Farber Cancer Institute, Harvard Medical School                     
      TL: 617 632 3824
HIM Building, Room 401                                                   
      FX: 617 632 3351
77 Avenue Louis Pasteur                                   EM:  
reche at research.dfci.harvard.edu
Boston, MA 02115, USA                                                 
W3: www.mifoundation.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 3382 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20040816/9256957c/attachment.bin


More information about the Bioperl-l mailing list