[Bioperl-l] Problem when retrieving protein DESCRIPTION with Bio::DB::GenBank
Juan Jovel
jovel_juan at hotmail.com
Thu Mar 15 15:13:03 UTC 2012
Dear All,
I have done assemblies and blastx on few dozens of libraries and then extracted the tabular report from Blast. As you know such table provides the 'gi' for each protein, but not the description. I have written the small script (shown below) to retrieve the Description of each hit, and append it to the tabular blast report.
>From typical 'gi's (gi|13022215|V|gb|AAK11700.1|AF345523_1), I extract the 'gi|13022215|' and use it to feed Bio::DB::GenBank. It works in most instances, however, it collapses when a gi, like gi|6178084|, matches more than one entry. I have also tried using the 'version' (|AF345523_1) but similar problems are encountered in some instances.
I can guess there are easier/more efficient solutions to this simple task, but unfortunately it is as far as my BioPerl skills go.
Any help will be appreciated. The referred script follows:
#!/usr/bin/perl -w
use Bio::DB::GenBank;
chomp($dir = $ARGV[0]);opendir(DIR, $dir) or die "$!";@files = grep {/\.blast$/} readdir DIR;close DIR;@files = sort(@files);
foreach $file(@files){ open(IN, "$dir/$file") or die "$!"; print "$file \n";
($out = $file) =~ s/\.blast$/\.xls/; open(OUT, ">$dir/$out");
while($line = <IN>){ chomp($line); @temp = split(/\t/, $line); $left = index($temp[1], 'gi|'); $right = index($temp[1], '|V|'); $id = substr($temp[1], $left + 2, ($right - $left) - 1);
my $db_obj = Bio::DB::GenBank->new; my $seq_obj = $db_obj->get_Seq_by_acc($id);
print OUT $line, "\t", $seq_obj->desc, "\n"; print $seq_obj->desc, "\n"; }close IN;close OUT;}exit;
More information about the Bioperl-l
mailing list