[Bioperl-l] extracting GI number from BLAST hit
Joshua Orvis
jorvis at gmail.com
Thu Sep 16 11:55:01 EDT 2004
How can one extract the GI number from hits when doing BLAST against
an NCBI-formatted BLAST database?
Each entry in the original multi-FASTA file was like this:
>gi|30260195|ref|NC_003997.3| Bacillus anthracis str. Ames, complete genome
[sequence .....]
and formatting was done like:
# formatdb -i filename.fna -p F -o T
When I BLAST and parse the hit section I cannot see how to get the GI
number out of each hit. This code:
## returns a Bio::SearchIO::blast object
$report = $fact->blastall($seq);
## returns a Bio::Search::Result::BlastResult object
while( my $result = $report->next_result ) {
## returns a Bio::Search::Hit::BlastHit object
while( my $hit = $result->next_hit ) {
my $acc = $hit->accession || 'NOACC';
my $desc = $hit->description || 'NODESC';
my $name = $hit->name || 'NONAME';
my $locus = $hit->locus || 'NOLOC';
print "$acc - $desc - $name - $locus\n";
## returns a Bio::Search::HSP::GenericHSP object
while( my $hsp = $hit->next_hsp ) {
## TODO, grab the alignments in a bit
}
}
}
generates output like this:
NC_002940 - Haemophilus ducreyi 35000HP, complete genome -
ref|NC_002940.2| - NOLOC
NC_004088 - Yersinia pestis KIM, complete genome - ref|NC_004088.1| - NOLOC
NC_003143 - Yersinia pestis strain CO92, complete genome -
ref|NC_003143.1| - NOLOC
NC_002516 - Pseudomonas aeruginosa PA01, complete genome -
ref|NC_002516.1| - NOLOC
NC_002677 - Mycobacterium leprae strain TN complete genome -
ref|NC_002677.1| - NOLOC
I expected that I could parse it out of the description line, but that
is being done at some stage before. I'm probably just missing a
method somewhere in the docs. Any suggestions?
Joshua
More information about the Bioperl-l
mailing list