[Bioperl-l] extracting GI number from BLAST hit

Fri Sep 17 09:35:19 EDT 2004

Well is the GI number actually in the Hit in the report or in the 
description down in the hsp?

We only report what is in the report - can you send a sample report 
which has the gi number in it?

You may want to run your blast with -I T
   -I  Show GI's in deflines [T/F]

-jason
On Sep 16, 2004, at 11:55 AM, Joshua Orvis wrote:

> How can one extract the GI number from hits when doing BLAST against
> an NCBI-formatted BLAST database?
>
> Each entry in the original multi-FASTA file was like this:
>
>> gi|30260195|ref|NC_003997.3| Bacillus anthracis str. Ames, complete 
>> genome
> [sequence .....]
>
> and formatting was done like:
>
> # formatdb -i filename.fna -p F -o T
>
> When I BLAST and parse the hit section I cannot see how to get the GI
> number out of each hit.  This code:
>
>         ## returns a Bio::SearchIO::blast object
>         $report = $fact->blastall($seq);
>
>         ## returns a Bio::Search::Result::BlastResult object
>         while( my $result = $report->next_result ) {
>
>             ## returns a Bio::Search::Hit::BlastHit object
>             while( my $hit = $result->next_hit ) {
>
>                 my $acc  = $hit->accession || 'NOACC';
>                 my $desc = $hit->description || 'NODESC';
>                 my $name = $hit->name || 'NONAME';
>                 my $locus = $hit->locus || 'NOLOC';
>
>                 print "$acc - $desc - $name - $locus\n";
>
>                 ## returns a Bio::Search::HSP::GenericHSP object
>                 while( my $hsp = $hit->next_hsp ) {
>                     ## TODO, grab the alignments in a bit
>                 }
>             }
>         }
>
> generates output like this:
>
> NC_002940 - Haemophilus ducreyi 35000HP, complete genome -
> ref|NC_002940.2| - NOLOC
> NC_004088 - Yersinia pestis KIM, complete genome - ref|NC_004088.1| - 
> NOLOC
> NC_003143 - Yersinia pestis strain CO92, complete genome -
> ref|NC_003143.1| - NOLOC
> NC_002516 - Pseudomonas aeruginosa PA01, complete genome -
> ref|NC_002516.1| - NOLOC
> NC_002677 - Mycobacterium leprae strain TN complete genome -
> ref|NC_002677.1| - NOLOC
>
>
> I expected that I could parse it out of the description line, but that
> is being done at some stage before.  I'm probably just missing a
> method somewhere in the docs.  Any suggestions?
>
> Joshua
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
--
Jason Stajich
jason.stajich at duke.edu
http://www.duke.edu/~jes12/