[Bioperl-l] Get nucleotide sequence when expecting proteinfromgenpept

Chris Fields cjfields at uiuc.edu
Wed Jul 12 15:39:39 UTC 2006


Problem is, you may or may not have GIs for a BLAST hit depending on how you
retrieve the BLAST report, what interface you use, etc.  NCBI is pretty
ambiguous when it comes to GI vs. accession; the sequence database guys want
you to use the GI for searches (since that's the unique ID for NCBI's
databases) and don't promise getting the correct sequence using the
accession.  

However, the BLAST interface guys have set up the BLAST CGI server to not
return the GI by default(accessible through Bio::Tools::Run::RemoteBlast).
Even more confusing, if you use the NCBI BLAST web interface, this option is
turned on by default.  Don't know what blastcl3 or blastall does, haven't
checked in a while.  

Anyway, this could be why there is no $hit->gi method for
GenericHit/BlastHit.  It could be added; I will need to look at
SearchIO::blast/blastxml/blasttable to see how this is parsed out.

BTW, what I do as a work-around, when using RemoteBlast, is below (you could
use the while loop to grab the GIs using SearchIO::blast if they are present
in the BLAST report).  This grabs all the GI's from the description line
(not just the best hit).

# sets retrieval header to include the GI always
$Bio::Tools::Run::RemoteBlast::RETRIEVALHEADER{'NCBI_GI'} = 'yes';

...

    while ( my $hit = $result->next_hit) {
        my $description = $hit->description;
        while ($description =~ /gi\|(.*?)\|/g) {
            my $gi = $1;
            push @gis, $gi;
        }
    }



Chris




 
> -----Original Message-----
> From: Frederick Partridge [mailto:frederick.partridge at st-
> johns.oxford.ac.uk]
> Sent: Wednesday, July 12, 2006 10:17 AM
> To: Chris Fields
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Get nucleotide sequence when expecting
> proteinfromgenpept
> 
> 
> 
> On Tue, 11 Jul 2006, Chris Fields wrote:
> > This returns both the nucleotide sequence and the correct protein
> sequence;
> > the protein was returned second for some reason, so get_Seq_by_acc
> misses it
> > while get_Stream_by_acc doesn't.  I have notified NCBI about this issue,
> but
> > they will likely just tell me to use the GI number for searches as they
> are
> > unique.  Probably a good warning for anyone using accessions for all
> their
> > work (I use the GI myself).
> 
> 
> Thank you both for your help, I have converted to GIs and it works much
> better.
> 
> As an aside, it might be nice to have a $hit->gi method as well as
> $hit->accession for parsing blast reports. (I now realise that you can
> derive the gi from $hit->name, but this might have encouraged me to start
> off using gi instead of accession numbers).
> 
> 
> Freddie Partridge
> 
> University of Oxford
> 





More information about the Bioperl-l mailing list