[Bioperl-l] Taxonomy DB problem

Thu Sep 2 16:21:48 UTC 2010

Chris,

There are a few things wrong with the original script, so I'll fix them.
Basically, it makes the assumption that every ID in the original list is
found.  The problem: eutils only reports back data it finds, silently
discarding IDs that don't match.  So, using the original ID list when
building the hashes needs a bit more error checking.

Here's the revised script that works for me.

https://gist.github.com/f5db90a432fed68548d4

I'm also adding a check to ensure all IDs are defined prior to adding
them to the param string, just in case.

chris

On Thu, 2010-09-02 at 10:53 -0400, J. Christopher Ellis wrote:
> Chris have you had any luck with this?
> 
> Thanks,
> Chris
> 
> On Tue 08/31/10 11:01 , "Chris Fields" cjfields at illinois.edu sent:
>         Yes, I see that one. It may be the ID hash that is being
>         returned is empty. I'll look into it.
>         
>         -c 
>         
>         On Aug 31, 2010, at 6:57 AM, J. Christopher Ellis wrote:
>         
>         > Hi Chris,
>         > 
>         > The error is...
>         > 
>         > "Use of uninitialized value $id in join or string at
>         C:/Perl64/site/lib/Bio/Tools/EUtilities/EUtilParameters.pm
>         line 363."
>         > 
>         > The script from
>         http://bioperl.org/wiki/Species_names_from_accession_numbers">http://bioperl.org/wiki/Species_names_from_accession_numbers is as follows....
>         > 
>         > use Bio::DB::EUtilities;
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > my (%taxa, @taxa);
>         > 
>         > 
>         > 
>         > my (%names, %idmap);
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > # these are protein ids; nuc ids will work by changing
>         -dbfrom => 'nucleotide',
>         > 
>         > 
>         > 
>         > # (probably)
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > my @ids = qw(1621261 89318838 68536103 
>         > 
>         > 20807972
>         > 730439);
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > my $factory = Bio::DB::EUtilities->new(
>         > 
>         > -
>         > eutil => 'elink',
>         > 
>         > 
>         > -db => 'taxonomy',
>         > 
>         > 
>         > 
>         > 
>         > -dbfrom => 'protein',
>         > 
>         > 
>         > 
>         > 
>         > -correspondence => 1,
>         > 
>         > 
>         > 
>         > 
>         > -id => \@ids);
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > # iterate through the LinkSet objects
>         > 
>         > 
>         > 
>         > while (my $ds = $factory->next_LinkSet) {
>         > 
>         > 
>         > 
>         > 
>         > $taxa{($ds->get_submitted_ids)[0]
>         > 
>         > }
>         > = ($ds->get_ids)[0]
>         > 
>         > }
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > @taxa = @taxa{@ids};
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > $factory = Bio::DB::EUtilities->new(-eutil 
>         > 
>         > =>
>         > 'esummary',
>         > 
>         > 
>         > -db => 'taxonomy',
>         > 
>         > 
>         > 
>         > 
>         > -id => \@taxa );
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > while (local $_ = $factory->next_DocSum)
>         > 
>         > 
>         > {
>         > 
>         > 
>         > $names{($_->get_contents_by_name('TaxId'))
>         > 
>         > [
>         > 0]} = 
>         > 
>         > ($_->get_contents_by_name('ScientificName'))[0
>         > 
>         > ]
>         > ;
>         > 
>         > }
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > foreach (@ids) {
>         > 
>         > 
>         > 
>         > 
>         > $idmap{$_} = $names{$taxa{$_
>         > 
>         > }
>         > };
>         > 
>         > }
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > # %idmap is
>         > 
>         > 
>         > 
>         > # 1621261 => 'Mycobacterium tuberculosis H37Rv'
>         > 
>         > 
>         > 
>         > # 20807972 => 'Thermoanaerobacter tengcongensis MB4'
>         > 
>         > 
>         > 
>         > # 68536103 => 'Corynebacterium jeikeium K411'
>         > 
>         > 
>         > 
>         > # 730439 => 'Bacillus caldolyticus'
>         > 
>         > 
>         > 
>         > # 89318838 => undef (this record has been removed from the
>         db)
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 1;
>         > 
>         > 
>         > Thanks,
>         > 
>         > 
>         > 
>         > Chris
>         > 
>         > 
>         > On Mon 08/30/10 09:36 , "Chris Fields" cjfields at illinois.edu
>         sent:
>         > Chris,
>         > 
>         > Regarding a fix for that script, we would have to see your
>         modified script and the error. However, there are modules
>         within BioPerl to essentially do what you want, in particular,
>         Bio::DB::Taxonomy.
>         > 
>         > chris
>         > 
>         > On Aug 30, 2010, at 7:55 AM, J. Christopher Ellis wrote:
>         > 
>         > > Hi All,
>         > > 
>         > > I am trying to extract the entire taxonomy of an organism
>         including the
>         > > classifications. Some thing like...
>         > > 
>         > > Phylum:Proteobacteria, Class:Gammaproteobacteria,
>         Order:Enterobacteriales, Family:Enterobacteriaceae,
>         Genus:Escherichia
>         > > 
>         > > I am not worried about format just that I get the
>         information and the associated level of hierarchy. The script
>         found
>         http://bioperl.org/wiki/Species_names_from_accession_numbers%
>         26quot%3B%26gt%
>         3Bhttp://bioperl.org/wiki/Species_names_from_accession_numbers">athttp://bioperl.org/wiki/Species_names_from_accession_numbers">http://bioperl.org/wiki/Species_names_from_accession_numbers seemed like a good starting point so I copied it and tried run it but got an error.
>         > > 
>         > > My first question is "Is there a known fix for this?" and
>         my second question is how do I get the full hierarchical
>         information (as seen above) with the taxonomy db?
>         > > 
>         > > Thanks for all your help in advance!
>         > > 
>         > > Chris 
>         > > 
>         > > 
>         > > _______________________________________________
>         > > Bioperl-l mailing list
>         > > Bioperl-l at lists.open-bio.org
>         > > http://lists.open-bio.org/mailman/listinfo/bioperl-l%
>         26quot%3B%26gt%
>         3Bhttp://lists.open-bio.org/mailman/listinfo/bioperl-l">http://lists.open-bio.org/mailman/listinfo/bioperl-l">http://lists.open-bio.org/mailman/listinfo/bioperl-l
>         > 
>         > 
>         
>