[Bioperl-l] how to retrieve organism name from accession number?

Smithies, Russell Russell.Smithies at agresearch.co.nz
Wed Jan 27 02:45:58 UTC 2010


Batch-entrez http://www.ncbi.nlm.nih.gov/portal/utils/batchentrez_p.cgi still works if you don't mind a bit of manual button clicking. It's handling chunks of 100,000 records OK (today).

--Russell

> -----Original Message-----
> From: Chris Fields [mailto:cjfields at illinois.edu]
> Sent: Wednesday, 27 January 2010 3:42 p.m.
> To: Smithies, Russell
> Cc: 'bioperl-l at lists.open-bio.org'; 'Mark A. Jensen'
> Subject: Re: [Bioperl-l] how to retrieve organism name from accession
> number?
> 
> Makes me wonder if they're pushing more users towards the SOAP-based
> services and away from eutils.
> 
> chris
> 
> On Jan 26, 2010, at 7:59 PM, Smithies, Russell wrote:
> 
> > I've had a wide selection of errors lately:
> >
> > ------------- EXCEPTION: Bio::Root::Exception -------------
> > MSG: NCBI esearch fatal error: Search Backend failed: Error 11 (Resource
> temporarily unavailable)
> > STACK: Error::throw
> > STACK: Bio::Root::Root::throw
> /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:357
> > STACK: Bio::Tools::EUtilities::parse_data
> /usr/lib/perl5/site_perl/5.8.8/Bio/Tools/EUtilities.pm:332
> > STACK: Bio::Tools::EUtilities::get_ids
> /usr/lib/perl5/site_perl/5.8.8/Bio/Tools/EUtilities.pm:441
> > STACK: Bio::DB::EUtilities::get_ids
> /usr/lib/perl5/site_perl/5.8.8/Bio/DB/EUtilities.pm:363
> > STACK: get_desc.pl:32
> > -----------------------------------------------------------
> >
> > And I never get a good explanation from NCBI or suggestions on how to
> avoid it.
> >
> >
> > --Russell
> >
> >
> >> -----Original Message-----
> >> From: Chris Fields [mailto:cjfields at illinois.edu]
> >> Sent: Wednesday, 27 January 2010 2:46 p.m.
> >> To: Smithies, Russell
> >> Cc: 'Mark A. Jensen'; 'bioperl-l at lists.open-bio.org'
> >> Subject: Re: [Bioperl-l] how to retrieve organism name from accession
> >> number?
> >>
> >> It's unfortunate but I have heard this problem popping up quite a bit
> more
> >> frequently lately.  Not to push too many buttons but NCBI isn't very
> >> forthcoming with help these days; they have become quite insular.  Not
> >> sure if they're short-staffed due to budget or if there are other
> issues.
> >>
> >> chris
> >>
> >> On Jan 26, 2010, at 7:40 PM, Smithies, Russell wrote:
> >>
> >>> Grrrrrr, I hate eutils!!!!
> >>>
> >>> ------------- EXCEPTION: Bio::Root::Exception -------------
> >>> MSG: NCBI esearch fatal error: Search Backend failed: Error 111
> >> (Connection refused)
> >>> STACK: Error::throw
> >>> STACK: Bio::Root::Root::throw
> >> /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:357
> >>> STACK: Bio::Tools::EUtilities::parse_data
> >> /usr/lib/perl5/site_perl/5.8.8/Bio/Tools/EUtilities.pm:332
> >>> STACK: Bio::Tools::EUtilities::get_ids
> >> /usr/lib/perl5/site_perl/5.8.8/Bio/Tools/EUtilities.pm:441
> >>> STACK: Bio::DB::EUtilities::get_ids
> >> /usr/lib/perl5/site_perl/5.8.8/Bio/DB/EUtilities.pm:363
> >>> STACK: get_desc.pl:32
> >>> -----------------------------------------------------------
> >>>
> >>>
> >>> Nice error message though :-)
> >>>
> >>>
> >>> --Russell
> >>>
> >>>> -----Original Message-----
> >>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> >>>> bounces at lists.open-bio.org] On Behalf Of Smithies, Russell
> >>>> Sent: Monday, 11 January 2010 10:05 a.m.
> >>>> To: 'Chris Fields'
> >>>> Cc: 'Bhakti Dwivedi'; 'Mark A. Jensen'; 'bioperl-l at lists.open-
> bio.org'
> >>>> Subject: Re: [Bioperl-l] how to retrieve organism name from accession
> >>>> number?
> >>>>
> >>>> I've started to go off eUtils recently (not BioPerl's fault) as I've
> >> often
> >>>> been finding that with large queries, chunks of the resulting data is
> >>>> missing.
> >>>> For example, before Xmas I was creating species-specific databases by
> >>>> using eUtils to get a list of GI numbers back for a taxid, then
> >> retrieving
> >>>> the fasta sequences in chunks of 500.
> >>>> Very regularly, in the middle of the fasta there would be a message
> >> about
> >>>> resource unavailable eg.
> >>>>> test_sequence_1
> >>>> TACGATCATCGCTResource UnavailableTACGACTCTGCT
> >>>>> test_sequence_2
> >>>> TACGTACTACGATCGATCATCACTATCGTCATACTACTACTGACT
> >>>>
> >>>> Often this wasn't detected until formatdb complained about invalid
> >>>> characters.
> >>>> Inquiries to NCBI as to why this was happening and what to do about
> it
> >>>> returned stupid answers ("do each sequence manually thru the web
> >>>> interface", or "use eUtils").
> >>>> As we have a nice fast network connection, I now prefer to download
> >> very
> >>>> large gzip files (i.e. all of refseq) and extract what I need.
> >>>>
> >>>> I can't help but think that NCBI could solve a lot of problems if
> they
> >>>> gzipped the output from eUtils queries - it's something I've
> requested
> >>>> regularly for the last 5 years or so!!
> >>>>
> >>>> --Russell
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Chris Fields [mailto:cjfields at illinois.edu]
> >>>>> Sent: Monday, 11 January 2010 9:50 a.m.
> >>>>> To: Smithies, Russell
> >>>>> Cc: 'Mark A. Jensen'; 'Bhakti Dwivedi'; 'bioperl-l at lists.open-
> bio.org'
> >>>>> Subject: Re: [Bioperl-l] how to retrieve organism name from
> accession
> >>>>> number?
> >>>>>
> >>>>> One could also use Bio::DB::Taxonomy, which indexes the same files
> or
> >>>>> (alternatively) makes the eutil calls (see Bio::DB::Taxonomy POD for
> >> the
> >>>>> details).
> >>>>>
> >>>>> chris
> >>>>>
> >>>>> On Jan 10, 2010, at 2:34 PM, Smithies, Russell wrote:
> >>>>>
> >>>>>> An alternate non-BioPerly way (that may be faster given NCBI's
> >>>> flakiness
> >>>>> lately) would be to download the gi_taxid_nucl.zip or
> >> gi_taxid_prot.zip
> >>>>> files from ftp://ftp.ncbi.nih.gov/pub/taxonomy/, load them into a
> hash
> >>>> and
> >>>>> do lookups.
> >>>>>> In that same dir, taxdump.tar.gz contains a file called names.dmp
> >>>> which
> >>>>> lists taxids and descriptions (and synonyms)
> >>>>>>
> >>>>>> If it was me, I'd split gi_taxid_nucl and names.dmp into hashes so
> I
> >>>>> could do this:
> >>>>>>
> >>>>>> my $taxid  = $gi_taxid_nucl{$accession};
> >>>>>> my $org_name = $names{$taxid};
> >>>>>>
> >>>>>> --Russell
> >>>>>>
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> >>>>>>> bounces at lists.open-bio.org] On Behalf Of Mark A. Jensen
> >>>>>>> Sent: Saturday, 26 December 2009 4:52 p.m.
> >>>>>>> To: Bhakti Dwivedi; bioperl-l at lists.open-bio.org
> >>>>>>> Subject: Re: [Bioperl-l] how to retrieve organism name from
> >> accession
> >>>>>>> number?
> >>>>>>>
> >>>>>>> Bhakti,
> >>>>>>> The following example (using EUtilities) may serve your purpose:
> >>>>>>>
> >>>>>>> use Bio::DB::EUtilities;
> >>>>>>>
> >>>>>>> my (%taxa, @taxa);
> >>>>>>> my (%names, %idmap);
> >>>>>>>
> >>>>>>> # these are protein ids; nuc ids will work by changing -dbfrom =>
> >>>>>>> 'nucleotide',
> >>>>>>> # (probably)
> >>>>>>>
> >>>>>>> my @ids = qw(1621261 89318838 68536103 20807972 730439);
> >>>>>>>
> >>>>>>> my $factory = Bio::DB::EUtilities->new(-eutil => 'elink',
> >>>>>>>                                     -db => 'taxonomy',
> >>>>>>>                                     -dbfrom => 'protein',
> >>>>>>>                                     -correspondence => 1,
> >>>>>>>                                     -id => \@ids);
> >>>>>>>
> >>>>>>> # iterate through the LinkSet objects
> >>>>>>> while (my $ds = $factory->next_LinkSet) {
> >>>>>>>  $taxa{($ds->get_submitted_ids)[0]} = ($ds->get_ids)[0]
> >>>>>>> }
> >>>>>>>
> >>>>>>> @taxa = @taxa{@ids};
> >>>>>>>
> >>>>>>> $factory = Bio::DB::EUtilities->new(-eutil => 'esummary',
> >>>>>>>      -db    => 'taxonomy',
> >>>>>>>      -id    => \@taxa );
> >>>>>>>
> >>>>>>> while (local $_ = $factory->next_DocSum) {
> >>>>>>>  $names{($_->get_contents_by_name('TaxId'))[0]} =
> >>>>>>> ($_->get_contents_by_name('ScientificName'))[0];
> >>>>>>> }
> >>>>>>>
> >>>>>>> foreach (@ids) {
> >>>>>>>  $idmap{$_} = $names{$taxa{$_}};
> >>>>>>> }
> >>>>>>>
> >>>>>>> # %idmap is
> >>>>>>> #    1621261 => 'Mycobacterium tuberculosis H37Rv'
> >>>>>>> #    20807972 => 'Thermoanaerobacter tengcongensis MB4'
> >>>>>>> #    68536103 => 'Corynebacterium jeikeium K411'
> >>>>>>> #    730439 => 'Bacillus caldolyticus'
> >>>>>>> #    89318838 => undef    (this record has been removed from the
> db)
> >>>>>>>
> >>>>>>> 1;
> >>>>>>>
> >>>>>>> You probably will need to break up your 30000 into chunks
> >>>>>>> (say, 1000-3000 each), and do the above on each chunk with a
> >>>>>>>
> >>>>>>> sleep 3;
> >>>>>>>
> >>>>>>> or so separating the queries.
> >>>>>>> MAJ
> >>>>>>> ----- Original Message -----
> >>>>>>> From: "Bhakti Dwivedi" <bhakti.dwivedi at gmail.com>
> >>>>>>> To: <bioperl-l at lists.open-bio.org>
> >>>>>>> Sent: Friday, December 25, 2009 9:46 PM
> >>>>>>> Subject: [Bioperl-l] how to retrieve organism name from accession
> >>>>> number?
> >>>>>>>
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Does anyone know how to retrieve the "Source" or the "Species
> name"
> >>>>>>> given
> >>>>>>>> the accession number using Bioperl.   I have these 30,000
> accession
> >>>>>>> numbers
> >>>>>>>> for which I need to get the source organisms.  Any kind of help
> >> will
> >>>>> be
> >>>>>>>> appreciated.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>>>> BD
> >>>>>>>> _______________________________________________
> >>>>>>>> Bioperl-l mailing list
> >>>>>>>> Bioperl-l at lists.open-bio.org
> >>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Bioperl-l mailing list
> >>>>>>> Bioperl-l at lists.open-bio.org
> >>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>>>>
> >>>>
> =======================================================================
> >>>>>> Attention: The information contained in this message and/or
> >>>> attachments
> >>>>>> from AgResearch Limited is intended only for the persons or
> entities
> >>>>>> to which it is addressed and may contain confidential and/or
> >>>> privileged
> >>>>>> material. Any review, retransmission, dissemination or other use
> of,
> >>>> or
> >>>>>> taking of any action in reliance upon, this information by persons
> or
> >>>>>> entities other than the intended recipients is prohibited by
> >>>> AgResearch
> >>>>>> Limited. If you have received this message in error, please notify
> >> the
> >>>>>> sender immediately.
> >>>>>>
> >>>>
> =======================================================================
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Bioperl-l mailing list
> >>>>>> Bioperl-l at lists.open-bio.org
> >>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Bioperl-l mailing list
> >>>> Bioperl-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list