[Bioperl-l] retrieve refseq ids from UIDs
Smithies, Russell
Russell.Smithies at agresearch.co.nz
Tue Jun 28 20:54:57 UTC 2011
It's fairly common for NCBI to return partial or incomplete data, often 1/2 a record is missing or requests will time-out at random.
If you have a lot of records, it may be better to download all the data from the ftp site then parse it locally. This is what we tend to do if there's more than a few hundred queries. I'd like to point out that it's NCBIs problem, not the BioPerl code at fault. You'll run into the same problems if you use NCBIs Perl API (http://www.ncbi.nlm.nih.gov/books/NBK1058/) directly.
Take a look at the gene2accession, gene2refseq, and gene_info data at ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ and at the tax data ftp://ftp.ncbi.nih.gov/pub/taxonomy/ if you need to decode the taxids without doing web queries.
It's much easier/faster to download these files, index them, them search rather than do queries against NCBI.
And as all the data is local, you don't need to worry about connection problems.
--Russell
> -----Original Message-----
> From: carandraug at gmail.com [mailto:carandraug at gmail.com] On Behalf Of
> Carnë Draug
> Sent: Tuesday, 28 June 2011 11:41 p.m.
> To: Smithies, Russell
> Cc: bioperl mailing list
> Subject: Re: [Bioperl-l] retrieve refseq ids from UIDs
>
> On 28 June 2011 04:20, Smithies, Russell
> <Russell.Smithies at agresearch.co.nz> wrote:
> > I assume you've had a look at the cookbook
> http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook
> > Also take a look at elink, it might do what you are after
> http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#I_want_a_list_of_
> database_.27x.27_UIDs_that_are_linked_from_a_list_of_database_.27y.27_U
> IDs
> > The Scrapbook is a good place to get ideas as well
> http://www.bioperl.org/wiki/Category:Scrapbook
>
> Hi Russel,
>
> thank you for your answer. I had indeed looking at the cookbook. I'd
> never tried elink and it works sometimes. I have a couple of problems
> with it tough.
>
> Basically, using that approach, I have to get the UID from gene, and
> use elink to get the transcripts by searching what links to
> 'nucleotide' (with link name gene_nuccore_refseqrna). Then, I have to
> search to where each of them links to the protein db. Also, since if I
> use an array of uids to search, I get all the UIDS that links in one
> list, I have to use a single UID so I know from where each comes. This
> is true for searching what nucleotides come from gene and what
> proteins come from nucleotide. This implies a lot of connections and
> it may be why sometimes I get the warning
>
> --------------------- WARNING ---------------------
> MSG: No linksets returned
> ---------------------------------------------------
>
> Does NCBI have some sort of mechanism to avoid flooding with requests?
> Here's the code I used http://pastebin.com/DsCh2JuL
>
> Also, the several connections makes it slower. There must be a simpler
> way since one of the pieces of code I showed on the first mail
>
> my @ids = qw(9555);
> my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch',
> -db => 'gene',
> -id => \@ids,
> );
> say $factory->get_Response->content;
>
> does retrieve a weird structure with all that info. Isn't there a
> method to access this data properly? Or maybe use some other module?
> Thanks,
> Carnë
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================
More information about the Bioperl-l
mailing list