[Bioperl-l] get CDS start site for entry in NCBI

Thu Apr 18 14:13:05 UTC 2013

I am a noob with BioPerl, so I don't know how to implement this
exactly, but from an NCBI Eutilities perspective, you can get many
records at once.  You can use ESearch to give you a list of IDs:
  http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&term=AT4g08500&retmode=xml

And then use EFetch
(http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch) you can
give a list if IDs right in the request:
  http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=332656411,240256243&retmode=xml

If the list is long, then set usehistory=1 in your esearch:
  http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&term=AT4g08500&retmode=xml&usehistory=1
and from that result, grab the WebEnv and use it in your efetch request.

I think all of these should be well supported by BioPerl.  Probably
look at https://metacpan.org/module/Bio::Tools::EUtilities.

Hope that helps!

Chris Maloney

On Wed, Apr 17, 2013 at 7:08 PM, Matthew McCormack
<mccormack at molbio.mgh.harvard.edu> wrote:
> I am not much of a Perl coder and I have a few questions.
>
>      First, I would like to write a script that will go to NCBI genebank and
> get the base number for the start of the CDS region, e.g. 235 (given a
> particular accession number). I have looked at HOWTO's and documentation for
> Bio::SeqIO and Bio::DB::GenBank and I can cut and paste the examples and
> they work, but I can not figure out how to get what I want; the CDS start
> site. I have difficulty knowing what all the methods and their options are
> for the seqio object and seq_object. Most of the examples seem to be using a
> file to get information and not a website.
>
>    Actually, what I have to start with is a TAIR locus number such as
> AT4g08500, but I can not search on this at NCBI and come up with a unique
> entry. I may have to have a table of conversions from TAIR locus number to
> accession numbers.
>
>   Also, I was looking for a bit of advice. What I am doing is getting data
> off another web site. I have a script using the WWW::Mechanize module in
> which I can input a link and go to that webpage, and then go down a line of
> links (over 100) getting information from each link. As part of that
> information that I am getting is the number base of a binding site, but I
> want to know if that binding site is in the CDS. The start number is the
> start of the gene, so say if the binding site is 235, then I want to know if
> this is in the CDS. This data is not provided by the website, that is why I
> want to go to NCBI and get the start of the CDS. The data at NCBI for 'gene'
> has the same length as the first webpage, but also contains the beginning of
> the CDS, say 299, so with this information I can tell if the binding site is
> in the CDS. Do you think the best way to do this is extract the info from
> the link on the first web page, then go to NCBI and extract the CDS, then
> back to the original web page and the next link, and so on, for a couple of
> hundred links ? Or is there a better way ? I am concerned about a script
> that will keep going back to NCBI.
>
> Matthew
>
>
>
> The information in this e-mail is intended only for the person to whom it is
> addressed. If you believe this e-mail was sent to you in error and the
> e-mail
> contains patient information, please contact the Partners Compliance
> HelpLine at
> http://www.partners.org/complianceline . If the e-mail was sent to you in
> error
> but does not contain patient information, please contact the sender and
> properly
> dispose of the e-mail.
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l