Bioperl: NCBI Entrez queries and Perl file handling
Francis Ouellette
francis@cmmt.ubc.ca
Wed, 2 Jun 1999 09:55:18 -0700 (PDT)
> Not strictly a bio-perl question I know, but I didnt manage to get a
> helpful answer from NCBI so I thought Id ask here.
not sure what ncbi told you (but I'm sure they had a good answer for
this one :-)
The problem with this query is that L34657 is the last segment of a
segmented set. And what you are getting are all the segments. I
seem to recall that there is a flag in the Entrez query engine to
indicate that you don't want the whole seg-set (which is the way this
data is stored) but only this specific sequence.
What I would ask our friends at the ncbi is:
"Can I get a fasta file for a unique record part of a seg-set?"
I'm sure you know that L34657 is exon 16 of the set, so only a part of
the sequence that is encoded by the CDS on this record.
The full set is available from (in GBFF):
http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=598222&form=6&db=n&Dopt=g
(note that www4 can now simply be www)
but to really appreciate this record, look at the graphical view:
http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/referer?/htbin-post/Entrez/query%3fdb=n&form=6&uid=598222&Dopt=z
(but I digress :-)
cheers,
f.
--
| B.F. Francis Ouellette tel: (604) 875-3815 |
| Director, Bioinformatics Core Facility fax: (604) 875-3840 |
| Centre for Molecular Medicine and Therapeutics, UBC, Canada |
| francis@cmmt.ubc.ca http://www.cmmt.ubc.ca |
> I have a perl script and Im using LWP to handle the retrieval of
> sequences from NCBI. One problem Im finding is that I dont always get
> just the one sequence I request, I get a load of associated ones I dont
> want. For example, using the entrez query below to try and get the
> nucleotide sequence for L34657
>
> http://www4.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=n&form=6&uid=L34657&dopt
> =f
>
> What I get back is about 15+ sequences (introns and exons, etc.) with
> L34657 at the end. How can I configure this to just give me the single
> sequence I requested and not
> all the other associated introns and exons? I tried things like
> dispmax=1 to no avail, with FastA format I always get all of the
> sequences.
>
> If I change the output to Genbank using the dopt=g option then I get
> just the
> sequence I want. I could always just parse the genbank format instead
> but Id rather not have to unless its really necessary. Is there a simple
> way I can just get the one specified sequence and not everything else -
> am I missing some command line options here?
>
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================