Bioperl: NCBI Entrez queries and Perl file handling

Francis Ouellette francis@cmmt.ubc.ca
Wed, 2 Jun 1999 09:55:18 -0700 (PDT)


> Not strictly a bio-perl question I know, but I didnt manage to get a
> helpful answer from NCBI so I thought Id ask here.

not sure what ncbi told you (but I'm sure they had a good answer for
this one :-)

The problem with this query is that  L34657 is the last segment of a
segmented set. And what you are getting are all the segments. I
seem to recall that there is a flag in the Entrez query engine to
indicate that you don't want the whole seg-set (which is the way this
data is stored) but only this specific sequence.

What I would ask our friends at the ncbi is:

"Can I get a fasta file for a unique record part of a seg-set?"

I'm sure you know that L34657 is exon 16 of the set, so only a part of
the sequence that is encoded by the CDS on this record.

The full set is available from (in GBFF):

http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=598222&form=6&db=n&Dopt=g

(note that www4 can now simply be www) 

but to really appreciate this record, look at the graphical view:

http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/referer?/htbin-post/Entrez/query%3fdb=n&form=6&uid=598222&Dopt=z

(but I digress :-)

cheers,

f.


--
| B.F. Francis Ouellette                     tel: (604) 875-3815  | 
| Director, Bioinformatics Core Facility     fax: (604) 875-3840  | 
| Centre for Molecular Medicine and Therapeutics, UBC, Canada     |
| francis@cmmt.ubc.ca                     http://www.cmmt.ubc.ca  |



> I have a perl script and Im using LWP to handle the retrieval of
> sequences from NCBI. One problem Im finding is that I dont always get
> just the one sequence I request, I get a load of associated ones I dont
> want. For example, using the entrez query below to try and get the
> nucleotide sequence for L34657
> 
> http://www4.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=n&form=6&uid=L34657&dopt
> =f
> 
> What I get back is about 15+ sequences (introns and exons, etc.) with
> L34657 at the end. How can I configure this to just give me the single
> sequence I requested and not
> all the other associated introns and exons? I tried things like
> dispmax=1 to no avail, with FastA format I always get all of the
> sequences.
> 
> If I change the output to Genbank using the dopt=g option then I get
> just the
> sequence I want. I could always just parse the genbank format instead
> but Id rather not have to unless its really necessary. Is there a simple
> way I can just get the one specified sequence and not everything else -
> am I missing some command line options here?
> 


=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================