[Bioperl-l] Fetching genomic sequences based on HUGO names orGeneIDs

Chris Fields cjfields at uiuc.edu
Fri Feb 17 23:02:02 UTC 2006


Brian,

I added some sample code to the page.  See what you think.

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign 

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Chris Fields
> Sent: Thursday, February 16, 2006 4:46 PM
> To: 'Brian Osborne'
> Cc: 'Harry Mangalam'; 'bioperl-l'
> Subject: Re: [Bioperl-l] Fetching genomic sequences based on HUGO names
> orGeneIDs
> 
> If I know the start, end, and strand info for a list of features (personal
> preference, since I use Bio::SeqFeature::Generic with the RNAMotif I drew
> up), couldn't I try pulling out the surrounding region?  My thought is
> this,
> though I haven't coded it yet:
> 
> 1)  Draw up a list of Seqfeatures, with accession, start, stop coordinates
> (array of hashes) based off what I get from RNAMotif objects.
> 2)  Pull the sequence from NCBI using Bio::DB::GenBank with x bp upstream
> and downstream, one at a time, using get_Seq_by_ID().  I could add a sleep
> in there somewhere to not tick off the NCBI curators.
> 
> Reason I'm interested in this is b/c I want to know where the RNA motif is
> in context to surrounding features. If it is very close to a coding
> region,
> then the motif likely indicates translational regulation.  Further away
> may
> indicate transcriptional termination or another mechanism.
> 
> The files returned should have the features included as long as they are
> in
> the full length GenBank record.  I tried it out using the web form but not
> through Bio::DB::GenBank yet.  If I can get it to work I'll add it to the
> page.
> 
> Christopher Fields
> Postdoctoral Researcher - Switzer Lab
> Dept. of Biochemistry
> University of Illinois Urbana-Champaign
> 
> 
> > -----Original Message-----
> > From: Brian Osborne [mailto:osborne1 at optonline.net]
> > Sent: Thursday, February 16, 2006 4:19 PM
> > To: Chris Fields
> > Cc: Harry Mangalam; bioperl-l
> > Subject: Re: [Bioperl-l] Fetching genomic sequences based on HUGO names
> or
> > GeneIDs
> >
> > Chris,
> >
> > Yes. The question now is where to easily get the coordinates.
> >
> > Brian O.
> >
> >
> > On 2/16/06 7:52 AM, "Chris Fields" <cjfields at uiuc.edu> wrote:
> >
> > > I think a method was recently implemented in Bio::DB::GenBank to
> > > retrieve a segment of DNA given start and end coordinates in GenBank
> > > format; that should contain the features you need.  I requested it
> > > ~Nov-Dec in the mailing list but didn't get a chance to test it.
> > > Would that help?
> > >
> > > On Feb 15, 2006, at 11:16 PM, Brian Osborne wrote:
> > >
> > >> Harry,
> > >>
> > >> It's not clear to me that NCBI's eutils offers this capability
> > >> directly. You
> > >> can probably download Entrez Gene entries and parse them for
> > >> coordinates but
> > >> I know of no way to remotely retrieve genomic sequences like this
> > >> from NCBI
> > >> (ENSEMBL API perhaps?). What I had in mind uses the local approach
> > >> that some
> > >> of us favor and to prove to myself that this is simple to do I wrote
> a
> > >> script that I just added to examples/tools, it's called
> > >> extract_genes.pl and
> > >> it's based on Bio::DB::Fasta. Download the sequence files for a given
> > >> species to some dir, download Entrez Gene's gene2accession file,
> > >> and run. It
> > >> creates and stores a hash for lookups, it won't read gene2accession
> > >> each
> > >> time it runs.
> > >>
> > >> Brian O.
> > >>
> > >>
> > >> On 2/14/06 12:15 PM, "Harry Mangalam" <hjm at tacgi.com> wrote:
> > >>
> > >>> Hi Brian,
> > >>>
> > >>> Thanks very much for the pointers and the speed of your reply and
> > >>> apologies
> > >>> for the speed of mine.
> > >>>
> > >>> This looks good, but what I was looking for was a bioP approach
> > >>> for hooking to
> > >>> an API at NCBI or EBI so I could get this info and seqs from
> > >>> them.  In this
> > >>> case, speed of retrieval is not critical and I'd rather not
> > >>> download the
> > >>> entirety of the sequences to a local disk to hack at them.
> > >>>
> > >>> I've determined a screen-scraping approach to get them and could
> > >>> script that,
> > >>> but I thought that bioP had a method for using NCBI's external
> > >>> API's, tho it
> > >>> may be that my memory is faulty or the approach is no longer
> > >>> supported due to
> > >>> overload.
> > >>>
> > >>> Does NCBI make such APIs available anymore?  I searched a bit for
> > >>> docs on them
> > >>> but couldn't find anything (unless it's buried in the NCBI tookit,
> > >>> which I
> > >>> haven't started to excavate).
> > >>>
> > >>> Failing that, would SEALS provide such a service? Any PerlPinipeds
> > >>> listening?
> > >>>
> > >>> Harry
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Sunday 12 February 2006 08:37, Brian Osborne wrote:
> > >>>> Harry,
> > >>>>
> > >>>> Hope you're doing well. The approach could be based on
> > >>>> Bio::DB::Fasta. So,
> > >>>> from its documentation:
> > >>>>
> > >>>>   use Bio::DB::Fasta;
> > >>>>
> > >>>>   # create database from directory of fasta files
> > >>>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
> > >>>>
> > >>>>   # simple access (for those without Bioperl)
> > >>>>   my $seq      = $db->seq('CHROMOSOME_I',4_000_000 => 4_100_000);
> > >>>>   my $revseq   = $db->seq('CHROMOSOME_I',4_100_000 => 4_000_000);
> > >>>>   my @ids     = $db->ids;
> > >>>>   my $length   = $db->length('CHROMOSOME_I');
> > >>>>   my $alphabet = $db->alphabet('CHROMOSOME_I');
> > >>>>   my $header   = $db->header('CHROMOSOME_I');
> > >>>>
> > >>>>   # Bioperl-style access
> > >>>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
> > >>>>
> > >>>>   my $obj     = $db->get_Seq_by_id('CHROMOSOME_I');
> > >>>>   my $seq     = $obj->seq;
> > >>>>   my $subseq  = $obj->subseq(4_000_000 => 4_100_000);
> > >>>>
> > >>>> Do you already have the offsets?
> > >>>>
> > >>>> Brian O.
> > >>>>
> > >>>> On 2/12/06 1:46 AM, "Harry Mangalam" <hjm at tacgi.com> wrote:
> > >>>>> Hi All,
> > >>>>>
> > >>>>> After perusing the tutorial and other docs for a an evening, I
> > >>>>> still
> > >>>>> can't find the answer to this.  Forgive me if I've missed
> something
> > >>>>> obvious.
> > >>>>>
> > >>>>> This should not be a novel request, but I've not found it
> > >>>>> answered.  If
> > >>>>> bioperl isn't the best way to do this, I'd be grateful to a
> > >>>>> pointer to a
> > >>>>> better way, especially if it includes an illuminating bit of code.
> > >>>>>
> > >>>>> The problem is to retrieve genomic sequences plus & minus some
> > >>>>> offset
> > >>>>> from a locus determined by HUGO keyword or GeneID.  This would be
> a
> > >>>>> common followup chore for some extra analysis from a gene
> > >>>>> expression
> > >>>>> expt.  Or maybe this is in the DBFetch routines, but I've missed
> > >>>>> the
> > >>>>> sequence type to specify...?
> > >>>>>
> > >>>>>
> > >>>>> TIA!
> > >>
> > >>
> > >> _______________________________________________
> > >> Bioperl-l mailing list
> > >> Bioperl-l at lists.open-bio.org
> > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > >
> > > Christopher Fields
> > > Postdoctoral Researcher
> > > Lab of Dr. Robert Switzer
> > > Dept of Biochemistry
> > > University of Illinois Urbana-Champaign
> > >
> > >
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list