[Bioperl-l] Fetching genomic sequences based on HUGO names or GeneIDs

Harry Mangalam hjm at tacgi.com
Thu Feb 16 23:10:59 UTC 2006


This is essentially what I want to do and my [only in pseudocode] approach is 
basically what you describe, except that currently I only have HUGO 
descriptors, not Genbank UIDs.  If you know of an index that lists both, that 
would be the entire shot.

I'm also interested in tracking transcriptional control elements and 
cross-correlating & why I wrote the 'rules' chunk of the recently 
(self-promoted) tacg.

Best
Harry


On Thursday 16 February 2006 14:45, Chris Fields wrote:
> If I know the start, end, and strand info for a list of features (personal
> preference, since I use Bio::SeqFeature::Generic with the RNAMotif I drew
> up), couldn't I try pulling out the surrounding region?  My thought is
> this, though I haven't coded it yet:
>
> 1)  Draw up a list of Seqfeatures, with accession, start, stop coordinates
> (array of hashes) based off what I get from RNAMotif objects.
> 2)  Pull the sequence from NCBI using Bio::DB::GenBank with x bp upstream
> and downstream, one at a time, using get_Seq_by_ID().  I could add a sleep
> in there somewhere to not tick off the NCBI curators.
>
> Reason I'm interested in this is b/c I want to know where the RNA motif is
> in context to surrounding features. If it is very close to a coding region,
> then the motif likely indicates translational regulation.  Further away may
> indicate transcriptional termination or another mechanism.
>
> The files returned should have the features included as long as they are in
> the full length GenBank record.  I tried it out using the web form but not
> through Bio::DB::GenBank yet.  If I can get it to work I'll add it to the
> page.
>
> Christopher Fields
> Postdoctoral Researcher - Switzer Lab
> Dept. of Biochemistry
> University of Illinois Urbana-Champaign
>
> > -----Original Message-----
> > From: Brian Osborne [mailto:osborne1 at optonline.net]
> > Sent: Thursday, February 16, 2006 4:19 PM
> > To: Chris Fields
> > Cc: Harry Mangalam; bioperl-l
> > Subject: Re: [Bioperl-l] Fetching genomic sequences based on HUGO names
> > or GeneIDs
> >
> > Chris,
> >
> > Yes. The question now is where to easily get the coordinates.
> >
> > Brian O.
> >
> > On 2/16/06 7:52 AM, "Chris Fields" <cjfields at uiuc.edu> wrote:
> > > I think a method was recently implemented in Bio::DB::GenBank to
> > > retrieve a segment of DNA given start and end coordinates in GenBank
> > > format; that should contain the features you need.  I requested it
> > > ~Nov-Dec in the mailing list but didn't get a chance to test it.
> > > Would that help?
> > >
> > > On Feb 15, 2006, at 11:16 PM, Brian Osborne wrote:
> > >> Harry,
> > >>
> > >> It's not clear to me that NCBI's eutils offers this capability
> > >> directly. You
> > >> can probably download Entrez Gene entries and parse them for
> > >> coordinates but
> > >> I know of no way to remotely retrieve genomic sequences like this
> > >> from NCBI
> > >> (ENSEMBL API perhaps?). What I had in mind uses the local approach
> > >> that some
> > >> of us favor and to prove to myself that this is simple to do I wrote a
> > >> script that I just added to examples/tools, it's called
> > >> extract_genes.pl and
> > >> it's based on Bio::DB::Fasta. Download the sequence files for a given
> > >> species to some dir, download Entrez Gene's gene2accession file,
> > >> and run. It
> > >> creates and stores a hash for lookups, it won't read gene2accession
> > >> each
> > >> time it runs.
> > >>
> > >> Brian O.
> > >>
> > >> On 2/14/06 12:15 PM, "Harry Mangalam" <hjm at tacgi.com> wrote:
> > >>> Hi Brian,
> > >>>
> > >>> Thanks very much for the pointers and the speed of your reply and
> > >>> apologies
> > >>> for the speed of mine.
> > >>>
> > >>> This looks good, but what I was looking for was a bioP approach
> > >>> for hooking to
> > >>> an API at NCBI or EBI so I could get this info and seqs from
> > >>> them.  In this
> > >>> case, speed of retrieval is not critical and I'd rather not
> > >>> download the
> > >>> entirety of the sequences to a local disk to hack at them.
> > >>>
> > >>> I've determined a screen-scraping approach to get them and could
> > >>> script that,
> > >>> but I thought that bioP had a method for using NCBI's external
> > >>> API's, tho it
> > >>> may be that my memory is faulty or the approach is no longer
> > >>> supported due to
> > >>> overload.
> > >>>
> > >>> Does NCBI make such APIs available anymore?  I searched a bit for
> > >>> docs on them
> > >>> but couldn't find anything (unless it's buried in the NCBI tookit,
> > >>> which I
> > >>> haven't started to excavate).
> > >>>
> > >>> Failing that, would SEALS provide such a service? Any PerlPinipeds
> > >>> listening?
> > >>>
> > >>> Harry
> > >>>
> > >>> On Sunday 12 February 2006 08:37, Brian Osborne wrote:
> > >>>> Harry,
> > >>>>
> > >>>> Hope you're doing well. The approach could be based on
> > >>>> Bio::DB::Fasta. So,
> > >>>> from its documentation:
> > >>>>
> > >>>>   use Bio::DB::Fasta;
> > >>>>
> > >>>>   # create database from directory of fasta files
> > >>>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
> > >>>>
> > >>>>   # simple access (for those without Bioperl)
> > >>>>   my $seq      = $db->seq('CHROMOSOME_I',4_000_000 => 4_100_000);
> > >>>>   my $revseq   = $db->seq('CHROMOSOME_I',4_100_000 => 4_000_000);
> > >>>>   my @ids     = $db->ids;
> > >>>>   my $length   = $db->length('CHROMOSOME_I');
> > >>>>   my $alphabet = $db->alphabet('CHROMOSOME_I');
> > >>>>   my $header   = $db->header('CHROMOSOME_I');
> > >>>>
> > >>>>   # Bioperl-style access
> > >>>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
> > >>>>
> > >>>>   my $obj     = $db->get_Seq_by_id('CHROMOSOME_I');
> > >>>>   my $seq     = $obj->seq;
> > >>>>   my $subseq  = $obj->subseq(4_000_000 => 4_100_000);
> > >>>>
> > >>>> Do you already have the offsets?
> > >>>>
> > >>>> Brian O.
> > >>>>
> > >>>> On 2/12/06 1:46 AM, "Harry Mangalam" <hjm at tacgi.com> wrote:
> > >>>>> Hi All,
> > >>>>>
> > >>>>> After perusing the tutorial and other docs for a an evening, I
> > >>>>> still
> > >>>>> can't find the answer to this.  Forgive me if I've missed something
> > >>>>> obvious.
> > >>>>>
> > >>>>> This should not be a novel request, but I've not found it
> > >>>>> answered.  If
> > >>>>> bioperl isn't the best way to do this, I'd be grateful to a
> > >>>>> pointer to a
> > >>>>> better way, especially if it includes an illuminating bit of code.
> > >>>>>
> > >>>>> The problem is to retrieve genomic sequences plus & minus some
> > >>>>> offset
> > >>>>> from a locus determined by HUGO keyword or GeneID.  This would be a
> > >>>>> common followup chore for some extra analysis from a gene
> > >>>>> expression
> > >>>>> expt.  Or maybe this is in the DBFetch routines, but I've missed
> > >>>>> the
> > >>>>> sequence type to specify...?
> > >>>>>
> > >>>>>
> > >>>>> TIA!
> > >>
> > >> _______________________________________________
> > >> Bioperl-l mailing list
> > >> Bioperl-l at lists.open-bio.org
> > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > >
> > > Christopher Fields
> > > Postdoctoral Researcher
> > > Lab of Dr. Robert Switzer
> > > Dept of Biochemistry
> > > University of Illinois Urbana-Champaign
> > >
> > >
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l

-- 
Cheers, Harry
Harry J Mangalam - 949 856 2847 (vox; email for fax) - hjm at tacgi.com 
            <<plain text preferred>>



More information about the Bioperl-l mailing list