[Bioperl-l] Fetching genomic sequences based on HUGO names or GeneIDs

Thu Feb 16 12:52:31 UTC 2006

I think a method was recently implemented in Bio::DB::GenBank to  
retrieve a segment of DNA given start and end coordinates in GenBank  
format; that should contain the features you need.  I requested it  
~Nov-Dec in the mailing list but didn't get a chance to test it.   
Would that help?

On Feb 15, 2006, at 11:16 PM, Brian Osborne wrote:

> Harry,
>
> It's not clear to me that NCBI's eutils offers this capability  
> directly. You
> can probably download Entrez Gene entries and parse them for  
> coordinates but
> I know of no way to remotely retrieve genomic sequences like this  
> from NCBI
> (ENSEMBL API perhaps?). What I had in mind uses the local approach  
> that some
> of us favor and to prove to myself that this is simple to do I wrote a
> script that I just added to examples/tools, it's called  
> extract_genes.pl and
> it's based on Bio::DB::Fasta. Download the sequence files for a given
> species to some dir, download Entrez Gene's gene2accession file,  
> and run. It
> creates and stores a hash for lookups, it won't read gene2accession  
> each
> time it runs.
>
> Brian O.
>
>
> On 2/14/06 12:15 PM, "Harry Mangalam" <hjm at tacgi.com> wrote:
>
>> Hi Brian,
>>
>> Thanks very much for the pointers and the speed of your reply and  
>> apologies
>> for the speed of mine.
>>
>> This looks good, but what I was looking for was a bioP approach  
>> for hooking to
>> an API at NCBI or EBI so I could get this info and seqs from  
>> them.  In this
>> case, speed of retrieval is not critical and I'd rather not  
>> download the
>> entirety of the sequences to a local disk to hack at them.
>>
>> I've determined a screen-scraping approach to get them and could  
>> script that,
>> but I thought that bioP had a method for using NCBI's external  
>> API's, tho it
>> may be that my memory is faulty or the approach is no longer  
>> supported due to
>> overload.
>>
>> Does NCBI make such APIs available anymore?  I searched a bit for  
>> docs on them
>> but couldn't find anything (unless it's buried in the NCBI tookit,  
>> which I
>> haven't started to excavate).
>>
>> Failing that, would SEALS provide such a service? Any PerlPinipeds  
>> listening?
>>
>> Harry
>>
>>
>>
>>
>>
>>
>> On Sunday 12 February 2006 08:37, Brian Osborne wrote:
>>> Harry,
>>>
>>> Hope you're doing well. The approach could be based on  
>>> Bio::DB::Fasta. So,
>>> from its documentation:
>>>
>>>   use Bio::DB::Fasta;
>>>
>>>   # create database from directory of fasta files
>>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
>>>
>>>   # simple access (for those without Bioperl)
>>>   my $seq      = $db->seq('CHROMOSOME_I',4_000_000 => 4_100_000);
>>>   my $revseq   = $db->seq('CHROMOSOME_I',4_100_000 => 4_000_000);
>>>   my @ids     = $db->ids;
>>>   my $length   = $db->length('CHROMOSOME_I');
>>>   my $alphabet = $db->alphabet('CHROMOSOME_I');
>>>   my $header   = $db->header('CHROMOSOME_I');
>>>
>>>   # Bioperl-style access
>>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
>>>
>>>   my $obj     = $db->get_Seq_by_id('CHROMOSOME_I');
>>>   my $seq     = $obj->seq;
>>>   my $subseq  = $obj->subseq(4_000_000 => 4_100_000);
>>>
>>> Do you already have the offsets?
>>>
>>> Brian O.
>>>
>>> On 2/12/06 1:46 AM, "Harry Mangalam" <hjm at tacgi.com> wrote:
>>>> Hi All,
>>>>
>>>> After perusing the tutorial and other docs for a an evening, I  
>>>> still
>>>> can't find the answer to this.  Forgive me if I've missed something
>>>> obvious.
>>>>
>>>> This should not be a novel request, but I've not found it  
>>>> answered.  If
>>>> bioperl isn't the best way to do this, I'd be grateful to a  
>>>> pointer to a
>>>> better way, especially if it includes an illuminating bit of code.
>>>>
>>>> The problem is to retrieve genomic sequences plus & minus some  
>>>> offset
>>>> from a locus determined by HUGO keyword or GeneID.  This would be a
>>>> common followup chore for some extra analysis from a gene  
>>>> expression
>>>> expt.  Or maybe this is in the DBFetch routines, but I've missed  
>>>> the
>>>> sequence type to specify...?
>>>>
>>>>
>>>> TIA!
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign