[Bioperl-l] Fetching genomic sequences based on HUGO names or GeneIDs

Thu Feb 16 05:16:04 UTC 2006

Harry,

It's not clear to me that NCBI's eutils offers this capability directly. You
can probably download Entrez Gene entries and parse them for coordinates but
I know of no way to remotely retrieve genomic sequences like this from NCBI
(ENSEMBL API perhaps?). What I had in mind uses the local approach that some
of us favor and to prove to myself that this is simple to do I wrote a
script that I just added to examples/tools, it's called extract_genes.pl and
it's based on Bio::DB::Fasta. Download the sequence files for a given
species to some dir, download Entrez Gene's gene2accession file, and run. It
creates and stores a hash for lookups, it won't read gene2accession each
time it runs.

Brian O.

On 2/14/06 12:15 PM, "Harry Mangalam" <hjm at tacgi.com> wrote:

> Hi Brian,
> 
> Thanks very much for the pointers and the speed of your reply and apologies
> for the speed of mine.
> 
> This looks good, but what I was looking for was a bioP approach for hooking to
> an API at NCBI or EBI so I could get this info and seqs from them.  In this
> case, speed of retrieval is not critical and I'd rather not download the
> entirety of the sequences to a local disk to hack at them.
> 
> I've determined a screen-scraping approach to get them and could script that,
> but I thought that bioP had a method for using NCBI's external API's, tho it
> may be that my memory is faulty or the approach is no longer supported due to
> overload.  
> 
> Does NCBI make such APIs available anymore?  I searched a bit for docs on them
> but couldn't find anything (unless it's buried in the NCBI tookit, which I
> haven't started to excavate).
> 
> Failing that, would SEALS provide such a service? Any PerlPinipeds listening?
> 
> Harry
> 
> 
> 
> 
> 
> 
> On Sunday 12 February 2006 08:37, Brian Osborne wrote:
>> Harry,
>> 
>> Hope you're doing well. The approach could be based on Bio::DB::Fasta. So,
>> from its documentation:
>> 
>>   use Bio::DB::Fasta;
>> 
>>   # create database from directory of fasta files
>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
>> 
>>   # simple access (for those without Bioperl)
>>   my $seq      = $db->seq('CHROMOSOME_I',4_000_000 => 4_100_000);
>>   my $revseq   = $db->seq('CHROMOSOME_I',4_100_000 => 4_000_000);
>>   my @ids     = $db->ids;
>>   my $length   = $db->length('CHROMOSOME_I');
>>   my $alphabet = $db->alphabet('CHROMOSOME_I');
>>   my $header   = $db->header('CHROMOSOME_I');
>> 
>>   # Bioperl-style access
>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
>> 
>>   my $obj     = $db->get_Seq_by_id('CHROMOSOME_I');
>>   my $seq     = $obj->seq;
>>   my $subseq  = $obj->subseq(4_000_000 => 4_100_000);
>> 
>> Do you already have the offsets?
>> 
>> Brian O.
>> 
>> On 2/12/06 1:46 AM, "Harry Mangalam" <hjm at tacgi.com> wrote:
>>> Hi All,
>>> 
>>> After perusing the tutorial and other docs for a an evening, I still
>>> can't find the answer to this.  Forgive me if I've missed something
>>> obvious.
>>> 
>>> This should not be a novel request, but I've not found it answered.  If
>>> bioperl isn't the best way to do this, I'd be grateful to a pointer to a
>>> better way, especially if it includes an illuminating bit of code.
>>> 
>>> The problem is to retrieve genomic sequences plus & minus some offset
>>> from a locus determined by HUGO keyword or GeneID.  This would be a
>>> common followup chore for some extra analysis from a gene expression
>>> expt.  Or maybe this is in the DBFetch routines, but I've missed the
>>> sequence type to specify...?
>>> 
>>> 
>>> TIA!