[Bioperl-l] Fetching genomic sequences based on HUGO names or GeneIDs

Thu Feb 16 16:27:13 UTC 2006

Harry,

I've long suspected, but never demonstrated, that the easiest way to do
something like this is through ENSEMBL, and Jason hinted at this as well. In
fact your question is something of a FAQ, and my previous responses always
included a plea to some anonymous ENSEMBL API expert, always unheeded. At
any rate, here is an example script I made:

#!/usr/bin/perl

use strict;

use lib "/Users/bosborne/ensembl/modules";

use DBI;

use Getopt::Long;

use Bio::EnsEMBL::DBSQL::DBAdaptor;

my $name;

GetOptions( "n=s" => \$name );

my $db = Bio::EnsEMBL::DBSQL::DBAdaptor->new(
-user   => "anonymous",

-dbname => "homo_sapiens_core_37_35j",

-host   => "ensembldb.ensembl.org",

-pass   => "",                 

-driver => 'mysql'

);

my $gene_adaptor = $db->get_GeneAdaptor;

my $slice_adaptor = $db->get_SliceAdaptor;

my @genes = @{$gene_adaptor->fetch_all_by_external_name($name)};

for my $gene (@genes) {

  for my $trans (@{$gene->get_all_Transcripts}) {

      my $seq = $slice_adaptor->fetch_by_region("chromosome",

             $trans->seq_region_name,

             $trans->start,

             $trans->end);

      print "\n",$seq->seq,"\n";

  }

}

There are some issues, the largest of which is that though this script
prints out big sequences it's completely untested! Another is that it makes
assumptions about transcripts, you should verify for yourself that ENSEMBL's
definition of transcript fits yours. Finally that
fetch_all_by_external_name() method does not seem to accept a second
argument, i.e. namespace. I found this surprising. Anyway, if more than one
gene is retrieved using some name or id you're in a quandary.

For more on this API see:

http://www.ensembl.org/info/software/core/core_tutorial.html

There are tons of modules and methods in this API, I've barely scratched the
surface here.

Brian O.

On 2/14/06 12:15 PM, "Harry Mangalam" <hjm at tacgi.com> wrote:

> Hi Brian,
> 
> Thanks very much for the pointers and the speed of your reply and apologies
> for the speed of mine.
> 
> This looks good, but what I was looking for was a bioP approach for hooking to
> an API at NCBI or EBI so I could get this info and seqs from them.  In this
> case, speed of retrieval is not critical and I'd rather not download the
> entirety of the sequences to a local disk to hack at them.
> 
> I've determined a screen-scraping approach to get them and could script that,
> but I thought that bioP had a method for using NCBI's external API's, tho it
> may be that my memory is faulty or the approach is no longer supported due to
> overload.  
> 
> Does NCBI make such APIs available anymore?  I searched a bit for docs on them
> but couldn't find anything (unless it's buried in the NCBI tookit, which I
> haven't started to excavate).
> 
> Failing that, would SEALS provide such a service? Any PerlPinipeds listening?
> 
> Harry
> 
> 
> 
> 
> 
> 
> On Sunday 12 February 2006 08:37, Brian Osborne wrote:
>> Harry,
>> 
>> Hope you're doing well. The approach could be based on Bio::DB::Fasta. So,
>> from its documentation:
>> 
>>   use Bio::DB::Fasta;
>> 
>>   # create database from directory of fasta files
>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
>> 
>>   # simple access (for those without Bioperl)
>>   my $seq      = $db->seq('CHROMOSOME_I',4_000_000 => 4_100_000);
>>   my $revseq   = $db->seq('CHROMOSOME_I',4_100_000 => 4_000_000);
>>   my @ids     = $db->ids;
>>   my $length   = $db->length('CHROMOSOME_I');
>>   my $alphabet = $db->alphabet('CHROMOSOME_I');
>>   my $header   = $db->header('CHROMOSOME_I');
>> 
>>   # Bioperl-style access
>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
>> 
>>   my $obj     = $db->get_Seq_by_id('CHROMOSOME_I');
>>   my $seq     = $obj->seq;
>>   my $subseq  = $obj->subseq(4_000_000 => 4_100_000);
>> 
>> Do you already have the offsets?
>> 
>> Brian O.
>> 
>> On 2/12/06 1:46 AM, "Harry Mangalam" <hjm at tacgi.com> wrote:
>>> Hi All,
>>> 
>>> After perusing the tutorial and other docs for a an evening, I still
>>> can't find the answer to this.  Forgive me if I've missed something
>>> obvious.
>>> 
>>> This should not be a novel request, but I've not found it answered.  If
>>> bioperl isn't the best way to do this, I'd be grateful to a pointer to a
>>> better way, especially if it includes an illuminating bit of code.
>>> 
>>> The problem is to retrieve genomic sequences plus & minus some offset
>>> from a locus determined by HUGO keyword or GeneID.  This would be a
>>> common followup chore for some extra analysis from a gene expression
>>> expt.  Or maybe this is in the DBFetch routines, but I've missed the
>>> sequence type to specify...?
>>> 
>>> 
>>> TIA!