[Bioperl-l] Fetching genomic sequences based on HUGO names orGeneIDs

Sun Feb 19 05:47:44 UTC 2006

Chris and Harry,

OK, I've put the missing link in place. This is Bio::DB::EntrezGene, so you
can get NCBI Genes as objects, perfectly analogous to Bio::DB::GenBank and
the related modules:

use Bio::DB::EntrezGene;
$db = new Bio::DB::EntrezGene;
$seq = $db->get_Seq_by_id(2);

So starting with just a Gene id, then using Bio::DB::GenBank as Chris
showed, you can get the sequence. What's a little odd is how Entrez Gene has
stored positional information and Sequence identifier, you may have thought
that they'd create a special set of fields for this but no, it's only
available as part of a URL as far as I can tell:

Bio::Annotation::DBLink=HASH()
'_root_verbose' => 0

'database' => 'Evidence Viewer'

'primary_id' => 4693

'url' => 
'http://www.ncbi.nlm.nih.gov/sutils/evv.cgi?taxid=9606&contig=NT_079573.2&ge
ne=NDP&lid=4693&from=6657835&to=6682559'

Question: are NT_* sequences going to be a problem for Bio::DB::GenBank? I
see this in NCBIHelper:

# NT contigs can not be retrieved

$self->throw("NT_ contigs are whole chromosome files which are not part of
regular".
"database distributions. Go to ftp://ftp.ncbi.nih.gov/genomes/.")
      if $ids =~ /NT_/;

Perhaps we can modify this so there's no throw() when a seq_start and
seq_stop are specified.

Brian O.

On 2/17/06 6:02 PM, "Chris Fields" <cjfields at uiuc.edu> wrote:

> Brian,
> 
> I added some sample code to the page.  See what you think.
> 
> Christopher Fields
> Postdoctoral Researcher - Switzer Lab
> Dept. of Biochemistry
> University of Illinois Urbana-Champaign
> 
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Chris Fields
>> Sent: Thursday, February 16, 2006 4:46 PM
>> To: 'Brian Osborne'
>> Cc: 'Harry Mangalam'; 'bioperl-l'
>> Subject: Re: [Bioperl-l] Fetching genomic sequences based on HUGO names
>> orGeneIDs
>> 
>> If I know the start, end, and strand info for a list of features (personal
>> preference, since I use Bio::SeqFeature::Generic with the RNAMotif I drew
>> up), couldn't I try pulling out the surrounding region?  My thought is
>> this,
>> though I haven't coded it yet:
>> 
>> 1)  Draw up a list of Seqfeatures, with accession, start, stop coordinates
>> (array of hashes) based off what I get from RNAMotif objects.
>> 2)  Pull the sequence from NCBI using Bio::DB::GenBank with x bp upstream
>> and downstream, one at a time, using get_Seq_by_ID().  I could add a sleep
>> in there somewhere to not tick off the NCBI curators.
>> 
>> Reason I'm interested in this is b/c I want to know where the RNA motif is
>> in context to surrounding features. If it is very close to a coding
>> region,
>> then the motif likely indicates translational regulation.  Further away
>> may
>> indicate transcriptional termination or another mechanism.
>> 
>> The files returned should have the features included as long as they are
>> in
>> the full length GenBank record.  I tried it out using the web form but not
>> through Bio::DB::GenBank yet.  If I can get it to work I'll add it to the
>> page.
>> 
>> Christopher Fields
>> Postdoctoral Researcher - Switzer Lab
>> Dept. of Biochemistry
>> University of Illinois Urbana-Champaign
>> 
>> 
>>> -----Original Message-----
>>> From: Brian Osborne [mailto:osborne1 at optonline.net]
>>> Sent: Thursday, February 16, 2006 4:19 PM
>>> To: Chris Fields
>>> Cc: Harry Mangalam; bioperl-l
>>> Subject: Re: [Bioperl-l] Fetching genomic sequences based on HUGO names
>> or
>>> GeneIDs
>>> 
>>> Chris,
>>> 
>>> Yes. The question now is where to easily get the coordinates.
>>> 
>>> Brian O.
>>> 
>>> 
>>> On 2/16/06 7:52 AM, "Chris Fields" <cjfields at uiuc.edu> wrote:
>>> 
>>>> I think a method was recently implemented in Bio::DB::GenBank to
>>>> retrieve a segment of DNA given start and end coordinates in GenBank
>>>> format; that should contain the features you need.  I requested it
>>>> ~Nov-Dec in the mailing list but didn't get a chance to test it.
>>>> Would that help?
>>>> 
>>>> On Feb 15, 2006, at 11:16 PM, Brian Osborne wrote:
>>>> 
>>>>> Harry,
>>>>> 
>>>>> It's not clear to me that NCBI's eutils offers this capability
>>>>> directly. You
>>>>> can probably download Entrez Gene entries and parse them for
>>>>> coordinates but
>>>>> I know of no way to remotely retrieve genomic sequences like this
>>>>> from NCBI
>>>>> (ENSEMBL API perhaps?). What I had in mind uses the local approach
>>>>> that some
>>>>> of us favor and to prove to myself that this is simple to do I wrote
>> a
>>>>> script that I just added to examples/tools, it's called
>>>>> extract_genes.pl and
>>>>> it's based on Bio::DB::Fasta. Download the sequence files for a given
>>>>> species to some dir, download Entrez Gene's gene2accession file,
>>>>> and run. It
>>>>> creates and stores a hash for lookups, it won't read gene2accession
>>>>> each
>>>>> time it runs.
>>>>> 
>>>>> Brian O.
>>>>> 
>>>>> 
>>>>> On 2/14/06 12:15 PM, "Harry Mangalam" <hjm at tacgi.com> wrote:
>>>>> 
>>>>>> Hi Brian,
>>>>>> 
>>>>>> Thanks very much for the pointers and the speed of your reply and
>>>>>> apologies
>>>>>> for the speed of mine.
>>>>>> 
>>>>>> This looks good, but what I was looking for was a bioP approach
>>>>>> for hooking to
>>>>>> an API at NCBI or EBI so I could get this info and seqs from
>>>>>> them.  In this
>>>>>> case, speed of retrieval is not critical and I'd rather not
>>>>>> download the
>>>>>> entirety of the sequences to a local disk to hack at them.
>>>>>> 
>>>>>> I've determined a screen-scraping approach to get them and could
>>>>>> script that,
>>>>>> but I thought that bioP had a method for using NCBI's external
>>>>>> API's, tho it
>>>>>> may be that my memory is faulty or the approach is no longer
>>>>>> supported due to
>>>>>> overload.
>>>>>> 
>>>>>> Does NCBI make such APIs available anymore?  I searched a bit for
>>>>>> docs on them
>>>>>> but couldn't find anything (unless it's buried in the NCBI tookit,
>>>>>> which I
>>>>>> haven't started to excavate).
>>>>>> 
>>>>>> Failing that, would SEALS provide such a service? Any PerlPinipeds
>>>>>> listening?
>>>>>> 
>>>>>> Harry
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sunday 12 February 2006 08:37, Brian Osborne wrote:
>>>>>>> Harry,
>>>>>>> 
>>>>>>> Hope you're doing well. The approach could be based on
>>>>>>> Bio::DB::Fasta. So,
>>>>>>> from its documentation:
>>>>>>> 
>>>>>>>   use Bio::DB::Fasta;
>>>>>>> 
>>>>>>>   # create database from directory of fasta files
>>>>>>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
>>>>>>> 
>>>>>>>   # simple access (for those without Bioperl)
>>>>>>>   my $seq      = $db->seq('CHROMOSOME_I',4_000_000 => 4_100_000);
>>>>>>>   my $revseq   = $db->seq('CHROMOSOME_I',4_100_000 => 4_000_000);
>>>>>>>   my @ids     = $db->ids;
>>>>>>>   my $length   = $db->length('CHROMOSOME_I');
>>>>>>>   my $alphabet = $db->alphabet('CHROMOSOME_I');
>>>>>>>   my $header   = $db->header('CHROMOSOME_I');
>>>>>>> 
>>>>>>>   # Bioperl-style access
>>>>>>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
>>>>>>> 
>>>>>>>   my $obj     = $db->get_Seq_by_id('CHROMOSOME_I');
>>>>>>>   my $seq     = $obj->seq;
>>>>>>>   my $subseq  = $obj->subseq(4_000_000 => 4_100_000);
>>>>>>> 
>>>>>>> Do you already have the offsets?
>>>>>>> 
>>>>>>> Brian O.
>>>>>>> 
>>>>>>> On 2/12/06 1:46 AM, "Harry Mangalam" <hjm at tacgi.com> wrote:
>>>>>>>> Hi All,
>>>>>>>> 
>>>>>>>> After perusing the tutorial and other docs for a an evening, I
>>>>>>>> still
>>>>>>>> can't find the answer to this.  Forgive me if I've missed
>> something
>>>>>>>> obvious.
>>>>>>>> 
>>>>>>>> This should not be a novel request, but I've not found it
>>>>>>>> answered.  If
>>>>>>>> bioperl isn't the best way to do this, I'd be grateful to a
>>>>>>>> pointer to a
>>>>>>>> better way, especially if it includes an illuminating bit of code.
>>>>>>>> 
>>>>>>>> The problem is to retrieve genomic sequences plus & minus some
>>>>>>>> offset
>>>>>>>> from a locus determined by HUGO keyword or GeneID.  This would be
>> a
>>>>>>>> common followup chore for some extra analysis from a gene
>>>>>>>> expression
>>>>>>>> expt.  Or maybe this is in the DBFetch routines, but I've missed
>>>>>>>> the
>>>>>>>> sequence type to specify...?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> TIA!
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>> 
>>>> Christopher Fields
>>>> Postdoctoral Researcher
>>>> Lab of Dr. Robert Switzer
>>>> Dept of Biochemistry
>>>> University of Illinois Urbana-Champaign
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>