[Bioperl-l] Fetching genomic sequences based on HUGO names orGeneIDs
Brian Osborne
osborne1 at optonline.net
Sun Feb 19 05:47:44 UTC 2006
Chris and Harry,
OK, I've put the missing link in place. This is Bio::DB::EntrezGene, so you
can get NCBI Genes as objects, perfectly analogous to Bio::DB::GenBank and
the related modules:
use Bio::DB::EntrezGene;
$db = new Bio::DB::EntrezGene;
$seq = $db->get_Seq_by_id(2);
So starting with just a Gene id, then using Bio::DB::GenBank as Chris
showed, you can get the sequence. What's a little odd is how Entrez Gene has
stored positional information and Sequence identifier, you may have thought
that they'd create a special set of fields for this but no, it's only
available as part of a URL as far as I can tell:
Bio::Annotation::DBLink=HASH()
'_root_verbose' => 0
'database' => 'Evidence Viewer'
'primary_id' => 4693
'url' =>
'http://www.ncbi.nlm.nih.gov/sutils/evv.cgi?taxid=9606&contig=NT_079573.2&ge
ne=NDP&lid=4693&from=6657835&to=6682559'
Question: are NT_* sequences going to be a problem for Bio::DB::GenBank? I
see this in NCBIHelper:
# NT contigs can not be retrieved
$self->throw("NT_ contigs are whole chromosome files which are not part of
regular".
"database distributions. Go to ftp://ftp.ncbi.nih.gov/genomes/.")
if $ids =~ /NT_/;
Perhaps we can modify this so there's no throw() when a seq_start and
seq_stop are specified.
Brian O.
On 2/17/06 6:02 PM, "Chris Fields" <cjfields at uiuc.edu> wrote:
> Brian,
>
> I added some sample code to the page. See what you think.
>
> Christopher Fields
> Postdoctoral Researcher - Switzer Lab
> Dept. of Biochemistry
> University of Illinois Urbana-Champaign
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Chris Fields
>> Sent: Thursday, February 16, 2006 4:46 PM
>> To: 'Brian Osborne'
>> Cc: 'Harry Mangalam'; 'bioperl-l'
>> Subject: Re: [Bioperl-l] Fetching genomic sequences based on HUGO names
>> orGeneIDs
>>
>> If I know the start, end, and strand info for a list of features (personal
>> preference, since I use Bio::SeqFeature::Generic with the RNAMotif I drew
>> up), couldn't I try pulling out the surrounding region? My thought is
>> this,
>> though I haven't coded it yet:
>>
>> 1) Draw up a list of Seqfeatures, with accession, start, stop coordinates
>> (array of hashes) based off what I get from RNAMotif objects.
>> 2) Pull the sequence from NCBI using Bio::DB::GenBank with x bp upstream
>> and downstream, one at a time, using get_Seq_by_ID(). I could add a sleep
>> in there somewhere to not tick off the NCBI curators.
>>
>> Reason I'm interested in this is b/c I want to know where the RNA motif is
>> in context to surrounding features. If it is very close to a coding
>> region,
>> then the motif likely indicates translational regulation. Further away
>> may
>> indicate transcriptional termination or another mechanism.
>>
>> The files returned should have the features included as long as they are
>> in
>> the full length GenBank record. I tried it out using the web form but not
>> through Bio::DB::GenBank yet. If I can get it to work I'll add it to the
>> page.
>>
>> Christopher Fields
>> Postdoctoral Researcher - Switzer Lab
>> Dept. of Biochemistry
>> University of Illinois Urbana-Champaign
>>
>>
>>> -----Original Message-----
>>> From: Brian Osborne [mailto:osborne1 at optonline.net]
>>> Sent: Thursday, February 16, 2006 4:19 PM
>>> To: Chris Fields
>>> Cc: Harry Mangalam; bioperl-l
>>> Subject: Re: [Bioperl-l] Fetching genomic sequences based on HUGO names
>> or
>>> GeneIDs
>>>
>>> Chris,
>>>
>>> Yes. The question now is where to easily get the coordinates.
>>>
>>> Brian O.
>>>
>>>
>>> On 2/16/06 7:52 AM, "Chris Fields" <cjfields at uiuc.edu> wrote:
>>>
>>>> I think a method was recently implemented in Bio::DB::GenBank to
>>>> retrieve a segment of DNA given start and end coordinates in GenBank
>>>> format; that should contain the features you need. I requested it
>>>> ~Nov-Dec in the mailing list but didn't get a chance to test it.
>>>> Would that help?
>>>>
>>>> On Feb 15, 2006, at 11:16 PM, Brian Osborne wrote:
>>>>
>>>>> Harry,
>>>>>
>>>>> It's not clear to me that NCBI's eutils offers this capability
>>>>> directly. You
>>>>> can probably download Entrez Gene entries and parse them for
>>>>> coordinates but
>>>>> I know of no way to remotely retrieve genomic sequences like this
>>>>> from NCBI
>>>>> (ENSEMBL API perhaps?). What I had in mind uses the local approach
>>>>> that some
>>>>> of us favor and to prove to myself that this is simple to do I wrote
>> a
>>>>> script that I just added to examples/tools, it's called
>>>>> extract_genes.pl and
>>>>> it's based on Bio::DB::Fasta. Download the sequence files for a given
>>>>> species to some dir, download Entrez Gene's gene2accession file,
>>>>> and run. It
>>>>> creates and stores a hash for lookups, it won't read gene2accession
>>>>> each
>>>>> time it runs.
>>>>>
>>>>> Brian O.
>>>>>
>>>>>
>>>>> On 2/14/06 12:15 PM, "Harry Mangalam" <hjm at tacgi.com> wrote:
>>>>>
>>>>>> Hi Brian,
>>>>>>
>>>>>> Thanks very much for the pointers and the speed of your reply and
>>>>>> apologies
>>>>>> for the speed of mine.
>>>>>>
>>>>>> This looks good, but what I was looking for was a bioP approach
>>>>>> for hooking to
>>>>>> an API at NCBI or EBI so I could get this info and seqs from
>>>>>> them. In this
>>>>>> case, speed of retrieval is not critical and I'd rather not
>>>>>> download the
>>>>>> entirety of the sequences to a local disk to hack at them.
>>>>>>
>>>>>> I've determined a screen-scraping approach to get them and could
>>>>>> script that,
>>>>>> but I thought that bioP had a method for using NCBI's external
>>>>>> API's, tho it
>>>>>> may be that my memory is faulty or the approach is no longer
>>>>>> supported due to
>>>>>> overload.
>>>>>>
>>>>>> Does NCBI make such APIs available anymore? I searched a bit for
>>>>>> docs on them
>>>>>> but couldn't find anything (unless it's buried in the NCBI tookit,
>>>>>> which I
>>>>>> haven't started to excavate).
>>>>>>
>>>>>> Failing that, would SEALS provide such a service? Any PerlPinipeds
>>>>>> listening?
>>>>>>
>>>>>> Harry
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sunday 12 February 2006 08:37, Brian Osborne wrote:
>>>>>>> Harry,
>>>>>>>
>>>>>>> Hope you're doing well. The approach could be based on
>>>>>>> Bio::DB::Fasta. So,
>>>>>>> from its documentation:
>>>>>>>
>>>>>>> use Bio::DB::Fasta;
>>>>>>>
>>>>>>> # create database from directory of fasta files
>>>>>>> my $db = Bio::DB::Fasta->new('/path/to/fasta/files');
>>>>>>>
>>>>>>> # simple access (for those without Bioperl)
>>>>>>> my $seq = $db->seq('CHROMOSOME_I',4_000_000 => 4_100_000);
>>>>>>> my $revseq = $db->seq('CHROMOSOME_I',4_100_000 => 4_000_000);
>>>>>>> my @ids = $db->ids;
>>>>>>> my $length = $db->length('CHROMOSOME_I');
>>>>>>> my $alphabet = $db->alphabet('CHROMOSOME_I');
>>>>>>> my $header = $db->header('CHROMOSOME_I');
>>>>>>>
>>>>>>> # Bioperl-style access
>>>>>>> my $db = Bio::DB::Fasta->new('/path/to/fasta/files');
>>>>>>>
>>>>>>> my $obj = $db->get_Seq_by_id('CHROMOSOME_I');
>>>>>>> my $seq = $obj->seq;
>>>>>>> my $subseq = $obj->subseq(4_000_000 => 4_100_000);
>>>>>>>
>>>>>>> Do you already have the offsets?
>>>>>>>
>>>>>>> Brian O.
>>>>>>>
>>>>>>> On 2/12/06 1:46 AM, "Harry Mangalam" <hjm at tacgi.com> wrote:
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> After perusing the tutorial and other docs for a an evening, I
>>>>>>>> still
>>>>>>>> can't find the answer to this. Forgive me if I've missed
>> something
>>>>>>>> obvious.
>>>>>>>>
>>>>>>>> This should not be a novel request, but I've not found it
>>>>>>>> answered. If
>>>>>>>> bioperl isn't the best way to do this, I'd be grateful to a
>>>>>>>> pointer to a
>>>>>>>> better way, especially if it includes an illuminating bit of code.
>>>>>>>>
>>>>>>>> The problem is to retrieve genomic sequences plus & minus some
>>>>>>>> offset
>>>>>>>> from a locus determined by HUGO keyword or GeneID. This would be
>> a
>>>>>>>> common followup chore for some extra analysis from a gene
>>>>>>>> expression
>>>>>>>> expt. Or maybe this is in the DBFetch routines, but I've missed
>>>>>>>> the
>>>>>>>> sequence type to specify...?
>>>>>>>>
>>>>>>>>
>>>>>>>> TIA!
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>> Christopher Fields
>>>> Postdoctoral Researcher
>>>> Lab of Dr. Robert Switzer
>>>> Dept of Biochemistry
>>>> University of Illinois Urbana-Champaign
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
More information about the Bioperl-l
mailing list