[Bioperl-l] Fetching genomic sequences based on HUGO names or GeneIDs
Brian Osborne
osborne1 at optonline.net
Thu Feb 16 17:59:37 UTC 2006
Chris and Harry,
I'm writing a Wiki page on this, it's linked to the FAQ as Wiki is
complaining that the FAQ is getting too big. I'll fill in the ENSEMBL API
and Bio::DB::Fasta approaches, if you would comment on the BioPerl/eutils
approach at some point that would be superb:
http://bioperl.open-bio.org/wiki/Getting_Genomic_Sequences
Brian O.
On 2/16/06 11:23 AM, "Harry Mangalam" <hjm at tacgi.com> wrote:
> Yes, I'm going to try this 1st. Also the pointer to the NCBI eutils page was
> helpful. They describe the same thing and I think that API will give me what
> I need. I'll post back to report.
>
> Sorry for the delay in answering - this is a side project and as such is going
> slow.
>
> Many thanks to you guys, especially Brian for the example code - much more
> than I had a right to expect. Virtual Beers all round and real ones should
> we ever meet up.
>
> Harry
>
>
> On Thursday 16 February 2006 04:52, Chris Fields wrote:
>> I think a method was recently implemented in Bio::DB::GenBank to
>> retrieve a segment of DNA given start and end coordinates in GenBank
>> format; that should contain the features you need. I requested it
>> ~Nov-Dec in the mailing list but didn't get a chance to test it.
>> Would that help?
>>
>> On Feb 15, 2006, at 11:16 PM, Brian Osborne wrote:
>>> Harry,
>>>
>>> It's not clear to me that NCBI's eutils offers this capability
>>> directly. You
>>> can probably download Entrez Gene entries and parse them for
>>> coordinates but
>>> I know of no way to remotely retrieve genomic sequences like this
>>> from NCBI
>>> (ENSEMBL API perhaps?). What I had in mind uses the local approach
>>> that some
>>> of us favor and to prove to myself that this is simple to do I wrote a
>>> script that I just added to examples/tools, it's called
>>> extract_genes.pl and
>>> it's based on Bio::DB::Fasta. Download the sequence files for a given
>>> species to some dir, download Entrez Gene's gene2accession file,
>>> and run. It
>>> creates and stores a hash for lookups, it won't read gene2accession
>>> each
>>> time it runs.
>>>
>>> Brian O.
>>>
>>> On 2/14/06 12:15 PM, "Harry Mangalam" <hjm at tacgi.com> wrote:
>>>> Hi Brian,
>>>>
>>>> Thanks very much for the pointers and the speed of your reply and
>>>> apologies
>>>> for the speed of mine.
>>>>
>>>> This looks good, but what I was looking for was a bioP approach
>>>> for hooking to
>>>> an API at NCBI or EBI so I could get this info and seqs from
>>>> them. In this
>>>> case, speed of retrieval is not critical and I'd rather not
>>>> download the
>>>> entirety of the sequences to a local disk to hack at them.
>>>>
>>>> I've determined a screen-scraping approach to get them and could
>>>> script that,
>>>> but I thought that bioP had a method for using NCBI's external
>>>> API's, tho it
>>>> may be that my memory is faulty or the approach is no longer
>>>> supported due to
>>>> overload.
>>>>
>>>> Does NCBI make such APIs available anymore? I searched a bit for
>>>> docs on them
>>>> but couldn't find anything (unless it's buried in the NCBI tookit,
>>>> which I
>>>> haven't started to excavate).
>>>>
>>>> Failing that, would SEALS provide such a service? Any PerlPinipeds
>>>> listening?
>>>>
>>>> Harry
>>>>
>>>> On Sunday 12 February 2006 08:37, Brian Osborne wrote:
>>>>> Harry,
>>>>>
>>>>> Hope you're doing well. The approach could be based on
>>>>> Bio::DB::Fasta. So,
>>>>> from its documentation:
>>>>>
>>>>> use Bio::DB::Fasta;
>>>>>
>>>>> # create database from directory of fasta files
>>>>> my $db = Bio::DB::Fasta->new('/path/to/fasta/files');
>>>>>
>>>>> # simple access (for those without Bioperl)
>>>>> my $seq = $db->seq('CHROMOSOME_I',4_000_000 => 4_100_000);
>>>>> my $revseq = $db->seq('CHROMOSOME_I',4_100_000 => 4_000_000);
>>>>> my @ids = $db->ids;
>>>>> my $length = $db->length('CHROMOSOME_I');
>>>>> my $alphabet = $db->alphabet('CHROMOSOME_I');
>>>>> my $header = $db->header('CHROMOSOME_I');
>>>>>
>>>>> # Bioperl-style access
>>>>> my $db = Bio::DB::Fasta->new('/path/to/fasta/files');
>>>>>
>>>>> my $obj = $db->get_Seq_by_id('CHROMOSOME_I');
>>>>> my $seq = $obj->seq;
>>>>> my $subseq = $obj->subseq(4_000_000 => 4_100_000);
>>>>>
>>>>> Do you already have the offsets?
>>>>>
>>>>> Brian O.
>>>>>
>>>>> On 2/12/06 1:46 AM, "Harry Mangalam" <hjm at tacgi.com> wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> After perusing the tutorial and other docs for a an evening, I
>>>>>> still
>>>>>> can't find the answer to this. Forgive me if I've missed something
>>>>>> obvious.
>>>>>>
>>>>>> This should not be a novel request, but I've not found it
>>>>>> answered. If
>>>>>> bioperl isn't the best way to do this, I'd be grateful to a
>>>>>> pointer to a
>>>>>> better way, especially if it includes an illuminating bit of code.
>>>>>>
>>>>>> The problem is to retrieve genomic sequences plus & minus some
>>>>>> offset
>>>>>> from a locus determined by HUGO keyword or GeneID. This would be a
>>>>>> common followup chore for some extra analysis from a gene
>>>>>> expression
>>>>>> expt. Or maybe this is in the DBFetch routines, but I've missed
>>>>>> the
>>>>>> sequence type to specify...?
>>>>>>
>>>>>>
>>>>>> TIA!
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list