[Bioperl-l] How to get from gi/ref/gb to genomic coordinates ?

Wed Jan 31 22:00:01 UTC 2007

Rainer -

You probably want to download each whole genome as a genbank file,  
parse the file and generate the coordinates of all the genes.  AFAIK,  
It is non-trivial to get the genomic coordinates starting from a gene  
record only using NCBI interface since you really want to see the  
sequence in genomic context.

I wrote a genbank2gff formatters that worked okay for my needs:
http://fungal.genome.duke.edu/~jes12/software/scripts/gbk2gff3.perl.txt

You can also try Chris Mungall's Unflattener module:
  http://bioperl.org/wiki/Module:Bio::SeqFeature::Tools::Unflattener
http://search.cpan.org/~sendu/bioperl/Bio/SeqFeature/Tools/ 
Unflattener.pm

Some annotations are inconsistent from the standardized formats so  
you have to work in some special casing if you really want to pull  
the data in consistently every time.

There are tools in BioPerl for managing these databases, typically  
the data can be represented in GFF format for simplicity and there  
are database implementations for fast access to the data.  See  
Bio::DB::GFF and Bio::DB::SeqFeature

I did make GFF files during my graduate work for most of the (then)  
available fungal genomes - http://fungal.genome.duke.edu/ which may  
be useful to you as well.

-jason
--
Jason Stajich
Miller Research Fellow
University of California, Berkeley
lab: 510.642.8441
http://pmb.berkeley.edu/~taylor/people/js.html
http://fungalgenomes.org/

On Jan 31, 2007, at 1:09 PM, Rainer Machne wrote:

> Dear Bioperl list,
>
> hoping not be on the wrong email list, i would have a short question:
>
> Is there a standard way or are there nice (Bioperl) tools to come  
> from a
> gene id (gi) other ids (see below) to the genomic coordinates of the
> respective gene?
>
> We have Fasta files retrieved from NCBI protein Blast in fungal  
> genomes:
>
>> gi|46100068|gb|EAK85301.1| hypothetical protein UM04252.1 [Ustilago
> maydis 521]
> or
>> gi|50292953|ref|XP_448909.1| unnamed protein product [Candida  
>> glabrata]
>
> (we only have gi, ref and gb in my set).
>
> I retrieved all my fasta files from whole fungal genomes with  
> available
> protein sequences at
> http://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi?organism=fungi
>
> As I only searched whole finished genomes (not shotgun), I thought it
> would then be easy to get the genomic coordinates and retrieve  
> upstream
> sequences, but we have failed so far to find a consistent way to do  
> this
> automatically. Many of the gi entries refer to mRNAs or partial mRNAs
> and the way to the coordinates seems to differ for each case.
>
> Any suggestions would be appreciated.
>
> with kind regards,
> Rainer Machne
>
> University of Vienna
> Department for Theoretical Chemistry
> Theoretical Biochemistry Group
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l