[Bioperl-l] How to get from gi/ref/gb to genomic coordinates ?

Thu Feb 1 12:54:21 UTC 2007

Barry and Jason,

thanks for your quick and very helpful replies.

I guess we should have done (or repeat) our blast search at 
http://fungal.genome.duke.edu/
to get better mapping from proteins to genomes ?

As I retrieved all my proteins via whole genome blasts we should find 
(most of) them in the genbank files ... a good opportunity for me to 
learn some Bioperl and the other packages you mentioned in case we want 
to do more complex analysis later :-)

Thank you very much!

Rainer

Barry Moore wrote:
> Rainer,
> 
> We use a perl library called CGL written by Mark Yandell and  colleagues 
> (which in turn uses Chris Mungal's BioChaos and  Unflattener.pm referred 
> to by Jason) for this type of task.  The  basic pipeline is convert 
> GenBank files to Chaos XML, then use CGL  with those XML files to get a 
> nice object oriented access to exons,  transcripts, proteins, 
> coordinates and more for of those genes.  I am  currently using this 
> with good success on most GenBank genomes  (unfortunately I haven't been 
> working with the fungal genomes, but it  should work fine).  The Ensembl 
> API provides similar functionality  for Ensembl genomes - but not very 
> many fungi there.
> 
> http://www.yandell-lab.org/cgl/
> http://www.ensembl.org/info/software/core/core_tutorial.html
> 
> Feel free to contact Mark or myself  directly if you are interested  in 
> using CGL.
> 
> Barry
> 
> On Jan 31, 2007, at 2:09 PM, Rainer Machne wrote:
> 
>> Dear Bioperl list,
>>
>> hoping not be on the wrong email list, i would have a short question:
>>
>> Is there a standard way or are there nice (Bioperl) tools to come  from a
>> gene id (gi) other ids (see below) to the genomic coordinates of the
>> respective gene?
>>
>> We have Fasta files retrieved from NCBI protein Blast in fungal  genomes:
>>
>>> gi|46100068|gb|EAK85301.1| hypothetical protein UM04252.1 [Ustilago
>>
>> maydis 521]
>> or
>>
>>> gi|50292953|ref|XP_448909.1| unnamed protein product [Candida  glabrata]
>>
>>
>> (we only have gi, ref and gb in my set).
>>
>> I retrieved all my fasta files from whole fungal genomes with  available
>> protein sequences at
>> http://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi?organism=fungi
>>
>> As I only searched whole finished genomes (not shotgun), I thought it
>> would then be easy to get the genomic coordinates and retrieve  upstream
>> sequences, but we have failed so far to find a consistent way to do  this
>> automatically. Many of the gi entries refer to mRNAs or partial mRNAs
>> and the way to the coordinates seems to differ for each case.
>>
>> Any suggestions would be appreciated.
>>
>> with kind regards,
>> Rainer Machne
>>
>> University of Vienna
>> Department for Theoretical Chemistry
>> Theoretical Biochemistry Group
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
>