[Bioperl-l] Re: Bio::Tools::Genscan

James Gilbert jgrg@sanger.ac.uk
Fri, 4 Aug 2000 10:43:40 +0100 (BST)


On Thu, 3 Aug 2000, Hilmar Lapp wrote:

> Dear all,
> 
> there is now a Genscan parser in Bio::Tools::Genscan.
> 
> After initialization with a file handle or a file the central method is
> 
> 	$gene = $genscan->next_prediction();
> 
> which returns a Bio::Tools::Prediction::Gene object, which in turn is-a
> Bio::SeqFeature::GeneStructure.
> 
> As a technical detail, the input file is not parsed completely first, but only
> up to the first predicted sequence entry. Assuming that these make up for a
> major part of the file, even parsing prediction results for very large input
> sequences should allow to keep the memory consumption reasonably low if you
> can manage to dispose each returned object before requesting the next one.
> 
> The predicted protein, and if available (i.e., if -cds was given to Genscan)
> the predicted CDS are both stored, too, although as far as I have been able to
> test it calling $gene->cds() would return mostly the same result (and
> translating it essentially yields the predicted peptide, but see below).
> 
> The exception is that if the initial exon is missing from a prediction there
> is obviously no way to know the phase of it, and consequently you don't know
> the frame of the first base of the first exon (the frame of the first base of
> the initial exon is always 0). 

We've been through the same problems in Ensembl.  
The only effective solution we found is to parse
the peptide sequences at the bottom of the
report.  Unfortunately, this would mean that you
could no longer delay parsing the whole file.

> To automatically select the right frame I have added a method correct_phase()
> to GeneStructure, which tries to adjust the frame (and phase) of a given DNA
> sequence by prepending 1 or 2 Ns (in lower-case) to it until the DNA sequence
> does no longer yield intervening termination codons in the frame starting with
> the first base.

This won't necessarily get you the correct
translation, because there may be more than one
full-length open reading frame.

I think there should be a frame or offset property
on the object which shows where to start the
translation.

> The automatic correction can be enabled by $gene->cds() with an argument of
> TRUE, and indeed comparing the prediction to $gene->cds(1) so far always
> showed no difference in the sequences. So far I haven't encountered a 'nasty'
> phase problem as Ewan mentioned it for Ensembl, but if anyone has a 'hard'
> example I'd be happy to test it.

The phase or frame numbers which genscan gives
don't make sense for reverse strand transcripts.

> Note that prepending Ns is exactly what Genscan does in its predicted CDS, so
> you could even easily determine the frame by examining the number of Ns
> prepended to the predicted CDS (provided you have it).
> 
> Bio::SeqFeature::GeneStructure is probably not the best place for
> correct_phase() to stay, because it's rather generic to DNA sequences. At
> present I'm not sure where to move it to. Has anyone any idea?
> 
> 	Hilmar
> 
> BTW the Ensembl Genscan parser is bound to bump at least on the version of
> Genscan we have (1.0), because Genscan prints a lot of explanations at the end
> of the file, thereby rendering the final predicted sequence invalid in FASTA
> format (the sequences are read with the SeqIO::fasta.pm module).
> 
> -- 
> -----------------------------------------------------------------------
> Hilmar Lapp                                      email: hlapp@gmx.net
> NFI Vienna, IFD/Bioinformatics                   phone: +43 1 86634 631
> A-1235 Vienna                                      fax: +43 1 86634 727
> -----------------------------------------------------------------------
> 

James G.R. Gilbert
The Sanger Centre
Wellcome Trust Genome Campus
Hinxton
Cambridge                        Tel: 01223 494906
CB10 1SA                         Fax: 01223 494919