[Bioperl-l] est2genome

Hilmar Lapp hlapp@gnf.org
Fri, 11 Oct 2002 10:30:04 -0700


On Friday, October 11, 2002, at 07:28 AM, Jason Stajich wrote:

> I wrote a very basic est2genome parser in Bio::Tools::Est2Genome and a
> test in t/est2genome.
>
> Now, I didn't really do this the way I'd like as I'm returning an array
> of either Bio::SeqFeature::SimilarityPair (exons) or 
> Bio::SeqFeature::Generic (introns)
> and next_feature isn't supported yet because I don't think the current
> gene objects fit properly with this data.
>
> The do not allow attatchment of evidence or the fact that the exon 
> might
> contain a pair of information for the genomic and cdna/pep information.
>

Good point. I thought I had written them inheriting from 
SeqFeature::SimilarityPair too, but that's not true apparently.

What about defining SeqFeature::SimilarityPairI and then 
Tools::Prediction::Exon would implement that too.

Evidence - yeah we need that. ChrisM proposed Ontology::EvidenceI, 
which made sense to me.

In the process of sorting out the ontology stuff here I was going to 
roll in ChrisM's stuff as well -- ChrisM if you read this and you've 
got an update let me know - I can try help you with resolving the 
commit problem you had or commit the stuff for you.

>
> Additionally, we don't really seem to do a good job of serializing 
> (GFF,
> GAME, GenBank/EMBL/Swissprot) Bio::SeqFeature objects which aren't
> Bio::SeqFeature::Generic.
>
> I think we need to add the hooks to make this simplier so one can, for
> example, parse with Est2Genome and output as annotation in GFF or
> GenBank/EMBL formats.  We can use tag/value pairs to output the
> score,alignment information in either of these formats, and allow 
> the user
> to override this if they have a specialized way they want to output 
> this.

Tools::Prediction::Exon does use the tag/value system for storing 
additional scores, so they should be included in GFF automatically.

>
> The problem comes in the composite objects (FeaturePair, 
> SimilarityPair) -
> these can't be properly written out because one never sees the
> feature2()/hit() component of the data, nor the extra fields like
> significance when being written out by genbank/embl or gff writers.

Hm. I've hit this before, I'll see whether I can dig up that piece 
of code. My approach was pretty simple as I remember, it probably 
wasn't bidirectional. Maybe not even worth looking at as I think of 
it.

>   So we
> need a better way to register what are the available outputs are in 
> a sort
> of recursive fashion which can be available as tag/values and may have
> non-unique tag names.
>
> Does anyone have good ideas of how to structure this? Some sort of 'get
> all the tag values and all of your children's tag/values pairs and any
> registered data functions'.

Like $self->register_gff_tag('donor_signal'), and then the GFF 
writer gathers all registered  tags?

>
> Also, in a final note, Ensembl is starting to standardize their 
> function
> names from each_XX to get_all_XX

Excellent - that's very similar to how we're moving, right? The 
difference is that we have get_XXXXs which returns an array, and 
get_all_XXXXs, which returns a possibly flattened array (e.g., for 
SeqI this would be get_SeqFeatures() instead of top_SeqFeatures(), 
and get_all_SeqFeatures() instead of all_SeqFeatures().

Ewan what's the plan for this distinction in Ensembl?

	-hilmar
--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------