[Bioperl-l] est2genome
Hilmar Lapp
hlapp@gnf.org
Fri, 11 Oct 2002 10:30:04 -0700
On Friday, October 11, 2002, at 07:28 AM, Jason Stajich wrote:
> I wrote a very basic est2genome parser in Bio::Tools::Est2Genome and a
> test in t/est2genome.
>
> Now, I didn't really do this the way I'd like as I'm returning an array
> of either Bio::SeqFeature::SimilarityPair (exons) or
> Bio::SeqFeature::Generic (introns)
> and next_feature isn't supported yet because I don't think the current
> gene objects fit properly with this data.
>
> The do not allow attatchment of evidence or the fact that the exon
> might
> contain a pair of information for the genomic and cdna/pep information.
>
Good point. I thought I had written them inheriting from
SeqFeature::SimilarityPair too, but that's not true apparently.
What about defining SeqFeature::SimilarityPairI and then
Tools::Prediction::Exon would implement that too.
Evidence - yeah we need that. ChrisM proposed Ontology::EvidenceI,
which made sense to me.
In the process of sorting out the ontology stuff here I was going to
roll in ChrisM's stuff as well -- ChrisM if you read this and you've
got an update let me know - I can try help you with resolving the
commit problem you had or commit the stuff for you.
>
> Additionally, we don't really seem to do a good job of serializing
> (GFF,
> GAME, GenBank/EMBL/Swissprot) Bio::SeqFeature objects which aren't
> Bio::SeqFeature::Generic.
>
> I think we need to add the hooks to make this simplier so one can, for
> example, parse with Est2Genome and output as annotation in GFF or
> GenBank/EMBL formats. We can use tag/value pairs to output the
> score,alignment information in either of these formats, and allow
> the user
> to override this if they have a specialized way they want to output
> this.
Tools::Prediction::Exon does use the tag/value system for storing
additional scores, so they should be included in GFF automatically.
>
> The problem comes in the composite objects (FeaturePair,
> SimilarityPair) -
> these can't be properly written out because one never sees the
> feature2()/hit() component of the data, nor the extra fields like
> significance when being written out by genbank/embl or gff writers.
Hm. I've hit this before, I'll see whether I can dig up that piece
of code. My approach was pretty simple as I remember, it probably
wasn't bidirectional. Maybe not even worth looking at as I think of
it.
> So we
> need a better way to register what are the available outputs are in
> a sort
> of recursive fashion which can be available as tag/values and may have
> non-unique tag names.
>
> Does anyone have good ideas of how to structure this? Some sort of 'get
> all the tag values and all of your children's tag/values pairs and any
> registered data functions'.
Like $self->register_gff_tag('donor_signal'), and then the GFF
writer gathers all registered tags?
>
> Also, in a final note, Ensembl is starting to standardize their
> function
> names from each_XX to get_all_XX
Excellent - that's very similar to how we're moving, right? The
difference is that we have get_XXXXs which returns an array, and
get_all_XXXXs, which returns a possibly flattened array (e.g., for
SeqI this would be get_SeqFeatures() instead of top_SeqFeatures(),
and get_all_SeqFeatures() instead of all_SeqFeatures().
Ewan what's the plan for this distinction in Ensembl?
-hilmar
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------