[Bioperl-l] Proposal: SemanticMapping and call for info on Gene Objects

Jason Stajich jason@cgt.mc.duke.edu
Mon, 13 May 2002 12:15:10 -0400 (EDT)


On Mon, 13 May 2002, Chris Mungall wrote:

>
> Sounds sensible; you know my opinions on biospecific classes but if folks
> want them this seems a good way to do it.
>
I think we need to at least try this out - no one has proposed any other
mechanisms for properly mapping the bioperl seqfeature subclasses into the
flatfile formats and the hard coded FTHelper is way too complicated for
someone to jump in and add their custom features.

> I would venture that map_to_generic_features() isn't really necessary, as
> I strongly feel that the Gene/Transcript/Exon/etc classes should be
> lightweight wrappers on top of the generic seqfeatures, with class
> specific attribute accessors mapped onto the seqfeature tag/value system.
>

They are in fact lightweight wrappers - my reasoning here is that there
isn't a sanctioned way to encode similarity features, computation, or
other custom defined objects and someone may want to plug in their own
specialized way of outputting the data.  Say for example that a program
like vectorNTI only recognizes certain set of gene related feature types,
while another program recognizes a separate, perhaps overalapping set.
It's much easier to plug in a different mapper than to reimplement the
objects.  I'm just shooting for more flexibility but maybe that is
giving enough rope to hang oneself?

> eg in gadfly, calling $gene->transcript_list([@trs]) actually maps to
>
> $seqfeature->set_subfeatures_by_type("transcript", [@trs])
>
> this keeps everything working for applications that just want to use the
> objects at the generic seqfeature level.
>
They can still be USED as generic seqfeatures - but a transcript object
would have to know how to uncoil itsself in such a way as to be outputted
in the genbank/embl feature table and I don't like putting format specific
code in the objects (admittedly I put it in the Location objects) OR the
SeqIO::FTHelper class have to uncoil the data and I am trying to provide a
pluggable system to that step so one can provide a custom mapper instead
of a hardcoded FTHelper object.

But maybe I'm just looking at it from one side - it may be possible to
properly encode the detanglement in a Transcript object and have all the
subclasses behave properly.

> Not sure about recording translation starts to the exons - what about
> doubly encoded genes in retroviral genomes? also, my faves - dicistronic
> genes.
>
> I'm happy to provide a tricky test-set of genbank files to test this.
>
It would be good to build a test set that goes from simple to complex, if
you could prepare at some files for this it would be great - maybe we'll
make a specific dir w/in t/data/ for this - I already added one file
AB077698.gb in there with some annotation but can move it to the new
directory if you want to push some files into the repository.

> This part is a bit less fleshed out... but it would be really nice if the
> biology encoded in the object model is both as flexible as possible, and
> open to introspection.
>
> E.g. let's take a small part of SO and turn it into a lispy perl
> datastructure:
>
> [schema=>
>   [gene=>[[isa=>"seqfeature"],
>           [coding=>1],
>           [class=>"Bio::GeneI"]]],
>   [noncoding-gene=>[[isa=>"gene"],
>           [coding=>0],
>           [class=>"Bio::NcGeneI"]]],
>   [transcript=>[[isa=>"seqfeature"],
>                 [partof=>"gene"],
>                 [class=>"Bio::TranscriptI"]]],
> ]
>
> Ewan will hate this... but it would be nice to have as much of the
> implementation specified dynamically by a "language" such as the above. Or
> at least have it as an implementation option. If not, at least let's try
> and keep the OM to SO mapping clean.
>

My local prototype contains a hash with essentially that format. I think
it is a reasonable way to go, but some programmed logic is going to have
to be applied to handle cases where annotation is not complete (no /gene
tag) or possibly even inconsistent.

> Here's SO:
> ftp://ftp.geneontology.org/pub/go/gobo/sequence.ontology/
>
> Flexibility is important; on the one hand there are some who want to write
> robust software that deals with pc genes with the minimum of fuss, writing
> to a simple (and possibly biologically restrictive) GeneI, TranscriptI etc
> interface. On the other hand some of us want more plasticky objects,
> possibly conforming to the pc-gene interfaces.
>
> Regarding the logic for the actual mapping; this seems kind of tricky. Is
> it robust to use the /gene field in genbank records to collect alternately
> spliced mRNAs into a gene object?
>

My first reaction is - this is up to the implementer.  There are going to
be tradeoffs as the seq format is not very well enforced since those
fields are "optional".  I think that alternate transcripts may also have
to be deduced.  I was hoping by starting some objects and discussion we
can get some opinions for folks and try a couple of different tradeoffs.
I would like to a strict mapper where only those with /gene are handled
and then a more flexible one which will do its best guessing.

> Would the semantic mapper do stuff like create intron objects from exons,
> etc?

Yes - but what happens when the file has introns annotated - we should use
their calls - the current transcript object deduces introns from the
contained exons - but it doesn't support explicitly setting introns or
attatching annotation to them which could be useful.

>
> It seems the mapping must be in 2 parts; the first will manipulate the
> seqfteaure / subseqfeature hierarchy, eg to fix genbank split location
> mRNA features into 3 level gene/transcript/exon/translation/cds objects.
> The second part would go through and "bless" the objects appropriately. It
> would be nice to seperate those.
>
Yep - exactly how I've protyped it...  I imagined blessing the objects
first though and then building up the transcript and gene objects.  Will
probably need to get that IntersectionGraph object implemented too so we
can easily query for overlapping features within an annotated seq range.


-j

> On Sat, 11 May 2002, Jason Stajich wrote:
>
> > I'm starting to try and build the semantic mapper for building
> > Bio::SeqFeature::Gene objects from a list of Bio::SeqFeatureI objects.
> > Dave/Hilmar any chance you guys can walk us through the ideas behind the
> > Gene objects and the assumptions that have been made.
> > I am wondering if we have a rich enough set of objects for truly
> > representing all the information one might have for a gene.
> >
> > I think we probably need a CDS object or a little richer exon object to
> > note where translation starts.  I'm not sure what is appropriate - to
> > build objects towards the way data is organized in a genbank/embl file, or
> > build them a little more generically and have to do some acrobatics to go
> > in seqfile -> GeneStructure -> out seqfile format.
> >
> > Anyone who has opinions or ideas here, I would encourage you to look over
> > the existing objects and help propose some directions.  I'd perhaps like
> > to adopt what we can from the Gadfly & Ensembl models as well - any
> > guidance and lessons learned would be great Ewan/Michele/Chris M.
> >
> > As for the actually semantic mapping part - here is a simple interface
> > I've started.
> >
> > Bio::SeqFeature::SemanticMapperI
> > (or should it be a Bio::Factory::SemanticMapperI ???)
> > (happy to hear better suggestions for names)
> >
> > =head2 map_from_generic_features
> >
> >  Title   : map_from_generic_features
> >  Usage   : my @features = $mapper->map_from_geneic_features(-features => \@generic);
> >  Function: Will build new Bio::SeqFeatureI object(s) from set of
> >            Bio::SeqFeatureI objects on implemented logic.
> >  Returns : List of Bio::SeqFeatureI objects
> >  Args    : -features => \@generic  # Feature list
> >
> > =head2 map_to_generic_features
> >
> >  Title   : map_to_generic_features
> >  Usage   : my @features = $mapper->map_to_generic_features(-features => \@specialized);
> >  Function: Will build generic Bio::SeqFeature::Generic objects from
> >            specialized Bio::SeqFeature:: objects useful for outputting
> >            GenBank/EMBL Feature Tables.
> >  Returns : List of Bio::SeqFeatureI
> >  Args    : -features => \@specialized # array ref of features to map to
> > generic objects
> >
> > =cut
> >
> > The first implementing class would be Bio::SeqFeature::GeneSemanticMapper,
> > which would work to build Bio::SeqFeature::Gene::GeneStructure objects or
> > at least Exon/Intron objects depending on the depth of the annotated
> > data.
> >
> > A second implementing class would be
> > Bio::SeqFeature::AnalysisSemanticMapper. (name up for debate!) This would
> > allow us to expand/collapse SeqFeature::Computational/FeaturePair/HSP etc
> > objects to/from a set of SeqFeatureI(s).
> >
> > This class would also provide a means for simplifying object from high
> > level bioperl SeqFeature classes down to the Generic object level suitable
> > for outputting.
> >
> > I would then propose adding methods to Bio::SeqIO - add_SemanticMapper(),
> > each_SemanticMapper, remove_SemanticMappers() to deal with having a set of
> > semantic mappers to process sequence features once they have been created.
> > Perhaps add a boolean state to the SeqIO class as to whether or not to use
> > SemanticMapping as there is going to be a serious performance cost.  One
> > can always process features after the sequence is read in so we gain
> > flexibility without always paying the performance cost.  By delegating
> > this to a separate factory we can still reimplement the sequence parsing
> > later on without affecting this behavior.
> >
> >
> > Comments, ideas, & volunteers welcomed.
> >
> > -jason
> >
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu