[Bioperl-l] EnsEMBL-Bioperl converters proposal
Juguang Xiao
juguang at fugu-sg.org
Thu Mar 6 17:20:28 EST 2003
Hi Ewan and Michele,
Nice to talk with you in Bio Hackathon Singapore. And I am writing the
proposal for the converter for the objects between bioperl and EnsEMBL.
TERMS
In this document, 'users' mean the programmers that use this converter
system, while "developer' refers to the programmer who develops this
converter system, currently just me. :)
PACKAGE
With your agreement, I will use Bio::EnsEMBL::Utils::Converter module
name as converter factory, and Bio::EnsEMBL::Utils::Converter package
will be dedicated for converter instances.
FRAMEWORK
All module that will be used by users, is
Bio::EnsEMBL::Utils::Converter, the converter factory. In the POD of
this module, the developer is responsible to announce which converter
instances he has implemented. For users, there are two steps that they
need to perform. (1) constructing an converter instance and (2)
converting. See the below example code.
my $ens_analysis ; # a Bio::EnsEMBL::Analysis object
my $ens_contig; # a Bio::EnsEMBL:RawContig object
my $converter = new Bio::EnsEMBL::Utils::Converter(
-in => 'Bio::Search::Hit::GenericHit',
-out => 'Bio::EnsEMBL::DnaPepAlignFeature',
-analysis => $ens_analysis,
-contig => $ens_contig
}
# NOTE: Convensions, that convert method accepts an array ref
# and returns an array ref.
my @objs; # an array of original objects.
my @converted_obj = @{$conveter->convert(\@objs)};
NOTES
1. In Converter::new, a user needs to, at least, specify '-in' and
'-out' module name of conversion. Say, -in => Bio::SeqFeature::Generic,
-out => Bio::EnsEMBL::SimpleFeature. If converting features form
bioperl to ensembl, as you know about ensembl, you need to offer the
analysis and rawcontig information.
2. This is a conventions that Converter::convert accepts an array ref
of objects and will return an array ref objects too. To be friendly, my
implementation also accept an object and return an object, but give
user a warning.
INSIDE
The hierarchy of converter module is like this
Converter
bio_ens
bio_ens_seq
bio_ens_seqFeature
bio_ens_featurePair (converting to Bio::EnsEMBL::FeaturePair /
RepeatFeature, for repeatmasker result.)
bio_ens_hit (Bio::Search::Hit::GenericHit / HSP::GenericHit,
generated by Blast)
ens_bio
ens_bio_seq (EnsEMBL feature object actually attaches bioperl seq
object)
ens_bio_seqFeature
ens_bio_featurePair
You can see the design is copied from bioperl SeqIO and a sort., but
with some variance of multiple layers. Hopefully no copyright legal
issue involved. :)
The first two top level mainly marshall to find the right converter
instance based on the -in and -out. Generally,
Bio::EnsEMBL::Utils::Converter will judge whether the conversion is
from bioperl to ensembl or opposite direction, and call the
constructor of one of (bio_ens and ens_bio). Consequently
Bio::EnsEMBL::Utils::Converter::bio_ens, for example, try to find the
more detailed implementor, also based on the -in and -out. The method
to do that is called _guess_module.
Each the third level module, such as
Bio::EnsEMBL::Utils::Converter::bio_ens_seqFeature, a hidden hero,
implements 2 *internal* methods, _initialize, and _convert_single.
Converter::convert dereference the original objects, calls the
_convert_single of converter instance module, and reference the
converted objects to return.
DEVELOPMENT TEST
There will be a converter.t file in module/t directory, ensembl cvs
repository. It is in charge to test all implemented converter instances.
Question: I did not find the Makefile.PL in ensembl cvs, like in
bioperl, so I do not know how to batch testing all test files, like
'make test' in bioperl. However, I do not think my converter breaks
other's code.
the converter.t test pass, with currently other codes live cvs, I think
that is EnsEMBL 11.
THE END
Did I miss something?
I have commit the code and test file to ensembl CVS. Now what I have
done is the framework, and the instance to convert between
1. Bio::SeqFeature::Generic <-> Bio::EnsEMBL::SeqFeature, SimpleFeature
2. Bio::SeqFeature::FeaturePair -> Bio::EnsEMBL::RepeatFeature and
RepeatConsensus, (for repeatmasker result)
Later soon, there will be
1. Bio::Search::Hit::GenericHit, Bio::Search::HSP::GenericHSP ->
Bio::EnsEMBL::BaseAlignFeature, and sub-categorize into
DnaDnaAlignFeature, DnaPepAlignFeature, and PepDanAlignFeature, based
on the program of blastall is used.
2. Bio::Tools::Prediction::Gene -> Bio::EnsEMBL::PredictionTranscript,
(for genscan, etc)
3.
Bio::SeqFeature::Gene::GeneStructure -> Bio::EnsEMBL::Gene
Bio;;SeqFeature::Gene::Transcript -> Bio::EnsEMBL::Transcript
Bio::SeqFeature::Gene::Exon -> Bio::EnsEMBL::Exon
(for result of genewise)
Did I match the object types correctly?? And your suggestion for more
conversion? Thanks
For the special case on converting RawContig <-> Seq in bioperl. I am
thinking whether it is a necessary work, because the RawContig's lazy
loading Seq and auto-saving Seq. See Bio::EnsEMBL::RawContig::subseq,
or Bio::EnsEMBL::DBSQL::RawContigAdaptor::fetch_by_name, for getting
the Seq, and Bio:;EnsEMBL::RawContig::seq, for setting the Seq.
Any comments are most welcomed! Thanks
Juguang
More information about the Bioperl-l
mailing list