[Bioperl-l] Re: [Bioperl-announce-l] an extension to Bio::SeqIO

Wed Jun 18 07:30:53 EDT 2003

Chris,

       Sounds way cool!
       For those of us too strapped to track down the pod, can you give
a quick explain of how you unflatten?

Cheers, -Stan

In a message dated 6/17/2003 11:06:09 PM Eastern Daylight Time, 
cjm at fruitfly.org writes:

> Subj: Re: [Bioperl-announce-l] an extension to Bio::SeqIO 
>  Date: 6/17/2003 11:06:09 PM Eastern Daylight Time
>  From: <A HREF="mailto:cjm at fruitfly.org">cjm at fruitfly.org</A>
>  To: <A HREF="mailto:jason at cgt.duhs.duke.edu">jason at cgt.duhs.duke.edu</A>
>  CC: <A HREF="mailto:peili at morgan.harvard.edu">peili at morgan.harvard.edu</A>, <A HREF="mailto:bioperl-l at bioperl.org">bioperl-l at bioperl.org</A>, <A HREF="mailto:alldev at morgan.harvard.edu">
> alldev at morgan.harvard.edu</A>, <A HREF="mailto:pavel at morgan.harvard.edu">pavel at morgan.harvard.edu</A>
>  Sent from the Internet 
> 
> 
> 
> Hi Jason
> 
> I agree the chadoxml writer should be kept fairly dumb
> 
> Here's how I think things should work:
> 
> genbank -> unflatten -> GeneStructure objects
> OR
> genbank -> unflatten -> map types to SO -> chadoxml
> 
> (I think this is pretty much exactly the same as what you are proposing)
> 
> thus the chadoxml part would not depend on the GeneStructure objects; but
> the same 'unflatten' component could be used for both GeneStructure
> objects, chadoxml, GFF3 etc.
> 
> the good news is i have written the unflatten component. it is tentatively
> called Bio::Tools::GenBankCollector
> 
> (not sure about the name - really it should be GenBankEMBLDDBJCollector??)
> 
> the task of this module is relatively simple; it takes a flat list of
> features like this:
> 
>  gene
>  mRNA CG4491-RA
>  CDS CG4491-PA
>  gene
>  tRNA tRNA-Pro
>  gene
>  mRNA CG32954-RA
>  mRNA CG32954-RC
>  mRNA CG32954-RB
>  CDS CG32954-PA
>  CDS CG32954-PB
>  CDS CG32954-PC
> 
> and turns them into this:
> 
>  gene
>   mRNA CG4491-RA
>    CDS CG4491-PA
>  gene
>   tRNA tRNA-Pro
>  gene
>   mRNA CG32954-RA
>    CDS CG32954-PA
>   mRNA CG32954-RC
>    CDS CG32954-PC
>   mRNA CG32954-RB
>    CDS CG32954-PB
> 
> Of course, as anyone who has attempted to get gene models out of genbank
> records knows, it's a wee bit harder than this.
> 
> full details in the pod docs, which i am about to send out to the bioperl
> list...
> 
> On Tue, 17 Jun 2003, Jason Stajich wrote:
> 
> >There is a bit of chicken-egg problem in that most of the data sets
> >Bioperl has tried to interface with are not as rich as chado, the
> >genbank->gene->chado way is not going to work for all genbank records
> >(which I personally can live with).  I would like to see if we can define
> >the least-common denominator for people to understand what needs to get a
> >chado db populated.
> >
> >
> >As we've been discussing in different venues I think we'd like to see a
> >general purpose system which can take a collection of sequence features,
> >relate them in a graph based on an identifiable grouping (the /gene field
> >or perhaps mapped into a general slot like 'group' ala Lincoln's
> >Bio::DB::GFF system), and then using SO map these into objects.  For genes
> >I'd like to see these be Bio::SeqFeature::Gene::GeneStructure objects
> >(the object model of which might need some work) because there are
> >additional methods already built in like intron inferences and ability to
> >loop through the transcripts, etc.
> >
> >So my request is that we make the chado writer dumb, it should not try to
> >group anything, but should just obey however the objects are built.  An
> >intermediete set of factories can take lists of features and assign
> >'group' fields to them, a second factory could relate them into a graph
> >based on SO and the group fields.  This graph can now be written out to
> >chadoxml.  Another factory (I was calling Bio::SeqFeature::Transmogrifier
> >for the calvin and hobbes fans) could build the appropriate composite
> >objects from the graph (Genes, HSPs, where appropriate) and deal with
> >multiple coordinate systems (in the case of features attached to the
> >annotated protein product).   The 'Transmogrifier' could also turn these
> >composite objects back into simple feature graphs so that they can be
> >written to chado simply and (finally) fully written out to a genbank
> >record with a controlled vocab of /tag=value fields.
> >
> >These are my ideas anyways, perhaps too much?  I know other people (Shawn
> >Hoon, Chris Mungall) have volunteered ideas and coding to this as well so
> >we'd like to see if we can perhaps work together on it.
> >
> >For examples of some minimal gene objects, the easiest way to get them
> >right now is from any of the gene prediction parser (
> >Bio::Tools::Genewise, Bio::Tools::Genomewise, Bio::Tools::Genscan,
> >Bio::Tools::Glimmer).
> >
> >-jason
> >On Tue, 17 Jun 2003, Peili Zhang wrote:
> >
> >>Hi,
> >>
> >>here at FlyBase, we implement chado database schema to store sequence,
> >>annotation, genetic, controlled vocabulary, publication and other types
> >>of data (for detailed information about chado schema, please visit
> >>http://www.gmod.org and read the schema documentations and scripts in
> >>its CVS).  we have developed tools to dump FlyBase data into chadoxml
> >>and load data in chadoxml format into FlyBase (for chadoxml dtd, please
> >>see
> >>
> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gmod/schema/chado/dat/chado.dtd),
> >>to facilitate data communication among the different sites of FlyBase
> >>and between FlyBase and the rest of the world. need arises for a tool to
> >>convert external data in other formats into chadoxml. I'm coding a perl
> >>module chadoxml.pm to write out a Bio::Seq object into chadoxml. we'd
> >>like to get your feedback on whether it's useful to add this module into
> >>bioperl as an extension to the Bio::SeqIO package. if you already have
> >>working code for the same purpose, maybe we can discuss how to merge our
> >>code to produce a better version.
> >>
> >>thanks for your input.
> >>
> >>regards,
> >>Peili Zhang
> >>FlyBase-Harvard
> >>
> >>_______________________________________________
> >>Bioperl-announce-l mailing list
> >>Bioperl-announce-l at portal.open-bio.org
> >>http://portal.open-bio.org/mailman/listinfo/bioperl-announce-l
> >>
> >
> >--
> >Jason Stajich
> >Duke University
> >jason at cgt.mc.duke.edu
> >
> 
> 
> 
> 
>