[Bioperl-l] Re: [Bioperl-announce-l] an extension to Bio::SeqIO

Peili Zhang peili at morgan.harvard.edu
Wed Jun 18 10:17:18 EDT 2003

here's my 0.002:

it'll be very cool if we have tools for the following workflow.

sequence data in any formats
	 | via Bio::SeqIO
      Bio::Seq object
      	 | via a module, be it GeneStructureGenerator/Transmogrifier
      	 v			/GenBankCollector/anything
         | via Bio::SeqIO::chadoxml
if non-gene data (eg protein records) is to be converted to chadoxml, the 
Bio::Seq object generated by Bio::SeqIO on that datafile will be passed directly 
to Bio::SeqIO::chadoxml to produce the chadoxml.

>Date: Tue, 17 Jun 2003 20:04:49 -0700 (PDT)
>From: Chris Mungall <cjm at fruitfly.org>
>X-X-Sender: <cjm at sos.lbl.gov>
>To: Jason Stajich <jason at cgt.duhs.duke.edu>
>Cc: Peili Zhang <peili at morgan.harvard.edu>, Bioperl <bioperl-l at bioperl.org>, 
<alldev at morgan.harvard.edu>, <pavel at morgan.harvard.edu>
>Subject: Re: [Bioperl-announce-l] an extension to Bio::SeqIO
>MIME-Version: 1.0
>X-Virus-Scanned: by amavisd-new
>X-Spam-Status: No, hits=-103.0 required=3.0 
>Hi Jason
>I agree the chadoxml writer should be kept fairly dumb
>Here's how I think things should work:
>genbank -> unflatten -> GeneStructure objects
>genbank -> unflatten -> map types to SO -> chadoxml
>(I think this is pretty much exactly the same as what you are proposing)
>thus the chadoxml part would not depend on the GeneStructure objects; but
>the same 'unflatten' component could be used for both GeneStructure
>objects, chadoxml, GFF3 etc.
>the good news is i have written the unflatten component. it is tentatively
>called Bio::Tools::GenBankCollector
>(not sure about the name - really it should be GenBankEMBLDDBJCollector??)
>the task of this module is relatively simple; it takes a flat list of
>features like this:
>  gene
>  mRNA CG4491-RA
>  CDS CG4491-PA
>  gene
>  tRNA tRNA-Pro
>  gene
>  mRNA CG32954-RA
>  mRNA CG32954-RC
>  mRNA CG32954-RB
>  CDS CG32954-PA
>  CDS CG32954-PB
>  CDS CG32954-PC
>and turns them into this:
>  gene
>    mRNA CG4491-RA
>      CDS CG4491-PA
>  gene
>    tRNA tRNA-Pro
>  gene
>    mRNA CG32954-RA
>      CDS CG32954-PA
>    mRNA CG32954-RC
>      CDS CG32954-PC
>    mRNA CG32954-RB
>      CDS CG32954-PB
>Of course, as anyone who has attempted to get gene models out of genbank
>records knows, it's a wee bit harder than this.
>full details in the pod docs, which i am about to send out to the bioperl
>On Tue, 17 Jun 2003, Jason Stajich wrote:
>> There is a bit of chicken-egg problem in that most of the data sets
>> Bioperl has tried to interface with are not as rich as chado, the
>> genbank->gene->chado way is not going to work for all genbank records
>> (which I personally can live with).  I would like to see if we can define
>> the least-common denominator for people to understand what needs to get a
>> chado db populated.
>> As we've been discussing in different venues I think we'd like to see a
>> general purpose system which can take a collection of sequence features,
>> relate them in a graph based on an identifiable grouping (the /gene field
>> or perhaps mapped into a general slot like 'group' ala Lincoln's
>> Bio::DB::GFF system), and then using SO map these into objects.  For genes
>> I'd like to see these be Bio::SeqFeature::Gene::GeneStructure objects
>> (the object model of which might need some work) because there are
>> additional methods already built in like intron inferences and ability to
>> loop through the transcripts, etc.
>> So my request is that we make the chado writer dumb, it should not try to
>> group anything, but should just obey however the objects are built.  An
>> intermediete set of factories can take lists of features and assign
>> 'group' fields to them, a second factory could relate them into a graph
>> based on SO and the group fields.  This graph can now be written out to
>> chadoxml.  Another factory (I was calling Bio::SeqFeature::Transmogrifier
>> for the calvin and hobbes fans) could build the appropriate composite
>> objects from the graph (Genes, HSPs, where appropriate) and deal with
>> multiple coordinate systems (in the case of features attached to the
>> annotated protein product).   The 'Transmogrifier' could also turn these
>> composite objects back into simple feature graphs so that they can be
>> written to chado simply and (finally) fully written out to a genbank
>> record with a controlled vocab of /tag=value fields.
>> These are my ideas anyways, perhaps too much?  I know other people (Shawn
>> Hoon, Chris Mungall) have volunteered ideas and coding to this as well so
>> we'd like to see if we can perhaps work together on it.
>> For examples of some minimal gene objects, the easiest way to get them
>> right now is from any of the gene prediction parser (
>> Bio::Tools::Genewise, Bio::Tools::Genomewise, Bio::Tools::Genscan,
>> Bio::Tools::Glimmer).
>> -jason
>> On Tue, 17 Jun 2003, Peili Zhang wrote:
>> > Hi,
>> >
>> > here at FlyBase, we implement chado database schema to store sequence,
>> > annotation, genetic, controlled vocabulary, publication and other types
>> > of data (for detailed information about chado schema, please visit
>> > http://www.gmod.org and read the schema documentations and scripts in
>> > its CVS).  we have developed tools to dump FlyBase data into chadoxml
>> > and load data in chadoxml format into FlyBase (for chadoxml dtd, please
>> > see
>> > 
>> > to facilitate data communication among the different sites of FlyBase
>> > and between FlyBase and the rest of the world. need arises for a tool to
>> > convert external data in other formats into chadoxml. I'm coding a perl
>> > module chadoxml.pm to write out a Bio::Seq object into chadoxml. we'd
>> > like to get your feedback on whether it's useful to add this module into
>> > bioperl as an extension to the Bio::SeqIO package. if you already have
>> > working code for the same purpose, maybe we can discuss how to merge our
>> > code to produce a better version.
>> >
>> > thanks for your input.
>> >
>> > regards,
>> > Peili Zhang
>> > FlyBase-Harvard
>> >
>> > _______________________________________________
>> > Bioperl-announce-l mailing list
>> > Bioperl-announce-l at portal.open-bio.org
>> > http://portal.open-bio.org/mailman/listinfo/bioperl-announce-l
>> >
>> --
>> Jason Stajich
>> Duke University
>> jason at cgt.mc.duke.edu

More information about the Bioperl-l mailing list