[Bioperl-l] Re: [Bioperl-announce-l] an extension to Bio::SeqIO

Jason Stajich jason at cgt.duhs.duke.edu
Wed Jun 18 10:29:47 EDT 2003


On Wed, 18 Jun 2003, Peili Zhang wrote:

>
> here's my 0.002:
>
> it'll be very cool if we have tools for the following workflow.
>
> sequence data in any (rich) formats
> 	 |
> 	 | via Bio::SeqIO
> 	 v
     Bio::Seq->get_SeqFeatures()
         OR
      Collection of Bio::SeqFeatureI

[Actually what's cooler I think is that you don't need Bio::Seq objects or
anything, just a set of Bio::SeqFeatureI objects. This would mean that
people could take their GFF files and turn them into chado IFF they are
rich enough.]

>       Bio::Seq object
>       	 |
>       	 | via a module, be it GeneStructureGenerator/Transmogrifier
>       	 v			/GenBankCollector/anything
>       Bio::SeqFeature::Gene::GeneStructure
>          |
>          | via Bio::SeqIO::chadoxml
>          v
>       chadoxml
>
> if non-gene data (eg protein records) is to be converted to chadoxml, the
> Bio::Seq object generated by Bio::SeqIO on that datafile will be passed directly
> to Bio::SeqIO::chadoxml to produce the chadoxml.
>
> >Date: Tue, 17 Jun 2003 20:04:49 -0700 (PDT)
> >From: Chris Mungall <cjm at fruitfly.org>
> >X-X-Sender: <cjm at sos.lbl.gov>
> >To: Jason Stajich <jason at cgt.duhs.duke.edu>
> >Cc: Peili Zhang <peili at morgan.harvard.edu>, Bioperl <bioperl-l at bioperl.org>,
> <alldev at morgan.harvard.edu>, <pavel at morgan.harvard.edu>
> >Subject: Re: [Bioperl-announce-l] an extension to Bio::SeqIO
> >MIME-Version: 1.0
> >X-Virus-Scanned: by amavisd-new
> >X-Spam-Status: No, hits=-103.0 required=3.0
> tests=EMAIL_ATTRIBUTION,IN_REP_TO,QUOTED_EMAIL_TEXT,
> SPAM_PHRASE_02_03,USER_AGENT_PINE,USER_IN_WHITELIST version=2.43
> >X-Spam-Level:
> >
> >Hi Jason
> >
> >I agree the chadoxml writer should be kept fairly dumb
> >
> >Here's how I think things should work:
> >
> >genbank -> unflatten -> GeneStructure objects
> >OR
> >genbank -> unflatten -> map types to SO -> chadoxml
> >
> >(I think this is pretty much exactly the same as what you are proposing)
> >
> >thus the chadoxml part would not depend on the GeneStructure objects; but
> >the same 'unflatten' component could be used for both GeneStructure
> >objects, chadoxml, GFF3 etc.
> >
> >the good news is i have written the unflatten component. it is tentatively
> >called Bio::Tools::GenBankCollector
> >
> >(not sure about the name - really it should be GenBankEMBLDDBJCollector??)
> >
> >the task of this module is relatively simple; it takes a flat list of
> >features like this:
> >
> >  gene
> >  mRNA CG4491-RA
> >  CDS CG4491-PA
> >  gene
> >  tRNA tRNA-Pro
> >  gene
> >  mRNA CG32954-RA
> >  mRNA CG32954-RC
> >  mRNA CG32954-RB
> >  CDS CG32954-PA
> >  CDS CG32954-PB
> >  CDS CG32954-PC
> >
> >and turns them into this:
> >
> >  gene
> >    mRNA CG4491-RA
> >      CDS CG4491-PA
> >  gene
> >    tRNA tRNA-Pro
> >  gene
> >    mRNA CG32954-RA
> >      CDS CG32954-PA
> >    mRNA CG32954-RC
> >      CDS CG32954-PC
> >    mRNA CG32954-RB
> >      CDS CG32954-PB
> >
> >Of course, as anyone who has attempted to get gene models out of genbank
> >records knows, it's a wee bit harder than this.
> >
> >full details in the pod docs, which i am about to send out to the bioperl
> >list...
> >
> >On Tue, 17 Jun 2003, Jason Stajich wrote:
> >
> >> There is a bit of chicken-egg problem in that most of the data sets
> >> Bioperl has tried to interface with are not as rich as chado, the
> >> genbank->gene->chado way is not going to work for all genbank records
> >> (which I personally can live with).  I would like to see if we can define
> >> the least-common denominator for people to understand what needs to get a
> >> chado db populated.
> >>
> >>
> >> As we've been discussing in different venues I think we'd like to see a
> >> general purpose system which can take a collection of sequence features,
> >> relate them in a graph based on an identifiable grouping (the /gene field
> >> or perhaps mapped into a general slot like 'group' ala Lincoln's
> >> Bio::DB::GFF system), and then using SO map these into objects.  For genes
> >> I'd like to see these be Bio::SeqFeature::Gene::GeneStructure objects
> >> (the object model of which might need some work) because there are
> >> additional methods already built in like intron inferences and ability to
> >> loop through the transcripts, etc.
> >>
> >> So my request is that we make the chado writer dumb, it should not try to
> >> group anything, but should just obey however the objects are built.  An
> >> intermediete set of factories can take lists of features and assign
> >> 'group' fields to them, a second factory could relate them into a graph
> >> based on SO and the group fields.  This graph can now be written out to
> >> chadoxml.  Another factory (I was calling Bio::SeqFeature::Transmogrifier
> >> for the calvin and hobbes fans) could build the appropriate composite
> >> objects from the graph (Genes, HSPs, where appropriate) and deal with
> >> multiple coordinate systems (in the case of features attached to the
> >> annotated protein product).   The 'Transmogrifier' could also turn these
> >> composite objects back into simple feature graphs so that they can be
> >> written to chado simply and (finally) fully written out to a genbank
> >> record with a controlled vocab of /tag=value fields.
> >>
> >> These are my ideas anyways, perhaps too much?  I know other people (Shawn
> >> Hoon, Chris Mungall) have volunteered ideas and coding to this as well so
> >> we'd like to see if we can perhaps work together on it.
> >>
> >> For examples of some minimal gene objects, the easiest way to get them
> >> right now is from any of the gene prediction parser (
> >> Bio::Tools::Genewise, Bio::Tools::Genomewise, Bio::Tools::Genscan,
> >> Bio::Tools::Glimmer).
> >>
> >> -jason
> >> On Tue, 17 Jun 2003, Peili Zhang wrote:
> >>
> >> > Hi,
> >> >
> >> > here at FlyBase, we implement chado database schema to store sequence,
> >> > annotation, genetic, controlled vocabulary, publication and other types
> >> > of data (for detailed information about chado schema, please visit
> >> > http://www.gmod.org and read the schema documentations and scripts in
> >> > its CVS).  we have developed tools to dump FlyBase data into chadoxml
> >> > and load data in chadoxml format into FlyBase (for chadoxml dtd, please
> >> > see
> >> >
> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gmod/schema/chado/dat/chado.dtd),
> >> > to facilitate data communication among the different sites of FlyBase
> >> > and between FlyBase and the rest of the world. need arises for a tool to
> >> > convert external data in other formats into chadoxml. I'm coding a perl
> >> > module chadoxml.pm to write out a Bio::Seq object into chadoxml. we'd
> >> > like to get your feedback on whether it's useful to add this module into
> >> > bioperl as an extension to the Bio::SeqIO package. if you already have
> >> > working code for the same purpose, maybe we can discuss how to merge our
> >> > code to produce a better version.
> >> >
> >> > thanks for your input.
> >> >
> >> > regards,
> >> > Peili Zhang
> >> > FlyBase-Harvard
> >> >
> >> > _______________________________________________
> >> > Bioperl-announce-l mailing list
> >> > Bioperl-announce-l at portal.open-bio.org
> >> > http://portal.open-bio.org/mailman/listinfo/bioperl-announce-l
> >> >
> >>
> >> --
> >> Jason Stajich
> >> Duke University
> >> jason at cgt.mc.duke.edu
> >>
> >
> >
> >
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu


More information about the Bioperl-l mailing list