[Bioperl-l] Re: [Bioperl-announce-l] an extension to Bio::SeqIO

Tue Jun 17 21:04:49 EDT 2003

Hi Jason

I agree the chadoxml writer should be kept fairly dumb

Here's how I think things should work:

genbank -> unflatten -> GeneStructure objects
OR
genbank -> unflatten -> map types to SO -> chadoxml

(I think this is pretty much exactly the same as what you are proposing)

thus the chadoxml part would not depend on the GeneStructure objects; but
the same 'unflatten' component could be used for both GeneStructure
objects, chadoxml, GFF3 etc.

the good news is i have written the unflatten component. it is tentatively
called Bio::Tools::GenBankCollector

(not sure about the name - really it should be GenBankEMBLDDBJCollector??)

the task of this module is relatively simple; it takes a flat list of
features like this:

  gene
  mRNA CG4491-RA
  CDS CG4491-PA
  gene
  tRNA tRNA-Pro
  gene
  mRNA CG32954-RA
  mRNA CG32954-RC
  mRNA CG32954-RB
  CDS CG32954-PA
  CDS CG32954-PB
  CDS CG32954-PC

and turns them into this:

  gene
    mRNA CG4491-RA
      CDS CG4491-PA
  gene
    tRNA tRNA-Pro
  gene
    mRNA CG32954-RA
      CDS CG32954-PA
    mRNA CG32954-RC
      CDS CG32954-PC
    mRNA CG32954-RB
      CDS CG32954-PB

Of course, as anyone who has attempted to get gene models out of genbank
records knows, it's a wee bit harder than this.

full details in the pod docs, which i am about to send out to the bioperl
list...

On Tue, 17 Jun 2003, Jason Stajich wrote:

> There is a bit of chicken-egg problem in that most of the data sets
> Bioperl has tried to interface with are not as rich as chado, the
> genbank->gene->chado way is not going to work for all genbank records
> (which I personally can live with).  I would like to see if we can define
> the least-common denominator for people to understand what needs to get a
> chado db populated.
>
>
> As we've been discussing in different venues I think we'd like to see a
> general purpose system which can take a collection of sequence features,
> relate them in a graph based on an identifiable grouping (the /gene field
> or perhaps mapped into a general slot like 'group' ala Lincoln's
> Bio::DB::GFF system), and then using SO map these into objects.  For genes
> I'd like to see these be Bio::SeqFeature::Gene::GeneStructure objects
> (the object model of which might need some work) because there are
> additional methods already built in like intron inferences and ability to
> loop through the transcripts, etc.
>
> So my request is that we make the chado writer dumb, it should not try to
> group anything, but should just obey however the objects are built.  An
> intermediete set of factories can take lists of features and assign
> 'group' fields to them, a second factory could relate them into a graph
> based on SO and the group fields.  This graph can now be written out to
> chadoxml.  Another factory (I was calling Bio::SeqFeature::Transmogrifier
> for the calvin and hobbes fans) could build the appropriate composite
> objects from the graph (Genes, HSPs, where appropriate) and deal with
> multiple coordinate systems (in the case of features attached to the
> annotated protein product).   The 'Transmogrifier' could also turn these
> composite objects back into simple feature graphs so that they can be
> written to chado simply and (finally) fully written out to a genbank
> record with a controlled vocab of /tag=value fields.
>
> These are my ideas anyways, perhaps too much?  I know other people (Shawn
> Hoon, Chris Mungall) have volunteered ideas and coding to this as well so
> we'd like to see if we can perhaps work together on it.
>
> For examples of some minimal gene objects, the easiest way to get them
> right now is from any of the gene prediction parser (
> Bio::Tools::Genewise, Bio::Tools::Genomewise, Bio::Tools::Genscan,
> Bio::Tools::Glimmer).
>
> -jason
> On Tue, 17 Jun 2003, Peili Zhang wrote:
>
> > Hi,
> >
> > here at FlyBase, we implement chado database schema to store sequence,
> > annotation, genetic, controlled vocabulary, publication and other types
> > of data (for detailed information about chado schema, please visit
> > http://www.gmod.org and read the schema documentations and scripts in
> > its CVS).  we have developed tools to dump FlyBase data into chadoxml
> > and load data in chadoxml format into FlyBase (for chadoxml dtd, please
> > see
> > http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gmod/schema/chado/dat/chado.dtd),
> > to facilitate data communication among the different sites of FlyBase
> > and between FlyBase and the rest of the world. need arises for a tool to
> > convert external data in other formats into chadoxml. I'm coding a perl
> > module chadoxml.pm to write out a Bio::Seq object into chadoxml. we'd
> > like to get your feedback on whether it's useful to add this module into
> > bioperl as an extension to the Bio::SeqIO package. if you already have
> > working code for the same purpose, maybe we can discuss how to merge our
> > code to produce a better version.
> >
> > thanks for your input.
> >
> > regards,
> > Peili Zhang
> > FlyBase-Harvard
> >
> > _______________________________________________
> > Bioperl-announce-l mailing list
> > Bioperl-announce-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-announce-l
> >
>
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
>