[Bioperl-l] Re: problems with Bio::Tools::GFF

Lincoln Stein lstein at cshl.edu
Wed Nov 5 09:39:20 EST 2003


> I am a bit wary of splitting on space wrt the last column but so we'll
> have to cook up some test cases to make sure it goes through okay.

Split on space?  You shouldn't need to do that.

> > > Note that I have also made no attempt to parse/write the Gap or
> > > Alignment stuff in any sort of special way - I basically made it so it
> > > supports what GFF2 currently looks like only in GFF3 flavor.  Perhaps
> > > it makes sense to do all of that work on Chris's Unflattner though
> > > rather than in Tools::GFF.  A SeqFeature::Tools::Flattner is probably
> > > in order as well to turn HSPs and other paired sequences into GFF3
> > > Alignments.
> >
> > I'm not sure it's necessary to move to Unflattener.  Since the format is
> > fairly simple, it is only really necessary to split the information in
> > the groups column to tag value pairs and let the user decide what to do
> > with the information.  The only thing that I am somewhat at a loss to
> > deal with is cigar line info, but I don't think that is being parse by
> > Bio::DB::GFF yet either.

Every CIGAR line can be turned into a set of nongapped HSPs and vice versa.  
The main issue is that if you represent a gapped alignment as a CIGAR, you 
can't give each of its HSPs a separate score!  I think this is a big problem.  
Therefore you can either ignore CIGARs completely, or mix CIGARs with HSPs.

Lincoln


>
> One day I could imagine us building Gene/Transcript objects from the GFF3.
> Actually I was thinking we'd need a Flattner to turn the Gene object back
> into flattened features.  Likewise with HSP objects and alignments.  I
> can't produce CIGAR lines currently from HSPs - I'm still a little
> confused about how to construct them but it means I need to read the spec
> a little more probably.
>
> > > As for the seq stuff - will likely need a Bio::SeqIO::gff3 for that.
> >
> > Ouch--I was afraid you were going to suggest that.  I suppose if we make
> > it a read-only module, I guess that should be ok.  The thought of making
> > it write makes my head hurt.
>
> For writing multiple sequences, could be pretty ugly.  Either some
> caching OR a special write_seq which takes an arrayref.  Maybe not a SeqIO
> after all....  unless GFF3 lets a new set start with
> # gff-version 3
> so you could interleave them?
> # gff-version 3
> ..
> ##FASTA
>
> >oneseq.1
>
> CAGT
> # gff-version 3
> ..
> ## FASTA
>
> >oneseq.2
>
> GATC
>
>
> For reading sequences next_seq will have to parse in the entire GFF file
> at once and next_seq will have to iterate through an internal array I
> guess.  Not that hard I hope...
>
> > > Anyone is welcome to add these changes - I don't think I'll be able to
> > > make many contributions until December so it would be best if someone
> > > else took it on.
> > >
> > > -jason
> > >
> > > On Mon, 3 Nov 2003, Scott Cain wrote:
> > > > Hi Jason and Lincoln,
> > > >
> > > > I have a few concerns with Bio::Tools::GFF. The first is with the
> > > > method _from_gff3_string, which does a split on \t to separate
> > > > columns.  I think the GFF3 spec says it can be space delimited, so
> > > > that should probably be \s+.  Additionally, to split the groups
> > > > column, it uses \s*;\s*, but I think that spaces have to be escaped,
> > > > therefore, it should only split on ; and spaces would indicate a
> > > > problem (especially if one splits on spaces as indicated above).
> > > >
> > > > Finally, it doesn't provide a method of accessing the sequence that
> > > > is optionally at the bottom of the file.  I am not exactly sure how
> > > > to implement that (or I would), but I suspect it will have to be
> > > > handled in the next_feature method.  Of course, the problem with
> > > > handling it there is that it is not a feature.
> > > >
> > > > Scott
> > >
> > > --
> > > Jason Stajich
> > > Duke University
> > > jason at cgt.mc.duke.edu
>
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l

-- 
Lincoln Stein
lstein at cshl.edu
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)


More information about the Bioperl-l mailing list