[Bioperl-l] cross-project parser fun

Jason Stajich jason@cgt.mc.duke.edu
Thu, 8 Aug 2002 09:33:34 -0400 (EDT)


Sounds great - I'm not sure we're as worried about testing the parsers
across the projects as we are making sure we get the same input/output
from a given file.  This just means read-in, write-out and diff.  I think
a better use of time is writing a framework which makes this easy to score
a particular parser on its ability to roundtrip a set of sequence files.

In perl, I've actually done this before as I really did spend a long time
testing the genbank/embl/swissprot parsers/writers for compatible diffs.
The problem comes in linewraps behaving slightly differently so that diffs
are perfectly clean and thus can't be scored automatically by software.

So creating a program framework which may or may not be cross-language
compatible to plug in a script which allows you to read & write a file and
compare the two automatically would be great.

-jason
On Wed, 7 Aug 2002, Danny Yoo wrote:

> Hi everyone,
>
> There's been a lot of activity on the Bioperl mailing list recently about
> their Genbank location parser, and there was a call to compare the
> behavior of it versus that of the other Bio projects.
>
>
> Perhaps it might be a good thing to collaborate with them?  Someone can
> write a set of common tests to make sure that sublocation parsing and
> sequence extraction is being done consistantly between Biopython and
> Bioperl.  I'll try writing something this afternoon.
>
>
>
> ---------- Forwarded message ----------
> Date: Wed, 7 Aug 2002 11:39:19 -0700 (PDT)
> From: Chris Mungall <cjm@fruitfly.org>
> To: Hilmar Lapp <hlapp@gnf.org>
> Cc: Elia Stupka <elia@fugu-sg.org>, Jason Stajich <jason@cgt.mc.duke.edu>,
>      bioperl-l@bioperl.org
> Subject: Re: [Bioperl-l] *major* error in genbank parser or am i just
>     insane?
>
>
> i would have though the sublocations strand should be -1, as they
> represent exons on the reverse strand. but i don't really understand the
> whole bioperl location+seqfeature semantics/model; when outside the
> bioperl world i just have one class that rolls seqfeature and location
> into one.
>
> i'm happy to have hilmar revoke my fix and instead go with checking the
> parent location strand rather than the sublocation strand (if someone
> could fix the genbank dumper to print the complement correctly that would
> be great). if we go this route i will fix bioperl-db so that the parent
> location strand goes into the seqfeature_location table. note that this
> will introduce a slight disjunction between biosql abnd bioperl (in biosql
> we absolutely must represent -ve strand exons as
> seqfeature_location.strand = -1). hmm, how does biojava handle this.
>
> On Wed, 7 Aug 2002, Hilmar Lapp wrote:
>
> > After looking at Chris' fix, it appears to be wrong: it would set
> > the sublocs' strand to -1. The problem lies elsewhere, I'm going to
> > revoke that fix.
> >
> > 	-hilmar
> >
> > On Wednesday, August 7, 2002, at 10:10  AM, Hilmar Lapp wrote:
> >
> > > I have no idea what the present status on that is, but my reply was
> > > generally not about a long-term/high-level/design/it would
> > > be much better if/ discussion. I basically asked the question what
> > > complement(join(1..100,201..300)) exactly means, and whether it has
> > > been decided how exactly it shall be translated into strand()
> > > attributes of the parent and sub-locations. This hasn't been
> > > answered yet ...
> > >
> > > Quoting from the FT definition:
> > >
> > > complement(join(2691..4571,4918..5163))
> > >                           Joins regions 2691 to 4571 and 4918 to
> > > 5163, then 
> > >                           complements the joined segments (the
> > > feature is 
> > >                           on the strand complementary to the
> > > presented strand)
> > >  
> > > join(complement(4918..5163),complement(2691..4571))
> > >                           Complements regions 4918 to 5163 and 2691
> > > to 4571, then 
> > >                           joins the complemented segments (the
> > > feature is 
> > >                           on the strand complementary to the
> > > presented strand)
> > >
> > > The case in question is the first example. To translate this
> > > properly to Bioperl locations, this means the parent SplitLoc is
> > > strand -1, whereas the subs are strand +1. Right?
> > >
> > > 	-hilmar
> > >
> > >
> > > On Tuesday, August 6, 2002, at 10:24  PM, Chris Mungall wrote:
> > >
> > >> ok, committed - it seems to have had some weird knock on effect
> > >> breaking
> > >> other tests - i can uncommit if this is bad
> > >>
> > >> On Wed, 7 Aug 2002, Elia Stupka wrote:
> > >>
> > >>>> we need a short term fix for the standard situation even more -
> > >>>> shall i
> > >>>> commit my chnange or will this mess things up more?
> > >>>
> > >>> Please commit it, I cannot stand when long-term/high-
> > >>> level/design/it would
> > >>> be much better if/ discussions get in the way of production
> > >>> improvement fixes.
> > >>>
> > >>> Once it's committed I can set off a script for the diffing of in/out
> > >>> genbank so you can be comfortable that it's not screwing up the
> > >>> rest of
> > >>> genkank parsing ;)
> > >>>
> > >>> Elia
> > >>>
> > >>> ********************************
> > >>> * http://www.fugu-sg.org/~elia *
> > >>> * tel:    +65 6874 1467        *
> > >>> * mobile: +65 9030 7613        *
> > >>> * fax:    +65 6777 0402        *
> > >>> ********************************
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > > --
> > > -------------------------------------------------------------
> > > Hilmar Lapp                            email: lapp at gnf.org
> > > GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> > > -------------------------------------------------------------
> > >
> > >
> > --
> > -------------------------------------------------------------
> > Hilmar Lapp                            email: lapp at gnf.org
> > GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> > -------------------------------------------------------------
> >
> >
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu