[Bioperl-l] Re: [SO-devel] GFF3 preliminary

Chris Mungall cjm at fruitfly.org
Wed Feb 19 01:45:38 EST 2003



On Wed, 19 Feb 2003, Ewan Birney wrote:

>
>
> On Tue, 18 Feb 2003, Mark Yandell wrote:
>
> > Hi All,
> >
> >
> > ".  When asked why they
> > > have modified the published Sanger specification, bioinformaticists
> > > frequently answer that the format was insufficient for their needs...",
> >
> >
> > So why not just use XML? you know, with like a real DTD, like the rest of the
> > world and be done with it ?
> >
>
> that's what NCBI Seq XML or GAME XML or (new and shiny...talk to Michele)
> Otter XML is for, and they solve specific problems.

the syntax (eg XML vs S-Expressions vs denormalised relations) is mostly
orthogonal to the semantics (eg GFF2 = feature bags, GFF3 = feature graphs
+ controlled types, Otter = explicit gene/transcript/exons/etc ie ensembl
semantics)

there is an implicit DTD underlying the GFF3 spec that Lincoln posted -
maybe it would be better to make this explicit and keep everyone happy?

of course this won't satisfy people who want semantically richer formats
(eg attaching deep curation to genes) but that isn't the problem GFFn sets
out to solve.

> With XML you can't:
>
>   use grep
>   use sort and sort -k and other twisted options of sort
>   use comm
>   use awk

and of course 'perl -ne' which is great

but you can do some really powerful stuff with sax events too, and it
needn't be that difficult (the main problem with XML is the technology
bloat surrounding it - simpler tools are needed)

> With XML you need
>
>   a decent XML SAX parser in your language of choice to read it reliably -
> now this is pretty much there for most languages
>
>
>   enough coding time to write a SAX event to internal data structure
> in a tag-tolerant way (after all, if you are going to be strict on the
> tags and not tolerate additional tags... then why use XML?). Nowhere near

loads of reasons

> impossible, but nowhere near as simple as @fields = split;
>
>   endless discussions with people who are trying to solve related but
> distinct problems to discover that you want to write separate XML formats.
>
>
>
> XML is a bad format, but undoubtly the best format out there for complex
> data.
>
>
> XML simply doesn't replace tab delimited formats and we shouldn't mandate
> the death of GFF and friends (eg, GTF) due to XML formats being used for
> complex data transfer.

i think a GFF3-table and GFF3-xml can live together happily

the great thing about GFF3 is the ability to properly represent feature
graphs / hierarchies and the use of an ontology for feature types. the
faster this replaces other GFFs the better (despite the remaining flaws in
the semantics that other have already alluded to) - i'm not so conerned
about the underlying syntax



More information about the Bioperl-l mailing list