[Biojava-l] SAX, DOM, XPath and Flat files

Mark Schreiber markjschreiber at gmail.com
Wed Nov 28 03:34:38 UTC 2007


Hi -

I think in most cases huge XML files in bioinformatics result from a
single XML containing multiple repetitive elements. Eg a BLAST XML
output with several hits or a GenBankXML with many Sequences.  A nice
approach I have seen for dealing with these is to use SAX to read over
the file and every time it comes to an element it delegates to a DOM
object.  You then parse the bits of the DOM you want with XPath or
convert to objects or something and then when you are finished with
that entry everything gets garbage collected and the SAX parser moves
to the next element and repeats the whole process.  This is a hybrid
of event based parsing and object-model based parsing which could let
you efficiently deal with huge files.

I think the BLAST XML has improved substantially, at least in terms of
validating against it's own DTD.  The DTD itself may not be the best
design but that is always a matter of taste and if you are using XPath
to get the relevant bits you don't need to make a SAX parser jump
through hoops to get them.

I agree we will have to keep flat file parsers but we should strongly
encourage the use of XML where possible. It is simply easier to deal
with. Most biological flat-files were designed for Fortran and mainly
for human consumption. There is no obvious validation mechanism.
Notably everything in NCBI is derived from ASN.1, what you see in the
flatfile is produced from there. I tend to think this means that the
ASN.1 is the holy gospel and what you get in the flat file is some
translation.  Ideally NCBI files should be parsed from the ASN.1 where
you can guarantee validation, the more practical alternative is to use
the XML which you can at least validate against a DTD.

With XML we (Biojava) can say if it validates we will parse it and if
it doesn't we may not.  With flat files there are so many dodgey
variants we cannot say anything.  Because XML dtds (or xsd's) have
versions it also makes it much easier to have parsers for different
versions and the parsing machinery can figure out which is needed.
With flat files it is anyones guess what version you are dealing with.

Finally parsers can be auto-generated for XML if you have the DTD or
XSD. This often doesn't give you an ideal parser but it can be a
useful starting point for rapid development.

For Biojava v 3 I think we should concentrate on XML parsers first and
flat files second. <sigh>if only Fasta had an XML format</sigh>

- Mark

On Nov 27, 2007 11:16 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> I was always under the impression that blast's XML output was nearly as
> hard to parse as the flat file format but I do agree that if we can use
> XML whenever we can it would make writing parsers a lot easier
> (especially if there are SAX based XPath libraries available). Actually
> this brings up a good question about development of this type of parser.
> The majority of XPath supporting libraries are DOM based which will mean
> large memory usage in some situations but overall providing an easier
> coding experience (and hopefully reduce our chances of creating bugs).
> Or should we code to the edge cases of someone trying to parse a 1GB
> XML? Personally I'd favour the former.
>
> Going back to the original topic there are going to be situations where
> people want the flat file parsers/writers & I think it's a valid point
> to say this is where BioJava is meant to come in & help a developer.
> Afterall XML is a computer science problem where as parsing an EMBL flat
> file or blast output is a bioinformatics problem.
>
> Andy
>
>
> Mark Schreiber wrote:
> > For a long time now my feeling has been that we should *only* support
> > the XML version of blast output.  The other formats are too brittle to
> > be easy to parse.  I also feel similarly about Genbank, EMBL, etc that
> > may be an extreme view but the power of generic XML parsers and things
> > like XPath etc really make these formats look very attractive.
> >
> > - Mark
> >
> >
> > On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >> I think Groovy have adopted a similar system recently & have guidelines
> >> for how each module should behave (dependencies, build system etc). This
> >> enforces the idea that a module whilst not part of the core project must
> >> behave in the same manner the core does. I do like the idea that we can
> >> have a core biojava & things get added around it & it might encourage
> >> other users to start developing their own modules for any
> >> formats/purpose they want.
> >>
> >> Richard Holland wrote:
> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> Hash: SHA1
> >>>
> >>>> What format options are there from blast? Just thinking if it supports
> >>>> CIGAR or something like that are we better providing a parser for that
> >>>> format & saying that we do not support the traditional blast output?
> >>>> That said it doesn't help is when that format changes so maybe what is
> >>>> needed is a way to push out parser changes without requiring a full
> >>>> biojava release (v3 discussion) ...
> >>> Exactly! So the modular idea would work nicely here - we could have a
> >>> blast module and only update that single module (which would be its own
> >>> JAR) whenever the format changes. In a way, BioJava releases as such
> >>> would no longer happen, except maybe for some kind of core BioJava
> >>> module. Everything would be done in terms of individual module+JAR
> >>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
> >>> for Phylogenetic tools, one for translation/transcription, etc. etc.
> >>>
> >>> cheers,
> >>> Richard
> >> _______________________________________________
> >> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>
>



More information about the Biojava-l mailing list