[Biojava-l] SAX, DOM, XPath and Flat files

Mark Schreiber markjschreiber at gmail.com
Fri Nov 30 02:28:58 UTC 2007


Java 5 SDK has both SAX and DOM as standard. I think it has XPath but
not XQuery although XPath is probably more important for this use.

The DOM model is a direct implementation of the W3C standard which
makes it a little awkward from a java point of view but it is usable.

Java 6 has StAX (the other one).

There are a few java API's for parsing ASN.1 mostly developed for the
telco industry, I've never really looked into which is best (anyone
experienced with this?) but we could probably use one to work directly
off NCBI ASN.1

- Mark

On Nov 28, 2007 10:29 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> Hi Mark,
>
> Okay that sounds like a perfectly sensible way to deal with this. Is
> this kind of parsing model supported in Java5? I only ask as I've not
> done a lot of XML parsing with Java5; more with things like XOM (which I
> think offers a DOM only representation but I'm probably wrong).
>
> That's good. There's not a huge point to have a format & a DTD/XSD and
> then have your files not conform to it.
>
> I was thinking the exact same thing about ASN.1 (well that & it looks
> bleeding horrible to parse but that is an un-educated look at the format
> which I'm sure is a parsable as JSON & the alike).
>
> When it comes to flat file parsers I would be happier to provide
> implementations of the more common formats where a viable alternative is
> not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide
> similar output to the above have a chance to write their own
> parsers/formatters. This is very similar to the current situation but we
> just need to remove dependencies on statically located data structures
> (don't get rid of them completely just give users an option to not use
> them).
>
> I'm not sure how much automatically generated parsers would help us. I
> guess it depends on the data model(s) we use if they are auto-parser
> friendly (which normally means POJO/JavaBean conventions including the
> no-args constructor).
>
> Cool I don't want to exclude flat file parsers completely (if only
> because my group has an interest in BioJava being able to read & write
> flat files) :)
>
> They decided to have HUPO-PSI Format instead :)
>
> Andy
>
>
> Mark Schreiber wrote:
> > Hi -
> >
> > I think in most cases huge XML files in bioinformatics result from a
> > single XML containing multiple repetitive elements. Eg a BLAST XML
> > output with several hits or a GenBankXML with many Sequences.  A nice
> > approach I have seen for dealing with these is to use SAX to read over
> > the file and every time it comes to an element it delegates to a DOM
> > object.  You then parse the bits of the DOM you want with XPath or
> > convert to objects or something and then when you are finished with
> > that entry everything gets garbage collected and the SAX parser moves
> > to the next element and repeats the whole process.  This is a hybrid
> > of event based parsing and object-model based parsing which could let
> > you efficiently deal with huge files.
> >
> > I think the BLAST XML has improved substantially, at least in terms of
> > validating against it's own DTD.  The DTD itself may not be the best
> > design but that is always a matter of taste and if you are using XPath
> > to get the relevant bits you don't need to make a SAX parser jump
> > through hoops to get them.
> >
> > I agree we will have to keep flat file parsers but we should strongly
> > encourage the use of XML where possible. It is simply easier to deal
> > with. Most biological flat-files were designed for Fortran and mainly
> > for human consumption. There is no obvious validation mechanism.
> > Notably everything in NCBI is derived from ASN.1, what you see in the
> > flatfile is produced from there. I tend to think this means that the
> > ASN.1 is the holy gospel and what you get in the flat file is some
> > translation.  Ideally NCBI files should be parsed from the ASN.1 where
> > you can guarantee validation, the more practical alternative is to use
> > the XML which you can at least validate against a DTD.
> >
> > With XML we (Biojava) can say if it validates we will parse it and if
> > it doesn't we may not.  With flat files there are so many dodgey
> > variants we cannot say anything.  Because XML dtds (or xsd's) have
> > versions it also makes it much easier to have parsers for different
> > versions and the parsing machinery can figure out which is needed.
> > With flat files it is anyones guess what version you are dealing with.
> >
> > Finally parsers can be auto-generated for XML if you have the DTD or
> > XSD. This often doesn't give you an ideal parser but it can be a
> > useful starting point for rapid development.
> >
> > For Biojava v 3 I think we should concentrate on XML parsers first and
> > flat files second. <sigh>if only Fasta had an XML format</sigh>
> >
> > - Mark
> >
> > On Nov 27, 2007 11:16 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >> I was always under the impression that blast's XML output was nearly as
> >> hard to parse as the flat file format but I do agree that if we can use
> >> XML whenever we can it would make writing parsers a lot easier
> >> (especially if there are SAX based XPath libraries available). Actually
> >> this brings up a good question about development of this type of parser.
> >> The majority of XPath supporting libraries are DOM based which will mean
> >> large memory usage in some situations but overall providing an easier
> >> coding experience (and hopefully reduce our chances of creating bugs).
> >> Or should we code to the edge cases of someone trying to parse a 1GB
> >> XML? Personally I'd favour the former.
> >>
> >> Going back to the original topic there are going to be situations where
> >> people want the flat file parsers/writers & I think it's a valid point
> >> to say this is where BioJava is meant to come in & help a developer.
> >> Afterall XML is a computer science problem where as parsing an EMBL flat
> >> file or blast output is a bioinformatics problem.
> >>
> >> Andy
> >>
> >>
> >> Mark Schreiber wrote:
> >>> For a long time now my feeling has been that we should *only* support
> >>> the XML version of blast output.  The other formats are too brittle to
> >>> be easy to parse.  I also feel similarly about Genbank, EMBL, etc that
> >>> may be an extreme view but the power of generic XML parsers and things
> >>> like XPath etc really make these formats look very attractive.
> >>>
> >>> - Mark
> >>>
> >>>
> >>> On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>>> I think Groovy have adopted a similar system recently & have guidelines
> >>>> for how each module should behave (dependencies, build system etc). This
> >>>> enforces the idea that a module whilst not part of the core project must
> >>>> behave in the same manner the core does. I do like the idea that we can
> >>>> have a core biojava & things get added around it & it might encourage
> >>>> other users to start developing their own modules for any
> >>>> formats/purpose they want.
> >>>>
> >>>> Richard Holland wrote:
> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>> Hash: SHA1
> >>>>>
> >>>>>> What format options are there from blast? Just thinking if it supports
> >>>>>> CIGAR or something like that are we better providing a parser for that
> >>>>>> format & saying that we do not support the traditional blast output?
> >>>>>> That said it doesn't help is when that format changes so maybe what is
> >>>>>> needed is a way to push out parser changes without requiring a full
> >>>>>> biojava release (v3 discussion) ...
> >>>>> Exactly! So the modular idea would work nicely here - we could have a
> >>>>> blast module and only update that single module (which would be its own
> >>>>> JAR) whenever the format changes. In a way, BioJava releases as such
> >>>>> would no longer happen, except maybe for some kind of core BioJava
> >>>>> module. Everything would be done in terms of individual module+JAR
> >>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
> >>>>> for Phylogenetic tools, one for translation/transcription, etc. etc.
> >>>>>
> >>>>> cheers,
> >>>>> Richard
> >>>> _______________________________________________
> >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>
>



More information about the Biojava-l mailing list