[Biojava-l] SAX, DOM, XPath and Flat files

jimmy Zhang crackeur at comcast.net
Thu Dec 6 09:46:25 UTC 2007


VTD-XML should also be worth mentioning
http://vtd-xml.sf.net

----- Original Message ----- 
From: "Mark Schreiber" <markjschreiber at gmail.com>
To: "Andy Yates" <ayates at ebi.ac.uk>
Cc: "biojava-1 mailing list" <biojava-l at lists.open-bio.org>
Sent: Thursday, November 29, 2007 6:28 PM
Subject: Re: [Biojava-l] SAX, DOM, XPath and Flat files


> Java 5 SDK has both SAX and DOM as standard. I think it has XPath but
> not XQuery although XPath is probably more important for this use.
>
> The DOM model is a direct implementation of the W3C standard which
> makes it a little awkward from a java point of view but it is usable.
>
> Java 6 has StAX (the other one).
>
> There are a few java API's for parsing ASN.1 mostly developed for the
> telco industry, I've never really looked into which is best (anyone
> experienced with this?) but we could probably use one to work directly
> off NCBI ASN.1
>
> - Mark
>
> On Nov 28, 2007 10:29 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Hi Mark,
>>
>> Okay that sounds like a perfectly sensible way to deal with this. Is
>> this kind of parsing model supported in Java5? I only ask as I've not
>> done a lot of XML parsing with Java5; more with things like XOM (which I
>> think offers a DOM only representation but I'm probably wrong).
>>
>> That's good. There's not a huge point to have a format & a DTD/XSD and
>> then have your files not conform to it.
>>
>> I was thinking the exact same thing about ASN.1 (well that & it looks
>> bleeding horrible to parse but that is an un-educated look at the format
>> which I'm sure is a parsable as JSON & the alike).
>>
>> When it comes to flat file parsers I would be happier to provide
>> implementations of the more common formats where a viable alternative is
>> not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide
>> similar output to the above have a chance to write their own
>> parsers/formatters. This is very similar to the current situation but we
>> just need to remove dependencies on statically located data structures
>> (don't get rid of them completely just give users an option to not use
>> them).
>>
>> I'm not sure how much automatically generated parsers would help us. I
>> guess it depends on the data model(s) we use if they are auto-parser
>> friendly (which normally means POJO/JavaBean conventions including the
>> no-args constructor).
>>
>> Cool I don't want to exclude flat file parsers completely (if only
>> because my group has an interest in BioJava being able to read & write
>> flat files) :)
>>
>> They decided to have HUPO-PSI Format instead :)
>>
>> Andy
>>
>>
>> Mark Schreiber wrote:
>> > Hi -
>> >
>> > I think in most cases huge XML files in bioinformatics result from a
>> > single XML containing multiple repetitive elements. Eg a BLAST XML
>> > output with several hits or a GenBankXML with many Sequences.  A nice
>> > approach I have seen for dealing with these is to use SAX to read over
>> > the file and every time it comes to an element it delegates to a DOM
>> > object.  You then parse the bits of the DOM you want with XPath or
>> > convert to objects or something and then when you are finished with
>> > that entry everything gets garbage collected and the SAX parser moves
>> > to the next element and repeats the whole process.  This is a hybrid
>> > of event based parsing and object-model based parsing which could let
>> > you efficiently deal with huge files.
>> >
>> > I think the BLAST XML has improved substantially, at least in terms of
>> > validating against it's own DTD.  The DTD itself may not be the best
>> > design but that is always a matter of taste and if you are using XPath
>> > to get the relevant bits you don't need to make a SAX parser jump
>> > through hoops to get them.
>> >
>> > I agree we will have to keep flat file parsers but we should strongly
>> > encourage the use of XML where possible. It is simply easier to deal
>> > with. Most biological flat-files were designed for Fortran and mainly
>> > for human consumption. There is no obvious validation mechanism.
>> > Notably everything in NCBI is derived from ASN.1, what you see in the
>> > flatfile is produced from there. I tend to think this means that the
>> > ASN.1 is the holy gospel and what you get in the flat file is some
>> > translation.  Ideally NCBI files should be parsed from the ASN.1 where
>> > you can guarantee validation, the more practical alternative is to use
>> > the XML which you can at least validate against a DTD.
>> >
>> > With XML we (Biojava) can say if it validates we will parse it and if
>> > it doesn't we may not.  With flat files there are so many dodgey
>> > variants we cannot say anything.  Because XML dtds (or xsd's) have
>> > versions it also makes it much easier to have parsers for different
>> > versions and the parsing machinery can figure out which is needed.
>> > With flat files it is anyones guess what version you are dealing with.
>> >
>> > Finally parsers can be auto-generated for XML if you have the DTD or
>> > XSD. This often doesn't give you an ideal parser but it can be a
>> > useful starting point for rapid development.
>> >
>> > For Biojava v 3 I think we should concentrate on XML parsers first and
>> > flat files second. <sigh>if only Fasta had an XML format</sigh>
>> >
>> > - Mark
>> >
>> > On Nov 27, 2007 11:16 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> >> I was always under the impression that blast's XML output was nearly 
>> >> as
>> >> hard to parse as the flat file format but I do agree that if we can 
>> >> use
>> >> XML whenever we can it would make writing parsers a lot easier
>> >> (especially if there are SAX based XPath libraries available). 
>> >> Actually
>> >> this brings up a good question about development of this type of 
>> >> parser.
>> >> The majority of XPath supporting libraries are DOM based which will 
>> >> mean
>> >> large memory usage in some situations but overall providing an easier
>> >> coding experience (and hopefully reduce our chances of creating bugs).
>> >> Or should we code to the edge cases of someone trying to parse a 1GB
>> >> XML? Personally I'd favour the former.
>> >>
>> >> Going back to the original topic there are going to be situations 
>> >> where
>> >> people want the flat file parsers/writers & I think it's a valid point
>> >> to say this is where BioJava is meant to come in & help a developer.
>> >> Afterall XML is a computer science problem where as parsing an EMBL 
>> >> flat
>> >> file or blast output is a bioinformatics problem.
>> >>
>> >> Andy
>> >>
>> >>
>> >> Mark Schreiber wrote:
>> >>> For a long time now my feeling has been that we should *only* support
>> >>> the XML version of blast output.  The other formats are too brittle 
>> >>> to
>> >>> be easy to parse.  I also feel similarly about Genbank, EMBL, etc 
>> >>> that
>> >>> may be an extreme view but the power of generic XML parsers and 
>> >>> things
>> >>> like XPath etc really make these formats look very attractive.
>> >>>
>> >>> - Mark
>> >>>
>> >>>
>> >>> On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> >>>> I think Groovy have adopted a similar system recently & have 
>> >>>> guidelines
>> >>>> for how each module should behave (dependencies, build system etc). 
>> >>>> This
>> >>>> enforces the idea that a module whilst not part of the core project 
>> >>>> must
>> >>>> behave in the same manner the core does. I do like the idea that we 
>> >>>> can
>> >>>> have a core biojava & things get added around it & it might 
>> >>>> encourage
>> >>>> other users to start developing their own modules for any
>> >>>> formats/purpose they want.
>> >>>>
>> >>>> Richard Holland wrote:
>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>> Hash: SHA1
>> >>>>>
>> >>>>>> What format options are there from blast? Just thinking if it 
>> >>>>>> supports
>> >>>>>> CIGAR or something like that are we better providing a parser for 
>> >>>>>> that
>> >>>>>> format & saying that we do not support the traditional blast 
>> >>>>>> output?
>> >>>>>> That said it doesn't help is when that format changes so maybe 
>> >>>>>> what is
>> >>>>>> needed is a way to push out parser changes without requiring a 
>> >>>>>> full
>> >>>>>> biojava release (v3 discussion) ...
>> >>>>> Exactly! So the modular idea would work nicely here - we could have 
>> >>>>> a
>> >>>>> blast module and only update that single module (which would be its 
>> >>>>> own
>> >>>>> JAR) whenever the format changes. In a way, BioJava releases as 
>> >>>>> such
>> >>>>> would no longer happen, except maybe for some kind of core BioJava
>> >>>>> module. Everything would be done in terms of individual module+JAR
>> >>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, 
>> >>>>> one
>> >>>>> for Phylogenetic tools, one for translation/transcription, etc. 
>> >>>>> etc.
>> >>>>>
>> >>>>> cheers,
>> >>>>> Richard
>> >>>> _______________________________________________
>> >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> >>>>
>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 





More information about the Biojava-l mailing list