[Biojava-l] SAX, DOM, XPath and Flat files

Andy Yates ayates at ebi.ac.uk
Tue Dec 4 09:12:51 UTC 2007


I think avoiding the jar explosion is a very good idea. I think if every 
Jar choice has to go through a process of issue/vote which makes it a 
bit harder to decide to introduce a new JAR without others knowing what 
it is, why the submitter has chosen it & why is it better than other 
alternatives; this really could be a simple as I've used this one & it's 
API is easier to understand.

Same thing is seen in all libraries. Just looking at the Spring 
synchronized collection factories you can see it testing for Java 
versions & class existence to know what type of synchronized collection 
it can create.

Also XML apis are one of the worst for jar dependency hell since 
everyone has their favourite parser (just try running a program in ant 
without forking & using two XML apis ... it's fun). Using XPath & a 
generic retrieval system could give us this flexibility we all seem to 
be wanting. It more depends on is there a good enough XPath 
implementation that can handle the XML files we'll be pushing through it 
(why is it I think the answer is no).

Hmmm it does but how many bioinformaticians use the ASN.1 syntax though 
compared to flat file & XML? I'm guessing that flat file is the winner 
here with XML & ASN.1 coming in reasonably equal*. If this is true then 
yes I'd be more tempted to write a ASN.1 parser & then support XML.

Andy (not a Mark in the slightest)


* Please note that this is a finger in the air guess with no actual 
statistical backing one way or another :).

Mark Schreiber wrote:
> The only major advantage to using the JDK DOM/SAX is that everyone has
> them (no new JARs required) and they will never go away.  However I
> can see there is a strong case for something else like XOM or Apache
> alternatives Saxon etc.  Infact these projects often feature bleeding
> edge technologies before they appear in the JDK.
> 
> To prevent an explosion of JARs I think we should agree on a small few
> XML options.  As Mark mentions a good interface design makes the user
> code completely independent of the XML parser that is used. This makes
> it much easier to change what is used under the hood if something
> better comes along or if one of our project dependencies stops being
> developed.
> 
> This has actually happened before in biojava. We used to rely on
> Xerces or something similar but once SAX and DOM appeared in the JDK
> we swapped out Xerces without too much impact.  Good unit tests help
> to make sure everything still works.
> 
> The occasional problem with NCBI XML is probably the best argument to
> delve into the dark world of ASN.1
> 
> - Mark (Classic Mark, not New Mark)
> 
> On Nov 30, 2007 1:30 PM, Mark Fortner <phidias51 at gmail.com> wrote:
>> There's a potential gotcha involved with XPath parsing.  If you use the
>> current implementation that ships with the Java 5 & 6 JDKs, it performs a
>> DOM parse on the whole document, even if you pass it a specific starting
>> node in the document.  I stumbled across this one the hard way when using
>> the hybrid approach that you mention.  This may be solved with another XPath
>> implementation such as Saxon.
>>
>> One other problem I've noticed is that the NCBI XML doesn't always parse.
>> I've reported this to them, and they've promised to address this. It usually
>> occurs when submitters put non-escaped characters into text fields such as
>> author lists in PubMed. NCBI doesn't always use CDATA blocks around text and
>> as soon as the parser hits one of these characters it throws an exception.
>>
>> I've also noticed a tendency (in other code bases) for developers to use
>> several different parsers; usually, whatever parser they're most familiar
>> with.  The problem with this is that they often introduce parser-specific
>> code into the code base, so you end up with numerous dependencies for
>> different parsers, and a potential configuration problem if you're passing
>> the XML parser as a run-time configuration parameter.  The most frequent
>> external parsers I've seen used are JDOM and DOM4J.  The usual way to get
>> around this is to write to an interface, but that will require some
>> additional vigilance.
>>
>> Just a few things to watch out for as we move forward.
>>
>> Mark (the other one) :-)
>>
>>
>> On Nov 30, 2007 1:26 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>
>>> I think I've seen XPath hanging around in other people's code in a 1.5
>>> code-base (in fact one of the guys I work with). I've used Java's DOM
>>> before & it really isn't very nice & quite verbose. I'd prefer if there
>>> was a better alternative/wrapper around the XML parsers just to cut down
>>> on code chatter.
>>>
>>> Wow I've just visited http://asn1.elibel.tm.fr/links/ looking for these
>>> Java tools & I think I've gone cross-eyed with the sheer number of
>>> acronyms! You've gotta love something which seems to add a letter to ER
>>> & that's a new acronym (e.g. BER, DER, PER and XER). Does anyone on the
>>> list know of a ASN.1 parser for Java that's good and should we support
>>> it (considering NCBI generate their DTD & XML from the ASN.1
>>> representation).
>>>
>>> Andy
>>>
>>> Mark Schreiber wrote:
>>>> Java 5 SDK has both SAX and DOM as standard. I think it has XPath but
>>>> not XQuery although XPath is probably more important for this use.
>>>>
>>>> The DOM model is a direct implementation of the W3C standard which
>>>> makes it a little awkward from a java point of view but it is usable.
>>>>
>>>> Java 6 has StAX (the other one).
>>>>
>>>> There are a few java API's for parsing ASN.1 mostly developed for the
>>>> telco industry, I've never really looked into which is best (anyone
>>>> experienced with this?) but we could probably use one to work directly
>>>> off NCBI ASN.1
>>>>
>>>> - Mark
>>>>
>>>> On Nov 28, 2007 10:29 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>> Hi Mark,
>>>>>
>>>>> Okay that sounds like a perfectly sensible way to deal with this. Is
>>>>> this kind of parsing model supported in Java5? I only ask as I've not
>>>>> done a lot of XML parsing with Java5; more with things like XOM (which
>>> I
>>>>> think offers a DOM only representation but I'm probably wrong).
>>>>>
>>>>> That's good. There's not a huge point to have a format & a DTD/XSD and
>>>>> then have your files not conform to it.
>>>>>
>>>>> I was thinking the exact same thing about ASN.1 (well that & it looks
>>>>> bleeding horrible to parse but that is an un-educated look at the
>>> format
>>>>> which I'm sure is a parsable as JSON & the alike).
>>>>>
>>>>> When it comes to flat file parsers I would be happier to provide
>>>>> implementations of the more common formats where a viable alternative
>>> is
>>>>> not available e.g. UniProt, EMBL, Genbank etc. Then groups which
>>> provide
>>>>> similar output to the above have a chance to write their own
>>>>> parsers/formatters. This is very similar to the current situation but
>>> we
>>>>> just need to remove dependencies on statically located data structures
>>>>> (don't get rid of them completely just give users an option to not use
>>>>> them).
>>>>>
>>>>> I'm not sure how much automatically generated parsers would help us. I
>>>>> guess it depends on the data model(s) we use if they are auto-parser
>>>>> friendly (which normally means POJO/JavaBean conventions including the
>>>>> no-args constructor).
>>>>>
>>>>> Cool I don't want to exclude flat file parsers completely (if only
>>>>> because my group has an interest in BioJava being able to read & write
>>>>> flat files) :)
>>>>>
>>>>> They decided to have HUPO-PSI Format instead :)
>>>>>
>>>>> Andy
>>>>>
>>>>>
>>>>> Mark Schreiber wrote:
>>>>>> Hi -
>>>>>>
>>>>>> I think in most cases huge XML files in bioinformatics result from a
>>>>>> single XML containing multiple repetitive elements. Eg a BLAST XML
>>>>>> output with several hits or a GenBankXML with many Sequences.  A nice
>>>>>> approach I have seen for dealing with these is to use SAX to read over
>>>>>> the file and every time it comes to an element it delegates to a DOM
>>>>>> object.  You then parse the bits of the DOM you want with XPath or
>>>>>> convert to objects or something and then when you are finished with
>>>>>> that entry everything gets garbage collected and the SAX parser moves
>>>>>> to the next element and repeats the whole process.  This is a hybrid
>>>>>> of event based parsing and object-model based parsing which could let
>>>>>> you efficiently deal with huge files.
>>>>>>
>>>>>> I think the BLAST XML has improved substantially, at least in terms of
>>>>>> validating against it's own DTD.  The DTD itself may not be the best
>>>>>> design but that is always a matter of taste and if you are using XPath
>>>>>> to get the relevant bits you don't need to make a SAX parser jump
>>>>>> through hoops to get them.
>>>>>>
>>>>>> I agree we will have to keep flat file parsers but we should strongly
>>>>>> encourage the use of XML where possible. It is simply easier to deal
>>>>>> with. Most biological flat-files were designed for Fortran and mainly
>>>>>> for human consumption. There is no obvious validation mechanism.
>>>>>> Notably everything in NCBI is derived from ASN.1, what you see in the
>>>>>> flatfile is produced from there. I tend to think this means that the
>>>>>> ASN.1 is the holy gospel and what you get in the flat file is some
>>>>>> translation.  Ideally NCBI files should be parsed from the ASN.1 where
>>>>>> you can guarantee validation, the more practical alternative is to use
>>>>>> the XML which you can at least validate against a DTD.
>>>>>>
>>>>>> With XML we (Biojava) can say if it validates we will parse it and if
>>>>>> it doesn't we may not.  With flat files there are so many dodgey
>>>>>> variants we cannot say anything.  Because XML dtds (or xsd's) have
>>>>>> versions it also makes it much easier to have parsers for different
>>>>>> versions and the parsing machinery can figure out which is needed.
>>>>>> With flat files it is anyones guess what version you are dealing with.
>>>>>>
>>>>>> Finally parsers can be auto-generated for XML if you have the DTD or
>>>>>> XSD. This often doesn't give you an ideal parser but it can be a
>>>>>> useful starting point for rapid development.
>>>>>>
>>>>>> For Biojava v 3 I think we should concentrate on XML parsers first and
>>>>>> flat files second. <sigh>if only Fasta had an XML format</sigh>
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>> On Nov 27, 2007 11:16 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>>> I was always under the impression that blast's XML output was nearly
>>> as
>>>>>>> hard to parse as the flat file format but I do agree that if we can
>>> use
>>>>>>> XML whenever we can it would make writing parsers a lot easier
>>>>>>> (especially if there are SAX based XPath libraries available).
>>> Actually
>>>>>>> this brings up a good question about development of this type of
>>> parser.
>>>>>>> The majority of XPath supporting libraries are DOM based which will
>>> mean
>>>>>>> large memory usage in some situations but overall providing an easier
>>>>>>> coding experience (and hopefully reduce our chances of creating
>>> bugs).
>>>>>>> Or should we code to the edge cases of someone trying to parse a 1GB
>>>>>>> XML? Personally I'd favour the former.
>>>>>>>
>>>>>>> Going back to the original topic there are going to be situations
>>> where
>>>>>>> people want the flat file parsers/writers & I think it's a valid
>>> point
>>>>>>> to say this is where BioJava is meant to come in & help a developer.
>>>>>>> Afterall XML is a computer science problem where as parsing an EMBL
>>> flat
>>>>>>> file or blast output is a bioinformatics problem.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>>
>>>>>>> Mark Schreiber wrote:
>>>>>>>> For a long time now my feeling has been that we should *only*
>>> support
>>>>>>>> the XML version of blast output.  The other formats are too brittle
>>> to
>>>>>>>> be easy to parse.  I also feel similarly about Genbank, EMBL, etc
>>> that
>>>>>>>> may be an extreme view but the power of generic XML parsers and
>>> things
>>>>>>>> like XPath etc really make these formats look very attractive.
>>>>>>>>
>>>>>>>> - Mark
>>>>>>>>
>>>>>>>>
>>>>>>>> On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>>>>> I think Groovy have adopted a similar system recently & have
>>> guidelines
>>>>>>>>> for how each module should behave (dependencies, build system etc).
>>> This
>>>>>>>>> enforces the idea that a module whilst not part of the core project
>>> must
>>>>>>>>> behave in the same manner the core does. I do like the idea that we
>>> can
>>>>>>>>> have a core biojava & things get added around it & it might
>>> encourage
>>>>>>>>> other users to start developing their own modules for any
>>>>>>>>> formats/purpose they want.
>>>>>>>>>
>>>>>>>>> Richard Holland wrote:
>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>> Hash: SHA1
>>>>>>>>>>
>>>>>>>>>>> What format options are there from blast? Just thinking if it
>>> supports
>>>>>>>>>>> CIGAR or something like that are we better providing a parser for
>>> that
>>>>>>>>>>> format & saying that we do not support the traditional blast
>>> output?
>>>>>>>>>>> That said it doesn't help is when that format changes so maybe
>>> what is
>>>>>>>>>>> needed is a way to push out parser changes without requiring a
>>> full
>>>>>>>>>>> biojava release (v3 discussion) ...
>>>>>>>>>> Exactly! So the modular idea would work nicely here - we could
>>> have a
>>>>>>>>>> blast module and only update that single module (which would be
>>> its own
>>>>>>>>>> JAR) whenever the format changes. In a way, BioJava releases as
>>> such
>>>>>>>>>> would no longer happen, except maybe for some kind of core BioJava
>>>>>>>>>> module. Everything would be done in terms of individual module+JAR
>>>>>>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS,
>>> one
>>>>>>>>>> for Phylogenetic tools, one for translation/transcription, etc.
>>> etc.
>>>>>>>>>> cheers,
>>>>>>>>>> Richard
>>>>>>>>> _______________________________________________
>>>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l



More information about the Biojava-l mailing list