Bioperl: XML

Matthew Pocock mrp@sanger.ac.uk
Thu, 06 May 1999 19:34:17 +0100


XML would be a good thing. I have been using XML a lot from Java, and it is a
very powerful way to transfer and store information. We could easily provide
DTDs for each of the core interfaces prety much automatically along with the
to-from XML code. I don't see how providing XML versions of analysis objects
(like alignments factories) would have much value. It is possible to write
extensible DTDs, so that some of the interface implementations can store extra
information while still being validated by a validating parser and have the
core values understood.

I guess where ever we have a field that is a primative type, it should map to
an attribute. Whenever a field is a reference to a built-in collection type,
this should map to a nested tag. Whenever a field is an object reference, it
should map to the object unique id, and cause that object to be dumped as well.
If we create a seperate name-space for each package of the form
"-//packagename" then we can avoid any name-space clashes elegantly and still
have validation. I would shy away from implementing a to-from XML method for
each class, and rely on an XML document builder module for doing the
translations. This object can keep track of things like building the DTD,
following references, breaking circularities etc. etc. when converting to XML,
and can remake objects and refereces between them when processing XML.

It would be fun to do.

Vicki Brown wrote:

> The BioPerl list hasn't mentioned XML since January... The message below
> was forwarded to me.  What is the current view/status in the BioPerl
> community as regards XML?  There was talk of a BoulderIO <-> XML convertros
> as well as a CGI <-> XML converter.
>
> I can't agree with the assertion that XML will result in
> "(No more perl-parsers for >BLAST-output!!)" But I thought this was worthy
> of bringing up on the BioPerl list.
>
> With the permission of Mr. Loeffler:
>
> -----Original Message-----
> >From: Gerald Loeffler <Gerald.Loeffler@vienna.at>
> >To: Computational Chemistry Mailing List <chemistry@infomeister.osc.edu>
> >Date: Friday, April 30, 1999 4:00 AM
> >Subject: CCL:XML for Bioinformtics Data
>
> >Hi!
> >
> >Recently, I've been working a lot with XML (see http://www.w3c.org/xml/
> >and e.g. http://www.ibm.com/xml/), which is a standard, human-readable,
> >extensible markup-language that is rapidly becoming _the_ method of
> >choice for exchange and storage of any kind of data and documents. It
> >seems to me that XML would simply be _perfect_ for data exchange and
> >maybe even data storage in bioinformatics (see end of message for a note
> >on chemistry and CML).
> >
> >E.g. (from the top of my head), a DNA/protein sequence similarity search
> >engine (e.g. NCBIs BLAST server) might return its search results in the
> >form of an XML document that
> >could look like this:
> >
> ><seq-sim-search-results>
> >  <query>
> >    <type>                         protein     </type>
> >    <seq name="My stupid peptide"> GAVLIFYWSTQ </seq>
> >    <algorithm>                    FASTA3      </algorithm>
> >    <db>                           SwissProt   </db>
> >    <gap-open>                    -12          </gap-open>
> >    <gap-extension>               -2           </gap-extension>
> >  </query>
> >  <hits>
> >    <hit>
> >      <accession>      HPS_HUMAN    </accession>
> >      <organism>       homo sapiens </organism>
> >      <overlap>        11           </overlap>
> >      <overlaping-seq> GAEVLFYWTDQ  </overlaping-seq>
> >      <z-score>        129.3        </z-score>
> >    </hit>
> >    <hit>
> >      <accession>      PA24_MOUSE   </accession>
> >      <organism>       mus musculus </organism>
> >      <overlap>        8            </overlap>
> >      <overlaping-seq> VFIFYWTT     </overlaping-seq>
> >      <z-score>        133.3        </z-score>
> >    </hit>
> >  </hits>
> ></seq-sim-search-results>
> >
> >There are several important points here:
> >
> >1) Without knowing what this XML document is about, a program can assert
> >that it is well-formed! These programs exist, are free and are
> >applicable to all XML documents!
> >
> >2) The rules for the nesting and naming of the tags in XML documents of
> >this type can be formally defined in XML. The above document would be of
> >type "seq-sim-search-results" and you could easily write a formal
> >definition (in a DTD file) that says that such a document must contain a
> >"query" and a "hits" tag; the "query" tag in turn must contain exactly
> >one of each "type", "seq", ... The "hits" tag in turn may contain 0 or
> >more "hit" tags which in turn ...
> >
> >3) Having a formal definition of documents of this type, a program can
> >verify that our above XML document complies with the formal definiton
> >(is valid). These programs exist, are free and are applicable to all XML
> >documents!
> >
> >4) Free utilities exist (e.g. IBMs xml4j) that can programmatically
> >write and read (parse) any XML document and thus give a program access
> >to the structure and content of the document!! (No more perl-parsers for
> >BLAST-output!!)
> >
> >5) This file is human-readable! (in contrast to a Corba struct or a
> >serialized Java object!)
> >
> >6) Modern WWW-browsers can (if a style-sheet is supplied) directly
> >display this XML document. For old browsers, the XML document can easily
> >be converted to HTML for display.
> >
> >I think you get the idea.
> >
> >Does such an XML-based approach sound reasonable?
> >What does this approach leave to be desired?
> >Are efforts underway in this direction?
> >Wouldn't it be a better world if we all used XML (-:
> >
> >I know that XML is currently being used for chemistry-related data (CML,
> >see http://www.xml-cml.org/), but I haven't heard of any efforts in the
> >area of Bioinformatics. So please view this message as targeted towards
> >the Bioinformatics community that is not served by CML. (CML has a
> >DNA/protein sequence tag.)
> >
> >        cheers,
> >        gerald
> >        cheers,
> >        gerald
> >--
> > Gerald Loeffler
> > Email: Gerald.Loeffler@vienna.at
> > Smail: Apollo Imaging, Marchettigasse 7, A-1060 Vienna, Austria
> > Phone: +43 676 3289588 (+43 1 5952333 27)
> > Fax:   +43 1 5952333 20
> > Keywords: Java, CORBA, OOA&D, Databases, Bioinformatics,
> >           Computational Biology, Computational Biophysics
> -----
>  //=\   Vicki Brown <vlb@deltagen.com>
>  \=//    Journeyman Sourcerer: Scripts & Philtres
>   //=\    (Mac)Perl, awk, sed, *sh..., occasional C
>   \=//     A little web-gardening on the weekends
>    //=\
>    \=//      Deltagen, Inc.
>     //=\     1031 Bing St, San Carlos, CA 94070
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL: http://bio.perl.org/
> For info about how to (un)subscribe, where messages are archived, etc:
> http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
> ====================================================================

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================