Bioperl: XML/BioPerl

Lincoln Stein lstein@cshl.org
Thu, 31 Dec 1998 10:31:28 -0500


For what it's worth, I will be adding a BoulderIO <-> XML converter in
the next week or so (thanks to Jaime Prilusky for the code).  Also a
CGI <-> XML converter.  For those of us using Boulder as a biological
data conversion format, the future is already here!

Lincoln

Gunther Birznieks writes:
 > On Wed, 30 Dec 1998, David J. States wrote:
 > 
 > [...]
 > > 
 > > Is anyone aware of plans on the part of the database organizations to serve 
 > > XML?
 > > 
 > If you mean general databases, Matthew Seargent has a DBI->XML and
 > XML->DBI converter. If you mean specific biological databases, then I am
 > naive on this aspect but I do have some thoughts on the subject you bring
 > up below.
 > 
 > > An alternative to agreement on a bio DTD is to push the burden of data 
 > > resolution issues onto the client.  In writing an applet for a specific 
 > > display function, you would need to know the relationship of the various 
 > > fields in the data sources that you were referencing.  This seems less 
 > > desirable, but at least it is a way forward.
 > > 
 > > Thoughts or suggestions?
 > > 
 > Your questions are thought provoking. To me, they reveal much about the
 > subtle tension between XML as a data parsing standard and how much
 > attention should actually be paid to the new markup languages that are
 > vying to get formed as new "standards" under the XML umbrella.
 > 
 > [1] XML Markup "standards"
 > 
 > My feeling is that too many people are trying to focus too hard on
 > defining standard "all-encompassing" DTDs for their problem domains. My
 > belief is that in your initial prototype stage, you should definately
 > consider placing the burden on the client to understand your XML structure
 > rather waiting to conform to something if it doesn't exist or isn't well
 > defined yet.
 > 
 > Even if it is defined, but it is too complex for your data needs, then you
 > probably should define your own XML markup anyway.  Much of the value of
 > XML lies in [1] easily being able to build structures that map to your
 > specific problem space easily and [2] making that structure efficient and
 > trivial to parse.
 > 
 > If you are exposing a relatively simple interface to your data on the
 > Internet, your users are probably going to be happy just knowing that they
 > can do away with HTML::Parser, and instead download the data in a simple
 > format.  In addition, they won't have to weed through a bloated Document
 > standard API.
 > 
 > If the data you are presenting is complicated, I would still say the same
 > thing.  Even if your data is complex, then you still may want to
 > consider not using a "standard biological markup".  The reason being
 > efficiency.  Your XML may actually be more readable if you create it in a
 > way that is centered around your data rather than someone elses idea of
 > how the data should look.
 > 
 > I am very wary of database "standards" since they tend to lock people into
 > inefficiencies and kludges to get around those inefficiencies. In many
 > cases, it would have been just as easy to provide an easier, readable
 > format without 20 million "exceptions to the rule" as people start using
 > these things in the real world.
 > 
 > My rule of thumb: if you are creating a general tool to do lots of general
 > things, then stick to a standard. If you are creating a specific tool to
 > generate specific data, sometimes conforming to the "standard" may not
 > produce very readable or useable code as just having formed a simple API
 > for your specific data set in the first place.
 > 
 > By the way, this rule of thumb probably might make more sense if you
 > think about the analogy to relational database schemas.  If you want a
 > general database to do store lots of different general data types in a
 > problem domain, it makes sense to stick with a standard database schema.
 > However, the more specific your database becomes within that problem
 > domain, the more you will find yourself trying to jump through hoops to
 > gain efficiency and ease of data access.
 > 
 > I realize my views may be controversial. In advance I have to say that I
 > am struggling with the notion of what XML is and is not good for in this
 > early stage of XML. So take what I say with that caveat.
 > 
 > [2] XML Perl object serialization?
 > 
 > To touch on another note, I would not recommend using a tool
 > to auto-generate XML from perl object structures.  The likelihood is high
 > that such a tool probably exposes too much extraneous stuff to the user.
 > The only time I would really think seriously about doing that is to get 
 > the ability to potentially recreate the object over the network based on
 > an XML stream. Sort of like Java object serialization over HTTP.
 > 
 > But that would not be a good "cross-language" interface to expose to the
 > world I imagine.
 > 
 > [3] Applets and XML
 > 
 > I think you also mentioned something about applets for display? I would
 > not recommend XML parsing inside applets quite yet.  The Java XML parsers
 > are not quite that fast and they are still a bit bloated.  Let's put it
 > this way, Aelfred which is optimized for speed and size is still 26k
 > uncompressed and 15k compressed jar. :(  And that does not even give you
 > any utilities for handling timeouts or any other standard communications
 > stuff you would want to do via HTTP communications.
 > 
 > If someone wants to write an small applet that displays your data in a
 > novel way, I would tend to suggest they write a CGI/Perl script to decode
 > the XML data through LWP. Then, the script can output the data in a
 > trivially easy to parse comma or pipe delimited way to the Java applet.
 > Furthermore, using CGI/Perl (or Servlet or whatever) middleware allows you
 > to move processing logic to the web server and out of the applet, further
 > minimizing the impact of the size of the applet itself.
 > 
 > If you are interested in a library to handle this sort of
 > communications/parsing for you, JavaCGIBridge 2.0 is located @
 > http://www.gunther.web66.com/JavaCGIBridge/. It comes with a smaller
 > automatic delimited file->Vector parser, is less than 10k uncompressed for
 > the core classes, and handles communications timeouts and stuff like that
 > automatically for you so you aren't stuck with the blocking URLConnection
 > JDK class.
 > 
 > Of course, if the applet being developed is already huge for some other
 > reason like a complication in the data display algorithms or is set for
 > deployment on an intranet, then a 26k XML parser becomes less of a
 > concern. But most people tend to want to keep their applets as thin as
 > possible.
 > 
 > Later,
 >   Gunther
 > 
 > =========== Bioperl Project Mailing List Message Footer =======
 > Project URL: http://bio.perl.org/
 > For info about how to (un)subscribe, where messages are archived, etc:
 > http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
 > ====================================================================
-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY
========================================================================
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================