Bioperl: XML/BioPerl
Lincoln Stein
lstein@cshl.org
Thu, 31 Dec 1998 10:31:28 -0500
For what it's worth, I will be adding a BoulderIO <-> XML converter in
the next week or so (thanks to Jaime Prilusky for the code). Also a
CGI <-> XML converter. For those of us using Boulder as a biological
data conversion format, the future is already here!
Lincoln
Gunther Birznieks writes:
> On Wed, 30 Dec 1998, David J. States wrote:
>
> [...]
> >
> > Is anyone aware of plans on the part of the database organizations to serve
> > XML?
> >
> If you mean general databases, Matthew Seargent has a DBI->XML and
> XML->DBI converter. If you mean specific biological databases, then I am
> naive on this aspect but I do have some thoughts on the subject you bring
> up below.
>
> > An alternative to agreement on a bio DTD is to push the burden of data
> > resolution issues onto the client. In writing an applet for a specific
> > display function, you would need to know the relationship of the various
> > fields in the data sources that you were referencing. This seems less
> > desirable, but at least it is a way forward.
> >
> > Thoughts or suggestions?
> >
> Your questions are thought provoking. To me, they reveal much about the
> subtle tension between XML as a data parsing standard and how much
> attention should actually be paid to the new markup languages that are
> vying to get formed as new "standards" under the XML umbrella.
>
> [1] XML Markup "standards"
>
> My feeling is that too many people are trying to focus too hard on
> defining standard "all-encompassing" DTDs for their problem domains. My
> belief is that in your initial prototype stage, you should definately
> consider placing the burden on the client to understand your XML structure
> rather waiting to conform to something if it doesn't exist or isn't well
> defined yet.
>
> Even if it is defined, but it is too complex for your data needs, then you
> probably should define your own XML markup anyway. Much of the value of
> XML lies in [1] easily being able to build structures that map to your
> specific problem space easily and [2] making that structure efficient and
> trivial to parse.
>
> If you are exposing a relatively simple interface to your data on the
> Internet, your users are probably going to be happy just knowing that they
> can do away with HTML::Parser, and instead download the data in a simple
> format. In addition, they won't have to weed through a bloated Document
> standard API.
>
> If the data you are presenting is complicated, I would still say the same
> thing. Even if your data is complex, then you still may want to
> consider not using a "standard biological markup". The reason being
> efficiency. Your XML may actually be more readable if you create it in a
> way that is centered around your data rather than someone elses idea of
> how the data should look.
>
> I am very wary of database "standards" since they tend to lock people into
> inefficiencies and kludges to get around those inefficiencies. In many
> cases, it would have been just as easy to provide an easier, readable
> format without 20 million "exceptions to the rule" as people start using
> these things in the real world.
>
> My rule of thumb: if you are creating a general tool to do lots of general
> things, then stick to a standard. If you are creating a specific tool to
> generate specific data, sometimes conforming to the "standard" may not
> produce very readable or useable code as just having formed a simple API
> for your specific data set in the first place.
>
> By the way, this rule of thumb probably might make more sense if you
> think about the analogy to relational database schemas. If you want a
> general database to do store lots of different general data types in a
> problem domain, it makes sense to stick with a standard database schema.
> However, the more specific your database becomes within that problem
> domain, the more you will find yourself trying to jump through hoops to
> gain efficiency and ease of data access.
>
> I realize my views may be controversial. In advance I have to say that I
> am struggling with the notion of what XML is and is not good for in this
> early stage of XML. So take what I say with that caveat.
>
> [2] XML Perl object serialization?
>
> To touch on another note, I would not recommend using a tool
> to auto-generate XML from perl object structures. The likelihood is high
> that such a tool probably exposes too much extraneous stuff to the user.
> The only time I would really think seriously about doing that is to get
> the ability to potentially recreate the object over the network based on
> an XML stream. Sort of like Java object serialization over HTTP.
>
> But that would not be a good "cross-language" interface to expose to the
> world I imagine.
>
> [3] Applets and XML
>
> I think you also mentioned something about applets for display? I would
> not recommend XML parsing inside applets quite yet. The Java XML parsers
> are not quite that fast and they are still a bit bloated. Let's put it
> this way, Aelfred which is optimized for speed and size is still 26k
> uncompressed and 15k compressed jar. :( And that does not even give you
> any utilities for handling timeouts or any other standard communications
> stuff you would want to do via HTTP communications.
>
> If someone wants to write an small applet that displays your data in a
> novel way, I would tend to suggest they write a CGI/Perl script to decode
> the XML data through LWP. Then, the script can output the data in a
> trivially easy to parse comma or pipe delimited way to the Java applet.
> Furthermore, using CGI/Perl (or Servlet or whatever) middleware allows you
> to move processing logic to the web server and out of the applet, further
> minimizing the impact of the size of the applet itself.
>
> If you are interested in a library to handle this sort of
> communications/parsing for you, JavaCGIBridge 2.0 is located @
> http://www.gunther.web66.com/JavaCGIBridge/. It comes with a smaller
> automatic delimited file->Vector parser, is less than 10k uncompressed for
> the core classes, and handles communications timeouts and stuff like that
> automatically for you so you aren't stuck with the blocking URLConnection
> JDK class.
>
> Of course, if the applet being developed is already huge for some other
> reason like a complication in the data display algorithms or is set for
> deployment on an intranet, then a 26k XML parser becomes less of a
> concern. But most people tend to want to keep their applets as thin as
> possible.
>
> Later,
> Gunther
>
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL: http://bio.perl.org/
> For info about how to (un)subscribe, where messages are archived, etc:
> http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
> ====================================================================
--
========================================================================
Lincoln D. Stein Cold Spring Harbor Laboratory
lstein@cshl.org Cold Spring Harbor, NY
========================================================================
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================