Bioperl: XML/BioPerl - design proposal

Andrew Dalke dalke@bioreason.com
Sat, 23 Jan 1999 21:25:18 -0800


David J. States <states@gpc.ibc.wustl.edu> said to Wayne Parrott:
> The discussion of frameworks and design patterns is interesting
> but under emphasizes a critical point, which is the need to develop
> and maintain standards. 
> [...]
> In all the frameworks discussion, I did not see much mention of this.

If I recall correctly from a discussion on bionet.software a month
or so back, Wayne Parrott <wayne@workingobjects.com>'s work actually
addresses a problem I see in the current XML translation work.  It
ignores people who want to see the original representation; possibly
with some interpretation.  This is different than setting up a standard
to store the content.

Wayne said:
| One of my key design goals was to maximize reuse. Instead of doing
| a straight Blast->XML transformer, I created the Blast Parsing
| Framework to provide basic Blast parsing and element handling
| functionality.

  He only mentioned translating the output to XML, but other outputs
are possible.  Consider what I did for DiscoveryBase
(http://www.mag.com/products/discobase.html or more specifically the
FASTA example at http://www.mag.com/products/dtourpg6.html).  (I no
longer work for MAG.)

  With the callback framework we had, implementing the output
you see in the top image was about 100 lines of perl.  I'll
assume from my memory of the usenet post that Wayne's implementation
is functionally equivalent to mine.

  You can associate a tag with each line of the BLAST (or FASTA)
output.  You can also create tags to describe the start/end of
functionally related regions.  For example, I think BLAST alignment
regions were described with something like:

__begin_record
__begin_record_header
record_header *
__end_record_header
__begin_alignment

__begin_subalignment         |
                             |
query_numbers   | repeats    |
query_sequence  | because    |
homology        | alignments |-- could be multiple alignments for a given
match_sequence  | can be     |   matched sequence
match_numbers   | wrapped    |
blank*          |            |
                             |
__end_subalignment           |

__end_record

(This isn't correct since there was text associated with each
aligned region found in the matched sequence.)

These tags are used for callbacks.  Lines that start with "__" are
control callbacks while the others contain the actual text from the
BLAST output line.  You can see how this can be mapped to both an XML
style tag description and a SAX-like callback code.

A null converter would just print every callback which doesn't have
a tag that starts "__".

With this extra context, adding support for colorized output like
that in my last URL is easy.  For example, adding URL cross references
is done by finding the "record_header" line, getting the right
text, and passing it to the routine which expands NCBI style
record labels into full HTML.  Everything else can be kept clean.

Adding colorized residues based on the homology means accumulating
the query_sequence, homology and match_sequence callback text instead
of printing them.  After the match_sequence is received, add the
HTML tags for color to the three strings then print them.

Adding links from the "index" lines (ie, the output block under "The
best scores are") to the full record is also quite simple.  The
index lines have a tag of "index" so the callback keeps track of
how many index lines have been reached and prints the HTML for a
relative line.  Then when the "__begin_record" tag is reached,
print the right "<A name=" HTML.


The point of all this is a parsing framework like I think Wayne
Parrott is describing allows you to parse data into XML while
*also* letting you do other parsing, like converting the results
into HTML while preserving the layout of the original data.  (Yes,
this can be done with the right viewer for that DTD, but no one has
done that yet.)

You can use the same method to parse, well, at least every data
type DiscoveryBase supported (FASTA, BLAST, and DSC (not officially
supported) output as well as PDB, PIR, SwissProt, GenBank and
Prosite/Prodoc records).

The result is a very flexible way to get access to the data ("content")
and the layout ("representation") and retarget the output as needed
for different representation styles.  You cannot do that with the
current bioperl BLAST/FASTA parser.

And I cannot work on this for another 5 months because of how I
interpret the non-compete clause in my previous contract :(

						Andrew
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================