[DAS] "XGLT" XML based bioinformatics functional programming language(s) proposal

Mon Jun 2 16:56:25 EDT 2003

I'm posting this naive proposal for an XML based 
functional-programming style of bioinformatics language , 
or collection of languages, to the main open-bio lists 
I am familiar with to try  to find out if anybody else 
is interested in thinking about a non-naive proposal, 
or knows somebody who might be , or is already doing so.

For very tentative examples of the sort of
thing I have in mind, see Examples 1. and 2. below.

(For one overview of functional programming 
languages and XML see for example  
http://www.xml.com/pub/a/2001/02/14/functional.html)

In what follows  an XML based functional-programming 
style of bioinformatics language is referred 
to as an XGLT (i.e. XSLT-with-a-G, "Genetic 
Transformation  Language", for want of a better term, 
though its not really related to Genetics specifically, 
so the G is moot).

The main ideas initially are that such a language 
would 

   * provide a high-level implementation-independent 
   interface to the rich Object Oriented (O-O) libraries 
   (BioJava , BioPERL, BioPython and others), more accessible 
   to  non-experts, and to developers working in other
   environments. XGLT interpreters could 
   be developed using these libraries.

   * provide an alternative "constructive" way
   of representing biological sequence and other 
   data. An XGLT based data packet would in general 
   express how to (reconstruct) a given piece of 
   biological sequence data - e.g. a sequence, or 
   a consensus alignment of sequences ,or a 
   translation - rather than convey the data itself, 
   or any  particular model of the data. While 
   initially limited to sequence data , it is possible 
   such a  functional programming dual may find 
   application  to other biological data. 

Such languages would have the following benefits

1) They would enable reference to and exchange of 
   large complex data structures , such as alignments, 
   in a succinct form, and very suitable for further 
   manipulation.
   (Example 1 below)

2) Because such languages would in most cases 
   exchange statements about how to (re)construct 
   data , rather  than the data itself ,they would 
   convey valuable  information lost when only the 
   end results are  transmitted - as an example, 
   any indels made in a DNA sequence read as part of its 
   protein translation. (Example 2 below)

3) Such languages could potentially provide a convenient 
   higher-level more declarative style of functional 
   programming interface  to Object Oriented libraries , 
   such as BioJava, BioPerl, BioPython  and others, 
   as these O-O libraries could be used to write the XGLT 
   engines required to actually interpret and execute XGLT 
   statements.

4) A functional programming style lends itself more 
   readily to expressing a chain of processing steps , 
   i.e. a (mini-) pipeline, than does an Object 
   Oriented system , which is more  expressive of 
   static structure.See example 2 below for a 
   very simple/naive example of a micro-translation-pipeline 
   expressed as a nested series of transforms in an XGLT. 

5) This point is related to both point 3) and point 4)
   above.

   It is likely that one popular method of making 
   Bioinformatic software libraries such as the Bio* 
   projects accessible to the non-expert and/or 
   non-Java/Perl/Python user will be to build Web 
   Services directories (WSDL), with each service 
   mapping to  a static Bio* facade method, that 
   internally creates temporary Bio* Objects to 
   execute the service method.

   However this approach is really limited to 
   one-shot services. Where a task calls for a 
   series of services to be invoked in a pipeline, 
   the fact that the underlying Bio* objects do not 
   persist between calls  is a problem , which would 
   require expensive marshalling  of output and  input 
   between web-service calls.

   The combination of an XGLT language allowing a 
   non-expert  user to specify a nested series of 
   processing steps in a high-level 
   implementation-independent manner, with an XGLT 
   interpreter/engine written using one or more of 
   the well engineered rich O-O-based Bio* libraries, 
   would potentially allow the entire pipeline to be 
   executed within the O-O based engine, with objects 
   persisting as and when required, for the entire 
   pipeline process.

6) an advantage of making a functional-programming 
   representation XML based , is that in many cases the
   representation would not need to be interpreted by 
   a real XGLT interpreter to be useful.

   For example it is easy to use XSLT to 
   transform Example 1.  below , into something 
   like an SVG (http://www.w3.org/TR/SVG/) based display of 
   the patterns of variation in an alignment, without 
   even actually executing the various editing steps
   required to construct the reads.

   An XGLT dual of a protein reference sequence , as in
   Example ,. includes enough information to 
   plot a rich feature track on a genome viewer, without 
   actually executing the translation.

Finally ,it would be desirable to provide some sort of
theoretical context for the suggestions and examples 
presented here , and so I give a very tentative one.

Comparing the two representations of an alignment
of sequences in Example 1, both contain the same
information, but one (the XGLT version) is projected into 
a space of functions, and the other into a geometric space.

This is analogous to the duality betwen the time-domain 
and frequency domain representations of a mathematical
function or data series.

(Another analogy is with the duality between a vector
space and the dual-space of linear functionals defined
on that space)

Others have pointed out a duality relationship between 
Object Oriented and Functional Programming languages.

So the tentative theoretical context , is that 
expressions in XGLT languages would amount to almost 
formal duals of the original data and models.

Therefore I would suggest the XGLT representation of
something like an alignment (Example 1) or 
protein translation (Example 2) , be referred to as
the "XGLT dual" or "functional programming dual" of the 
original , to emphasize that we are really dealing with 
the same information , but projected into a 
different space - one of functions.

And just as working in the frequency domain can 
sometimes be a productive thing to do with a 
mathematical function or data series, so working in a 
dual functional-programming domain as suggested here may 
be productive for some purposes.

I'd be grateful for any feedback (however harsh !)
on my admittedly very naive proposal.

Cheers

Alan McCulloch

---------------------------------------------------------------------
Example 1
---------------------------------------------------------------------

Set out below is a possible XGLT dual of the 
following alignment fragment : 

>Contig1
CGATCGAGCGTG

read1   CGATCCGAGCGTG
read2    GATC-GAGCGTG
read3     GACC-AGGGTT
read4    GACC-GAGCGT
read5     ATC-GA
        -------------
        CGATC-GAGCGTG

<!--
   this is an XGLT functional-programming dual
   of an alignment of reads making up a contig. 
   Rather than literally presenting the contig, consensus 
   and alignments, it gives instructions for how to construct 
   the consensus given the contig, and then for constructing 
   each read from the consensus - i.e.working backwards. 
-->
<mydata
   xmlns:xglt="www.pretend.xglt.org/XGLT-version-1.html"      
   xmlns:xbiopath="www.pretend.xglt.org/xbiopath-version-1.html"
   xmlns:xseqedit="www.pretend.xglt.org/xseqedit-version-1.html"
   xmlns:xprotein="www.pretend.xglt.org/xprotein-version-1.html">
   <!--
      provide the contig starting point
   -->
   <contig1>
      CGATCGAGCGTG
   </contig1>

   <!--
      a transform to obtain the consensus
   -->
   <xglt:transform name="consensus">
      <xbiopath:copy_sequence source="../contig1"/>
      <xseqedit:insert from="5" value="gap()" count="1"/>
   </xglt:transform>

   <!--
      transforms to obtain each read from the 
      consensus- we only need to specify changes from the 
      consensus. Each transform first calls the above consensus 
      transform, to provide its starting point (an XGLT interpreter 
      engine would of course optimise such multiple calls away in 
      the actual 
      execution)
   -->
   <xglt:transform name="read1">
      <xglt:apply_transform name="../consensus"/>
      <xseqedit:substitute from="6" to="6" value="C"/>
   </xglt:transform>
   <xglt:transform name="read2">
      <xglt:apply_transform name="../consensus"/>
      <xseqedit:substitute from="1" to="1" value="null()"/>
   </xglt:transform>
   <xglt:transform name="read3">
      <xglt:apply_transform name="../consensus"/>
      <xseqedit:substitute from="1" to="2" value="null()"/>
      <xseqedit:substitute from="3" to="3" value="G"/>
      <xseqedit:substitute from="4" to="4" value="A"/>
      <xseqedit:substitute from="6" to="6" value="C"/>
      <xseqedit:substitute from="7" to="7" value="gap()"/>
      <xseqedit:substitute from="10" to="10" value="G"/>
      <xseqedit:substitute from="13" to="13" value="T"/>
   </xglt:transform>
   <xglt:transform name="read4">
      <xglt:apply_transform name="../consensus"/>
      <xseqedit:substitute from="1" to="1" value="null()"/>
      <xseqedit:substitute from="4" to="4" value="C"/>
      <xseqedit:substitute from="13" to="13" value="null()"/>
   </xglt:transform>
   <xglt:transform name="read5">
      <xglt:apply_transform name="../consensus"/>
      <xseqedit:substitute from="1" to="2" value="null()"/>
      <xseqedit:substitute from="9" to="13" value="null()"/>
   </xglt:transform>
</mydata>

------------------------------------------------------------------------
---------------
Example 2
------------------------------------------------------------------------
---------------

<!--
   this is an XGLT functional-programming dual
   of a hypothetical RefSeq protein sequence, that has 
   undergone a curated translation from an underlying read 
   (hg11 genome say) that contains errors. Rather than 
   presenting the literal end-product sequence, this dual gives 
   instructions for how to construct it. When processed by an XGLT 
   interpreter/engine, the end result would simply be the 
   RefSeq protein sequence
-->
<xglt:transform 
   name="myRefSeqProtein"
   xmlns:xglt="www.pretend.xglt.org/XGLT-version-1.html"      
   xmlns:xbiopath="www.pretend.xglt.org/xbiopath-version-1.html"
   xmlns:xseqedit="www.pretend.xglt.org/xseqedit-version-1.html"
   xmlns:xprotein="www.pretend.xglt.org/xprotein-version-1.html">

   <!-- 
      this transform retrieves 3 exons from hg11 and 
      concatenates them into a single string 
   -->
   <xglt:transform name="getMyRefseqExons">
      <xbiopath:extract_sequence target="hg11">
         <xbiopath:subseq start="chr3.12345" stop="chr3.12545"/>
         <xbiopath:subseq start="chr3.23456" stop="chr3.23656"/>
         <xbiopath:subseq start="chr3.34567" stop="chr3.34667"/>
         <xglt:concatenate xref="./workspace()"/>
      </xbiopath:extract_sequence>
   </xglt:transform>

   <!-- 
      this transform calls the above transform to retrieve
      sequence,and then applies some edits 
   -->
   <xglt:transform name="myCuratedRefSeq">
      <xglt:apply_transform name="../getMyRefseqExons"/>
      <xseqedit:delete from="100" to="110"/>
      <xseqedit:insert from="50" value="G" count="1"/>
      <xseqedit:substitute from="200" to="200" value="G"/>
   </xglt:transform>

   <!--
      this transform calls the above transform to supply a DNA
      sequence , and then translates it
   -->   
   <xglt:transform name="translation">
      <xglt:apply_transform name="../myCuratedRefSeq"/>
      <xprotein:translate species="human"/>
   </xglt:transform>
</xglt:transform>

------------------------------------------------------------------------
--------------
Comment on Above Examples
------------------------------------------------------------------------
-------------

In these examples I have...

1) ...tried to suggest a functional style of 
   programming, but an actual XGLT may look 
   quite different.
   Transformations are declared and referenced inside 
   other transformations, in a nested structure. Each 
   transform stands alone , in that it first calls 
   another transform that provides its starting point 
   (and this transform may in turn involve a call to 
   another transform, etc)

2) ...tried to demonstrate how an XGLT would convey 
   valuable information about (in this example) the way 
   the RefSeq was made, not just the sequence of the 
   RefSeq itself. We not only achieve a succinct and in 
   this case compressed expression of the actual sequence 
   of the RefSeq, we also have an audit-trail of how the 
   RefSeq was curated.

3) ...supposed that rather than a single xglt language/name-space, 
   there would be a collection of namespaces such as

xglt: basic language for expressing things in a 
   functional programming manner - defining and 
   referencing transforms etc.

xbiopath: functions for referencing and extracting 
   biological sequences from databases and genomes. The 
   example given in (1) is a simple coordinate based 
   extract , but one could also envisage specifying things 
   like similarity based paths....

<xbiopath:match_sequence query="../myCuratedRefSeq()" 
          method="blast -e 1.0e-30"
          target="hg15" offset=-2000 length=2500/>

   - this would result in the extraction of 2.5Kb 
   sections of sequence, from all positions 2Kb upstream 
   of any hg15 hits to the RefSeq that was constructed in 
   example 1.

xprotein: functions for working with protein primary and 
   secondary structure

xseqedit: basic functions for sequence editing. This 
   example shows indels and changes - one can also envisage
   , say, masking and quality trimming functions that 
   could be specified in a transform, as part 
   of a pipeline.

4) noted that one would  also want to be able to use 
   XPath-ish (http://www.w3.org/TR/xpath) references, to 
   other parts of the current or other XGLT documents.

=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================