[Biojava-l] "XGLT" XML based bioinformatics functional programming
language(s) proposal
McCulloch, Alan
alan.mcculloch at agresearch.co.nz
Mon Jun 2 16:56:25 EDT 2003
I'm posting this naive proposal for an XML based
functional-programming style of bioinformatics language ,
or collection of languages, to the main open-bio lists
I am familiar with to try to find out if anybody else
is interested in thinking about a non-naive proposal,
or knows somebody who might be , or is already doing so.
For very tentative examples of the sort of
thing I have in mind, see Examples 1. and 2. below.
(For one overview of functional programming
languages and XML see for example
http://www.xml.com/pub/a/2001/02/14/functional.html)
In what follows an XML based functional-programming
style of bioinformatics language is referred
to as an XGLT (i.e. XSLT-with-a-G, "Genetic
Transformation Language", for want of a better term,
though its not really related to Genetics specifically,
so the G is moot).
The main ideas initially are that such a language
would
* provide a high-level implementation-independent
interface to the rich Object Oriented (O-O) libraries
(BioJava , BioPERL, BioPython and others), more accessible
to non-experts, and to developers working in other
environments. XGLT interpreters could
be developed using these libraries.
* provide an alternative "constructive" way
of representing biological sequence and other
data. An XGLT based data packet would in general
express how to (reconstruct) a given piece of
biological sequence data - e.g. a sequence, or
a consensus alignment of sequences ,or a
translation - rather than convey the data itself,
or any particular model of the data. While
initially limited to sequence data , it is possible
such a functional programming dual may find
application to other biological data.
Such languages would have the following benefits
1) They would enable reference to and exchange of
large complex data structures , such as alignments,
in a succinct form, and very suitable for further
manipulation.
(Example 1 below)
2) Because such languages would in most cases
exchange statements about how to (re)construct
data , rather than the data itself ,they would
convey valuable information lost when only the
end results are transmitted - as an example,
any indels made in a DNA sequence read as part of its
protein translation. (Example 2 below)
3) Such languages could potentially provide a convenient
higher-level more declarative style of functional
programming interface to Object Oriented libraries ,
such as BioJava, BioPerl, BioPython and others,
as these O-O libraries could be used to write the XGLT
engines required to actually interpret and execute XGLT
statements.
4) A functional programming style lends itself more
readily to expressing a chain of processing steps ,
i.e. a (mini-) pipeline, than does an Object
Oriented system , which is more expressive of
static structure.See example 2 below for a
very simple/naive example of a micro-translation-pipeline
expressed as a nested series of transforms in an XGLT.
5) This point is related to both point 3) and point 4)
above.
It is likely that one popular method of making
Bioinformatic software libraries such as the Bio*
projects accessible to the non-expert and/or
non-Java/Perl/Python user will be to build Web
Services directories (WSDL), with each service
mapping to a static Bio* facade method, that
internally creates temporary Bio* Objects to
execute the service method.
However this approach is really limited to
one-shot services. Where a task calls for a
series of services to be invoked in a pipeline,
the fact that the underlying Bio* objects do not
persist between calls is a problem , which would
require expensive marshalling of output and input
between web-service calls.
The combination of an XGLT language allowing a
non-expert user to specify a nested series of
processing steps in a high-level
implementation-independent manner, with an XGLT
interpreter/engine written using one or more of
the well engineered rich O-O-based Bio* libraries,
would potentially allow the entire pipeline to be
executed within the O-O based engine, with objects
persisting as and when required, for the entire
pipeline process.
6) an advantage of making a functional-programming
representation XML based , is that in many cases the
representation would not need to be interpreted by
a real XGLT interpreter to be useful.
For example it is easy to use XSLT to
transform Example 1. below , into something
like an SVG (http://www.w3.org/TR/SVG/) based display of
the patterns of variation in an alignment, without
even actually executing the various editing steps
required to construct the reads.
An XGLT dual of a protein reference sequence , as in
Example ,. includes enough information to
plot a rich feature track on a genome viewer, without
actually executing the translation.
Finally ,it would be desirable to provide some sort of
theoretical context for the suggestions and examples
presented here , and so I give a very tentative one.
Comparing the two representations of an alignment
of sequences in Example 1, both contain the same
information, but one (the XGLT version) is projected into
a space of functions, and the other into a geometric space.
This is analogous to the duality betwen the time-domain
and frequency domain representations of a mathematical
function or data series.
(Another analogy is with the duality between a vector
space and the dual-space of linear functionals defined
on that space)
Others have pointed out a duality relationship between
Object Oriented and Functional Programming languages.
So the tentative theoretical context , is that
expressions in XGLT languages would amount to almost
formal duals of the original data and models.
Therefore I would suggest the XGLT representation of
something like an alignment (Example 1) or
protein translation (Example 2) , be referred to as
the "XGLT dual" or "functional programming dual" of the
original , to emphasize that we are really dealing with
the same information , but projected into a
different space - one of functions.
And just as working in the frequency domain can
sometimes be a productive thing to do with a
mathematical function or data series, so working in a
dual functional-programming domain as suggested here may
be productive for some purposes.
I'd be grateful for any feedback (however harsh !)
on my admittedly very naive proposal.
Cheers
Alan McCulloch
---------------------------------------------------------------------
Example 1
---------------------------------------------------------------------
Set out below is a possible XGLT dual of the
following alignment fragment :
>Contig1
CGATCGAGCGTG
read1 CGATCCGAGCGTG
read2 GATC-GAGCGTG
read3 GACC-AGGGTT
read4 GACC-GAGCGT
read5 ATC-GA
-------------
CGATC-GAGCGTG
<!--
this is an XGLT functional-programming dual
of an alignment of reads making up a contig.
Rather than literally presenting the contig, consensus
and alignments, it gives instructions for how to construct
the consensus given the contig, and then for constructing
each read from the consensus - i.e.working backwards.
-->
<mydata
xmlns:xglt="www.pretend.xglt.org/XGLT-version-1.html"
xmlns:xbiopath="www.pretend.xglt.org/xbiopath-version-1.html"
xmlns:xseqedit="www.pretend.xglt.org/xseqedit-version-1.html"
xmlns:xprotein="www.pretend.xglt.org/xprotein-version-1.html">
<!--
provide the contig starting point
-->
<contig1>
CGATCGAGCGTG
</contig1>
<!--
a transform to obtain the consensus
-->
<xglt:transform name="consensus">
<xbiopath:copy_sequence source="../contig1"/>
<xseqedit:insert from="5" value="gap()" count="1"/>
</xglt:transform>
<!--
transforms to obtain each read from the
consensus- we only need to specify changes from the
consensus. Each transform first calls the above consensus
transform, to provide its starting point (an XGLT interpreter
engine would of course optimise such multiple calls away in
the actual
execution)
-->
<xglt:transform name="read1">
<xglt:apply_transform name="../consensus"/>
<xseqedit:substitute from="6" to="6" value="C"/>
</xglt:transform>
<xglt:transform name="read2">
<xglt:apply_transform name="../consensus"/>
<xseqedit:substitute from="1" to="1" value="null()"/>
</xglt:transform>
<xglt:transform name="read3">
<xglt:apply_transform name="../consensus"/>
<xseqedit:substitute from="1" to="2" value="null()"/>
<xseqedit:substitute from="3" to="3" value="G"/>
<xseqedit:substitute from="4" to="4" value="A"/>
<xseqedit:substitute from="6" to="6" value="C"/>
<xseqedit:substitute from="7" to="7" value="gap()"/>
<xseqedit:substitute from="10" to="10" value="G"/>
<xseqedit:substitute from="13" to="13" value="T"/>
</xglt:transform>
<xglt:transform name="read4">
<xglt:apply_transform name="../consensus"/>
<xseqedit:substitute from="1" to="1" value="null()"/>
<xseqedit:substitute from="4" to="4" value="C"/>
<xseqedit:substitute from="13" to="13" value="null()"/>
</xglt:transform>
<xglt:transform name="read5">
<xglt:apply_transform name="../consensus"/>
<xseqedit:substitute from="1" to="2" value="null()"/>
<xseqedit:substitute from="9" to="13" value="null()"/>
</xglt:transform>
</mydata>
------------------------------------------------------------------------
---------------
Example 2
------------------------------------------------------------------------
---------------
<!--
this is an XGLT functional-programming dual
of a hypothetical RefSeq protein sequence, that has
undergone a curated translation from an underlying read
(hg11 genome say) that contains errors. Rather than
presenting the literal end-product sequence, this dual gives
instructions for how to construct it. When processed by an XGLT
interpreter/engine, the end result would simply be the
RefSeq protein sequence
-->
<xglt:transform
name="myRefSeqProtein"
xmlns:xglt="www.pretend.xglt.org/XGLT-version-1.html"
xmlns:xbiopath="www.pretend.xglt.org/xbiopath-version-1.html"
xmlns:xseqedit="www.pretend.xglt.org/xseqedit-version-1.html"
xmlns:xprotein="www.pretend.xglt.org/xprotein-version-1.html">
<!--
this transform retrieves 3 exons from hg11 and
concatenates them into a single string
-->
<xglt:transform name="getMyRefseqExons">
<xbiopath:extract_sequence target="hg11">
<xbiopath:subseq start="chr3.12345" stop="chr3.12545"/>
<xbiopath:subseq start="chr3.23456" stop="chr3.23656"/>
<xbiopath:subseq start="chr3.34567" stop="chr3.34667"/>
<xglt:concatenate xref="./workspace()"/>
</xbiopath:extract_sequence>
</xglt:transform>
<!--
this transform calls the above transform to retrieve
sequence,and then applies some edits
-->
<xglt:transform name="myCuratedRefSeq">
<xglt:apply_transform name="../getMyRefseqExons"/>
<xseqedit:delete from="100" to="110"/>
<xseqedit:insert from="50" value="G" count="1"/>
<xseqedit:substitute from="200" to="200" value="G"/>
</xglt:transform>
<!--
this transform calls the above transform to supply a DNA
sequence , and then translates it
-->
<xglt:transform name="translation">
<xglt:apply_transform name="../myCuratedRefSeq"/>
<xprotein:translate species="human"/>
</xglt:transform>
</xglt:transform>
------------------------------------------------------------------------
--------------
Comment on Above Examples
------------------------------------------------------------------------
-------------
In these examples I have...
1) ...tried to suggest a functional style of
programming, but an actual XGLT may look
quite different.
Transformations are declared and referenced inside
other transformations, in a nested structure. Each
transform stands alone , in that it first calls
another transform that provides its starting point
(and this transform may in turn involve a call to
another transform, etc)
2) ...tried to demonstrate how an XGLT would convey
valuable information about (in this example) the way
the RefSeq was made, not just the sequence of the
RefSeq itself. We not only achieve a succinct and in
this case compressed expression of the actual sequence
of the RefSeq, we also have an audit-trail of how the
RefSeq was curated.
3) ...supposed that rather than a single xglt language/name-space,
there would be a collection of namespaces such as
xglt: basic language for expressing things in a
functional programming manner - defining and
referencing transforms etc.
xbiopath: functions for referencing and extracting
biological sequences from databases and genomes. The
example given in (1) is a simple coordinate based
extract , but one could also envisage specifying things
like similarity based paths....
<xbiopath:match_sequence query="../myCuratedRefSeq()"
method="blast -e 1.0e-30"
target="hg15" offset=-2000 length=2500/>
- this would result in the extraction of 2.5Kb
sections of sequence, from all positions 2Kb upstream
of any hg15 hits to the RefSeq that was constructed in
example 1.
xprotein: functions for working with protein primary and
secondary structure
xseqedit: basic functions for sequence editing. This
example shows indels and changes - one can also envisage
, say, masking and quality trimming functions that
could be specified in a transform, as part
of a pipeline.
4) noted that one would also want to be able to use
XPath-ish (http://www.w3.org/TR/xpath) references, to
other parts of the current or other XGLT documents.
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================
More information about the Biojava-l
mailing list